# LDA introduction.

Natural language processing (NLP) employs the probabilistic topic modeling method known as Latent Dirichlet Allocation (LDA).

By identifying the topics that most accurately represent each theme, LDA aims to reveal the hidden thematic structure of a group of documents.

LDA can be used to find the topics that are most pertinent to computer science, mathematics, and physics in the case of abstracts for STEM subjects.

We begin by preprocessing the text data for this purpose in order to get rid of stop words, punctuation, and other extraneous details.

In order to represent the frequency of each term in each document, we tokenize the text to separate it into individual words or phrases.

The topics that best explain the variation in the data can be found using LDA once we have the document-term matrix.

Until the model converges on a stable solution, this entails repeatedly assigning each word in each document to a topic and adjusting the topic probabilities.

The LDA model produces a list of topics, each of which is represented by a distribution over the vocabulary words.

The topics can then be understood by looking at the most frequently occurring words in each topic and using domain knowledge to assign them to pertinent STEM subject areas.

We previously conduct a similar procedure using TF-IDF, and this model will work in tandom with the website that we have created for the users to input their abstracts.

We first begin by importing the libraries that will be using.

In [None]:
pip install pyLDAvis


In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('wordnet')

You can use the following command line to import the dataset file in case youre using Google Colab.

In [None]:
from google.colab import files

uploaded = files.upload()

Saving abstracts.csv to abstracts.csv


In [4]:
import pandas as pd
import numpy as np
import re
import string
import spacy
import pickle

# libraries for visualization
import pyLDAvis
import pyLDAvis.lda_model
import matplotlib.pyplot as plt
import seaborn as sns
import gensim
from gensim import corpora
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis.gensim_models as gensimvis
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

  if LooseVersion(mpl.__version__) >= "3.0":
  other = LooseVersion(other)
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


We will then import our data set as we did before.

In [5]:
train_df = pd.read_csv('abstracts.csv')
train_df.head()

Unnamed: 0,ID,TITLE,ABSTRACT
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...


In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20972 entries, 0 to 20971
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   ID        20972 non-null  int64 
 1   TITLE     20972 non-null  object
 2   ABSTRACT  20972 non-null  object
dtypes: int64(1), object(2)
memory usage: 491.7+ KB


## Data cleaning and preprocessing.
And afterwords, we will have our cleaning function. this function performs a series of common text preprocessing steps to remove noise and irrelevant information from the input text, which can improve the accuracy of natural language processing tasks. These steps include removing punctuation, removing words that are entirely composed of digits, and removing short words.



In [7]:
def clean_text(text):
    #The first line of the function creates a dictionary called clean_dict that maps each punctuation character in the string.punctuation string to an empty string.
    #This will be used to remove all punctuation characters from the text.
    clean_dict = {special_char: '' for special_char in string.punctuation}
    clean_dict[' '] = ' '
    #A translation table is created using the str.maketrans() method, which takes the clean_dict dictionary as input
    #and returns a translation table that can be used with the translate() method to remove punctuation from the text.
    table = str.maketrans(clean_dict)
    text_1 = text.translate(
        table
    )  #he translate() method is called on the input text using the translation table to remove all punctuation characters.
    text_Array = text_1.split()

    """
    A list comprehension is used to remove any words that are entirely composed of digits (isdigit()) or that have a length less than or equal to 3 characters.
    The remaining words are joined back together into a string with spaces between them using the join() method.
    The resulting cleaned text is converted to lowercase using the lower() method.
    """
    text_2 = ' '.join([
        word for word in text_Array
        if (not word.isdigit() and (not word.isdigit() and len(word) > 3))
    ])

    return text_2.lower()

We will then be using a Natural Language Toolkit (NLTK) library to remove stopwords from text data.

Stopwords are words that occur frequently in a language but do not carry much meaning, such as "a", "an", "the", "in", "of", etc.

The first two lines of the code import the stopwords module from NLTK and create a variable stop_words that contains a list of English stopwords.

Next, a function remove_stopwords is defined that takes a single argument text, which is a string containing text data.

The function splits the input text into an array of words using the split() method and then uses a list comprehension to remove any words that appear in the stop_words list.

The filtered words are then joined back together into a string using the join() method and returned.

Finally, the apply() method is used to apply the remove_stopwords function to every row in the 'ABSTRACT' column of the train_df DataFrame.

This removes the stopwords from the text data in each row and updates the 'ABSTRACT' column in-place with the cleaned text.

In [8]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')


#We then create a function to remove the stopwords in our text.
def remove_stopwords(text):
    text_Array = text.split(' ')
    remove_words = " ".join([i for i in text_Array if i not in stop_words])
    return remove_words


#And here we will apply the remove_stopwords function. This will remove the stopwords from our dataset's text
train_df['ABSTRACT'] = train_df['ABSTRACT'].apply(remove_stopwords)


We will use the SpaCy library to perform lemmatization on a list of input texts. Lemmatization is the process of reducing words to their base or dictionary form, which can be useful for standardizing text data and reducing noise in natural language processing tasks. However before beginning to use the Spacy library, you must first install the required tools to begin using the Spacy library. Run the following commands in the terminal to install the required tools.
````
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
````



In [15]:
'''
The first line of the code loads the 'en_core_web_md' SpaCy model, 
which is a medium-sized English language model that includes word vectors and supports part-of-speech tagging, 
named entity recognition, and dependency parsing. The 'parser' and 'ner' components are disabled using the disable parameter, 
which speeds up the processing time since these components are not needed for lemmatization.
'''
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *    

nlp = spacy.load('en_core_web_lg', disable=['parser', 'ner'])
 
def lemmatization(texts,allowed_postags=['VERB', 'ADV','ADJ']): 
       output = []
       for sent in texts:
             doc = nlp(sent) 
             output.append([token.lemma_ for token in doc if token.pos_ in allowed_postags ])
       return output



In [16]:
train_df['ABSTRACT'].dtypes

dtype('O')

In [17]:
train_df['ABSTRACT'].isnull()

0        False
1        False
2        False
3        False
4        False
         ...  
20967    False
20968    False
20969    False
20970    False
20971    False
Name: ABSTRACT, Length: 20972, dtype: bool

In [18]:

text_list=train_df['ABSTRACT'].tolist()
print(text_list[1])
tokenized_reviews = lemmatization(text_list)
print(tokenized_reviews[1])

  Rotation invariance translation invariance great values image
recognition tasks. In paper, bring new architecture convolutional
neural network (CNN) named cyclic convolutional layer achieve rotation
invariance 2-D symbol recognition. We also get position and
orientation 2-D symbol network achieve detection purpose for
multiple non-overlap target. Last least, architecture achieve
one-shot learning cases using invariance.

['great', 'bring', 'new', 'convolutional', 'neural', 'name', 'cyclic', 'convolutional', 'achieve', 'd', 'also', 'get', 'd', 'achieve', 'multiple', 'non', '-', 'overlap', 'last', 'least', 'achieve', 'use']


In [19]:
dictionary = corpora.Dictionary(tokenized_reviews)
bow_corpus = [dictionary.doc2bow(doc) for doc in tokenized_reviews]
bow_corpus[4310]

[(13, 3),
 (48, 1),
 (128, 2),
 (129, 1),
 (170, 1),
 (235, 1),
 (249, 1),
 (333, 1),
 (409, 1),
 (464, 1),
 (556, 2),
 (688, 1),
 (766, 1),
 (822, 3),
 (873, 1),
 (1080, 1),
 (1325, 3),
 (1534, 3),
 (1624, 2),
 (1838, 6),
 (3500, 1)]

In [20]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 -
1 aim
2 allow
3 also
4 analyse
5 analyze
6 associate
7 binary
8 bootstrap
9 compare
10 compose


In [21]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [22]:
bow_doc_4310 = bow_corpus[4310]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(
        bow_doc_4310[i][0], dictionary[bow_doc_4310[i][0]],
        bow_doc_4310[i][1]))


Word 13 ("diagnostic") appears 3 time.
Word 48 ("reduce") appears 1 time.
Word 128 ("let") appears 2 time.
Word 129 ("maximum") appears 1 time.
Word 170 ("molecular") appears 1 time.
Word 235 ("rather") appears 1 time.
Word 249 ("diagonal") appears 1 time.
Word 333 ("ongoing") appears 1 time.
Word 409 ("analytic") appears 1 time.
Word 464 ("parallel") appears 1 time.
Word 556 ("major") appears 2 time.
Word 688 ("significant") appears 1 time.
Word 766 ("exotic") appears 1 time.
Word 822 ("iteratively") appears 3 time.
Word 873 ("effectively") appears 1 time.
Word 1080 ("exponentially") appears 1 time.
Word 1325 ("conformal") appears 3 time.
Word 1534 ("drastically") appears 3 time.
Word 1624 ("sized") appears 2 time.
Word 1838 ("schrödinger") appears 6 time.


KeyError: 3500

In [23]:
train_df['ABSTRACT'].shape

(20972,)

In [24]:
train_df['ABSTRACT'].size
print(train_df['ABSTRACT'][0])

  Predictive models allow subject-specific inference analyzing disease
related alterations neuroimaging data. Given subject's data, inference can
be made two levels: global, i.e. identifiying condition presence the
subject, local, i.e. detecting condition effect individual
measurement extracted subject's data. While global inference widely
used, local inference, used form subject-specific effect maps,
is rarely used existing models often yield noisy detections composed of
dispersed isolated islands. In article, propose reconstruction
method, named RSM, improve subject-specific detections predictive
modeling approaches particular, binary classifiers. RSM specifically
aims reduce noise due sampling error associated using finite
sample examples train classifiers. The proposed method wrapper-type
algorithm used different binary classifiers diagnostic
manner, i.e. without information condition presence. Reconstruction posed
as Maximum-A-Posteriori problem prior model whose parameters are
es

In [25]:

LDA = gensim.models.ldamodel.LdaModel
# Build LDA model
lda_model = LDA(corpus = bow_corpus,
                                       num_topics=5,
                                       id2word=dictionary,
                                       random_state=100,
                                       chunksize=1000,
                                       passes=50,
                                       iterations=100)

IndexError: index 3117 is out of bounds for axis 1 with size 3117

In [None]:
lda_model_multi = gensim.models.LdaMulticore(corpus = bow_corpus,
                                       num_topics=5,
                                       id2word=dictionary,
                                       workers=2,
                                       random_state=100,
                                       chunksize=1000,
                                       passes=50,
                                       iterations=100)


  and should_run_async(code)


In [None]:
lda_model.print_topics()

  and should_run_async(code)


[(0,
  '0.022*"give" + 0.021*"show" + 0.020*"prove" + 0.015*"also" + 0.010*"set" + 0.009*"obtain" + 0.009*"define" + 0.009*"use" + 0.008*"study" + 0.008*"consider"'),
 (1,
  '0.026*"use" + 0.018*"propose" + 0.017*"base" + 0.016*"learn" + 0.010*"show" + 0.009*"neural" + 0.009*"different" + 0.009*"deep" + 0.008*"new" + 0.008*"present"'),
 (2,
  '0.020*"propose" + 0.019*"use" + 0.015*"show" + 0.015*"-" + 0.013*"base" + 0.011*"optimal" + 0.010*"random" + 0.009*"provide" + 0.008*"consider" + 0.008*"well"'),
 (3,
  '0.057*"-" + 0.030*"non" + 0.019*"dimensional" + 0.013*"topological" + 0.010*"nonlinear" + 0.010*"critical" + 0.009*"study" + 0.008*"spatial" + 0.008*"describe" + 0.008*"couple"'),
 (4,
  '0.015*"use" + 0.014*"high" + 0.014*"find" + 0.010*"show" + 0.010*"low" + 0.010*"large" + 0.010*"magnetic" + 0.008*"observe" + 0.008*"also" + 0.007*"present"')]

In [None]:
lda_model_multi.print_topics()

  and should_run_async(code)


[(0,
  '0.024*"give" + 0.021*"show" + 0.020*"prove" + 0.014*"also" + 0.013*"set" + 0.010*"define" + 0.008*"use" + 0.008*"bound" + 0.008*"bind" + 0.007*"study"'),
 (1,
  '0.025*"use" + 0.016*"propose" + 0.016*"base" + 0.014*"learn" + 0.010*"show" + 0.009*"neural" + 0.009*"different" + 0.009*"deep" + 0.008*"present" + 0.008*"-"'),
 (2,
  '0.020*"propose" + 0.019*"use" + 0.014*"show" + 0.014*"-" + 0.013*"base" + 0.011*"optimal" + 0.009*"provide" + 0.008*"random" + 0.008*"well" + 0.007*"also"'),
 (3,
  '0.029*"-" + 0.020*"non" + 0.013*"dimensional" + 0.012*"show" + 0.011*"use" + 0.010*"obtain" + 0.009*"consider" + 0.009*"study" + 0.009*"nonlinear" + 0.009*"also"'),
 (4,
  '0.014*"use" + 0.012*"high" + 0.012*"find" + 0.010*"show" + 0.009*"magnetic" + 0.009*"-" + 0.009*"low" + 0.008*"large" + 0.007*"observe" + 0.007*"present"')]

In [None]:
for index, score in sorted(lda_model[bow_corpus[4310]],
                           key=lambda tup: -1 * tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score,
                                             lda_model.print_topic(index, 10)))



Score: 0.5677973031997681	 
Topic: 0.015*"use" + 0.014*"high" + 0.014*"find" + 0.010*"show" + 0.010*"low" + 0.010*"large" + 0.010*"magnetic" + 0.008*"observe" + 0.008*"also" + 0.007*"present"

Score: 0.2265579253435135	 
Topic: 0.026*"use" + 0.018*"propose" + 0.017*"base" + 0.016*"learn" + 0.010*"show" + 0.009*"neural" + 0.009*"different" + 0.009*"deep" + 0.008*"new" + 0.008*"present"

Score: 0.1152733564376831	 
Topic: 0.057*"-" + 0.030*"non" + 0.019*"dimensional" + 0.013*"topological" + 0.010*"nonlinear" + 0.010*"critical" + 0.009*"study" + 0.008*"spatial" + 0.008*"describe" + 0.008*"couple"

Score: 0.08490440994501114	 
Topic: 0.022*"give" + 0.021*"show" + 0.020*"prove" + 0.015*"also" + 0.010*"set" + 0.009*"obtain" + 0.009*"define" + 0.009*"use" + 0.008*"study" + 0.008*"consider"


  and should_run_async(code)


In [None]:
for index, score in sorted(lda_model_multi[bow_corpus[4310]],
                           key=lambda tup: -1 * tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score,
                                             lda_model_multi.print_topic(index, 10)))


Score: 0.692303478717804	 
Topic: 0.014*"use" + 0.012*"high" + 0.012*"find" + 0.010*"show" + 0.009*"magnetic" + 0.009*"-" + 0.009*"low" + 0.008*"large" + 0.007*"observe" + 0.007*"present"

Score: 0.2264566421508789	 
Topic: 0.025*"use" + 0.016*"propose" + 0.016*"base" + 0.014*"learn" + 0.010*"show" + 0.009*"neural" + 0.009*"different" + 0.009*"deep" + 0.008*"present" + 0.008*"-"

Score: 0.07028019428253174	 
Topic: 0.024*"give" + 0.021*"show" + 0.020*"prove" + 0.014*"also" + 0.013*"set" + 0.010*"define" + 0.008*"use" + 0.008*"bound" + 0.008*"bind" + 0.007*"study"


  and should_run_async(code)


In [None]:
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, bow_corpus, dictionary)
vis

  and should_run_async(code)
  default_term_info = default_term_info.sort_values(


In [None]:
pyLDAvis.save_html(vis, 'lda_model.html')

  and should_run_async(code)


In [None]:
pyLDAvis.enable_notebook()
vis2 = gensimvis.prepare(lda_model_multi, bow_corpus, dictionary)
vis2

  and should_run_async(code)
  default_term_info = default_term_info.sort_values(


In [None]:
pyLDAvis.save_html(vis2, 'lda_model_multi.html')

  and should_run_async(code)


In [None]:
print('\nPerplexity: ', lda_model.log_perplexity(bow_corpus,total_docs=80000))
from gensim.models.coherencemodel import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_reviews, dictionary=dictionary , coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

  and should_run_async(code)



Perplexity:  -7.195498084202839

Coherence Score:  0.4097290552536677


In [None]:
print('\nPerplexity: ', lda_model_multi.log_perplexity(bow_corpus,total_docs=80000))
from gensim.models.coherencemodel import CoherenceModel
coherence_model_lda_multi = CoherenceModel(model=lda_model_multi, texts=tokenized_reviews, dictionary=dictionary , coherence='c_v')
coherence_lda_multi = coherence_model_lda_multi.get_coherence()
print('\nCoherence Score: ', coherence_lda_multi)

  and should_run_async(code)



Perplexity:  -7.150351881079179

Coherence Score:  0.40879715468728806


In [None]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.022*"give" + 0.021*"show" + 0.020*"prove" + 0.015*"also" + 0.010*"set" + 0.009*"obtain" + 0.009*"define" + 0.009*"use" + 0.008*"study" + 0.008*"consider"
Topic: 1 
Words: 0.026*"use" + 0.018*"propose" + 0.017*"base" + 0.016*"learn" + 0.010*"show" + 0.009*"neural" + 0.009*"different" + 0.009*"deep" + 0.008*"new" + 0.008*"present"
Topic: 2 
Words: 0.020*"propose" + 0.019*"use" + 0.015*"show" + 0.015*"-" + 0.013*"base" + 0.011*"optimal" + 0.010*"random" + 0.009*"provide" + 0.008*"consider" + 0.008*"well"
Topic: 3 
Words: 0.057*"-" + 0.030*"non" + 0.019*"dimensional" + 0.013*"topological" + 0.010*"nonlinear" + 0.010*"critical" + 0.009*"study" + 0.008*"spatial" + 0.008*"describe" + 0.008*"couple"
Topic: 4 
Words: 0.015*"use" + 0.014*"high" + 0.014*"find" + 0.010*"show" + 0.010*"low" + 0.010*"large" + 0.010*"magnetic" + 0.008*"observe" + 0.008*"also" + 0.007*"present"


  and should_run_async(code)


In [None]:
for idx, topic in lda_model_multi.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.024*"give" + 0.021*"show" + 0.020*"prove" + 0.014*"also" + 0.013*"set" + 0.010*"define" + 0.008*"use" + 0.008*"bound" + 0.008*"bind" + 0.007*"study"
Topic: 1 
Words: 0.025*"use" + 0.016*"propose" + 0.016*"base" + 0.014*"learn" + 0.010*"show" + 0.009*"neural" + 0.009*"different" + 0.009*"deep" + 0.008*"present" + 0.008*"-"
Topic: 2 
Words: 0.020*"propose" + 0.019*"use" + 0.014*"show" + 0.014*"-" + 0.013*"base" + 0.011*"optimal" + 0.009*"provide" + 0.008*"random" + 0.008*"well" + 0.007*"also"
Topic: 3 
Words: 0.029*"-" + 0.020*"non" + 0.013*"dimensional" + 0.012*"show" + 0.011*"use" + 0.010*"obtain" + 0.009*"consider" + 0.009*"study" + 0.009*"nonlinear" + 0.009*"also"
Topic: 4 
Words: 0.014*"use" + 0.012*"high" + 0.012*"find" + 0.010*"show" + 0.009*"magnetic" + 0.009*"-" + 0.009*"low" + 0.008*"large" + 0.007*"observe" + 0.007*"present"


  and should_run_async(code)


In [None]:

lda_model.save('Model/lda_model.model')


  and should_run_async(code)


FileNotFoundError: ignored

In [None]:
import pickle
data = {'model': lda_model, 'topic': train_df['ABSTRACT_Topic']}
with open('model.pkl', 'wb') as file:
    pickle.dump(data, file)

with open('model.pkl', 'rb') as file:
    data = pickle.load(file)

model_loaded = data['model']
topic_loaded = data['topic']