# LDA introduction.

Natural language processing (NLP) employs the probabilistic topic modeling method known as Latent Dirichlet Allocation (LDA).

By identifying the topics that most accurately represent each theme, LDA aims to reveal the hidden thematic structure of a group of documents.

LDA can be used to find the topics that are most pertinent to computer science, mathematics, and physics in the case of abstracts for STEM subjects.

We begin by preprocessing the text data for this purpose in order to get rid of stop words, punctuation, and other extraneous details.

In order to represent the frequency of each term in each document, we tokenize the text to separate it into individual words or phrases.

The topics that best explain the variation in the data can be found using LDA once we have the document-term matrix.

Until the model converges on a stable solution, this entails repeatedly assigning each word in each document to a topic and adjusting the topic probabilities.

The LDA model produces a list of topics, each of which is represented by a distribution over the vocabulary words.

The topics can then be understood by looking at the most frequently occurring words in each topic and using domain knowledge to assign them to pertinent STEM subject areas.

We previously conduct a similar procedure using TF-IDF, and this model will work in tandom with the website that we have created for the users to input their abstracts.

We first begin by importing the libraries that will be using.

In [40]:
pip install pyLDAvis


Note: you may need to restart the kernel to use updated packages.


In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('wordnet')

You can use the following command line to import the dataset file in case youre using Google Colab.

In [None]:
from google.colab import files

uploaded = files.upload()

Saving abstracts.csv to abstracts.csv


In [1]:
import pandas as pd
import numpy as np
import re
import string
import spacy
import pickle

# libraries for visualization
import pyLDAvis
import pyLDAvis.lda_model
import matplotlib.pyplot as plt
import seaborn as sns
import gensim
from gensim import corpora
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis.gensim_models as gensimvis
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

  if LooseVersion(mpl.__version__) >= "3.0":
  other = LooseVersion(other)
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


We will then import our data set as we did before.

In [2]:
train_df = pd.read_csv('abstracts.csv')
train_df.head()

Unnamed: 0,ID,TITLE,ABSTRACT
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...


In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20972 entries, 0 to 20971
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   ID        20972 non-null  int64 
 1   TITLE     20972 non-null  object
 2   ABSTRACT  20972 non-null  object
dtypes: int64(1), object(2)
memory usage: 491.7+ KB


## Data cleaning and preprocessing.
And afterwords, we will have our cleaning function. this function performs a series of common text preprocessing steps to remove noise and irrelevant information from the input text, which can improve the accuracy of natural language processing tasks. These steps include removing punctuation, removing words that are entirely composed of digits, and removing short words.



In [4]:
def clean_text(text):
    #The first line of the function creates a dictionary called clean_dict that maps each punctuation character in the string.punctuation string to an empty string.
    #This will be used to remove all punctuation characters from the text.
    clean_dict = {special_char: '' for special_char in string.punctuation}
    clean_dict[' '] = ' '
    #A translation table is created using the str.maketrans() method, which takes the clean_dict dictionary as input
    #and returns a translation table that can be used with the translate() method to remove punctuation from the text.
    table = str.maketrans(clean_dict)
    text_1 = text.translate(
        table
    )  #he translate() method is called on the input text using the translation table to remove all punctuation characters.
    text_Array = text_1.split()

    """
    A list comprehension is used to remove any words that are entirely composed of digits (isdigit()) or that have a length less than or equal to 3 characters.
    The remaining words are joined back together into a string with spaces between them using the join() method.
    The resulting cleaned text is converted to lowercase using the lower() method.
    """
    text_2 = ' '.join([
        word for word in text_Array
        if (not word.isdigit() and (not word.isdigit() and len(word) > 3))
    ])

    return text_2.lower()

We will then be using a Natural Language Toolkit (NLTK) library to remove stopwords from text data.

Stopwords are words that occur frequently in a language but do not carry much meaning, such as "a", "an", "the", "in", "of", etc.

The first two lines of the code import the stopwords module from NLTK and create a variable stop_words that contains a list of English stopwords.

Next, a function remove_stopwords is defined that takes a single argument text, which is a string containing text data.

The function splits the input text into an array of words using the split() method and then uses a list comprehension to remove any words that appear in the stop_words list.

The filtered words are then joined back together into a string using the join() method and returned.

Finally, the apply() method is used to apply the remove_stopwords function to every row in the 'ABSTRACT' column of the train_df DataFrame.

This removes the stopwords from the text data in each row and updates the 'ABSTRACT' column in-place with the cleaned text.

In [5]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')


#We then create a function to remove the stopwords in our text.
def remove_stopwords(text):
    text_Array = text.split(' ')
    remove_words = " ".join([i for i in text_Array if i not in stop_words])
    return remove_words


#And here we will apply the remove_stopwords function. This will remove the stopwords from our dataset's text
train_df['ABSTRACT'] = train_df['ABSTRACT'].apply(remove_stopwords)


We will use the SpaCy library to perform lemmatization on a list of input texts. Lemmatization is the process of reducing words to their base or dictionary form, which can be useful for standardizing text data and reducing noise in natural language processing tasks. However before beginning to use the Spacy library, you must first install the required tools to begin using the Spacy library. Run the following commands in the terminal to install the required tools.
````
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
````



In [6]:
'''
The first line of the code loads the 'en_core_web_md' SpaCy model, 
which is a medium-sized English language model that includes word vectors and supports part-of-speech tagging, 
named entity recognition, and dependency parsing. The 'parser' and 'ner' components are disabled using the disable parameter, 
which speeds up the processing time since these components are not needed for lemmatization.
'''
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *    

 
def lemmatization(texts,allowed_postags=['VERB', 'ADV','ADJ']): 
      nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
      output = []
      for sent in texts:
            doc = nlp(sent) 
            output.append([token.lemma_ for token in doc if token.pos_ in allowed_postags ])
      return output





In [7]:
train_df['ABSTRACT'].dtypes

dtype('O')

In [8]:
train_df['ABSTRACT'].isnull()

0        False
1        False
2        False
3        False
4        False
         ...  
20967    False
20968    False
20969    False
20970    False
20971    False
Name: ABSTRACT, Length: 20972, dtype: bool

In [9]:

text_list=train_df['ABSTRACT'].tolist()
print(text_list[1])
tokenized_reviews = lemmatization(text_list)
print(tokenized_reviews[1])

  Rotation invariance translation invariance great values image
recognition tasks. In paper, bring new architecture convolutional
neural network (CNN) named cyclic convolutional layer achieve rotation
invariance 2-D symbol recognition. We also get position and
orientation 2-D symbol network achieve detection purpose for
multiple non-overlap target. Last least, architecture achieve
one-shot learning cases using invariance.

['great', 'bring', 'new', 'convolutional', 'neural', 'name', 'cyclic', 'convolutional', 'achieve', 'd', 'also', 'get', 'd', 'achieve', 'multiple', 'non', '-', 'overlap', 'last', 'least', 'achieve', 'use']


In [46]:
dictionary = corpora.Dictionary(tokenized_reviews)
bow_corpus = [dictionary.doc2bow(doc) for doc in tokenized_reviews]
bow_corpus[4310]

[(11, 3),
 (45, 1),
 (125, 1),
 (126, 1),
 (170, 1),
 (234, 1),
 (248, 1),
 (335, 1),
 (416, 1),
 (467, 1),
 (554, 2),
 (683, 1),
 (758, 1),
 (806, 1),
 (817, 3),
 (866, 1),
 (1083, 1),
 (1232, 1),
 (1346, 3),
 (1572, 3),
 (1667, 1),
 (1879, 6)]

In [29]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 -
1 aim
2 allow
3 also
4 analyse
5 analyze
6 associate
7 binary
8 compare
9 compose
10 cortical


In [30]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [31]:
bow_doc_4310 = bow_corpus[4310]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(
        bow_doc_4310[i][0], dictionary[bow_doc_4310[i][0]],
        bow_doc_4310[i][1]))


Word 11 ("detect") appears 3 time.
Word 45 ("reduce") appears 1 time.
Word 125 ("finally") appears 1 time.
Word 126 ("interest") appears 1 time.
Word 170 ("metallic") appears 1 time.
Word 234 ("main") appears 1 time.
Word 248 ("utilize") appears 1 time.
Word 335 ("giant") appears 1 time.
Word 416 ("adequate") appears 1 time.
Word 467 ("check") appears 1 time.
Word 554 ("geometric") appears 2 time.
Word 683 ("create") appears 1 time.
Word 758 ("odd") appears 1 time.
Word 806 ("possible") appears 1 time.
Word 817 ("internal") appears 3 time.
Word 866 ("begin") appears 1 time.
Word 1083 ("logarithmic") appears 1 time.
Word 1232 ("fine") appears 1 time.
Word 1346 ("employ") appears 3 time.
Word 1572 ("categorical") appears 3 time.
Word 1667 ("stimulate") appears 1 time.
Word 1879 ("probe") appears 6 time.


In [32]:
train_df['ABSTRACT'].shape

(20972,)

In [33]:
train_df['ABSTRACT'].size

20972

In [47]:

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=bow_corpus,
                num_topics=5,
                id2word=dictionary,
                random_state=100,
                chunksize=1000,
                passes=50,
                iterations=100,
                update_every=1, alpha='auto')


In [45]:
lda_model_multi = gensim.models.LdaMulticore(corpus = bow_corpus,
                                       num_topics=5,
                                       id2word=dictionary,
                                       workers=2,
                                       random_state=100,
                                       chunksize=1000,
                                       passes=50,
                                       iterations=100)


KeyboardInterrupt: 

In [None]:
lda_model.print_topics()

[(0,
  '0.023*"give" + 0.022*"prove" + 0.020*"show" + 0.014*"also" + 0.010*"define" + 0.009*"set" + 0.008*"obtain" + 0.008*"study" + 0.008*"finite" + 0.007*"positive"'),
 (1,
  '0.027*"use" + 0.018*"propose" + 0.018*"base" + 0.014*"learn" + 0.010*"different" + 0.009*"show" + 0.008*"present" + 0.008*"new" + 0.008*"neural" + 0.008*"deep"'),
 (2,
  '0.019*"use" + 0.018*"-" + 0.018*"show" + 0.013*"propose" + 0.012*"consider" + 0.010*"provide" + 0.010*"base" + 0.010*"non" + 0.009*"well" + 0.009*"also"'),
 (3,
  '0.036*"-" + 0.022*"non" + 0.018*"dimensional" + 0.017*"topological" + 0.013*"critical" + 0.012*"couple" + 0.010*"spatial" + 0.009*"boundary" + 0.009*"nonlinear" + 0.009*"algebra"'),
 (4,
  '0.015*"find" + 0.014*"high" + 0.012*"magnetic" + 0.011*"use" + 0.011*"low" + 0.010*"show" + 0.009*"large" + 0.009*"observe" + 0.007*"also" + 0.007*"optical"')]

In [None]:
lda_model_multi.print_topics()

[(0,
  '0.024*"give" + 0.020*"show" + 0.020*"prove" + 0.014*"also" + 0.012*"set" + 0.010*"define" + 0.008*"use" + 0.007*"study" + 0.007*"bound" + 0.007*"new"'),
 (1,
  '0.025*"use" + 0.017*"propose" + 0.016*"base" + 0.014*"learn" + 0.010*"show" + 0.009*"neural" + 0.009*"different" + 0.009*"deep" + 0.008*"present" + 0.007*"new"'),
 (2,
  '0.019*"propose" + 0.019*"use" + 0.014*"show" + 0.014*"-" + 0.013*"base" + 0.011*"optimal" + 0.009*"provide" + 0.008*"random" + 0.008*"well" + 0.007*"also"'),
 (3,
  '0.026*"-" + 0.018*"non" + 0.013*"dimensional" + 0.012*"show" + 0.010*"use" + 0.010*"nonlinear" + 0.009*"consider" + 0.009*"obtain" + 0.009*"boundary" + 0.009*"study"'),
 (4,
  '0.014*"use" + 0.012*"high" + 0.012*"find" + 0.010*"show" + 0.009*"low" + 0.009*"magnetic" + 0.009*"-" + 0.008*"large" + 0.007*"observe" + 0.007*"present"')]

In [None]:
for index, score in sorted(lda_model[bow_corpus[4310]],
                           key=lambda tup: -1 * tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score,
                                             lda_model.print_topic(index, 10)))



Score: 0.3617786169052124	 
Topic: 0.023*"give" + 0.022*"prove" + 0.020*"show" + 0.014*"also" + 0.010*"define" + 0.009*"set" + 0.008*"obtain" + 0.008*"study" + 0.008*"finite" + 0.007*"positive"

Score: 0.2938082814216614	 
Topic: 0.027*"use" + 0.018*"propose" + 0.018*"base" + 0.014*"learn" + 0.010*"different" + 0.009*"show" + 0.008*"present" + 0.008*"new" + 0.008*"neural" + 0.008*"deep"

Score: 0.20413623750209808	 
Topic: 0.015*"find" + 0.014*"high" + 0.012*"magnetic" + 0.011*"use" + 0.011*"low" + 0.010*"show" + 0.009*"large" + 0.009*"observe" + 0.007*"also" + 0.007*"optical"

Score: 0.11722999066114426	 
Topic: 0.036*"-" + 0.022*"non" + 0.018*"dimensional" + 0.017*"topological" + 0.013*"critical" + 0.012*"couple" + 0.010*"spatial" + 0.009*"boundary" + 0.009*"nonlinear" + 0.009*"algebra"

Score: 0.023046845570206642	 
Topic: 0.019*"use" + 0.018*"-" + 0.018*"show" + 0.013*"propose" + 0.012*"consider" + 0.010*"provide" + 0.010*"base" + 0.010*"non" + 0.009*"well" + 0.009*"also"


In [None]:
for index, score in sorted(lda_model_multi[bow_corpus[4310]],
                           key=lambda tup: -1 * tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score,
                                             lda_model_multi.print_topic(index, 10)))


Score: 0.6847151517868042	 
Topic: 0.014*"use" + 0.012*"high" + 0.012*"find" + 0.010*"show" + 0.009*"low" + 0.009*"magnetic" + 0.009*"-" + 0.008*"large" + 0.007*"observe" + 0.007*"present"

Score: 0.2423679679632187	 
Topic: 0.025*"use" + 0.017*"propose" + 0.016*"base" + 0.014*"learn" + 0.010*"show" + 0.009*"neural" + 0.009*"different" + 0.009*"deep" + 0.008*"present" + 0.007*"new"

Score: 0.061950888484716415	 
Topic: 0.024*"give" + 0.020*"show" + 0.020*"prove" + 0.014*"also" + 0.012*"set" + 0.010*"define" + 0.008*"use" + 0.007*"study" + 0.007*"bound" + 0.007*"new"


In [None]:
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, bow_corpus, dictionary)
vis

  default_term_info = default_term_info.sort_values(


In [None]:
pyLDAvis.save_html(vis, 'lda_model.html')

In [None]:
pyLDAvis.enable_notebook()
vis2 = gensimvis.prepare(lda_model_multi, bow_corpus, dictionary)
vis2

  default_term_info = default_term_info.sort_values(


In [None]:
pyLDAvis.save_html(vis2, 'lda_model_multi.html')

In [None]:
print('\nPerplexity: ', lda_model.log_perplexity(bow_corpus,total_docs=80000))
from gensim.models.coherencemodel import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_reviews, dictionary=dictionary , coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -7.197719773848608

Coherence Score:  0.4277841211568861


In [None]:
print('\nPerplexity: ', lda_model_multi.log_perplexity(bow_corpus,total_docs=80000))
from gensim.models.coherencemodel import CoherenceModel
coherence_model_lda_multi = CoherenceModel(model=lda_model_multi, texts=tokenized_reviews, dictionary=dictionary , coherence='c_v')
coherence_lda_multi = coherence_model_lda_multi.get_coherence()
print('\nCoherence Score: ', coherence_lda_multi)


Perplexity:  -7.154807698518829

Coherence Score:  0.4128407170282179


In [None]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.023*"give" + 0.022*"prove" + 0.020*"show" + 0.014*"also" + 0.010*"define" + 0.009*"set" + 0.008*"obtain" + 0.008*"study" + 0.008*"finite" + 0.007*"positive"
Topic: 1 
Words: 0.027*"use" + 0.018*"propose" + 0.018*"base" + 0.014*"learn" + 0.010*"different" + 0.009*"show" + 0.008*"present" + 0.008*"new" + 0.008*"neural" + 0.008*"deep"
Topic: 2 
Words: 0.019*"use" + 0.018*"-" + 0.018*"show" + 0.013*"propose" + 0.012*"consider" + 0.010*"provide" + 0.010*"base" + 0.010*"non" + 0.009*"well" + 0.009*"also"
Topic: 3 
Words: 0.036*"-" + 0.022*"non" + 0.018*"dimensional" + 0.017*"topological" + 0.013*"critical" + 0.012*"couple" + 0.010*"spatial" + 0.009*"boundary" + 0.009*"nonlinear" + 0.009*"algebra"
Topic: 4 
Words: 0.015*"find" + 0.014*"high" + 0.012*"magnetic" + 0.011*"use" + 0.011*"low" + 0.010*"show" + 0.009*"large" + 0.009*"observe" + 0.007*"also" + 0.007*"optical"


In [None]:
for idx, topic in lda_model_multi.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.024*"give" + 0.020*"show" + 0.020*"prove" + 0.014*"also" + 0.012*"set" + 0.010*"define" + 0.008*"use" + 0.007*"study" + 0.007*"bound" + 0.007*"new"
Topic: 1 
Words: 0.025*"use" + 0.017*"propose" + 0.016*"base" + 0.014*"learn" + 0.010*"show" + 0.009*"neural" + 0.009*"different" + 0.009*"deep" + 0.008*"present" + 0.007*"new"
Topic: 2 
Words: 0.019*"propose" + 0.019*"use" + 0.014*"show" + 0.014*"-" + 0.013*"base" + 0.011*"optimal" + 0.009*"provide" + 0.008*"random" + 0.008*"well" + 0.007*"also"
Topic: 3 
Words: 0.026*"-" + 0.018*"non" + 0.013*"dimensional" + 0.012*"show" + 0.010*"use" + 0.010*"nonlinear" + 0.009*"consider" + 0.009*"obtain" + 0.009*"boundary" + 0.009*"study"
Topic: 4 
Words: 0.014*"use" + 0.012*"high" + 0.012*"find" + 0.010*"show" + 0.009*"low" + 0.009*"magnetic" + 0.009*"-" + 0.008*"large" + 0.007*"observe" + 0.007*"present"


In [None]:

lda_model.save('Model/lda_model.model')


In [None]:
test_df = pd.read_csv('test.csv')

In [None]:
user_list = test_df['ABSTRACT']
user_list.dropna()
user_list = user_list.apply(remove_stopwords)
user_list.head(10)

0      We present novel understandings Gamma-Poisso...
1      Meteorites contain minerals Solar System ast...
2      Frame aggregation mechanism multiple frames ...
3      Milky Way open clusters diverse terms age, c...
4      Proving cryptographic protocol correct secre...
5      This paper proposes regularized pairwise dif...
6      A central issue theory extreme values focuse...
7      Astrophysics cosmology rich data. The advent...
8      A number recent works proposed techniques en...
9      We use hydrodynamical galaxy formation simul...
Name: ABSTRACT, dtype: object

In [None]:
from matplotlib import pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import matplotlib.colors as mcolors

cols = [color for name, color in mcolors.TABLEAU_COLORS.items()
        ]  # more colors: 'mcolors.XKCD_COLORS'

cloud = WordCloud(stopwords=stop_words,
                  background_color='white',
                  width=2500,
                  height=1800,
                  max_words=10,
                  colormap='tab10',
                  color_func=lambda *args, **kwargs: cols[i],
                  prefer_horizontal=1.0)

topics = lda_model.show_topics(formatted=False)

fig, axes = plt.subplots(2, 2, figsize=(10, 10), sharex=True, sharey=True)

for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    topic_words = dict(topics[i][1])
    cloud.generate_from_frequencies(topic_words, max_font_size=300)
    plt.gca().imshow(cloud)
    plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))
    plt.gca().axis('off')

plt.subplots_adjust(wspace=0, hspace=0)
plt.axis('off')
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()

In [None]:
num_topics = lda_model.num_topics
    for i, topic in enumerate(topics):
        if i >= num_topics:
            break

        topic_id = topic[0]
        topic_prob = topic[1]
        if topic_id < num_topics:
        
            print(f'Topic {topic_id}: {topic_prob:.3f} - {lda_model.print_topic(topic_id)}')
