# LDA introduction.

Natural language processing (NLP) employs the probabilistic topic modeling method known as Latent Dirichlet Allocation (LDA).

By identifying the topics that most accurately represent each theme, LDA aims to reveal the hidden thematic structure of a group of documents.

LDA can be used to find the topics that are most pertinent to computer science, mathematics, and physics in the case of abstracts for STEM subjects.

We begin by preprocessing the text data for this purpose in order to get rid of stop words, punctuation, and other extraneous details.

In order to represent the frequency of each term in each document, we tokenize the text to separate it into individual words or phrases.

The topics that best explain the variation in the data can be found using LDA once we have the document-term matrix.

Until the model converges on a stable solution, this entails repeatedly assigning each word in each document to a topic and adjusting the topic probabilities.

The LDA model produces a list of topics, each of which is represented by a distribution over the vocabulary words.

The topics can then be understood by looking at the most frequently occurring words in each topic and using domain knowledge to assign them to pertinent STEM subject areas.

We previously conduct a similar procedure using TF-IDF, and this model will work in tandom with the website that we have created for the users to input their abstracts.

We first begin by importing the libraries that will be using.

In [5]:
pip install pyLDAvis


^C
Note: you may need to restart the kernel to use updated packages.




In [6]:
import pandas as pd
import numpy as np
import re
import string
import spacy
import pickle

# libraries for visualization
import pyLDAvis
import pyLDAvis.lda_model
import matplotlib.pyplot as plt
import seaborn as sns
import gensim
from gensim import corpora
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis.gensim_models as gensimvis
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

You can use the following command line to import the dataset file in case youre using Google Colab.

In [8]:
from google.colab import files

uploaded = files.upload()

ModuleNotFoundError: No module named 'google.colab'

We will then import our data set as we did before.

In [9]:
train_df = pd.read_csv('abstracts.csv')
train_df.head()

Unnamed: 0,ID,TITLE,ABSTRACT
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...


## Data cleaning and preprocessing.
And afterwords, we will have our cleaning function. this function performs a series of common text preprocessing steps to remove noise and irrelevant information from the input text, which can improve the accuracy of natural language processing tasks. These steps include removing punctuation, removing words that are entirely composed of digits, and removing short words.



In [10]:
def clean_text(text):
    #The first line of the function creates a dictionary called clean_dict that maps each punctuation character in the string.punctuation string to an empty string.
    #This will be used to remove all punctuation characters from the text.
    clean_dict = {special_char: '' for special_char in string.punctuation}
    clean_dict[' '] = ' '
    #A translation table is created using the str.maketrans() method, which takes the clean_dict dictionary as input
    #and returns a translation table that can be used with the translate() method to remove punctuation from the text.
    table = str.maketrans(clean_dict)
    text_1 = text.translate(
        table
    )  #he translate() method is called on the input text using the translation table to remove all punctuation characters.
    text_Array = text_1.split()

    """
    A list comprehension is used to remove any words that are entirely composed of digits (isdigit()) or that have a length less than or equal to 3 characters.
    The remaining words are joined back together into a string with spaces between them using the join() method.
    The resulting cleaned text is converted to lowercase using the lower() method.
    """
    text_2 = ' '.join([
        word for word in text_Array
        if (not word.isdigit() and (not word.isdigit() and len(word) > 3))
    ])

    return text_2.lower()

We will then be using a Natural Language Toolkit (NLTK) library to remove stopwords from text data.

Stopwords are words that occur frequently in a language but do not carry much meaning, such as "a", "an", "the", "in", "of", etc.

The first two lines of the code import the stopwords module from NLTK and create a variable stop_words that contains a list of English stopwords.

Next, a function remove_stopwords is defined that takes a single argument text, which is a string containing text data.

The function splits the input text into an array of words using the split() method and then uses a list comprehension to remove any words that appear in the stop_words list.

The filtered words are then joined back together into a string using the join() method and returned.

Finally, the apply() method is used to apply the remove_stopwords function to every row in the 'ABSTRACT' column of the train_df DataFrame.

This removes the stopwords from the text data in each row and updates the 'ABSTRACT' column in-place with the cleaned text.

In [11]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Rayni\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rayni\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Rayni\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [12]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')


#We then create a function to remove the stopwords in our text.
def remove_stopwords(text):
    text_Array = text.split(' ')
    remove_words = " ".join([i for i in text_Array if i not in stop_words])
    return remove_words


#And here we will apply the remove_stopwords function. This will remove the stopwords from our dataset's text
train_df['ABSTRACT'] = train_df['ABSTRACT'].apply(remove_stopwords)


In [13]:


tf_vectorizer = CountVectorizer(analyzer=clean_text,
                                strip_accents='unicode',
                                stop_words='english',
                                lowercase=True,
                                token_pattern=r'\b[a-zA-Z]{3,}\b',
                                max_df=0.5,
                                min_df=10)
dtm_tf = tf_vectorizer.fit_transform(train_df['ABSTRACT'])
dtm_tf.shape



(20972, 27)

In [14]:
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(train_df['ABSTRACT'])
dtm_tfidf.shape



(20972, 27)

In [15]:
lda_tfidf = LatentDirichletAllocation(n_components=20, random_state=0)
lda_tfidf.fit(dtm_tfidf)

In [16]:
pyLDAvis.lda_model.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)

  default_term_info = default_term_info.sort_values(


PreparedData(topic_coordinates=              x         y  topics  cluster       Freq
topic                                                
10     0.285814 -0.160587       1        1  20.853575
4     -0.124704  0.118689       2        1   7.237380
14    -0.167533 -0.049187       3        1   5.990865
15    -0.207415  0.016300       4        1   5.926171
13    -0.200304  0.036943       5        1   5.795898
19    -0.201335  0.013156       6        1   5.748620
2     -0.199290  0.033195       7        1   5.282359
11    -0.010216  0.228120       8        1   5.231331
6     -0.154773  0.055658       9        1   4.982194
7     -0.201975 -0.137921      10        1   4.452306
3     -0.190784 -0.049468      11        1   4.323611
16     0.194800 -0.006614      12        1   3.532901
5     -0.150529 -0.028978      13        1   3.309932
12     0.266893  0.270302      14        1   3.013260
9      0.266621 -0.052828      15        1   2.555161
8      0.244879 -0.029036      16        1   2.5189

We will use the SpaCy library to perform lemmatization on a list of input texts. Lemmatization is the process of reducing words to their base or dictionary form, which can be useful for standardizing text data and reducing noise in natural language processing tasks. However before beginning to use the Spacy library, you must first install the required tools to begin using the Spacy library. Run the following commands in the terminal to install the required tools.
````
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
````



In [17]:
'''
The first line of the code loads the 'en_core_web_md' SpaCy model, 
which is a medium-sized English language model that includes word vectors and supports part-of-speech tagging, 
named entity recognition, and dependency parsing. The 'parser' and 'ner' components are disabled using the disable parameter, 
which speeds up the processing time since these components are not needed for lemmatization.
'''

nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])


def lemmatization(texts, allowed_postags=['VERB', 'ADV', 'ADJ']):
    output = []
    for sent in texts:
        doc = nlp(sent)
        output.append(
            [token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return output


In [18]:

text_list=train_df['ABSTRACT'].tolist()
print(text_list[1])
tokenized_reviews = lemmatization(text_list)
print(tokenized_reviews[1])

  Rotation invariance translation invariance great values image
recognition tasks. In paper, bring new architecture convolutional
neural network (CNN) named cyclic convolutional layer achieve rotation
invariance 2-D symbol recognition. We also get position and
orientation 2-D symbol network achieve detection purpose for
multiple non-overlap target. Last least, architecture achieve
one-shot learning cases using invariance.

['great', 'bring', 'new', 'convolutional', 'neural', 'name', 'cyclic', 'convolutional', 'achieve', 'd', 'also', 'get', 'd', 'achieve', 'multiple', 'non', '-', 'overlap', 'last', 'least', 'achieve', 'use']


In [19]:
dictionary = corpora.Dictionary(tokenized_reviews)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in tokenized_reviews]

In [20]:
LDA = gensim.models.ldamodel.LdaModel

# Build LDA model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=4, random_state=100,
                chunksize=1000, passes=50,iterations=100)

In [21]:
lda_model.print_topics()

[(0,
  '0.019*"give" + 0.018*"show" + 0.017*"prove" + 0.013*"also" + 0.012*"-" + 0.009*"non" + 0.009*"obtain" + 0.008*"study" + 0.008*"set" + 0.008*"use"'),
 (1,
  '0.024*"use" + 0.015*"propose" + 0.015*"base" + 0.014*"learn" + 0.009*"show" + 0.008*"neural" + 0.008*"different" + 0.008*"deep" + 0.008*"present" + 0.007*"new"'),
 (2,
  '0.020*"propose" + 0.019*"use" + 0.015*"show" + 0.015*"-" + 0.013*"base" + 0.010*"optimal" + 0.009*"random" + 0.009*"provide" + 0.008*"consider" + 0.008*"well"'),
 (3,
  '0.013*"use" + 0.011*"find" + 0.011*"high" + 0.011*"-" + 0.010*"show" + 0.008*"low" + 0.008*"magnetic" + 0.008*"large" + 0.007*"present" + 0.007*"observe"')]

In [22]:
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, doc_term_matrix, dictionary)
vis

  default_term_info = default_term_info.sort_values(


In [23]:
pyLDAvis.save_html(vis, 'lda_model.html')

In [24]:
print('\nPerplexity: ', lda_model.log_perplexity(doc_term_matrix,total_docs=80000))
from gensim.models.coherencemodel import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_reviews, dictionary=dictionary , coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -7.158010503543251

Coherence Score:  0.41018808458104605


In [25]:
import pickle
pickle.dump(lda_model, open('model.pkl', 'wb'))
lda_model = pickle.load(open('model.pkl', 'rb'))