# Topic modelling with Gensim : the LDA algorithm

LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

### 1. Loading of the required libraries and vocabularies.
***



In [None]:
import gensim
import pandas as pd

import spacy
import sys

## Run once this chunk with the command below active. Then comment it out and run the notebook
!{sys.executable} -m spacy download en

nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

import nltk
import re
import pprint
import os
import sys


from gensim import corpora

import nltk
from nltk import sent_tokenize
from nltk.corpus import stopwords

##nltk.download('stopwords')
##stop_words = stopwords.words('english')

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

from pprint import pprint



### Step 2. Pre-processing

* Read two files exported from the database which contain the SE Glossary articles definitions and their titles.
* In later versions, the **corresponding tables will be directly exported from the database**.
* Merge by_id_ and discard records with duplicate titles and/or definitions.


In [None]:
dat1= pd.read_csv("ESTAT_dat_concepts_2021_04_08.csv",sep=";")
dat2= pd.read_csv("ESTAT_dat_link_info_2021_04_08.csv",sep=";")
Gloss_concepts = pd.merge(dat1,dat2,on=['id'])
del(dat1,dat2)

Gloss_concepts = Gloss_concepts[['id','title','definition']]

Gloss_concepts = Gloss_concepts.drop_duplicates(subset=["definition"])
Gloss_concepts = Gloss_concepts.dropna(axis=0,subset=["definition"])
Gloss_concepts = Gloss_concepts.drop_duplicates(subset=["title"])
Gloss_concepts = Gloss_concepts.dropna(axis=0,subset=["title"])

Gloss_concepts.reset_index(drop=True, inplace=True)
Gloss_concepts

### 3. Pre-processing input data (cont).
***

Next we tokenize the texts - definitions in the articles, select tokens with minimum length 5, delete stop words and apply a simple pre-processing (convert to lowercase, drop accents). 

The result, _texts_ is a list with 1286 elements corresponding to the records in the dataframe _df_.

In [None]:
#Tokenize texts and clean-up text.
from gensim.parsing.preprocessing import remove_stopwords
def sent_to_words(sentences):
    for sentence in sentences:
        sentence = remove_stopwords(sentence) ## remove stop words
        tokens = gensim.utils.tokenize(sentence)
        sentence = [token for token in tokens if len(token) >= 5] ##minimum length = 5 
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # to lower + deacc=True removes punctuations
        
texts = list(sent_to_words(Gloss_concepts['definition']))
print('\nFirst 10 texts: \n',texts[:10])
print('\nTotal texts: ',len(texts),'\n')


### 4. Creation of corpus and terms frequencies.
***

Next we create:
* a dictionary from _texts_ with name _id2word_. This has 7258 unique tokens. 
* a mapping with name _corpus_ of the texts into lists with tuples: (word id, frequency in each text)


In [None]:
#Create Dictionary
id2word = corpora.Dictionary(texts) #Gensim creates a unique id for each word in the document. 
print(id2word,'\n')

## The produced corpus shown above is a mapping of (word_id, word_frequency).

#Alternatively:
corpus = [id2word.doc2bow(text) for text in texts] #corpus package automatically creates a set of corpus reader instances that can be used to access the corpora in the NLTK data package.

print('First 10 texts:\n')
print(corpus[:10])
print('\nTotal texts: ',len(corpus))

#Human readable format of corpus (term-frequency)
#[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

### 5. Dominant topic in each document.
***
Αt this stage we build the  _lda_model_, through which the dominant topic for each document will be extracted and we show the weight of the topic and the keywords. We define a dataframe _df_topic_sents_keywords_  in order to store these dominant topics. 

In function _format_topics_sentences()_ below, the list _ldamodel[corpus]_ has one nested list element per text. Each nested list contains tuples (topic, contribution). We sort each nested list by descending contribution to find the dominant topic and then, we retrieve for this topic, the list _wp_ of tuples (word, probability) for the most probable words, using the function _ldamodel.show_topic()_. We join these words into a list and put the result in column 'Topic_Keywords' of the dataframe.


In [None]:
#What is the Dominant topic and its percentage contribution in each document.

def format_topics_sentences(ldamodel=None, corpus=corpus):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row_list in enumerate(ldamodel[corpus]): 
        row = row_list[0] if ldamodel.per_word_topics else row_list ## ldamodel.per_word_topics is False          
        #print(row)
        row = sorted(row, key=lambda x: (x[1]), reverse=True) ## sort the nested list by descending contribution
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        topic_num, prop_topic = row[0]
        wp = ldamodel.show_topic(topic_num)
        topic_keywords = ", ".join([word for word, prop in wp])
        sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original tokenized text and title to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    sent_topics_df = pd.concat([sent_topics_df, Gloss_concepts['id'],Gloss_concepts['title']], axis=1)
    return(sent_topics_df)

In [None]:
#Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [None]:
df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus)
df_topic_sents_keywords.rename(columns = {0:'Text tokenized definition'}, inplace = True)
df_topic_sents_keywords.rename(columns = {'id':'Text id','title':'Text title'}, inplace = True)
df_topic_sents_keywords = df_topic_sents_keywords[['Text id','Text title','Text tokenized definition','Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']]
df_topic_sents_keywords['Dominant_Topic'] = df_topic_sents_keywords['Dominant_Topic'].astype(int)
df_topic_sents_keywords


### 6. Most representative document for each topic.
***

**Need to add the document id**

We want to take samples of documents that best represent a given topic, to have a complete analysis.This code receives the most exemplary document for each topic. We create the _sent_topics_sorteddf_mallet_ and these Topic Keywords are presented below.

In [None]:
# Group top 5 sentences under each topic
sent_topics_sorteddf_mallet = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords[['Text id','Text title','Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']].groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)


In [None]:
#Find the most representative document for each topic
# Reset Index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf_mallet.columns = ['Text id','Text title','Topic_Num', 'Topic_Perc_Contrib', 'Topic Keywords' ]
sent_topics_sorteddf_mallet = sent_topics_sorteddf_mallet[['Topic_Num', 'Topic_Perc_Contrib', 'Topic Keywords','Text id','Text title']]

# Show
sent_topics_sorteddf_mallet

### 7. The topics as a mix of keywords.
***

The above LDA model is built with 20 different topics where each topic is a combination of keywords and each keyword contributes with a certain weight to the topic.
Finally,we want to understand the volume and distribution of topics in order to judge how widely it was discussed,so we define the _df_dominant_topics_.


In [None]:
#Print the Keyword in the 20 topics

from pprint import pprint
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

### 8. Visualization of the topics.
***

To visualize the fitted LDA model we use the _pyLDAvis_ package. This is the Python porting of the R package _LDAvis_, see [LDAvis vignette](https://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf) for details and Chuang, Jason, Manning, Christopher D., and Heer, Jeffrey (2012). Termite: Visualization Techniques for Assessing Textual Topic Models, *Advanced Visual Interfaces* for the theory behind the visualization algorithm. The paper is available [here](https://dl.acm.org/doi/pdf/10.1145/2254556.2254572?casa_token=q2BavKP415QAAAAA:MhcYHzz4PJpC7dNkkm12GL-ohQRUXBgumPJ9l1t_5n3M4qVE1kdDqKGfPmtnR7qbale_ukS-2nJs).



In [None]:
#Visualize the topics

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
vis

In [None]:
#Topic distribution across documents

# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

# Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)

# Topic Number and Keywords
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]

# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

# Change Column names
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']

# Show
df_dominant_topics

### 9. The model's coherence score.
***

This is based on the work in Röder, M., Both, A., & Hinneburg, A. (2015, February). Exploring the space of topic coherence measures. In *Proceedings of the eighth ACM international conference on Web search and data mining* (pp. 399-408), available [here](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf).

In [None]:
from gensim.models import CoherenceModel

coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
docs_topics = lda_model.get_document_topics(corpus,minimum_probability=0.3)
res  = [d for d in docs_topics]
res
len(res)

In [None]:
print(len(lda_model[corpus]))
[a for a in lda_model[corpus]]