# Exercises Lecture 12: Topic Modeling (Gensim)

In this notebook, we use LDA to perform topic modeling on a corpus of Wikipedia articles from 16 categories:

Airports, Artists, Astronauts, Astronomical_objects, Building,City,Comics_characters, Companies,Foods, Monuments_and_memorials,Politicians,Sports_teams,Sportspeople, Transport, Universities_and_colleges, Written_communication..

The assignment involves the following steps:

* Preparing the data  
* Training an LDA model
* Interpreting the results of the LDA model
   - Printing out the topk relevant tokens for each topic
   - Computing coherence
   - Visualising the topic graph

Data: wkp_sorted.zip      

Python libraries
- sklearn.datasets to load data 
- pandas
- WordCloud
- gensim for topic modeling  

Cheat sheets
- clustering_cheat_sheet.ipynb   
- topic_modeling_cheat_sheet.ipynb
 

#### Installing dependent packages

In [None]:
# To be run only once
#!pip install gensim
#!pip install PyLDAvis
#!pip install spacy
#!python -m spacy download en_core_web_sm

In [1]:
from sklearn.datasets import load_files
import pandas as pd
import numpy as np
import os
import re
import nltk
import spacy
from nltk.corpus import stopwords
nlp = spacy.load("en_core_web_sm")

In [2]:
stopword = stopwords.words('english')

## Generating a word cloud

**Exercise 1** 

* Create a pandas dataframe containing a column for the text of each Wikipedia article included in  "wkp_sorted/". 
* Use sklearn.datasets load_files method (cf. clustering CS) 

In [3]:
os.chdir('/experiments/cours nlp/data science/lecture12/')
d = load_files("wkp_sorted/", encoding = "utf8")
df = pd.DataFrame(zip(d['data'], d['target'], d['filenames']),  columns=['Text','Target', 'Filenames'])
df.Filenames = df.Filenames.apply(os.path.basename)
df['Target_name'] = df.Target.apply(lambda x : d['target_names'][x])

In [4]:
df.head()

Unnamed: 0,Text,Target,Filenames,Target_name
0,The Cardiff Roller Collective (CRoC) are a rol...,11,Cardiff_Roller_Collective_Sports_teams.txt,Sports_teams
1,"""Go! Pack Go!"" is the fight song of the Green ...",11,Go!_You_Packers_Go!_Sports_teams.txt,Sports_teams
2,Al-Machriq (English translation: The East) was...,14,Al-Machriq_Universities_and_colleges.txt,Universities_and_colleges
3,Ajman International Airport (Arabic: مطار عجما...,0,Ajman_International_Airport_Airports.txt,Airports
4,Kapla is a construction set for children and a...,4,Kapla_Building.txt,Building


**Exercise 2:** Generate a word cloud (topic_modeling CS)

* The WordCloud method takes as input the corpus as a single string. 
* Use pandas str.cat method to concatenate the content of the "story_str" column into a single string

In [5]:
story_str = df['Text'].str.cat(sep = ' ')

## Pre-processing the data

**Exercise 3:** Preparing the corpus for topic modeling

Gensim topic modeling module takes as input a list of tokens.

 - Define a clean_up function which takes as input a list of texts and outputs the list of lemmas for tokens in the input which :
* are not stop words  (spacy CS)
* only contains characters (python_basic CS) 
* whose length is greater than 2
* whose spacy POS tag is not 'ADV','PRON','CCONJ','PUNCT','PART','DET','ADP'or 'SPACE'  (spacy CS)

- Apply this function to the 'text' column of the Wikipedia dataframe (cf. Ex. 1 and 2)   
_**Help**_ : use pandas apply method

In [6]:
pos = ['ADV','PRON','CCONJ','PUNCT','PART','DET','ADP','SPACE']
def clean_up(texts:str):
    text_ = " ".join([text.lower() for text in texts.split() if text not in stopword and \
                     text.isalpha() and len(text) > 2])
    texts = nlp(text_)
    l = [text.lemma_ for text in texts if text.pos_ not in pos]
    return l

In [7]:
clean_text = df['Text'].apply(clean_up)

## Learn a topic model

In [8]:
clean_text

0      [cardiff, roller, collective, roller, sport, l...
1      [pack, fight, song, green, bay, first, profess...
2      [journal, found, jesuit, chaldean, priest, lou...
3      [ajman, international, airport, مطار, عجمان, u...
4      [kapla, construction, set, child, set, consist...
                             ...                        
155    [air, route, traffic, control, center, one, un...
156    [suliman, yari, august, may, afghan, son, nek,...
157    [sapphire, stagg, fictional, character, appear...
158    [relocation, professional, sport, team, occur,...
159    [uncommanded, rotation, undesirable, character...
Name: Text, Length: 160, dtype: object

**Exercise 4:** Create a vocabulary for the lda model and convert your list of list of lemmas into a document-term matrix

* Use [Gensim dictionary method](https://radimrehurek.com/gensim/corpora/dictionary.html) to create a dictionary 
* Use Gensim doc2bow method (from Corpora module) to convert each synopsis to a list of integers

In [9]:
from gensim.corpora import Dictionary
dic = Dictionary(clean_text)

In [None]:
doc_token_matrix = [dic.doc2bow(text) for text in clean_text]

**Exercise 5:** Create an LDA model with 16 topics and apply it to your data

In [11]:
import gensim
lda_model = gensim.models.LdaMulticore(corpus=doc_token_matrix,
                                       id2word=dic,
                                       num_topics=10, 
                                       random_state=100,
                                       chunksize=100,
                                       passes=10,
                                       per_word_topics=True)

**Exercise 6:** Print out the keywords of the 16 topics (Airports, Artists, Astronauts, Astronomical_objects, Building,City,Comics_characters, Companies,Foods, Monuments_and_memorials,Politicians,Sports_teams,Sportspeople, Transport, Universities_and_colleges, Written_communication)


Each topic is a combination of keywords.

* Use `lda_model.print_topics()` to see the keywords for each topic and the weight of each keyword for that topic
* Retrain you LDA model with different numbers of topics and examine the top keywords to determine which number of topics is best
* Can you match the topics to the Wikipedia categories ?

In [12]:
print(lda_model.print_topics())

[(0, '0.018*"airspace" + 0.014*"area" + 0.012*"caste" + 0.011*"people" + 0.009*"class" + 0.007*"control" + 0.006*"route" + 0.006*"slave" + 0.006*"aircraft" + 0.006*"include"'), (1, '0.016*"shinto" + 0.013*"kami" + 0.011*"shrine" + 0.009*"japanese" + 0.008*"new" + 0.007*"know" + 0.006*"cannonball" + 0.006*"term" + 0.005*"use" + 0.005*"many"'), (2, '0.017*"airport" + 0.014*"burial" + 0.008*"natural" + 0.008*"international" + 0.006*"use" + 0.005*"new" + 0.005*"united" + 0.005*"ajman" + 0.005*"beacon" + 0.005*"green"'), (3, '0.014*"airport" + 0.012*"aldrin" + 0.011*"security" + 0.006*"passenger" + 0.006*"use" + 0.005*"stohlman" + 0.004*"isbn" + 0.004*"first" + 0.004*"leather" + 0.004*"become"'), (4, '0.010*"college" + 0.010*"program" + 0.009*"cement" + 0.009*"university" + 0.008*"student" + 0.006*"catherine" + 0.005*"school" + 0.005*"art" + 0.004*"one" + 0.004*"make"'), (5, '0.012*"nebula" + 0.011*"crab" + 0.011*"club" + 0.007*"move" + 0.006*"star" + 0.006*"use" + 0.005*"supernova" + 0.005

## Evaluate your model

**Exercise 7:** Compute Model Perplexity and Coherence Score

* A lower perplexity score indicates better generalization performance
* Coherence measures score a  topic by measuring the degree of semantic similarity between high scoring words in the topic.

1. `C_v` measure is based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity
2. `C_p` is based on a sliding window, one-preceding segmentation of the top words and the confirmation measure of Fitelson's coherence
3. `C_uci` measure is based on a sliding window and the pointwise mutual information (PMI) of all word pairs of the given top words
4. `C_umass` is based on document cooccurrence counts, a one-preceding segmentation and a logarithmic conditional probability as confirmation measure
5. `C_npmi` is an enhanced version of the C_uci coherence using the normalized pointwise mutual information (NPMI)
6. `C_a` is based on a context window, a pairwise comparison of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarit

In [13]:
from gensim.models import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=clean_text, dictionary=dic, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Coherence Score:  0.3355186752565105


**Visualize the topic model using pyLDAvis (PROVIDED)**

In [15]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, doc_token_matrix, dic)
vis

  default_term_info = default_term_info.sort_values(


### Hyperparameter tuning (PROVIDED)

First, let's differentiate between model hyperparameters and model parameters :

- `Model hyperparameters` can be thought of as settings for a machine learning algorithm that are tuned by the data scientist before training. Examples would be the number of trees in the random forest, or in our case, number of topics K

- `Model parameters` can be thought of as what the model learns during training, such as the weights for each word in a given topic.

Now that we have the baseline coherence score for the default LDA model, let's perform a series of sensitivity tests to help determine the following model hyperparameters: 
- Number of Topics (K)
- Dirichlet hyperparameter alpha: Document-Topic Density
- Dirichlet hyperparameter beta: Word-Topic Density

We'll perform these tests in sequence, one parameter at a time by keeping others constant and run them over the two difference validation corpus sets. We'll use `C_v` as our choice of metric for performance comparison 

In [None]:
# supporting function
def compute_coherence_values(corpus, dictionary, k, a, b):
    
    lda_model = gensim.models.LdaMulticore(corpus=doc_term_matrix,
                                           id2word=dictionary,
                                           num_topics=k, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=a,
                                           eta=b)
    
    coherence_model_lda = CoherenceModel(model=lda_model, texts=doc_term_matrix, dictionary=dictionary, coherence='c_v')
    
    return coherence_model_lda.get_coherence()

Let's call the function, and iterate it over the range of topics, alpha, and beta parameter values

In [None]:
import numpy as np
import tqdm
import gensim

corpus = doc_term_matrix
grid = {}
grid['Validation_Set'] = {}

# Topics range
min_topics = 2
max_topics = 11
step_size = 1
topics_range = range(min_topics, max_topics, step_size)

# Alpha parameter
alpha = list(np.arange(0.01, 1, 0.3))
alpha.append('symmetric')
alpha.append('asymmetric')

# Beta parameter
beta = list(np.arange(0.01, 1, 0.3))
beta.append('symmetric')

# Validation sets
num_of_docs = len(corpus)
corpus_sets = [corpus]

corpus_title = ['100% Corpus']

model_results = {'Validation_Set': [],
                 'Topics': [],
                 'Alpha': [],
                 'Beta': [],
                 'Coherence': []
                }

# Can take a long time to run
if 1 == 1:
    pbar = tqdm.tqdm(total=(len(beta)*len(alpha)*len(topics_range)*len(corpus_title)))
    
    # iterate through validation corpuses
    for i in range(len(corpus_sets)):
        # iterate through number of topics
        for k in topics_range:
            # iterate through alpha values
            for a in alpha:
                # iterare through beta values
                for b in beta:
                    # get the coherence score for the given parameters
                    cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=dictionary, 
                                                  k=k, a=a, b=b)
                    # Save the model results
                    model_results['Validation_Set'].append(corpus_title[i])
                    model_results['Topics'].append(k)
                    model_results['Alpha'].append(a)
                    model_results['Beta'].append(b)
                    model_results['Coherence'].append(cv)
                    
                    pbar.update(1)
    pd.DataFrame(model_results).to_csv('lda_tuning_results.csv', index=False)
    pbar.close()

### Final Model Training

Based on external evaluation (Code to be added from Excel based analysis), train the final model

In [None]:
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=16, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=0.01,
                                           eta=0.9)

In [None]:
from gensim.models import CoherenceModel

# Compute Perplexity
print('Perplexity:', lda_model.log_perplexity(doc_term_matrix))
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=lemmas, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)