# Topic Modelling with LDA   
In this notebook we will use the `gensim` library to perform topic modelling task and find out what are the sarcasm headlines database are about.  We will use the cleaned tokens saved from Notebook 1 as input for our models.  
 

**Outline**  
- Create bag of words and tf-idf representations for the documents  
- Use LDA model to do topic modelling  
- Analyse and visualise topic results  


**Estimated time:** 
 30 mins

In [None]:
### Change notebook directory, for Gadi environment only
### Change notebook directory, for Gadi environment only
import os
working_path = os.path.expandvars("/scratch/vp91/$USER/Introduction-to-NLP/")
os.chdir(working_path)
data_path = '/scratch/vp91/NLP-2024/data/'
model_path = '/scratch/vp91/NLP-2024/model/'

In [None]:
# local paths
# working_path = './'
# data_path = '../data/'
# model_path = '../model/'

In [None]:
import gensim
from gensim.corpora import Dictionary
from gensim.corpora import MmCorpus
from gensim.models import LdaMulticore
from gensim.models import CoherenceModel
from gensim.models import TfidfModel
from nltk import word_tokenize

#  Use log to make sure that by the final passes, most of the documents have converged. 
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import pickle
from pprint import pprint
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis 

## Load Tokens from Headlines Dataset 
Now we still use the smaller data and feed the clean tokens to `Dictionary` class. 

<div class="alert alert-block alert-info">
Remember, if you want to use your own dataset, you need to do the text preprocess like notebook 1 demonstrates before feeding them to the model.
</div>  


In [None]:
data = pd.read_json(data_path + 'Sarcasm_Headlines_Dataset_v2.json',lines=True)
data
# load our tokens back
with open(working_path + 'tokens.pkl', 'rb') as f:
    tokens = pickle.load(f)
print(len(tokens),tokens[:10])

## Add Bi-grams
The result of topic models are weighted tokens for each topic and weighted topic for each document. There is NO topic name from the model, only topic indexes. So it is up to us to understand and explain the topic results.  

One way to help us with that is to creat n-grams for our tokens.  

N-grams means a phrase made of n words. This will show us the frequently occurred phrases and help us to qualitively understand the topic results.

In [None]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(tokens, min_count=20)
for idx in range(len(tokens)):
    for token in bigram[tokens[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            tokens[idx].append(token)

In [None]:
tokens[:5]

## Create Input    
Using the `Dictionary` class of the library, we can easily filter the words with extremely low or high frequency counts.

In [None]:
%%time
texts = tokens
# Create a dictionary representation of the documents.
dictionary = Dictionary(texts)
# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)
# remove gaps in id sequence after words that were removed
dictionary.compactify()  



<div class="alert alert-block alert-info">

Right click on the left side of the output, select **Enable Scrolling for Outputs**
</div>  

### Vectorization  
Now we vectorize the documents using 2 different frequency counts. Bag-of-words, which is the total frequency count in the corpus, and tf-idf, which is capable of highlighting the uniqueness of a word in relation to the document and the corpus.

In [None]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in texts]

In [None]:
# tf-idf representation of the documents
## train the tfidf model
tfidf = TfidfModel(corpus) 
### apply the model to whole corpus
corpus_tfidf = tfidf[corpus]

In [None]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

## Training LDAMulticore  
Here we use the `LdaMulticore` class to distribute the training on multiple workers.  

`num_topics`  
Now we are ready to train the model. But how many topics should we define? It really depends on your data and your interpretation. Let's start trying with 10 topics and see what the results are like.  

`chunksize`  
It controls how many documents are processed each time for training. For example, it the chunksize is the size of the corpus, then the model will process them in one go. 
<div class="alert alert-block alert-info">
Becareful when increasing the chunksize, because it needs to fit in memory. 
</div>   

`passes`   
It is equivalent to `epoch` in neural network, specifing how many times we want to go over the corpus.  


`iterations`  
This controls how many times we want to learn each document. High iterations and passes usually improve the result.  

`eval_every`  
This will evaluate model log perplexity every n updates. Setting to 1 will slow down the training by 2x. 

`alpha`  
This is A-priori belief on document-topic distibution. Can also be 1D array of length equal to num_topics to denote an asymmetric user defined prior for each topic.  

`eta`  
This is A-prior belief on topic-word distribution. `auto` learns an asymmetric prior from the corpus.  



In [None]:
%%time
# Set training parameters.
num_topics = 10
chunksize = 2000
passes = 5
iterations = 20
eval_every = 1  
alpha = 'symmetric'
eta = 'auto'

# Build LDA model
lda = LdaMulticore(corpus=corpus, id2word=dictionary,num_topics=num_topics, 
                   workers = 20,
                   chunksize=chunksize, passes=passes, iterations=iterations, 
                   alpha=alpha, eta = eta,
#                    decay=0.9,
                   # Topics with a probability lower than this threshold will be filtered out.
#                    minimum_probability=0.5,
                   # for reproducibility.
#                    random_state=100,
                   # If True, the model also computes a list of topics, sorted in descending order of most likely topics for each word, along with their phi values multiplied by the feature length (i.e. word count).
#                    per_word_topics=True,
                   # if per_word_topics is True, this represents a lower bound on the term probabilities.
#                    minimum_phi_value = 0.5,
                  )

ldatopics = [[word for word, prob in topic] for topicid, topic in lda.show_topics(formatted=False)]
coh= CoherenceModel(topics=ldatopics, texts=texts, dictionary=dictionary, coherence = 'u_mass')
print('\nPerplexity: ', lda.log_perplexity(corpus))
print('\nCoherence Score: ', coh.get_coherence())


In [None]:
%%time
# Set training parameters.
num_topics = 10
chunksize = 2000
passes = 5
iterations = 20
update_every=1
eval_every = 1  
alpha = 'symmetric'
eta = 'auto'

# Build LDA model
lda_tfidf = LdaMulticore(corpus=corpus_tfidf, id2word=dictionary,num_topics=num_topics, 
                   workers = 20,
                   chunksize=chunksize, passes=passes, iterations=iterations, 
                   alpha=alpha, eta = eta,
#                    decay=0.9,
#                    minimum_probability=0.5,
#                    minimum_phi_value = 0.5,
#                    random_state=100,
#                    per_word_topics=True
                  )

lda_tfidf_topics = [[word for word, prob in topic] for topicid, topic in lda_tfidf.show_topics(formatted=False)]
coh_ifidf = CoherenceModel(topics=ldatopics, texts=texts, dictionary=dictionary, coherence = 'u_mass')
print('\nPerplexity: ', lda.log_perplexity(corpus))
print('\nCoherence Score: ', coh.get_coherence())


## Result Analysis  
### Topic Coherence

Now we use the `top_topics()` function to get the coherence score for each topic. This function implements `Umass` measure and lower values means better coherence.   

In [None]:
top_topics = lda.top_topics(corpus, coherence='u_mass')

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

> **Think**  
> Can you interprete each topic based on the top words?  

### Topic Coherence Distribution

In [None]:
# number of top words in each topic to use
topw =15
topic_coh = lda.top_topics(corpus=corpus, texts=texts, dictionary=dictionary, 
    window_size=None, coherence='u_mass', topn=topw, processes=- 1)
tt=[]
c=[]
for e in topic_coh:
    tt.append(e[0])
    c.append(e[1])

df_tt = pd.DataFrame(tt)
df_coh = pd.DataFrame(c)
df_coh.rename(columns={0:'coherence'},inplace=True)
# pd.concat([df_tt,df_coh],axis=1)
df_coh = df_coh.merge(df_tt,left_index=True,right_index=True)
df_coh
df_coh.plot.hist()

> **Think**  
> How is the quality of the topics?

### Document-Topic Representation
Sorted by topic probabilities

In [None]:
# get doc topic info from array
dt = lda.get_document_topics(bow = corpus, minimum_probability=0, minimum_phi_value=None, per_word_topics=False)
dt = [sorted(e,key = lambda x: x[1],reverse=True) for e in dt ]
df_t = pd.DataFrame(dt).rename({0:'Top Topic'},axis=1)
df_t3 = df_t.loc[:,:2]
# get top topicID from tuple
top=[]
for doc in dt:
    top.append(doc[0][0])
df_top = pd.DataFrame(top)

df_DT = data
df_DT = pd.merge(df_t3,df_DT,how='outer',left_index=True, right_index=True)
# # print(list(df_DT.columns.values))
# df_DT = df_DT[['Top Topic', 1, 2, 'title', 'clean_abs','_id','DOI','indexed.date-time', 'created.date-time', 'deposited.date-time', 'container-title', 'indexed.month', 'created.month', 'deposited.month', 'Fiscal', 'language']]
df_DT = pd.merge(df_top,df_DT,how='outer',left_index=True, right_index=True)
df_DT.rename(columns={0:'Top Topic','Top Topic':0},inplace=True)
df_DT
df_DT.groupby('Top Topic')['headline'].nunique().sort_values(ascending=False).plot(kind='bar')

> **Think**  
> How are documents distributed?  
> If you know the corpus, is the distribution consistant with your understanding?  
> Can this help us to change any hyperparameter?


### Topic-Word Representation
Ordered word probabilities in each topic

In [None]:
# rows -> topics
tw = lda.print_topics(num_topics=-1, num_words=15)
t = []
for i in range(num_topics):
    t.append(lda.show_topic(i,topw))
pd.DataFrame(t)

> **Think**  
> Are there duplicate words with top weight in different topics?  
> Is there diversity for words in different topics?  
> How should we filter the words when we define the Dictionary to improve the result?

### Visualise Topics

> **Think**  
> Are topics separeted or overlapped?
> Which topics might be merged based on their location and key words?

In [None]:
# PCoA scaling, not good at dissimmilarity
pyLDAvis.enable_notebook()
# mmds scaling
gensimvis.prepare(lda, corpus, dictionary)