# Evaluating the hSBM and the LDA model

This notebook exemplifies how we evaluated the hSBM and the LDA topic models. We ran this code for all three datasets.

In [None]:
# import packages
import pickle
import pandas as pd
import numpy as np
import ast
import re
import graph_tool.all as gt
from hSBM_Topicmodel.sbmtm import sbmtm
import pylab as plt
%matplotlib inline 
import seaborn as sns
import gensim
from gensim.corpora.dictionary import Dictionary

In [None]:
# load the dataframe
df = pd.read_csv('de_bigrams_data2.csv',  lineterminator='\n')
df.head()

# the hSBM was run on random sample and we reproduce this sample here
df = df.sample(n=20000, random_state=40)

In [None]:
# function to turn the tokenized list into a readable format
def string_list(text):
    
    # we transform the string representation of the list into an actual list
    text = ast.literal_eval(text)
    
    return text

In [None]:
# apply function to the lemma_no_mention column (since we'll use that to construct bigrams)
df['token'] = df['token'].apply(string_list)
df['lemma'] = df['lemma'].apply(string_list)
df['token_no_mention'] = df['token_no_mention'].apply(string_list)
df['lemma_no_mention'] = df['lemma_no_mention'].apply(string_list)
df['lemma_uni_bi'] = df['lemma_uni_bi'].apply(string_list)

# Inspecting the hSBM model 
Here we inspect some properties of the model and visualize the topic distribution.

In [None]:
# load the model
hsbm_model = pickle.load(open(r'/home/root/model_sav/de_hsbm_bigram_sample20_nmin2.sav', 'rb'))

## Overview of the topics

In the ``hSBM`` package, the method ``.topics()`` returns all the topics that the model detected and their most predictive words.

In [None]:
# get an overview of the topics and their most predictive words
topics = hsbm_model.topics()
num_topics = len(topics)
print(f'The model detected {num_topics} topics')

In [None]:
# print the topics
for t in topics:
    print(f"\nTopic {t}: \n")
    for tup in topics[t]:
        print(tup[0])

In order to analyse the results, we need to access the probability of each word-group (= topic) in each document (= tweet).

We use the ``.get_groups()`` method which returns a dictionary that contains different information on word-group and topic-groups. We are interested in the ``p_tw_d`` part of the dictionary which contains a ``numpy`` array that specifies the probability of each word-group (= topic) in each document (= tweet). 

For reference, in the ``sbmtm.py`` file, the ``.getgroups()`` method is described as follows:
```
'''
extract statistics on group membership of nodes form the inferred state.
return dictionary
- B_d, int, number of doc-groups
- B_w, int, number of word-groups
- p_tw_w, array B_w x V; word-group-membership:
     prob that word-node w belongs to word-group tw: P(tw | w)
- p_td_d, array B_d x D; doc-group membership:
     prob that doc-node d belongs to doc-group td: P(td | d)
- p_w_tw, array V x B_w; topic distribution:
     prob of word w given topic tw P(w | tw)
- p_tw_d, array B_w x d; doc-topic mixtures:
     prob of word-group tw in doc d P(tw | d)
'''
```

Once we've accessed this ``numpy`` array, we turn it into a ``pandas`` dataframe and we will retrieve the tweets that have the highest probabilities of each word-group (= topic).

In [None]:
# we need to access the numpy array which is stored in 
# model.get_groups()['p_tw_d']
dist_dict = hsbm_model.get_groups()['p_tw_d']

# we turn the numpy array into a pandas dataframe and transpose it
# so that the topics are in the columns and the tweets are in the rows
topic_docs = pd.DataFrame(dist_dict).transpose()

# we drop all rows that contain NaN values: since we only considered words that occur at least 5 times, there are some
# tweets which were not used in the topic model and they therefore contain NaN values
topic_docs.dropna(inplace=True)

# display the topic_docs dataframe
topic_docs.head(3)

In [None]:
### Getting the most prevalent topics
# we select all columns except for the last one since the last column contains the number of the dominant topic
# we sum up all weights and divide by the total number of tweets
topic_sum = pd.DataFrame(topic_docs.sum() / len(df), columns=['prevalence'])

# get the 25 most prevalent topics
prev_topics = topic_sum.sort_values(by='prevalence', ascending=False)
prev_topics.head(25)

In [None]:
# save the data to be plotted
plot_df = prev_topics.head(25)
plot_df.reset_index(inplace=True)
plot_df.to_csv('de_topicmodel_plot.csv', index=False)

In [None]:
# save the prevalent topics
prev_index = prev_topics.head(25).index

## Getting the most representative tweets

In [None]:
# now we want to retrieve the most 10 most representative tweets for each topic

# empty list to store the indices in
index_list = []

# we iterate through the number of topics
for i in range(len(topics)):
    
    # we retrieve the index of the 10 most representative tweets (returns a pandas index object)
    index = topic_docs.sort_values(by=[i], ascending=False).iloc[:10].index
    index_list.append(index)

### Tweets for the most prevalent topics
#### ...in descending order, i.e. stariting with the most prevalent topic

In [None]:
# printing the text of these most representive tweets for all topics

# iterate through the number of topics
for i in prev_index:
    
    # print the Topic number, the words of each topic and its most representative tweets   
    
    print(f"""*************************************************************************************************'
          \n \033[1m --- Topic {i} --- \033[0m \n 
          \n \033[1m --- Words: --- \033[0m \n 
          \n {[j[0] for j in topics[i]]} \n 
          \n \033[1m --- Most representative tweets: --- \033[0m \n\n""")
    
    for text in df.iloc[index_list[i]]['text']:
        print(f'>> {text} \n')
    
    print(f'\n \033[1m --- Most representative list of lemmas: --- \033[0m \n\n')
    
    for lemma in df.iloc[index_list[i]]['lemma_uni_bi']:
              print(f'>> {lemma} \n')

### Tweets for all topics

In [None]:
# printing the text of these most representive tweets for all topics

# iterate through the number of topics
for i in range(len(topics)):
    
    # print the Topic number, the words of each topic and its most representative tweets
    print(f"""*************************************************************************************************'
          \n \033[1m --- Topic {i} --- \033[0m \n 
          \n \033[1m --- Words: --- \033[0m \n 
          \n {[j[0] for j in topics[i]]} \n 
          \n \033[1m --- Most representative tweets: --- \033[0m \n\n""")
    
    for text in df.iloc[index_list[i]]['text']:
        print(f'>> {text} \n')
    
    print(f'\n \033[1m Most representative list of lemmas: --- \033[0m \n\n')
    
    for lemma in df.iloc[index_list[i]]['lemma_uni_bi']:
              print(f'>> {lemma} \n')

# Visualization

In [None]:
# now we visualize the topics_sum dataframe (which we created further up) in a barplot

# either do topic_sum.bar.plot() or try seaborn
ax = sns.barplot(x=topic_sum.index, y = 'prevalence', color = '#175985', data=topic_sum)
xtickslocs = ax.get_xticks()
_ = ax.set_xticks(np.arange(xtickslocs[0], xtickslocs[-1], 50))
_ = ax.set_xticklabels(np.arange(xtickslocs[0], xtickslocs[-1], 50))
_ = ax.set_xlabel("Topics")
_ = ax.set_ylabel("Weighted contribution of each topic")
_ = ax.set_title("hSBM: Distribution of topics by weights")

### Plotting the topic-document distribution
The code for this heatmap is taken from the ``sbmtm.py`` file of the hSBM package.

In [None]:
# plotting the distribution of topics over documents

# select all columns except for the last one (because it contains the number of the dominant topic)
p_tw_d = topic_docs

# figure
fig=plt.figure(figsize=(12,10))

# get color map: see https://matplotlib.org/stable/tutorials/colors/colormaps.html
# for reversed cmaps do: 
# color_map = plt. cm. get_cmap('PuBuGn')
# reversed_color_map = color_map. reversed() 

# plot the heatmap
plt.imshow(p_tw_d,origin='lower',aspect='auto',interpolation='none', cmap='Reds')
plt.title(r'hSBM: Probability of each word-group (= topic) in each document (= tweet)')
plt.xlabel('Topic')
plt.ylabel('Document')
_ = plt.colorbar()

# LDA

In [None]:
# load model
lda_model = pickle.load(open(r'/home/root/model_sav/de_lda_sample20_nmin2.sav', 'rb'))

## Create corpus

In [None]:
#Create a id2word dictionary

#Insert the column where you saved unigram and bigram tokens between the parentheses
id2word = Dictionary(df['lemma_uni_bi']) 

#Viewing how many words are in our vocabulary
print(len(id2word))

# Use filter_extremes to remove very frequent (those that appear in more than 99.9% of the 
# documents) and very infrequent words (those that appear in less than 10 documents)
id2word.filter_extremes(no_below=2, no_above=1)

#Viewing how many words are in our vocabulary
print(len(id2word))

# creating a corpus object
corpus = [id2word.doc2bow(doc) for doc in df['lemma_uni_bi']]

#### Topic overview

In [None]:
# topic overview
words = [re.findall(r'"([^"]*)"',t[1]) for t in lda_model.print_topics(num_topics,10)]
topic_words = []

# Printing the topics in a nice format
for id, t in enumerate(words): 
    print(f"------ Topic {id} ------")
    print(' '.join(t), end="\n\n")
    topic_words.append(' '.join(t))

### Getting the most representative tweets for teach topic

In [None]:
# the gamma probabilities show how prevalent a topic is in a given document

# gamma_list is a list containing another list for each doc within the nested list are tuples (topic, probability)
gamma_list = list(lda_model.get_document_topics(corpus, minimum_probability = 0.0))

In [None]:
# empty list to store probabilities in in
gamma_df_list = []

# iterating through the gamma_list
for doc in gamma_list:
    
    # retrieving all probabilities and storing them in a list
    prob = [tup[1] for tup in doc]
    
    # adding list to the gamma_df_list that we'll turn into a pandas dataframe
    gamma_df_list.append(prob)

### Most prevalent topics

In [None]:
# turning the list into a dataframe
gamma_df = pd.DataFrame(gamma_df_list)

# turn into dataframe
gamma_sum = pd.DataFrame(gamma_df.sum() / len(gamma_df), columns=['prevalence'])

# get the 25 most prevalent topics
lda_prev_topics = gamma_sum.sort_values(by='prevalence', ascending=False)
lda_prev_topics.head(25)

In [None]:
# save the prevalent topics
lda_prev_index = lda_prev_topics.head(25).index

### Tweets for the most prevalent topics
#### ...in descending order, i.e. stariting with the most prevalent topic

In [None]:
# now we want to retrieve the most 10 most representative tweets for each topic

# empty list to store the indices in
lda_index_list = []

# we iterate through the number of topics
for i in range(num_topics):
    
    # we retrieve the index of the 10 most representative tweets (returns a pandas index object)
    index = gamma_df.sort_values(by=[i], ascending=False).iloc[:10].index
    lda_index_list.append(index)

In [None]:
gamma_df

In [None]:
gamma_df.sort_values(by=[i], ascending=False).iloc[:10].index

In [None]:
# printing the text of these most representive tweets for all topics

# iterate through the number of topics
for i in range(len(lda_prev_index)):
    
    # print the Topic number, the words of each topic and its most representative tweets   
    
    print(f"""*************************************************************************************************'
          \n \033[1m --- Topic {lda_prev_index[i]} --- \033[0m \n 
          \n \033[1m --- Words: --- \033[0m \n 
          \n {topic_words[i]} \n 
          \n \033[1m --- Most representative tweets: --- \033[0m \n\n""")
    
    for text in df.iloc[lda_index_list[i]]['text']:
        print(f'>> {text} \n')
    
    print(f'\n \033[1m --- Most representative list of lemmas: --- \033[0m \n\n')
    
    for lemma in df.iloc[lda_index_list[i]]['lemma_uni_bi']:
              print(f'>> {lemma} \n')

### Tweets for all topics

In [None]:
# printing the text of these most representive tweets for all topics

# iterate through the number of topics
for i in range(num_topics):
    
    # print the Topic number, the words of each topic and its most representative tweets   
    
    print(f"""*************************************************************************************************'
          \n \033[1m --- Topic {i} --- \033[0m \n 
          \n \033[1m --- Words: --- \033[0m \n 
          \n {topic_words[i]} \n 
          \n \033[1m --- Most representative tweets: --- \033[0m \n\n""")
    
    for text in df.iloc[lda_index_list[i]]['text']:
        print(f'>> {text} \n')
    
    print(f'\n \033[1m --- Most representative list of lemmas: --- \033[0m \n\n')
    
    for lemma in df.iloc[lda_index_list[i]]['lemma_uni_bi']:
              print(f'>> {lemma} \n')

#### Visualization of topic distribution

In [None]:
# visualization

# either do topic_sum.bar.plot() or try seaborn
ax = sns.barplot(x=gamma_sum.index, y = 'prevalence', color = '#175985', data=gamma_sum)
xtickslocs = ax.get_xticks()
_ = ax.set_xticks(np.arange(xtickslocs[0], xtickslocs[-1], 50))
_ = ax.set_xticklabels(np.arange(xtickslocs[0], xtickslocs[-1], 50))
_ = ax.set_xlabel("Topics")
_ = ax.set_ylabel("Weighted contribution of each topic")
_ = ax.set_title("LDA: Distribution of topics by weights")

### Plotting the topic-document distribution
The code for this heatmap is taken from the ``sbmtm.py`` file of the hSBM package, but as soon as we have the gamma probabilities, the code works equally on the LDA model.

In [None]:
# plotting the distribution of topics over documents

# select all columns except for the last one (because it contains the number of the dominant topic)
p_tw_d = gamma_df

# figure
fig=plt.figure(figsize=(12,10))

# get color map: see https://matplotlib.org/stable/tutorials/colors/colormaps.html
# for reversed cmaps do: 
# color_map = plt. cm. get_cmap('PuBuGn')
# reversed_color_map = color_map. reversed() 

# plot the heatmap
plt.imshow(p_tw_d,origin='lower',aspect='auto',interpolation='none', cmap='Blues')
plt.title(r'hSBM: Probability of each word-group (= topic) in each document (= tweet)')
plt.xlabel('Topic')
plt.ylabel('Document')
_ = plt.colorbar()