(explainer_topic_model)=
hSBM topic modelling
==================

# Topic Modelling with hSBM: Community Detection in a Topic Model Context

In this section we introduce part of our text analysis, namely the part which pertains to our topic modelling approach. Topic models are mainly used to cluster a collection of documents into different so-called 'topics'.

In [None]:
# Import relevant modules
import joblib
from hSBM_Topicmodel.sbmtm import sbmtm
from matplotlib import pyplot as plt
from wordcloud import WordCloud
import matplotlib.colors as mcolors
import networkx as nx
import pandas as pd

Inspired by the approach set forth by [Gerlach et al. 2018](https://www.science.org/doi/10.1126/sciadv.aaq1360), we show the hierarchical clustering of documents and words from our wikipedia dataset for all six scientific disciplines using a **hierarchical Stochastic Block Model** (hSBM). We only consider words that appear more than ten times in the text corpus and end up with 4810 articles or in our case, nodes. As opposed to other popular topic models such as LDA, we do not need to specify number of groups, levels or topics beforehand, since hSBM automatically detects these parameters. The model is inspired by community detection in networks and creates a bipartite-like network of words and documents. It splits the network into groups on different hierarchical levels organized as a tree. On each level, the hSBM clusters the documents and words in varying group sizes.

We have fed the model with data consisting of preprocessed tokens from each Wikipedia article where wer remove infrequent tokens with a threshold at minimum 10 occurences per token. We are able to extract the documents from the model using the document number which corresponds to the index from our dataframe. We have merged the title of each article with the discipline from which it originates i.e. 'Anarchism' becomes 'political_science-anarchism', which we need for the forthcoming analysis. 


In [None]:
# load model
model = joblib.load("hSBM_model.joblib")

Let's take a look at the structure of the model

In [None]:
# plot the model
model.plot(nedges=10000)

comment on the visualization

And now for the real fun - let's extract the topics. We have experimented with extracting topics at different levels of the model (our model has 7 levels). Generally, we find that the topics at level 1 are the most cohesive and specific which is useful for our analysis. We specify that the model should return the 20 most probable words per topic (remember the model itself chooses number of topics). 

In [None]:
topics_l1 = model.topics(l=1,n=20)

The model generates 76 topics on level 1 which span from very method-related to more theme or topic-specific such as 'socialism' or 'US politics'. On level 2, we find 15 topics which are more discipline-related and broader. The lower the level, the less semantically cohesive the topics. We choose and name 10 topics for our content analysis: 

In [2]:
# the 10 topics from level 1 of our model
chosen_topics = {'administrative_science':topics_l1[7], 'market_economy':topics_l1[18], 'museum_anthropology':topics_l1[23],
  'cognitive_psychology':topics_l1[25], 'academia':topics_l1[30], 'statistical_methods':topics_l1[32], 
   'socialism':topics_l1[43], 'labour_economics':topics_l1[53], 'class_racialization':topics_l1[68], 
  'US_politics_SoMe':topics_l1[75]}

# we create a pandas dataframe for our topics and topic words
topic_names = list(chosen_topics.keys())
topic_values = list(chosen_topics.values())

topic_df = pd.DataFrame(index = topic_names)
topic_df['topics'] = topic_values
topic_df

The topics we chose are cherry picked meaning that we choose them based on word clusters we find interesting - both some which we believe to be 'pure' discipline and some which we believe to capture some overlap between disciplines. Next step is to extract the topic distribution for each chosen topic related to the articles which the given topic is most contributing in. Luckily, the hSBM library has a function we can use:

In [None]:
# get model paramers from level 1
model_params = loaded_model.get_groups(l=1)

# create a dataframe to store the topic distributions from the model parameters
topic_dist_df = pd.DataFrame(model_params['p_tw_d'],
columns=new_names)

(p_tw_d corresponds to probability of word group e.i. topic is in a given document)

In [None]:
# we create a new dict with original key values as keys and inferred topic as value
new_keys_names = list(chosen_topics.keys())
topic_num = [7, 18, 23, 25, 30, 32, 43, 53, 68, 75]
topic_num_names = dict(zip(topic_num, new_keys_names))

Now we are ready to extract the topic distributions for each topic. We create a function which returns the topic name, the article and their probabilities as well as the words and word probabilities for each topic.

In [None]:
def get_topic_dist(key):
    topic = topics_l1[key]
    words = [tuple[0] for index, tuple in enumerate(topic)]
    word_prob = [tuple[1] for index, tuple in enumerate(topic)]
    topic_dist = topic_dist_df.iloc[key].sort_values(ascending = False)[:30]
    article_probabilities = [t for t in topic_dist]
    articles = [t for t in topic_dist.index]
    return topic_num_names[key], article_probabilities, articles, words, word_prob

All we need now is to store the different values for our following analysis. 

In [None]:
# lists of the different values we need 
topics = []
probabilities = []
article_names = []
words = []
word_prob = []

# we use our topic distribution function to extract all relevant values
for key in topic_num:
    element = get_topic_dist(key)
    topics.append(element[0])
    probabilities.append(element[1])
    article_names.append(element[2])
    words.append(element[3])
    word_prob.append(element[4])

### Bipartite network of 10 topics-articles and 10 topic-words

We want to visualize the topics and their related articles as well as their related words in two separate bipartite networks. The topics will be one type of nodes and the articles/words another type. Links represents articles/words connected to the given topic. The links are further weigthed by the probabilities for an article/word to contributing/explanatory of the topic. 

In [None]:
# remember this is not a coding class xD
y_word = list()
w_word = list()

# we create two different lists of tuples
for i in range(len(topics)):
    for element in range(len(words[0])):
        # topic name and related words
        y_word.append((topics[i], words[i][element]))
        #topic name, words and their probabilities
        w_word.append((topics[i], words[i][element],word_prob[i][element]))
    
# we repeat the process for articles     
y = list()
w = list()

for i in range(len(article_names)):
    for element in range(len(article_names[0])):
        y.append((topics[i], article_names[i][element]))
        w.append((topics[i], article_names[i][element],probabilities[i][element]))

In [None]:
# two separate dataframes to create the network from 
bipart_nx = pd.DataFrame(w, columns = ['sender', 'receiver', 'weight'])
word_nx = pd.DataFrame(w_word, columns = ['sender', 'receiver', 'weight'])

In [None]:
# bipartite network x graph object of topic-words
B_word=nx.Graph()
B_word.add_nodes_from(word_nx['receiver'], bipartite=0)
B_word.add_nodes_from(word_nx['sender'], bipartite=1)
B_word.add_weighted_edges_from(w_word)

# bipartite network x graph object of topic-articles
B=nx.Graph()
B.add_nodes_from(bipart_nx['receiver'], bipartite=0)
B.add_nodes_from(bipart_nx['sender'], bipartite=1)
B.add_weighted_edges_from(w)

### Mangler

* vise svg fil eller visualiser med netwulf 
* lav wordclouds 
* skriv content analysen ud fra visualiseringer

Gerlach et al. point to the hSBM's ability to identify groups of stopwords. We however chose to remove stopwords in our initial preprocessing. We do not specify any stopwords for the wordclouds to bypass. 

In [None]:
# wordcloud of Top 20 words in each topic
cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]  

cloud = WordCloud(background_color='white',
                  width=2500,
                  height=1800,
                  max_words=20,
                  colormap='tab10',
                  color_func=lambda *args, **kwargs: cols[i],
                  prefer_horizontal=1.0)

topics = topics_l1

fig, axes = plt.subplots(2, 2, figsize=(10,10), sharex=True, sharey=True)

for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    topic_words = dict(topics[i])
    cloud.generate_from_frequencies(topic_words, max_font_size=300)
    plt.gca().imshow(cloud)
    plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))
    plt.gca().axis('off')


plt.subplots_adjust(wspace=0, hspace=0)
plt.axis('off')
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()

In [None]:
# subsample 

In [None]:
# clustering

In [None]:
# dataframes

In [None]:
# plots 