<a href="https://colab.research.google.com/github/MatJohaDH/LDA_playground/blob/main/LDA_playground.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:

#@title Welcome to Jupyter notebooks! { run: "auto", vertical-output: true, display-mode: "form" }
#@markdown This _notebook_ is made to be run in google colab, which is a free
#@markdown (though somewhat limited) cloud computing service offered by ... 
#@markdown Google.

#@markdown A very brief summary: It lets us run python-backed calculations
#@markdown in a webbrowser without having to install anything on our local
#@markdown machines.

#@markdown In order to operate this you need to be aware of a few things of how
#@markdown it works.
#@markdown **Firstly**, this box (ending at the horizontal line below) is 
#@markdown called a *cell*. As a rule, each *cell* contains code. To *run* the
#@markdown *cell*, that is execute the instructions in the code, we can either
#@markdown click the circle with a 'play' triangle in the cell's top-left corner.
#@markdown Or we simply hold <shift> and press <enter> to run the selected cell.
#@markdown **Secondly**, the instructions in subsequent cells often depend on
#@markdown the successfull execution of preceeding cells, so try to execute them
#@markdown in sequence. 
#@markdown **Finally**, if you want to read the code in any of the
#@markdown cells just double click on the text, and again to hide the code.

from tqdm.notebook import tqdm
import os
import ipywidgets as widgets
from ipywidgets import interact_manual
import gensim
import re
import pandas as pd
from nltk.stem.snowball import SnowballStemmer as stemmer
!pip install stopwords
# !pip install wordcloud
from wordcloud import WordCloud as wc
from stopwords import get_stopwords


#add the swedish stopwords too...
stopwords = {'English': {_ for _ in gensim.parsing.preprocessing.STOPWORDS},
             'Swedish': {_ for _ in pd.read_csv('https://raw.githubusercontent.com/peterdalle/svensktext/master/stoppord/stoppord.csv', header=0, encoding='utf-8')['all']},
             'Latin' : {_ for _ in get_stopwords('latin')}}

stemmers = {'English': gensim.parsing.preprocessing.stem_text,
            'Swedish': stemmer('swedish').stem,
            'Latin': lambda token: token}
import matplotlib.pyplot as plt

%matplotlib inline
from pprint import pprint

if not os.path.exists('collection_txt.zip'):
    !wget https://github.com/MatJohaDH/LDA_playground/raw/main/collection_txt.zip

#@markdown ---



# Topic Modelling Playground

This notebook has been prepared with the intention to make the topic modelling process more transparent, by giving easier access to some of the many knobs and dials of the process. If anything breaks, does not work as promised, is unclear or is missing do not hesitate to contact me and I will do my best to make things right: Mathias.Johansson@kom.lu.se

## What is topic modelling?

- Techniques for detecting _latent topics_ within a corpus
- Threfore  a _topic_ does not refer to what we normally would consider a _topic_

## What is it used for?

- Document retreival
- Literature review
- Distant reading

# 1. Uploading corpus

In order to run a Topic Model we need a corpus, so the first thing we need to do
is to upload a single *.txt* file or a collection of *.txt* files encapsuled in
a *.zip* file. 

In [5]:
#@title Upload zip file { run: "auto", vertical-output: true, display-mode: "form" }
#@markdown In order to have some data to play with, we need to upload it into 
#@markdown the running notebook. Running this cell (shift+enter) or pressing the
#@markdown circle with an arrow in the top left corner of the cell will start the
#@markdown file uploading widget.

#@markdown The widget works similary to many other similar user-interactions,
#@markdown namely, press "Browse..." and select the file(s) you want to upload.

#@markdown Any type of file _can_ be uploaded, but the notebook is set to read
#@markdown only *.txt* files and can unzip *.zip* files to access them if necessary.
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
#@markdown By default the notebook will run a small toy example, if you want to
#@markdown work with it running this cell is not necessary. However, if you did 
#@markdown run the cell, you will have to press the button "Cancel upload" to 
#@markdown continue to the next cell.
#@markdown The example contains 149 texts taken from the description, portion of
#@markdown games listed on https://boardgamegeek.com, particularly those games
#@markdown that were recorded as being in my collection when the __corpus__ was
#@markdown prepared.

In [6]:
#@title Select zip to extract and use { run: "auto", vertical-output: true, display-mode: "form" }
#@markdown Select the file you want to load your _corpus_ from and press
#@markdown "Run Interact" to load it into memory.

#@markdown **note:** You have to rerun the cell (shift+enter) to refresh the
#@markdown list of available files.S

sources = [_ for _ in os.listdir() if _.endswith('.txt') or _.endswith('.zip')]
# source = widgets.Select(options=sources, description='Source:')
# display(source)

def load_files(source):
    if source.endswith('.zip'):
        if not os.path.exists(source[:-4]):
            os.system(f'unzip {source}')
        sources = [os.path.join(p,f) for p,d,fs in os.walk(source[:-4]) for f in fs if f.endswith('.txt')]
        if len(sources) == 1:
            return load_files(sources[0])
        else:
            texts = []
            for fpath in sources:
                with open(fpath, 'r') as f:
                    texts.append(f.read())
            df = pd.DataFrame({'source': sources, 'raw': texts})
            return df
    elif source.endswith('.txt'):
        with open(source, 'r') as f:
            lines = f.read()
        sources = []
        texts = [] 
        for nr, line in enumerate(re.split('\n+', lines)):
            sources.append(f'{source}_{nr}')
            texts.append(line)
        df = pd.DataFrame({'source': sources, 'raw': texts})
        return df

df = ''
@interact_manual(source=widgets.Select(options=sources, description='Source:', rows=len(sources)+1),
                 description='load')
def set_df(source):
    global df
    df = load_files(source)
    n_docs = len(df)
    display(f'Loaded {n_docs} documents into corpus from {source}')


interactive(children=(Select(description='Source:', options=('collection_txt.zip',), rows=2, value='collection…

# 2. Preprocessing corpus
Before we feed the corpus into the topic modelling algorithm we need to make the corpus readabla by the machine learning algorithm. There are many ways to do
to do this in practice, but in essence they are all the same: Turning
strings into numerical vectors:

$\vec{d}=(0, 1, 4, 5, 0, 1, ..., 0)$

To do this we will treat the texts as Bag of Words (BoW); we do not pay 
attention to sentence structure, we only count how many times each word
(*token*) appears in each *document*.

### The process in six steps:
1. Turn all _documents_ to lowercase
2. Strip _documents_ of puncuation, multiple whitespaces and numbers.
3. Remove stopwords (and short words)
4. Stem *tokens*
5. Establish a vocabulary
6. Count vocabulary *tokens* in each document 

In [7]:
#@title Pick a language { run: "auto", vertical-output: true, display-mode: "form" }
#@markdown Since some of the preprocessing steps are language dependent we first
#@markdown need to select which language to use. For this exersice there are three
#@markdown options: English, Swedish or Latin.
#@markdown (There is no stemming prepared for Latin)
language = 'English' #@param['English', 'Swedish', 'Latin']
gensim.parsing.preprocessing.DEFAULT_FILTERS[-1] = stemmers[language]

print(f'You have selected: {language}')

You have selected: English


In [None]:
#@title Stopwords { run: "auto", vertical-output: true, display-mode: "form" }
#@markdown __Stopwords__: words that are used too
#@markdown often to convey any *real* meaning
#@markdown on their own. For topic modelling
#@markdown these terms can get in the way of
#@markdown reaching relevant topics and are
#@markdown therefore removed.

#@markdown The simplest way to detect stopwords
#@markdown is to use a preexisting list. Which is
#@markdown exactly what we are doing here.

show_stopwords = False #@param {type:"boolean"}

add_stopwords = '' #@param {type: "string"}
new_stopwords = {_ for _ in re.findall(r'\w+', add_stopwords)}

remove_stopwords = '' #@param {type: "string"}
not_stopwords = {_ for _ in re.findall(r'\w+', remove_stopwords)}

gensim.parsing.preprocessing.STOPWORDS = (stopwords[language] | new_stopwords) - not_stopwords

if show_stopwords:
    pprint(gensim.parsing.preprocessing.STOPWORDS)

In [None]:
#@title Preparing Vocabulary { run: "auto", vertical-output: true, display-mode: "form" }
#@markdown The terms from the corpus that we end up using to vectorize the
#@markdown documents is called *vocabulary* 
#@markdown (but is as often referd to as *lexicon* or *dictionary*). 

#@markdown Leave this box check if you want to ignore all the regular
#@markdown preprocessing steps.
use_raw = False #@param{type: "boolean"}

#Preparing raw vocab
if 'doc' not in df.columns or use_raw:
    df['doc'] = df['raw'].apply(lambda raw: re.findall(r'[a-zA-ZååÅÄÖ\d]+', raw))
else:
    df['doc'] = df['raw'].apply(gensim.parsing.preprocessing.preprocess_string)
    old_shape = df.shape
    # We also have to make sure that we remove the _documents_ that have no representation in this space.
    df = df[df['doc'].apply(lambda doc: len(doc)>0)]
    # print(f'New shape: {df.shape}\nOld shape: {old_shap

# Preparing the vocabulary
vocab = gensim.corpora.Dictionary(documents=df['doc'])

#@markdown Selecting the vocabulary by how many documents the terms appear in.
#@markdown Everythign below the lower threshold, and above the upper threshold
#@markdown will be removed from the *vocabulary*.

@interact_manual(thresholds = widgets.IntRangeSlider(
    value=[5,50], min=0, max=100, description='frequency (%)'
))
def vocab_prep(thresholds):
    lower, upper = thresholds
    global vocab
    before = len(vocab)
    if lower >0 or upper < 100:
        lower /= 100
        upper /= 100
        vocab.filter_extremes(no_below=lower, no_above=upper)
        vocab.compactify()
        after = len(vocab)
        print(f'Reduced the vocabulary from {before} terms to {after} terms.')
    else:
        print(f'The vocabulary has {before} terms.')

In [None]:
#@title **Vectorizing the corpus** { run: "auto", vertical-output: true, display-mode: "form" }

#@markdown This is where the magic happens

tqdm.pandas(desc="Vectorizing corpus")
df['corpus'] = df['doc'].progress_apply(vocab.doc2bow)


# 3. Topic modelling crash course

Three important things to know about topic modelling:
1. "Topic" refers to a *__latent topic__*, which is not what we normally mean when we use the term.
2. Texts are considered random compositions of **preexisting topics**
3. The model will find exactly as many topics as we tell it to.

The most common topic model is called **Latent Dirichlet Allocation (LDA)** and uses
the dirichlet-distribution to assign probabilities to each topic-token pair.
It is also the basis for other, more specialized, topic modelling algorithms:
- Author Topic Modelling (ATM)
- Structural Topic Modelling (STM) - Developed for surveys and adds metadata
- Dynamic Topic modelling (DTM) - Adds a temporal element

Since this process relies on a **random** initiation, different instances of the models, even when using the exact same corpus, will yield different topics. Similar, but not identical.

In short the result of a LDA model is a large table that pairs each topic with a specific term in the *vocabulary*:

| | topic 0 | topic 1 | ... | topic n |
| -- | --- | --- | --- | --- |
| word 0 | $p_{0, 0}$ | $p_{0, 1}$ | ... | $p_{0, n}$ |
| word 1 | $p_{1, 0}$ | $p_{1, 1}$ | ... | $p_{1, n}$ |
| ... | ... | ... | ... | ... |
| word m | $p_{m, 0}$ | $p_{m, 1}$ | ... | $p_{m, n}$ |




In [None]:
#@title Picking the number of topics { run: "auto", vertical-output: true, display-mode: "form" }

#@markdown Since the model will find exactly as many topics as we tell it to,
#@markdown how do we select the **correct** number of topics?

#@markdown There are essentially three approaches to this:
#@markdown 1. Rely on your knowledge of the corpus, how many topics do you expect
#@markdown to find?
#@markdown 2. Pick a few, but disparate levels: 50, 250, 500 to use for distant
#@markdown reading on different levels.
#@markdown 3. Select a few metrics and calculate which is the best fit!

#@markdown All of these are abitrary and which is more appropriate depends on:
#@markdown - Familiarity with the corpus
#@markdown - Size of the corpus
#@markdown - Research question


#@markdown ---
#@markdown If you are interested in the metric-based approach, run this cell to
#@markdown get a graph of two such metrics calculated on the following number of topics:
ntops = []
cv = []
umass = []
logper = []
topics = '3, 5, 7, 11, 13, 17, 19, 23' #@param {type: "string"}
topics = sorted(set(int(_) for _ in re.findall(r'\d+', topics)))
for ntop in tqdm(topics, desc='Calculating scores'):
    ntops.append(ntop)
    lda =gensim.models.LdaModel(corpus=df['corpus'], id2word=vocab, 
                                num_topics=ntop, alpha='auto', 
                                per_word_topics=True)
    cm = gensim.models.coherencemodel.CoherenceModel(model=lda, 
                                                     corpus=df['corpus'], 
                                                     dictionary=vocab, 
                                                     texts=df['doc'],
                                                     coherence='c_v')
    cv.append(cm.get_coherence())
    cm_u = gensim.models.coherencemodel.CoherenceModel(model=lda, 
                                                     corpus=df['corpus'], 
                                                     dictionary=vocab, 
                                                     texts=df['doc'],
                                                     coherence='u_mass')
    umass.append(cm_u.get_coherence())

    # logper.append(lda.log_perplexity(df['corpus']))
    logper.append(lda.bound(df['corpus']))

# Createing a single plot with three different y-axes
fig, ax1 = plt.subplots()
ax2 =ax1.twinx()
ax3 =ax1.twinx()

# Plotting each metric on a different y-axis
p1, = ax1.plot(ntops, cv, 'r', label='Coherence$_v$ - maximise')
p2, = ax2.plot(ntops, umass, 'b', label='U-mass - maximise')
p3, = ax3.plot(ntops, logper, 'g--', label='Log perplexity - minimize')

# removing ticks and labels on all three y-axes
for ax in [ax1, ax2, ax3]:
    ax.tick_params(left=False)
    ax.set_yticks([])

# Setting the appropriate x-ticks, and its label
ax1.set_xticklabels(ntops)
ax1.set_xticks(ntops)
ax1.set_xlabel('Topics')

# Adding legend
ax1.legend(handles=[p1, p2, p3])

# Showing plot
plt.show()

In [None]:
#@title The topic model { run: "auto", vertical-output: true, display-mode: "form" }

#@markdown With a number of topics in mind we can finally fit a LDA-model to the
#@markdown preprocessed corpus. Just fill in the number of topics and the

#@markdown ###Selecting the number of topics
number_of_topics =  5#@param {type: "integer", min: 2, max: 500}

#@markdown ###A note on reproducability
#@markdown In this instance an algorith called __Latent Dirichlet Allocation__ (LDA) is applied. Part of the initialisation of this algorithm is randomized, this means that we cannot reliably get identical results every time we run the algorithm. By assigning a value to the _seed_ we can make sure that every time __we__ run the same algorith with the the same data and same number of topics they will be identical.   
seed = 5 #@param {type: "number"}

lda = gensim.models.LdaModel(corpus=df['corpus'], id2word=vocab, num_topics=number_of_topics, alpha='auto', per_word_topics=True, random_state=seed)


cm = gensim.models.CoherenceModel(coherence='c_v', texts=df['doc'], model=lda)

top_topics = sorted([(c,i) for i,c in enumerate(cm.get_coherence_per_topic())], reverse=True)

print(f'Calculation done with {number_of_topics} topics.')

# 4. Interpreting the results

There are many ways to explore the resulting topics, and we will stick to the most basic approach:

- Looking at the top terms of the topics.

Note:
- The topics are numbered **arbitrarily** starting from zero; they are no indication of which topic is *better*.

In [None]:
#@title  Picking a specific topic { run: "auto", vertical-output: true, display-mode: "form" }

#@markdown To inspect the topic select a topic number and the number of terms you want to display. 
#@markdown Remember that the topic number has to be less than the number of topics you selected when running the model above.
#@markdown It is also worth noting that it starts counting at _0_.

topic_nr =  0#@param {type: "integer"}
topic_nr = min(topic_nr, number_of_topics -1)
number_of_words =  13 #@param {type : "slider", min: 1, max: 20}
draw_cloud = True#@param {type: "boolean"}

print(f'Topic: {topic_nr}')
def print_topic_terms(topic_nr):
    for term, p in lda.show_topic(topic_nr, number_of_words):
        print(f'\t{term} (p. {p:.4f})')
print_topic_terms(topic_nr)

def cloud(cloud_topic, cloud_terms, plim=0):
    if cloud_topic >= 0 and cloud_topic < lda.num_topics:
        cloud = wc(background_color='white', max_words=cloud_terms, contour_width=3, contour_color='steelblue')
        term_ps = lda.show_topic(topicid=cloud_topic, topn=cloud_terms)
        term_ps
        term_freq_dic = {term:p for term, p in term_ps if p>=plim}
        cloud.generate_from_frequencies(frequencies=term_freq_dic)

        plt.figure(figsize=(7, 4))
        plt.axis("off");
    #         plt.gcf().canvas.set_window_title('My Window Title') 
    #         plt.gca().set_title('Simple plot')
        plt.imshow(cloud) 
        plt.show()
if draw_cloud:
    cloud(topic_nr, number_of_words)

In [None]:
#@title Inspecting top topics { run: "auto", vertical-output: true, display-mode: "form" }
#@markdown Alternatively select how many of the __top__ topics and number of terms to inspect
nr_top_topics =  20#@param {type: "integer"}
number_of_words =  20 #@param {type : "slider", min: 1, max: 20}
draw = 'Cloud' #@param ["Table", "Cloud", "Both"]

for coherence, topic in top_topics[:nr_top_topics]:
    print(f'Topic: {topic} (coherence {coherence:.4f})')
    if draw != 'Cloud':
        print_topic_terms(topic)
    if draw != 'Table':
        cloud(topic, nr_top_topics)
