# Topic Modelling in Python



This notebook guides you through all the steps required to create a topic model with Gensim LDA and NMF and demonstrates some visualisation options:
- Heatmap
- pyLDAvis
- Wordclouds






## Preparation


### Connect the google-Drive

(as we did yesterday: First import google-drive, then change to the summerschool-directory)

In [213]:
from google.colab import drive
drive.mount('/content/drive/')

  and should_run_async(code)


Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [214]:
%cd /content/drive/MyDrive/didip_ss/

/content/drive/MyDrive/didip_ss


  and should_run_async(code)


If you haven't done it this morning: Check for updates:

In [215]:
!git pull

  and should_run_async(code)


Already up to date.


### Import of Python-packages`

We will use `nltk` and `spacy` for preprocessing and `gensim` for calculating the model. `glob` is for file management, `numpy` and `pandas` for structuring of the data. `matplotlib`, `seaborn` and `pyLDAvis` are for visualisation. `re` is for regular expressions (we will use it only for one minor task, i promise)

We might need to install the `pyLDAvis`-package first, this is done with `pip`in the next line.


In [None]:
!pip install pyLDAvis --quiet

In [None]:
import nltk, re, numpy as np
from nltk import word_tokenize
from nltk.corpus import stopwords

import spacy


from gensim import corpora
from gensim.models import LdaModel, LdaMulticore, Nmf
from gensim.models import CoherenceModel
import gensim

import glob
import pandas as pd


import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


### Import a Stopword list from nltk

(Adjust the value for the language variable if required)



In [None]:
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
language = 'english'
stopword = stopwords.words(language)

**Optional**: Inspect the stop word list




In [None]:
print('\n'.join(stopword))
print(len(stopword))

**Optional**: Extend stop word list:




In [None]:
stopword = stopword + ['wordxyz','wordyyz']

**Alternative**: load your own stop word list from file

A txt file is expected as input, in which each word appears in a new line.

We will load a stopword-list from our drive, it is in the `stopwords`-folder of the `data`-folder from `D02`. You can upload your own lists in that folder on your google-drive.

In [None]:
with open("./D02/data/stopwords/Stopwords_en.txt","r", encoding='utf8') as stopfile:
   stopword = stopfile.read().splitlines()


**Optional**: Inspect the stop word list

In [None]:
print('\n'.join(stopword))

## Text import and Preprocessing





### Preprocessing

In the following, we will write two different functions for preprocessing, one with `nltk`, one with `spacy`.

In both settings, the texts are tokenised and all letters are converted to lower case, punctuation marks are removed and only the words that do not appear in the stop word list are included.

For some applications, however, it is useful to consider only words with certain Part-of-Speech-Tags (POS-tags), e.g., only nouns. `spacy` is said to be faster ins POS-tagging, however, it is mostly fit for modern languages (at least as far as I know), therefore I also present the `nltk`-routine for older languages.







So, first, the `nltk`-routine:

The text is tokenized, converted to lower-case, only words consisting of alphabetic characters are kept, and finally only the words that do not appear in the stop word list are included in the output

In [None]:
def nltk_prep(text):
    ppwords = nltk.word_tokenize(text)
    ppwords = [w.lower() for w in ppwords if w.isalpha()]
    ppwords = [w for w in ppwords if w not in stopword]

    return ppwords

Second, the `spacy`-routine:

We first have to load a language package (which should be changed of course, if you use other languages).

In [None]:
nlp = spacy.load("en_core_web_sm")

By default, spacy handles texts with maximum 1.000.000. Since we have some longer texts, we set the limit higher. Be warned, this might exceed your storage. There is a workaround at the end of the notebook, but it is a tiny bit more complicated.

In [None]:
nlp.max_length = 2000000

Then, we do the preprocessing by converting the text to a spacy-object, `.text` provides the tokenized text, `.pos_` the POS-tags (don't forget the underscore at the end)

In [None]:
def spacy_prep(text):
    doc = nlp(text)
    ppword = [t.text for t in doc if t.text not in stopword and t.pos_=="NOUN"]
    return(ppword)

**Hint**: In some scenarios, you even want to lemmatize the text. You can easily do that by changing `t.text` into `t.lemma_` in  the `spacy_prep` function

Let's see by an example, what the two routines do:



In [None]:
text = "This is a nice example text 4 preprocessing, 4thewin!"
print(nltk_prep(text))
print(spacy_prep(text))


### Text import

Now, we are ready import our texts and convert them into preprocessed tokens.

First, we create two empty lists, `docs` for the tokenized text, `docnames` for the filenames of the documents. We fill these lists by looping through our uploaded files and sending their text to either one of our preprocessing routines.

In [None]:
docs=[]
docnames=[]




directory = './D02/data/corpus_of_english_fiction/'
for filename in glob.glob(directory +"*.txt"):
    with(open(filename)) as textfile:
        text = textfile.read()

    print("Import file ",filename," length of text: ", len(text))


    ### Call nltk-preprocessing: Tokenize, keep only words that consist only of alphabet letters, and remove stopwords
    wd = nltk_prep(text)

    ### Alternative: Call Spacy-preprocessing: Tokenize, keep only words that consist only of alphabet letters, remove stopwords and keep only certain POS

    #wd = spacy_prep(text)

    print("Output ", len(wd)," tokens")
    ### Append the output of the preprocessing routines to the docs-list
    docs.append(wd)


    ### Append the filename to the docnames-list
    docnames.append(filename)

print("Import done!")







The previous import drew on a folder with single `txt`-files.

If you want to import a `tsv`-file you could use (and adapt) the following routine that imports the monastery sample data,We convert the file into a dataframe and obtain our `docnames` from the column `atom_id`. Our text will be the english abstracts found in the column `translated_abstract_opus` and we will also make an additonal list `docdates` that will be filled with all the dates of the documents (because that could be handy for making a timeline).

Note that in comparison to the previous cell (apart from the creation of the `docdates`-list) only the first three lines after the declaration of the empty lists are changed , the rest stays the same as above.

In [None]:
docs=[]
docnames=[]
docdates=[]




text_df = pd.read_csv('./DATA/mom_1000_sample.tsv', sep='\t')
for doc,text,date in zip(text_df['atom_id'], text_df['translated_abstract_opus'],text_df['year']):
    docname = doc


    print("Import file ",docname," length of text: ", len(text))


    ### Call nltk-preprocessing: Tokenize, keep only words that consist only of alphabet letters, and remove stopwords
    wd = nltk_prep(text)

    ### Alternative: Call Spacy-preprocessing: Tokenize, keep only words that consist only of alphabet letters, remove stopwords and keep only certain POS

    #wd = spacy_prep(text)

    print("Output ", len(wd)," tokens")
    ### Append the output of the preprocessing routines to the docs-list
    docs.append(wd)


    ### Append the filename to the docnames-list
    docnames.append(docname)

    ### Append the year to the docdate-list
    docdates.append(date)

print("Import done!")

#### Inspect single preprocessed documents

The preprocessed texts are located in `docs`, a list of lists. This will be the base for our model.

**Optional:** To access the individual preprocessed texts, simply index docs. The following line outputs the third text (as usual in Python, the index count starts with 0):

In [None]:
docs[2]



### Convert preprocessed documents into Bag-of-Words-model (BOW-model)

The starting point for Gensim models is usually a bag-of-words model resp. a document term matrix, i.e. simply a list of all tokens and their frequency in the text document. The tokens are counted, `corpus` contains the tokens resolved into numbers, the assignment of the tokens to the indices is mapped in `dct`


In [None]:
dct = corpora.Dictionary(docs)
corpus = [dct.doc2bow(line) for line in docs]


## Training of LDA-Model

We will train our first model with `LdaMulticore`-from the `gensim`-package.

Before training, we set the number of topics and store it in a variable (for convenience reasons). You can change the number here and experiment with different numbers of topics.






In [None]:
num_topics = 20

Now we train our model on the base of our BOW-corpus and our dictionary (This might take a while)

Beside the input and `num_topics`, we have the following parameters:

- `passes` specifies the number of training passes.
- `chunksize` divides the texts before training into sections of a certain length (here preset to 500 tokens).

Especially interesting to experiment with are the so called Priors, `alpha` und `eta`.
- `alpha` determines how widely the topics are scattered across the documents (the higher the value for `alpha`, the more).
- `eta` determines how widely the topics are scattered across the words.

default-values for `alpha` und `eta` in Gensim are `1/num_topics`.


In [None]:
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=dct,
                                       num_topics = num_topics,
                                       passes=70,
                                       chunksize=500,
                                       alpha = 1/num_topics,
                                       eta = None)


#### Exploration of Topics



The topics and their most relevant words can be output with `print_topics()`

In [None]:
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic: {idx} \nWords: {topic}\n")


Get the most dominant topic for each text:

In [None]:
for i, corp in enumerate(corpus):
    top_topics = lda_model.get_document_topics(corp)
    top_topic = sorted(top_topics, key=lambda x: x[1], reverse=True)[0]
    print(f"Text {i + 1} - Dominant topic: {top_topic[0]} (Score: {top_topic[1]:.2f})")

Get topic distribution for each text:

(We can use this to produce a heat map, see below)

In [None]:
for i, corp in enumerate(corpus):
    top_topics = lda_model.get_document_topics(corp, minimum_probability=0)
    print(re.sub(r'.*/(.*)\.txt',r'\1',docnames[i]))
    print(top_topics)

Get the relevance of a certain word for each topic

In [None]:
lda_model.get_term_topics("thing", minimum_probability=0)

Get the representation for a single topic:

In [None]:
wot = lda_model.get_topic_terms(0, topn=15)
for w in wot:
    print(dct.get(w[0]), w[1])

### Evaluation of the model

One way of evaluating the quality of topic models would be to check their coherence. Gensim has functions for this, see as follows. The higher the score, the more coherent the model is. The calculation takes a little time.



In [None]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=docs, dictionary=dct, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

## Visualization of the model

### Visualization in Heatmap

The heatmap shows the percentage of the topics for each document.

We use the `get_document_topis` function as seen above to generate a dataframe and visualize it with a seaborn heatmap.

In [None]:
t_dist=[]

for corp in corpus:
    top_dist = lda_model.get_document_topics(corp, minimum_probability=0)
    t_dist.append([v[1] for v in top_dist])

t_dist

df = pd.DataFrame(t_dist, [re.sub(r'.*/(.*)\.txt',r'\1',n) for n in docnames])

ax = sns.heatmap(df, linewidth=0.5)
plt.yticks(rotation=0)
plt.show()



### Using the pyLDAvis package

pyLDAvis is a visually attractive tool for displaying the topic distribution in a two-dimensional space (see below for more details on the visualisation).

First, the output for Jupyter notebooks (such as this one) is activated, next the pyLDAvis model is prepared and stored in the variable `vis`, which can then simply be called:

In [None]:
pyLDAvis.enable_notebook()


In [None]:
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dct)


In [None]:
vis


Explanation of the visualization:

Left side:

pyLDAvis allocates the topics on a two-dimensional field, using Principal Component Analysis (PCA) to distribute the topics according to similarities in word usage.

The larger the circle representing a topic, the greater its share of the texts in the corpus.

Right side:

The blue bars show the relative frequencies of the words in the entire corpus, the red bars (which appear when hovering over the topic circles with the mouse or when selecting them in the top left-hand navigation) show those in the respective topics.



### Visualization in Wordclouds

The following code is adapted from the tutorial https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/

We create a table with Wordclouds for every topic, that visualize the most relevant words in a topic in a font size according to that relevance (thus, the largest word is the most relevant).

First, we import some packages:

In [None]:
from matplotlib import pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import matplotlib.colors as mcolors

First, we specify the number of topics to be displayed. We use the number of topics we generated in the model as default, but it is possible to also show less clouds than the total number of topics in the model.

To modify, simply change the value of the variable. If necessary, however, it is also possible that the number of rows or columns must be adjusted below. The following assumes 20 topics, which are output in 4 rows of 5 word clouds (Feel free to write a routine that aumatically calculates the distribution of rows and columns).

In [None]:
number_topics = num_topics
table_cols = 5
table_rows = 4

In the tutorial from which the code originates, each word cloud is displayed in a new colour. As this display is not always easy to read, I implemented a bit of a detour workaround, which initially sets black as the basic colour for all word clouds (a list is created in the size of the number of desired topics, for each of which the RGB value for black is specified. if you want a different colour, you can adjust the colour code `#000000` respectively).

In [None]:
cols = ["#000000"] * number_topics

However, if you want a different coloured display as in the tutorial, simply remove the comment marker # in the following line and execute the line. (This makes the previous line of code obsolete).

The colours are taken from the XKCD_COLORS colour list and can be changed, but there must be enough colours for the number of topics to be displayed.

In [None]:
#cols = [color for name, color in mcolors.XKCD_COLORS.items()]

Next, the cloud is calculated.

The parameters are largely self-explanatory and can be customised

In [None]:
cloud = WordCloud(stopwords=stopwords,
                  background_color='white',
                  width=250,
                  height=180,
                  max_words=15,
                  colormap='Dark2',
                  color_func=lambda *args, **kwargs: cols[i],
                  prefer_horizontal=1.0)


In the following line, the number of desired words can be adjusted under `num_word` (but be aware that the number should not exceed the value of `max_words` from the previous command that generates the cloud)

In [None]:
topics = lda_model.show_topics(num_topics=number_topics, num_words= 15,formatted=False)


Finally, the table is pre-structured and plotted.

In the for loop, the font size can be regulated with max_font_size.

Within Python, the first topic has index 0, the second has index 1 and so on. For a more attractive display, 1 is added to each of the figure titles (str(i+1)).


In [None]:
fig, axes = plt.subplots(table_rows, table_cols, figsize=(10,10), sharex=True, sharey=True)

for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    topic_words = dict(topics[i][1])
    cloud.generate_from_frequencies(topic_words, max_font_size=400)
    plt.gca().imshow(cloud)
    plt.gca().set_title('Topic ' + str(i+1), fontdict=dict(size=16))
    plt.gca().axis('off')


plt.subplots_adjust(wspace=0, hspace=0)
plt.axis('off')
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()

#### Visualize a single topic in a Wordcloud

`mytopic` contains the number of the topic (indexed with the internal count that starts at 0. For the title of the visualization, this count is increased by one).

Thus, to display the word clouds for other topics, simply adjust the value of `mytopic`.

In [None]:
mytopic=0

topic_words = dict(topics[mytopic][1])
cloud.generate_from_frequencies(topic_words, max_font_size=400)
plt.gca().imshow(cloud)
plt.gca().set_title('Topic ' + str(mytopic+1), fontdict=dict(size=16))
plt.gca().axis('off')

## Other methods

### Non-Negative Matrix factorization

Implementation with Gensim is as easy as LDA:

In [None]:
nmf_model = gensim.models.Nmf(corpus, num_topics = num_topics, id2word=dct, passes=70, chunksize=500)

Once the model is calculated, we can use the same code as above. Just change the variable `lda_model` that we have used above to `nmf_model`.

For instance, we can show the topic with their most relevant words

In [None]:
for idx, topic in nmf_model.print_topics(-1):
    print(f"Topic: {idx} \nWords: {topic}\n")

... or show the topic distribution

In [None]:
for i, corp in enumerate(corpus):
    top_topics = nmf_model.get_document_topics(corp, minimum_probability=0)
    print(re.sub(r'.*/(.*)\.txt',r'\1',docnames[i]))
    print(top_topics)

## Leftovers

### Split longer text into pieces for spacy


In [None]:
    if(len(text) < 1000000):
        wd = spacy_prep(text)
    else:
        wd = []
        for x in (range(int(len(text)/1000000))):
            wd_temp = spacy_prep(text[x*1000000:x*1000000+999999])
            print(x*1000000,x*1000000+999999)
            wd.append(wd_temp)
        wd_temp = spacy_prep(text[(x+1)*1000000:len(text)])
        wd.append(wd_temp)