# Topic Modelling Fundamentals by Example

---
---
**Let's dive right in! 🏊🏽‍♀️**

We have a corpus of 56 presidential speeches that we've prepared into clean lemmas.

> Our question: **What are the underlying themes of these texts as a group?**

---
---
## Loading the Corpus into Gensim from Files

Get a list of all the files in the `data/inaugural` folder:

In [50]:
from pathlib import Path
inaugural = Path('data', 'inaugural')
files = list(inaugural.iterdir())

Open all the files in turn and add their contents to a big list of strings call `text`:

In [51]:
text = []
for file in files:
    with open(file, 'r') as reader:       
        document = []
        for token in reader.read().split():
            document.append(token)
        text.append(document)

print(f'There are {len(text)} documents loaded.')

There are 56 documents loaded.


(👆👆👆 If you don't understand this code above yet, don't worry. You can skip over it and still follow along with the topic modelling.)

### Loading the Tokens into a Dictionary

Now we have to load the corpus from text files into a _dictionary_. Gensim provides a special class of dictionary for us to work with called `gensim.corpora.Dictionary`. It has some extra stuff in it over and above the ordinary Python `dict`, but we don't need to worry about the details.

In [52]:
from gensim.corpora import Dictionary

dictionary = Dictionary(text)
str(dictionary)

"Dictionary(6164 unique tokens: ['2', 'abandon', 'abide', 'abundance', 'abundantly']...)"

From this we can understand that Gensim has found 6164 unique tokens in the corpus. But exactly what information does this `Dictionary` contain?

In short, it is a _mapping_ between each _token_ and a _unique id number_:

In [53]:
dictionary.token2id

{'2': 0,
 'abandon': 1,
 'abide': 2,
 'abundance': 3,
 'abundantly': 4,
 'achieve': 5,
 'action': 6,
 'add': 7,
 'advance': 8,
 'allow': 9,
 'altar': 10,
 'america': 11,
 'american': 12,
 'americans': 13,
 'ancient': 14,
 'and': 15,
 'ant': 16,
 'article': 17,
 'aside': 18,
 'ask': 19,
 'aspire': 20,
 'await': 21,
 'barely': 22,
 'battalion': 23,
 'bear': 24,
 'before': 25,
 'belief': 26,
 'believe': 27,
 'believer': 28,
 'belong': 29,
 'bend': 30,
 'betray': 31,
 'bill': 32,
 'bind': 33,
 'body': 34,
 'bounty': 35,
 'brave': 36,
 'brighten': 37,
 'bring': 38,
 'build': 39,
 'but': 40,
 'call': 41,
 'candle': 42,
 'capitalist': 43,
 'cause': 44,
 'century': 45,
 'change': 46,
 'changeless': 47,
 'character': 48,
 'child': 49,
 'choice': 50,
 'citizen': 51,
 'city': 52,
 'clamor': 53,
 'clear': 54,
 'clerk': 55,
 'close': 56,
 'color': 57,
 'come': 58,
 'common': 59,
 'companion': 60,
 'conceive': 61,
 'conquer': 62,
 'constantly': 63,
 'continent': 64,
 'control': 65,
 'conviction': 66

### Vocabulary Size and Filtering Extremes

The tokens are collectively known as a _vocabulary_ and the size and nature of the vocabulary you choose is important. A large vocabulary will take more computational power and memory to analyse. A vocabulary with many rare words has less useful information in it (so you are wasting time and memory processing it).

To reduce the size of the vocabulary and increase the density of information content, we can filter out the extremes with `filter_extremes()`. You can experiment with different values, but here we filter out tokens that appear less than 5 times or appear in more than 50% of the documents:

In [54]:
dictionary.filter_extremes(no_below=5, no_above=0.5)
str(dictionary)

"Dictionary(1679 unique tokens: ['2', 'abandon', 'abide', 'abundance', 'achieve']...)"

Now we have 1679 unique tokens, compared with 6164 for the original `Dictionary`.

### Saving the Dictionary To File

Gensim provides an easy way to save the `Dictionary` to file so you can reload it later.

In [55]:
dict_file = str(Path('data', 'saved', '1-dictionary.gensim')) # Transform Path to string as `save()` only accepts strings
dictionary.save(dict_file)

You can check that we now have a file named `1-dictionary.gensim` in the `data/saved` folder. NB: This file is not human-readable.

### Loading the Dictionary From File

Here is how you can load the dictionary, or any other object you create in Gensim, back into the notebook:

In [56]:
dictionary = Dictionary.load(dict_file)
str(dictionary)

"Dictionary(1679 unique tokens: ['2', 'abandon', 'abide', 'abundance', 'achieve']...)"

Once we have the corpus loaded we can start to analyse its contents. The first step is to count the words to create a **bag-of-words** corpus.

---
---
## Bag of Words Corpus

>**The key to understanding Natural Language Processing (NLP) is that the computer can only do computations on _numbers_. We have to present our corpus for analysis in a numerical form — typically _vectors_ — and make human sense of everything at the end.**

A bag-of-words (BoW) corpus is the _vocabulary_ of known tokens (words) in the corpus together with some _measure_ of how often they occur. The measurement may be:
* binary (presence or absence)
* count (how many times the token occurs)
* frequency (count divided by the total number of tokens).

In our example, we will use Gensim's `doc2bow()`, which simply counts the tokens:

In [57]:
corpus = [dictionary.doc2bow(doc) for doc in text]
corpus[0]

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 2),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 2),
 (10, 1),
 (11, 1),
 (12, 1),
 (13, 2),
 (14, 1),
 (15, 1),
 (16, 2),
 (17, 1),
 (18, 2),
 (19, 1),
 (20, 1),
 (21, 1),
 (22, 4),
 (23, 1),
 (24, 1),
 (25, 1),
 (26, 1),
 (27, 2),
 (28, 1),
 (29, 1),
 (30, 1),
 (31, 1),
 (32, 2),
 (33, 1),
 (34, 1),
 (35, 6),
 (36, 1),
 (37, 1),
 (38, 1),
 (39, 1),
 (40, 1),
 (41, 1),
 (42, 1),
 (43, 1),
 (44, 2),
 (45, 1),
 (46, 1),
 (47, 1),
 (48, 1),
 (49, 2),
 (50, 1),
 (51, 1),
 (52, 1),
 (53, 1),
 (54, 2),
 (55, 1),
 (56, 2),
 (57, 1),
 (58, 1),
 (59, 1),
 (60, 2),
 (61, 3),
 (62, 1),
 (63, 1),
 (64, 1),
 (65, 1),
 (66, 1),
 (67, 1),
 (68, 1),
 (69, 1),
 (70, 1),
 (71, 2),
 (72, 1),
 (73, 1),
 (74, 1),
 (75, 1),
 (76, 2),
 (77, 1),
 (78, 1),
 (79, 1),
 (80, 1),
 (81, 1),
 (82, 1),
 (83, 1),
 (84, 5),
 (85, 1),
 (86, 1),
 (87, 1),
 (88, 2),
 (89, 1),
 (90, 1),
 (91, 1),
 (92, 1),
 (93, 1),
 (94, 2),
 (95, 1),
 (96, 1),
 (97, 1),
 (98, 1),
 (99, 2),
 (100, 1),

What `corpus[0]` shows us is a list of _token ids_ and their _count_ for the first document in the corpus. For example, `(7, 1)` is the token id `7` and its count `1` i.e. it was found once in this document. 

You can look at the counts for any of the documents from 0-55 by changing the index number (remember indexing starts at 0).

### The 'Bag' in Bag of Words

In this basic BoW model the order and location of words is discarded. For example, it does not matter if the words 'red' and 'nose' are adjacent ('red nose'), or at the beginning or end of a sentence; BoW just treats the words individually. It is like a 'bag' of Scrabble™ tiles, where each tile is a word, all rattling around together in no particular order.

---
## Going Further: Term Frequency–Inverse Document Frequency (TF-IDF)
Highly frequent words can come to dominate your results and yet they may not be as meaningful or interesting as rarer words. For example, if you are looking at articles about the history of the Moon landings, even if you have removed all the stopwords, you may well find that the words 'lunar', 'moon', 'landing', 'orbit', and 'earth' predominate. Subtle differences in topic between documents may be lost.

We have already done some filtering of extremes (above) by filtering tokens that appeared less than 5 times or appeared in more than 50% of the documents.

Another way to deal with this is to use a _weighting factor_ called **TF-IDF**. A value is calculated for each word that reflects:
* Term frequency (TF) - the number of times the word appears in the document
* Document frequency (DF) - the number of documents in the corpus that contain the word

For example, if a very uncommon word is present in two documents, this word is weighted more highly than a word that is present in all documents in a corpus.

Gensim provides `TfidfModel`, which accepts a BoW corpus and creates the weightings for each token in each document:

In [58]:
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus)
tfidf.num_docs

56

We can take a peak at the weightings for any document like this, where the first number in each pair (tuple) is the _token id_ and the second is the _weighting_:

In [59]:
for doc in tfidf[corpus[0]]:
    print(doc)

(0, 0.04658544133182475)
(1, 0.049138727352675916)
(2, 0.0442214677128228)
(3, 0.07706534495155763)
(4, 0.06896016880051575)
(5, 0.028385604597972087)
(6, 0.049138727352675916)
(7, 0.03448008440025788)
(8, 0.032843877238734824)
(9, 0.10990922219029248)
(10, 0.07706534495155763)
(11, 0.049138727352675916)
(12, 0.07124946120908732)
(13, 0.08404132185143978)
(14, 0.03448008440025788)
(15, 0.07124946120908732)
(16, 0.046541654027646435)
(17, 0.031287517543396454)
(18, 0.14249892241817463)
(19, 0.07124946120908732)
(20, 0.029803573420781278)
(21, 0.023270827013823218)
(22, 0.11921429368312511)
(23, 0.0442214677128228)
(24, 0.03802807606047904)
(25, 0.027027993496264512)
(26, 0.03448008440025788)
(27, 0.13266440313846842)
(28, 0.06633220156923421)
(29, 0.05495461109514624)
(30, 0.04658544133182475)
(31, 0.032843877238734824)
(32, 0.0442214677128228)
(33, 0.029803573420781278)
(34, 0.024474707475413353)
(35, 0.46239206970934577)
(36, 0.027027993496264512)
(37, 0.049138727352675916)
(38, 0.028

This model could be used instead of the BoW model as input to topic modelling algorithms.

---

---
---

## Latent Dirichlet Allocation (LDA)
If you've heard of topic modelling before, you may have heard of Latent Dirichlet Allocation. LDA is a popular statistical model for topics and one that is almost synonymous with topic modelling in general.

However, it's important to understand that LDA is only _one_ type of topic model; there are many others with equally dull acronyms (e.g. LSA, HDP, LSI, NNMF). Also, Gensim provides an _implementation_ of LDA called `LdaModel` (based on the LDA mathematics), but there are many other implementations in different libraries and software. Different implementations of LDA should give you more or less the same results, but different topic models may well give you different results.

**Let's get started with our topic modelling!**

### Training the LDA Model

We `import` the `LdaModel`, pass it our BoW `corpus` and limit the number of topics we are interested in to `5`. Feel free to experiment with the number of topics.

In [60]:
from gensim.models.ldamodel import LdaModel

ldamodel = LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)

lda_file = str(Path('data', 'saved', '2-lda.gensim'))
ldamodel.save(lda_file)

### Topics for the Whole Corpus

> **NB: You may see the topics, words and probabilities created differently than the examples below due to the way they are generated.**

To view the topics we can use `show_topics` — and we can optionally limit it to the number of topics and words we are interested in:

In [61]:
ldamodel.show_topics(num_topics=5, num_words=5)

[(0,
  '0.005*"object" + 0.005*"institution" + 0.005*"executive" + 0.005*"opinion" + 0.004*"general"'),
 (1,
  '0.007*"counsel" + 0.006*"wish" + 0.006*"mankind" + 0.006*"help" + 0.005*"fear"'),
 (2,
  '0.008*"congress" + 0.007*"business" + 0.004*"race" + 0.004*"increase" + 0.004*"ought"'),
 (3,
  '0.011*"today" + 0.009*"americans" + 0.008*"century" + 0.007*"democracy" + 0.007*"child"'),
 (4,
  '0.007*"help" + 0.007*"problem" + 0.006*"face" + 0.006*"moral" + 0.005*"leadership"')]

What can we see here? These 5 topics represent the topic distribution of the **corpus as a whole**.

Let's take an example topic:

```
(3, '0.008*"congress" + 0.007*"business" + 0.005*"increase" + 0.004*"trade" + 0.004*"ought"')
```

The first number `3` in the tuple is the topic number. In front of each word is the probability of that word making up the topic. For example, `0.007*"business"` means that the topic is 0.7% business-y.

Overall, the topic _appears_ to be about the role of congress in increasing business and trade.

### Topics for an Individual Document
We can ask the model to give us the topic distribution for any individual document.

For example, if we pass in document `10` from the BoW corpus, it gives us two topics for that document: it is 56% topic `0` and 44% topic `3`.

In [62]:
ldamodel.get_document_topics(corpus[10])

[(2, 0.40281704), (3, 0.59535265)]

## Visualising Topics with pyLDAvis
Understanding the data that underlies a topic model is vital, but fortunately we also have a more human-friendly option to help us interpret the topics!

[pyLDAvis](https://github.com/bmabey/pyLDAvis) is a library for creating interactive topic model visualisations. It even has a helper function specifically for Gensim that we can use.

In [63]:
# Silence an annoying warning we cannot do anything about
import warnings
warnings.filterwarnings('ignore')

# pyLDAvis code starts here
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)

Here are some hints to help you interpret the visualisation:

* On the **left-hand side** is a scatterplot of some bubbles:
 * Each **bubble** represents a topic.
 * The **size of a bubble** represents how _prevalent_ or popular the topic is overall.
 * The **distance** from one bubble to another represents how similar the topics are to each other. If they overlap then the topics share significant similarity.
 
* On the **right-hand side** is a histogram of terms (tokens):
 * Select a bubble and it shows the top-30 **most relevant terms** for that topic.
 * The **red bar** represents how frequent a term is in the topic.
 * The **blue bar** represents how frequent the term is overall in all topics. So a long red bar with only a short blue bar indicates a term that is highly specific to that particular topic. Conversely, a red bar with a long blue bar means the term is also present in many other topics.
 * By mousing over a particular term, the size of the bubbles changes to show the relative frequency of that term in the various topics.
 * By adjusting the slide, it adjusts the **_relevance_ value (λ)**, which is the weight given to whether a term appears exclusively in a particular topic or is spread over topics more evenly. If λ = 1 terms are ranked according to their probabilities in the particular topic only; if λ = 0 terms are ranked higher if they are unusual terms that occur almost exclusively in that topic. Typically, the optimal value is around 0.6, but it is interesting to adjust it and observe any differences.

---
---
## Summary

Well done for getting to the end of the topic modelling fundamentals notebook! Here is what we have done:

* Loaded and saved the cleaned tokens in a Gensim corpus dictionary
* Created a bag-of-words (BoW) corpus
* Looked at Term Frequency–Inverse Document Frequency (TF-IDF)
* Trained a Latent Dirichlet Allocation (LDA) topic model
* Visualised the resulting topics with pyLDAvis


👌👌👌

The next notebook `3-xxx` we will look at...