# Topic Modelling Fundamentals by Example

---
---
**Let's dive right in! 🏊🏽‍♀️**

We have a corpus of 56 presidential speeches that we've prepared into clean lemmas.

> Our question: **What are the underlying themes of these texts as a group?**

---
---
## Loading the Corpus into Gensim from Files

First, we have to load the corpus from text files into a _dictionary_. Gensim provides a special class of dictionary for us to work with called `gensim.corpora.Dictionary`.

Get a list of all the files in the `data/inaugural` folder:

In [56]:
from pathlib import Path
files = Path('data', 'inaugural').iterdir()

Open all the files in turn and add their contents to a big list of strings call `text`:

In [57]:
text = []
for file in files:
    with open(file, 'r') as reader:       
        document = []
        for lemma in reader.read().split():
            document.append(lemma)
        text.append(document)

print(f'There are {len(text)} documents loaded.')

There are 56 documents loaded.


(👆👆👆 The above code is a bit tricky; if you don't understand it yet, don't worry. You can skip over it and still follow along with the topic modelling.)

Now we load the `text` into the Gensim `Dictionary`:

In [67]:
from gensim.corpora import Dictionary

dictionary = Dictionary(text)
str(dictionary)

"Dictionary(6164 unique tokens: ['2', 'abandon', 'abide', 'abundance', 'abundantly']...)"

From this we can understand that Gensim has found 6164 unique tokens in the corpus. But exactly what information does this `Dictionary` contain?

In short, it is a _mapping_ between each token and a unique id number:

In [82]:
dictionary.token2id

{'2': 0,
 'abandon': 1,
 'abide': 2,
 'abundance': 3,
 'abundantly': 4,
 'achieve': 5,
 'action': 6,
 'add': 7,
 'advance': 8,
 'allow': 9,
 'altar': 10,
 'america': 11,
 'american': 12,
 'americans': 13,
 'ancient': 14,
 'and': 15,
 'ant': 16,
 'article': 17,
 'aside': 18,
 'ask': 19,
 'aspire': 20,
 'await': 21,
 'barely': 22,
 'battalion': 23,
 'bear': 24,
 'before': 25,
 'belief': 26,
 'believe': 27,
 'believer': 28,
 'belong': 29,
 'bend': 30,
 'betray': 31,
 'bill': 32,
 'bind': 33,
 'body': 34,
 'bounty': 35,
 'brave': 36,
 'brighten': 37,
 'bring': 38,
 'build': 39,
 'but': 40,
 'call': 41,
 'candle': 42,
 'capitalist': 43,
 'cause': 44,
 'century': 45,
 'change': 46,
 'changeless': 47,
 'character': 48,
 'child': 49,
 'choice': 50,
 'citizen': 51,
 'city': 52,
 'clamor': 53,
 'clear': 54,
 'clerk': 55,
 'close': 56,
 'color': 57,
 'come': 58,
 'common': 59,
 'companion': 60,
 'conceive': 61,
 'conquer': 62,
 'constantly': 63,
 'continent': 64,
 'control': 65,
 'conviction': 66

>The key to understanding Natural Language Processing (NLP) is that the computer can only do computations on **numbers**. We have to present our corpus for analysis in a numerical form — and make human sense of it at the end.

Now we have got the corpus loaded we can start to analyse its contents. The first step is to count the words to create a **bag-of-words** corpus.

---
---
## Bag of words

A bag-of-words (BoW) corpus is a _vocabulary_ of the known words in the corpus together with some _measure_ of how often they occur. The measurement may be:
* binary (presence or absence)
* count (how many times the word occurs)
* frequency (count divided by the total number of words).  

This combination of vocabulary and measurement is called a **document vector**.

#### Example

Here is a simplified example to demonstrate the principles of creating a vector from a document.

Document (20 words):

>'No room to poise the lance or bend the bow;
> But hand to hand, and man to man, they grow:'
 
 (from _The Iliad of Homer_, translated by Alexander Pope (1899)) 
  
Vocabulary of unique words (15 words):

* no
* room
* to
* poise
* the
* lance
* or
* bend
* bow
* but
* hand
* and
* man
* they
* grow

Count measurements (how many times each word appears in the document):

* no = 1
* room = 1
* to = 3
* poise = 1
* the = 2
* lance = 1
* or = 1
* bend = 1
* bow = 1
* but = 1
* hand = 2
* and = 1
* man = 2
* they = 1
* grow = 1

If we treat this vocabulary as a list with a fixed order, we can just extract the counts into a list. This is the document vector.

`[1, 1, 3, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1]`

In order to compare other documents with this one for similarity, we could generate a document vector with the same vocabulary list for each document, or expand the vocabulary list to cover all the words in all the documents we are interested in.

#### The 'bag' in bag of wordsprocessing

In this most basic BoW model all order and location of the words is discarded. For example, it does not matter if the words 'red' and 'nose' are adjacent ('red nose'), or at the beginning or end of a sentence; BoW just treats the words individually. It is like a 'bag' of Scrabble™ tiles, where each tile is a word, all rattling around together in no particular order.

It is possible to create a BoW corpus that uses two or more adjacent words, and potentially . For example, if you measure all pairs of words in our example document (above) you might end up with a vocabulary that looks like this:

* no room
* room to
* to poise
* poise the
* the lance
* lance or
* or bend
* bend the
* the bow
* bow but
* but hand
* hand to
* to hand
* hand and
* and man
* man to
* to man
* man they
* they grow

### n-grams

Two adjacent words together like this is known as an **bigram**. The case before where we took just one word is called a **unigram**. Three words is a **trigram** and so on. These are all special cases of **n-gram**, where _n_ is some number of words.

### Vocabulary choice

As you may have suspected by now, the size and nature of the vocabulary you choose is vitally important. A large vocabulary will take more computational power and memory to analyse. A vocabulary with many rare words (so the count for these words is 0) creates what is called a _sparse_ vector, which has less useful information in it. Likewise, very common but largely meaningless words are often wasteful to include, for example, we would probably want to exclude a list of **stopwords**.

### Term Frequency–Inverse Document Frequency (TF-IDF)
If you measure word frequency, highly frequent words come to dominate your results and yet they may not be as meaningful or interesting as rarer words. For example, if you are looking at articles about the history of the Moon landings, even if you have removed all the stopwords, you may well find that the words 'lunar', 'moon', 'landing', 'orbit', and 'earth' predominate. Subtle differences in topic between documents may be lost.

One way to deal with this is to use a _weighting factor_ called **TF-IDF**. A value is calculated for each word that reflects:
* Term frequency (TF) - the number of times the word appears in the document
* Document frequency (DF) - the number of documents in the corpus that contain the word

For example, if a very uncommon word is present in two documents, this word is weighted more highly than a word that is present in all documents in a corpus.

In [None]:
# Convert to a bag-of-words corpus using Gensim (corpora.Dictionary.doc2bow)

In [None]:
# Save bag-of-words corpus object to disc using pickle (alternatives?)

## LDA

In [None]:
# Use LDA model to find topics using Gensim (gensim.models.ldamodel.LdaModel)

In [10]:
from gensim.corpora import Dictionary
dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
str(dct)
for item in dct.items():
    print(item)

(0, 'maso')
(1, 'mele')
(2, 'máma')
(3, 'ema')
(4, 'má')


In [11]:
dct.doc2bow(["this", "is", "máma"])

[(2, 1)]

In [12]:
dct.doc2bow(["this", "is", "máma"], return_missing=True)

([(2, 1)], {'is': 1, 'this': 1})

In [13]:
from gensim import corpora
dictionary = corpora.Dictionary([text_data])
dictionary

NameError: name 'text_data' is not defined

In [14]:
corpus = [dictionary.doc2bow(text) for text in [text_data]]
corpus[:100]

NameError: name 'text_data' is not defined

In [None]:
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))

In [None]:
import gensim
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

In [None]:
new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))

In [None]:
dictionary.save('dictionary.gensim')

Put this in somewhere?

### A Bit About Machine Learning

You may not have realised it when you started this notebook, but topic modelling is a Machine Learning (ML) method. ML is, of course, something of a hot topic...

Topic modelling is described as an **unsupervised** **classification** technique.

**Model**
... 

**Classification**

**Supervised and unsupervised techniques**

Since topic modelling is an _unsupervised_ technique, we need to spend some time evaluating the topic models that are produced. We will do this in our worked example, below.

---
---
## Summary

Blah

Blah: 

* sdfsdfsdf
* sdfsdfsdf

👌👌👌

The next notebook ...