# Topic Modelling

Before we start with anything fancy let's define some key concepts that will serve us in our journey with topic modelling.

1. `Document` -- some text.
2. `Corpus` -- a collection of documents (texts).
3. `Vector` -- a mathematical representation of a document (text).
4. `Model` -- an algorithm for transforming vectors from one representation to another.

## Document

A document is basically a string that might be of the length of a single tweet or [A Litle Life](https://en.wikipedia.org/wiki/A_Little_Life). In _Python_ it will look more or less to what we could have expected:

In [None]:
document = "Human machine interface for lab abc computer applications"


## Corpus

On the other hand corpus is a collection of documetns (texts). They have two roles:

1. They are the input for trainig a model. This is going to be our main use case. It allows for separating common themes and topics.
2. Documents to orgnize. After training, a topic model can be used to extract topics from new documents (documents not seen in the training corpus).

At this point we can think about a corpus as a list of strings. Consider the following.


In [19]:
text_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]


**Important**: The above example loads the entire corpus into memory. In practice, corpora may be very large, so loading them into memory may be inconvinient (to say at least). However, for now we are leaving it as it is.

Before, we move any further, we need to tokenize our corpus and also remove stop words. We may just use the same code we discussed last week.

In [27]:
## Import NLTK module
import nltk
## Import stop words
from nltk.corpus import stopwords
## Import function to tokenize text
from nltk.tokenize import word_tokenize
## A new type of dictionary
from collections import defaultdict

## Download stopwords list
nltk.download('stopwords')

## Assign the list of English stop words to stop_words
stop_words = stopwords.words('english')

## Tokenize every single text. Meanwhile remove tokens
## that are shorter than 2 and are in stop_words list
texts = [ [ token for token in word_tokenize(doc.lower()) if token not in stop_words and len(token) > 1 ] 
          for doc in text_corpus ]

## We define the dictionary which will not raise an error
## when the key is not present. Instead it will create a 
## new pairing with default value of 0
frequency = defaultdict(int)

## Count the number of words all texts
for text in texts:
    for token in text:
        frequency[token] += 1

## With a normal dictionary our code would look like 
## the following
## frequency = {}
## for text in texts:
##     for token in text:
##         if token in frequency:
##             frequency[token] += 1
##         else:
##             frequency[token] = 1

            
## Remove tokens that appear only once in the whole corpus
processed_corpus = [ [ token for token in text if frequency[token] > 1 ] 
                     for text in texts ]

processed_corpus

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mikolaj/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

Before proceeding, we want to associate each wrod in the corpus with a unique integer ID. We will use for that a new module `gensim`. It will allow for creating an object that will store out texts as representation of unique tokens. In other words, to save space our we will not store in the list token but rather reference to a unique token. It sounds more complicated than it actulaly is. Please consider the following.

In [29]:
## Import gensim
from gensim import corpora

## Create a new object called Dictionary
dictionary = corpora.Dictionary(processed_corpus)

## Let's print it out
print(dictionary)

Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...>


We have here a relatively small corpus, therefore, there is only 12 different tokens. For larger corpuses, dictionaries can become massive, like over thousand of tokens.

## Vector

What happened so far was relatively easy. More or less we converted a set of texts into a new object which is called Dictionary (it is not the same as `dict`). Now, we would like to convert tokens we have into vectors. We need that because to ifer the latent structure in our corpus we need a way to represent documents that we cana mnaipulate mathematically. In other words, we need to transform strings that are understandable for us into something which will be understandable for a comupter. We can infer the latent structure (find similarities between texts) by understanding the meaning of the texts but computers can not really do that (don't believe people who state otherwise). Instead they can quite well and fast perform mathematical operations. Therefore, we need a way to convert a text into a mathematical object (vector) and ask computer find similiarties between set of vectors using math (you probably computed the distance between two vectors in high school).

In other words, we need to find a way to represent a document (text) as a vector of features. For example, a single feature may be thought of as a question-answer pair:

1. How many times does the word *splonge* appear in the document? Zero.
2. How many paragraphs does the document consist of? Two.
3. How many fonts does the document use? Five.

If we are going to apply the same questions to every single text we can skip their text and only refer to their integer ids (syc as 1, 2, and 3). Therefore, the representation of a document becomes a series of pairs like `(1, 0), (2, 2), (3, 5)`. This is called a **dense vector**, because it contains an explicit answer to each of the above questions.

In practice, we can ommit `0` values. We do that to save memory (usually we have much more features than just 3). Therefore, actually our vector would look like `(2,2), (3,5)`. This is known as a **sparse vector** or **bag-of-words vector**.

Assuming the questions are the same we can compare two texts, for example texts with  the following **bag-of-words** vectors `(2, 2), (3, 5)` and `(1, .1), (2, 1.9), (3, 4.9)` must be similar, at least on the examined features. We don't have to do any heavy math to examine it.

Ok, so let's have something more involving than our silly questions. Under the **bag-of-words** models, each document is represented by a vector containing the frequency count of each word in the dictionary. For example, let's assume we have a dictionary containing the following words `['coffee', 'milk', 'sugar', 'spoon']`. A document consisting of the string `"coffee milk coffee"` would then be represented by the vector `(1, 2), (2, 1), (3, 0), (4, 0)`, where the entries of the vector are (in order) the occurrences of "coffee", "milk", "sugar", and "spoon" in the document. The only issue here is that we kind of ignore the order of words using this approach. But is is understandable since the appraoch is called **bag-of-words**.

Ok, it is all very good but how it connects with the creation of this Dictionary object? This is somehow exactly this. Our processed corpus has `12` unique tokens in it, which means that each document will be represented by a 12-dimensional vector under the bag-of-words model. We can use the dictionary to turn tokenized documents into these 12-dimensional vectors. Let's first see tehse IDs:

In [30]:
print(dictionary.token2id)

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}
