# Tutorial: Topic Models

In this module, we will learn how to apply and visualize topic models in Python. We will use the package `gensim` (see [here](https://radimrehurek.com/gensim) for further information and tutorials.)

First of all, we import the package, along with `numpy`:

In [None]:
import gensim
import numpy as np

## Preprocessing the data

We start with a toy example to illustrate how to preprocess and visualize data. Consider a set of four documents, each consisting of one single sentence:

In [None]:
doc1 = "I like to eat broccoli and bananas. Broccoli and bananas are healthy."
doc2 = "I eat broccoli smoothie and bananas for breakfast."
doc3 = "Hamsters and kittens are cute."
doc4 = "My sister says she wants to adopt two cute kittens, but we already have three hamsters at home."

# complete list of documents
doc_complete = [doc1, doc2, doc3, doc4]

These are the steps that we will go through:
1. Remove punctuation.
2. Remove "stop words".
3. Remove low-frequency words.
4. Create the dictionary.
5. Create the bag-of-words representation.

### 1. Remove punctuation

First, we remove the punctuation signs (commas, periods, etc.).

In [None]:
import string
exclude = set(string.punctuation)
print(exclude)

In [None]:
doc_noPunc = [''.join(ch for ch in doc if ch not in exclude) for doc in doc_complete]
print(doc_noPunc)

### 2. Remove "stop words"

In computing and natural language processing, stop words are words which are filtered out as a pre-processing step. Stop words are usually common words that are *semantically insignificant* (e.g., articles like "the" or "a"). The specific list of stop words will vary from application to application; indeed, some applications do not remove any stop word.

In our toy example, we will use the following list of stop words:
```
stoplist = set('i my to and a for are at this on of she or but we'.split())
```
Of course, in more realistic applications, we do not specify the stop words by hand. Rather, we use existing lists for that purpose. Two examples of stop word lists in English are:
- http://xpo6.com/list-of-english-stop-words
- http://www.textfixer.com/resources/common-english-words.txt

Additionally, the package [`nltk`](http://www.nltk.org/book/ch02.html) contains stopword lists in several languages.

In the cell below, remove the stop words from the document list `doc_complete` using the `stoplist` defined above. Do not forget to *lowercase* all words before comparing.

In [None]:
# remove common words and tokenize
stoplist = set('i my to and a for are at this on of she or but we'.split())
doc_noStop = [[word for word in document.lower().split() if word not in stoplist]
               for document in doc_noPunc]
print(doc_noStop)

### 3. Remove high and low-frequency words

High and low-frequency words are typically removed or simply ignored, because they will not provide any useful information (none of them are discriminative). In the cell below, we first count how many times each term appears. We also remove all words that appear only once.

In [None]:
import collections
# obtain the frequency of each word
frequency = collections.defaultdict(int)
for doc in doc_noStop:
    for token in doc:
        frequency[token] += 1
# remove words that appear only once
doc_noLowFreq = [[token for token in text if frequency[token] > 1]
                  for text in doc_noStop]
print(doc_noLowFreq)

To remove high-frequency words, we could remove the top (say) 25 most common words by replacing
```python
# remove words that appear only once
doc_noLowFreq = [[token for token in text if frequency[token] > 1]
                  for text in doc_noStop]
```
with
```python
# obtain the frequency of the words as a numpy array
n_most_common = 25
np_freq = np.zeros(len(frequency))
count = 0
for token in frequency:
    np_freq[count] = frequency[token]
    count += 1
# sort the frequencies
np_freq_sorted = np.sort(np_freq)
# obtain the maximum allowed frequency
max_freq = np_freq_sorted[-n_most_common]

# remove words that appear only once or more than max_freq times
doc_noLowFreq = [[token for token in text if frequency[token] > 1 and frequency[token]<max_freq]
                  for text in doc_noStop]
```

We will not do that in this tutorial because this is a toy example with a few vocabulary words.


### 4. Create the dictionary

In topic modeling, we represent documents using the approach called "bag-of-words". This consists in two steps. First, each vocabulary word is assigned a unique integer id. Second, each document is represented by a vector, where the $n$-th element in that vector contains the number of times that the word with id $n$ appears in the document.

The mapping between the vocabulary words and the ids is called a *dictionary*. Below, we create a dictionary using the `gensim` package:

In [None]:
dictionary = gensim.corpora.Dictionary(doc_noLowFreq)
print(dictionary)

The function `token2id()` allows us to recover the id that was assigned to each vocabulary word:

In [None]:
print(dictionary.token2id)

### 5. Create the bag-of-words representation

The function `doc2bow()` counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector. For instance,

```
   dictionary.doc2bow("kittens are cute and hamsters are also cute".lower().split())
```

returns `[(3, 2), (4, 1), (5, 1)]` to indicate that word with id 3 ("cute") appears twice, word with id 4 ("hamsters") appears once, and word with id 5 ("kittens") appears also once.


In [None]:
print(dictionary.doc2bow("kittens are cute and hamsters are also cute".lower().split()))

We now apply the `doc2bow()` function to our collection of documents:

In [None]:
corpus = [dictionary.doc2bow(doc) for doc in doc_noLowFreq]
print(corpus)

## Appying LDA

We now apply Latent Dirichlet Allocation (LDA) to our preprocessed corpus. The idea behind LDA is that each document can be understood as a mixture of "topics". For instance, documents 1 and 2 are about food because they contain the words "broccoli", "bananas", and "eat"; documents 3 and 4 are about animals ("kittens", "hamsters", "cute"); and document 5 is about both animals ("hamsters") and food ("broccoli"). LDA unveils these topics automatically from the data.

We use the package `gensim` to fit LDA. We specify the dictionary and the number of topics that we want to find. In this case, we use only 2 topics (typical values in a realistic scenario will be 50-200 topics). 

In [None]:
model = gensim.models.LdaModel(corpus, id2word=dictionary, num_topics=2)

### Analyze the topics

We can print the topics found by the LDA model using the function `print_topics`. 

In [None]:
model.print_topics(2)

In general, one of the topics (e.g., topic 0) will mainly express the words "broccoli", "eat", and "bananas" with higher percentage, whereas the other topic (e.g., topic 1) will be mostly about "cute", "hamsters", and "kittens". This is consistent with our earlier intuitions of having a topic about animals and another topic about food.

Recall that a topic is formally defined as a distribution over the entire vocabulary.

### Obtain the topic proportions

We now want to find the topic proportions of each individual document. For instance, we know that document 1 is mostly about food, while document 4 is mostly about animals. The following commands allow us to obtain the topic distribution of each document.

In [None]:
print(model[dictionary.doc2bow(doc1.split())])

In [None]:
print(model[dictionary.doc2bow(doc4.split())])

Note that this can be applied to unseen documents too. For instance, consider the following new document, which is about both animals and food:

In [None]:
doc5 = "Look at these hamsters munching on a piece of broccoli".lower()
print(model[dictionary.doc2bow(doc5.split())])

The resulting topic proportions should be around $0.5$ (at least moderately close; keep in mind that these are all very short documents), indicating that this document expresses both topics.