# Exercise 11: Topic modeling with latent Dirichlet allocation (LDA) 

In this practical exercise, we will use latent Dirichlet allocation (LDA) to discover the topics prevalent in a document collection, and assign topics to the documents. As discussed in the lecture, an LDA model describes each topic in terms of a distribution over words, and each document as a distribution over topics. The problem setting is unsupervised in the sense that only the text in the documents is observed, and all other variables are latent and need to be inferred by the model.

We will study the <a href="https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection"> Reuters-21578 </a> document collection, which contains news wire articles that appeared on Reuters in 1987.

First we download and load the Reuters-21578 dataset using the *Natural Language Toolkit* (*NLTK*), which is a platform for handling natural language in Python. It provides access to several text corpora and functions for handling text data. We first download the Reuters-21578 data and import it into the list $articles$.

In [1]:
! sudo pip install nltk
import nltk
from nltk.corpus import reuters
nltk.download('reuters')
articles=[]
for doc_id in reuters.fileids():
    articles.append(reuters.raw(doc_id))

'sudo' is not recognized as an internal or external command,
operable program or batch file.
[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\lucas\AppData\Roaming\nltk_data...


The variable $articles$ now contains 10788 documents in the form of strings.

As discussed in the lecture, LDA is a bag-of-words model, meaning that the probabilistic process for a document does not take into account the order in which the words appear in a document. This means that for performing inference with the LDA model, it suffices to know which words appear in a document and how often each word appears. Therefore we will pass the document collection as a document word frequency matrix. The document word frequency matrix is a 2D array where rows represent documents and columns represent words and each cell counts how many times the specific word appears in a given document.

The following code constructs a document word matrix $tf$, pruning the vocabulary of words such that it only contains words  that:

* have latin characters and are of length 3 or more characters (token_pattern='[a-zA-Z]{3,}'),

* are not english stop words, that is, frequent but uninformative words such as "the" or "a" (stop_words='english')

* occur in at least 0.2% of all documents and at most 95% of all documents (max_df=0.95, min_df=0.002)

and out of these using the 2000 most frequent words (max_features=2000).

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
tf = CountVectorizer( token_pattern='[a-zA-Z]{3,}',max_df=0.95, min_df=0.002,max_features=2000,stop_words='english')
articles_words = tf.fit_transform(articles)
word_index = tf.get_feature_names_out()

#### Exercise 1:  
Using sklearn.decomposition.LatentDirichletAllocation, train a LDA model with $K=20$ topics on the document collection using the document word frequency matrix *articles_words*. You can find the function documentation <a href="http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html"> here</a>. You can leave the Dirichlet prior parameters $\alpha$ and $\eta$ at their default values of $1/K$. Hint: use the argument enabling multicore processing to increase the speed of inference. Use the argument *random_state=0* to set the random seed of the inference algorithm in order to get reproducible results.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
# YOUR CODE GOES HERE


#### Exercise 2:  
Describe each topic by listing the 10 most probable words in that topic. Please check LatentDirichletAllocation function documentation on how to obtain the word distributions per topic.

In [None]:
# YOUR CODE GOES HERE


#### Exercise 3: 
From the inferred topic distributions, we can define the topic distance between two documents $d_1$, $d_2$ to be the Kullback-Leibler divergence between their topic distributions. Let $\theta_{d_j}$ for $j \in \{1,2\}$ be the parameter vector of the categorical topic distributions for documents $d_1$ and $d_2$, and let $\theta_{d_j,i}$ denote their $i$-th component, that is, the probability for topic $i$. 

The Kullback-Leibler divergence is defined by

$$KL(d_1,d_2) = \sum_i{\theta_{d_1,i}}\log{\frac{\theta_{d_1,i}}{\theta_{d_2,i}}}$$

Implement a function *get_similar(doc_id,doc_topic_distribution)* that takes an integer *doc_id* representing the document index and a matrix that gives the distribution over topics for each document. The function should return a list that contains the indices of all documents in the collection ordered by their topic distance to *doc_id*. 


In [None]:
# YOUR CODE GOES HERE


#### Exercise 4:  

Get the 10  documents that are most similar to the document with index 1 according to topic distance, and inspect manually whether they are all indeed covering similar content as this document.

To get the distribution of topics over documents, use the LatentDirichletAllocation transform function. It takes the document word matrix and returns an un-normalized document topic distribution, so you have to normalize the matrix before using it.

In [None]:
# YOUR CODE GOES HERE
