# Topic Modelling

In this part of the assignment you will analyze a data set consisting of abstracts from the 2017 Conference on Neural Information Processing Systems (NeurIPS), The file `papers2017.csv` contans a list of 679 titles and abstracts from the conference proceedings. 

The task is to compute the posterior distribution in the LDA model that was discussed during the lectures. Set the hyperparameters of the prior to $\alpha = \eta = 1$. The number of topics can be set to $K = 5$.

Before we can begin with the inference we need to load and pre-process the data.

In [None]:
import pandas as pd
import gensim
import nltk
import numpy as np
import scipy
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
stemmer = SnowballStemmer('english')
nltk.download('wordnet')

documents = pd.read_csv('papers2017.csv')

# Pre-process the data

* **Tokenization:** Split each headline into words, lowercase the words and remove punctuation.
* **Small words:** Remove all words with less than 3 characters.
* **Stopwords:** Remove all stopwords.
* **Lemmatize:** Change words in third person to first person and all verbs into present.
* **Stem:** Reduce the words to their root form.

In [None]:
# lemmatizing and stemming
def lem_stem(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text,pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in STOPWORDS and len(token) >= 3:
            result.append(lem_stem(token))
    return result

Next step is to process all of the abstract through this pre-process

In [None]:
processed_docs = documents['abstract'].map(preprocess)

Finally we create our dictionary by filtering out words with less than 10 appearances and all words which appears in more than half of the documents.

We then create a bag of words, where each processed document gets replaced by a list of words and the number of times they appear.

In [None]:
dictionary = gensim.corpora.Dictionary(processed_docs)
dictionary.filter_extremes(no_below=10, no_above=0.5)
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# Collapsed Gibbs

**Q1:**  For a sweep of the collapsed Gibbs sampler we need to sample from the distribution $p(\mathbf{z} \mid \mathbf{w}, \alpha, \eta)$. Where $\mathbf{z}$ is the topic per word for the documents.

Write out the full distribution and simplify the computations as much as possible. 

Also explain how to obtain the posterior distribution from the variables $\mathbf{\theta}$ (topic proportions for each document) and $\beta$ (word distributions for each topic), based on the output of the Gibbs sampler. That is how do we find the distribution of these quantities given the distribution of $\mathbf{z}$.

_hint: see the slides_

**Q2:** Implement the collapsed Gibbs sampler on the provided NeurIPS dataset to approximate the posterior $p(\mathbf{z} \mid \mathbf{w}, \alpha, \eta)$. To help with this you have code providing the outer loop, you need to implement the initialization algorithm and the function for calculating the probabilities of the posterior distribution.

Where `c` and `ct` are $c$ and $\tilde{c}$ from the slides. The count of each topic per document and the count of each topic per word.

In [None]:
# Initialization function, sets the topic for each word in each document at random.
def initialization_gibbs(bow_corpus, num_topics, num_words):
    num_docs = len(bow_corpus)
    z = []
    c = np.zeros((num_docs,num_topics), dtype = int)
    ct = np.zeros((num_topics,num_words), dtype = int)
    for d in range(num_docs):
        topics_in_doc = np.zeros(0,dtype = int)
        for (v,i) in bow_corpus[d]:
            # Inner loop to set topic for each copy of a word independently.
            for _ in range(i):
                # Sample a random topic
                k = 
                # Update list of topics in document
                topics_in_doc = np.append(topics_in_doc,k)

                # Update c and ct based on document d, word v and topic k
                c[ ][ ] =
                ct[ ][ ] =
        z.append(topics_in_doc)
    return(z, c, ct)

# Calculates the probability for each topic on the current word and document.
def calc_probs(c, ct, d, v, alpha, eta, num_topics):
    p = np.zeros(num_topics)
    for k in range(num_topics):
        # For each topic calculate the probability that word v in document d belongs to topic k.
        p =
        
    # Normalize the vector before returning it.
    return p/sum(p)

In [None]:
# Hyperparameters
alpha = 
eta =

# Number of iterations (number of samples from the posterior)
max_itr_gibbs = 

# Number of topics in the model
num_topics =

# Number of words and documents (may help you later)
num_words = len(dictionary)
num_docs = len(bow_corpus)

# Start by initializing all values: z (topics) should be set randomly and c and ct (tilde c) should be calculated based on the randomly set topics.
(z, c, ct) = initialization_gibbs(bow_corpus, num_topics, num_words)

# Gibbs sampling:
for itr in range(max_itr_gibbs):
    for d in range(num_docs):
        # indx keeps track of the index of the words in each document
        indx = 0
        for (v,i) in bow_corpus[d]:
            for tmp in range(i):
                # k is current topic
                k = z[d][indx]

                # Decrease c and ct based on the current topic
                c[d][k] -= 1
                ct[k][v] -= 1

                # Calculate probabilities for the posterior distribution
                probs = calc_probs(c, ct, d, v, alpha, eta, num_topics)

                # Sample new topic
                new_k = np.random.choice(num_topics, p = probs)

                # Increase c and ct based on the new topic
                c[d][new_k] += 1
                ct[new_k][v] += 1

                # Set the word (index indx) to the new topic
                z[d][indx] = new_k
                indx += 1

**Q3:** Present the top 5 words based on term-score for each topic, also give a name to each of the topics.

# Variational inference

**Q4:** Write down explicit update expressions for a CAVI algorithm for approximating the posterior distribution $p(\mathbf{\theta},\mathbf{z},\beta \mid \mathbf{w}, \alpha, \eta)$.

**Q5:** Implement the CAVI algorithm for the NeurIPS data. To help you with this task some code is provided that you need to fill in.

For the code we work with $\log(\phi)$, this simplifies some of the expressions and is in general more numericly stable.

In [None]:
# Initilization function, sets lambdas and gammas randomly, sets all phis to 0.
def initialization_cavi(bow_corpus, num_topics, num_words):
    num_docs = len(bow_corpus)
    logphis = []
    wordMatrix = []
    docLengths = []
    lambdas = np.random.gamma(1, size = (num_topics, num_words))
    gammas = np.random.gamma(1, size = (num_docs, num_topics))
    for d in range(num_docs):
        words_in_doc = 0
        word_vec = np.zeros(0, dtype = int)
        for (v,i) in bow_corpus[d]:
            words_in_doc += i
            for _ in range(i):
                word_vec = np.append(word_vec, v)
        logphis.append(np.zeros((words_in_doc,num_topics)))
        docLengths.append(words_in_doc)
        wordMatrix.append(word_vec)
    return(logphis, lambdas, gammas, docLengths, wordMatrix)


In [None]:
# Hyperparameters
alpha = 
eta =

# Number of iterations (number of samples from the posterior)
max_itr_cavi = 

# Number of topics in the model
num_topics =

# Number of words and documents (may help you later)
num_words = len(dictionary)
num_docs = len(bow_corpus)

# Start by initializing all values, we set all phis to zero and randomize lambdas and gammas from the gamma distribution. Also calculates doc_lengths and word_matrix.
(logphis, lambdas, gammas, doc_lengths, word_matrix) = initialization_cavi(bow_corpus, num_topics, num_words)
for itr in range(max_itr_cavi):
    for d in range(num_docs):
        indx = 0
        for (v,i) in bow_corpus[d]:
            for _ in range(i):
                for k in range(num_topics):
                    # Calculate each logphi based on the expected value of the natural parametrization.
                    # The digamma function is available in scipy.specieal.digamma
                    logphis[d][indx][k] = 
                # Normalize the logphis
                logphis[d][indx, :] = 
                indx += 1
        for k in range(num_topics):
            # Calculate the gammas based on the phis
            gammas[d][k] = 
    for k in range(num_topics):
        for v in range(num_words):
            # Update the lambdas. 
            lambdas[k][v] = 

**Q6:** Present the top-5 words based on term-score for each topic and also give a name to each of the topics.

# Comparing Gibbs and Variational Inference

**Q7:** Choose one of the abstracts, present the top 5 topics of the document and present the title of the 5 closest other abstracts. Do this for both of the algorithms on the same abstract. Discuss similairities and differences between the results from the two algorithms.

**Q8:** Discuss the key conceptual differences between the Gibbs sampler and the CAVI algorithm. What are the pros and cons of each method? Which method do you prefer and why?