## Tutorial 2: Mixture Models and Expectation Maximization

### Exercise 1: Categorical Mixture Model (CMM)

In [None]:
# Import libraries
import numpy as np
import pandas as pd
from ast import literal_eval
import matplotlib.pyplot as plt

import gensim
from wordcloud import WordCloud, STOPWORDS

from categorical_em import CategoricalEM

#### 1.4) Play around with the dataset

##### Load and pre-process the data
Load the data from the tweets_cleaned csv file as `pandas` dataframe. It contains the documents already pre-processed and cleaned after applying the following steps:

1. Tokenization
2. Homogeneization, which includes:
    1. Removing capitalization.
    2. Removing non alphanumeric tokens (e.g. punktuation signs)
    3. Stemming/Lemmatisation.
3. Cleaning
4. Vectorization

In [None]:
df = # FILL HERE
# FILL HERE  # drop duplicates tweets
df['tokens'] = df['tokens'].apply(# FILL HERE)  # transform the string into a list of tokens
X_tokens = list(df['tokens'].values)

In [None]:
print('Columns: {}\n'.format(' | '.join(df.columns.values)))

print('Tweet:\n{}'.format(df.loc[1, 'tweet']))
print('Tweet cleaned:\n{}'.format(df.loc[1, 'tweets_clean']))
print('Tweet tokens:\n{}'.format(X_tokens[1]))

##### Create the dictionary
We have transformed the raw text collection in a list of documents stored in `X_tokens`, where each document is a collection of words which are the most relevant according to the semantic analysis. 

We now convert these data (a list of token lists) into a numerical representation (a list of vectors, or a matrix). For this purpose we use the `gensim` library.

In [None]:
I = 120  # hyperparameter: number of different words to keep

In [None]:
dictionary = gensim.# FILL HERE
print(dictionary)
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=I)
print(dictionary)

##### Create Bag of Words (BoW)
Let's create the numerical version of our corpus using the `doc2bow` method. In general, 
`D.doc2bow(token_list)` transforms any list of tokens into a list of tuples `(token_id, n)`, one per each token in 
`token_list`, where `token_id` is the token identifier (according to dictionary `D`) and `n` is the number of occurrences 
of such token in `token_list`. 

In [None]:
X_bow = list()
keep_tweet = list()
for tweet in X_tokens:
    tweet_bow = # FILL HERE
    if len(tweet_bow) > 1:
        X_bow.append(tweet_bow)
        keep_tweet.append(True)
    else:
        keep_tweet.append(False)

df_data = df[keep_tweet]
N = len(df_data)

##### Create the matrix
Finally, we transform the BoW representation `X_bow` into a matrix, namely `X_matrix`, in which the n-th row and j-th column represents the 
number of occurrences of the j-th word of the dictionary in the n-th document. This will be the matrix used in the algorithm.

In [None]:
X_matrix = np.zeros([N, I])
for i, doc_bow in enumerate(X_bow):
    # FILL HERE
X_matrix.shape

#### 1.5) Implement the EM algorithm

In [None]:
K = 6  # hyperparameter: number of topics
i_theta = 5
i_pi = 5
model = CategoricalEM(K, I, N, delta=0.01, epochs=200, init_params={'theta': i_theta, 'pi': i_pi})
model.fit(X_matrix)


#### 1.6) Show the ten most representative words for each topic using a wordcloud, and the ten most relevant documents for each topic

Words per topic

In [None]:
# FILL HERE

In [None]:
fig, axs = plt.subplots(2, 3, figsize=(30, 10))
for k in range(K):
    # FILL HERE

Documents per topic

In [None]:
rnk = # FILL HERE

In [None]:
# FILL HERE