## Topic Models with LDA

In this exercise, we will learn how to apply and visualize topic models in Python. 
We will use the package `sklearn`.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import _stop_words as stop_words
from sklearn.decomposition import LatentDirichletAllocation
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from collections import Counter

import matplotlib.pyplot as plt

### Exercise 1: processing bag of words representation and analyze results

We start with a toy example to illustrate how to preprocess and visualize data. Consider a set of four documents, each consisting of one single sentence:

In [None]:
doc1 = "I like to eat broccoli and bananas; Broccoli and bananas are healthy."
doc2 = "I eat broccoli smoothie and bananas for breakfast."
doc3 = "Hamsters and kittens are cute."
doc4 = "My sister says she wants to adopt two cute kittens, but we already have three hamsters at home."

# complete list of documents
doc_complete = [doc1, doc2, doc3, doc4]

#### a) Tokenize the document
These are the steps that this is doing:
1. Remove punctuation.
2. Remove "stop words".
3. Remove low-frequency words.
4. Create the dictionary.
5. Create the bag-of-words representation.

In [None]:
list_stop = list(stop_words.ENGLISH_STOP_WORDS)

##### Stemming

In [None]:
def tokenize_documents(documents,stoplist,max_df0=0.80, min_df0=0.02,print_vocabulary=False,outfolder=None,output_vocabulary_fname='vocabulary.dat'):
    '''
    From a list of documents raw text build a matrix DxV
    D: number of docs
    V: size of the vocabulary, i.e. number of unique terms found in the whole set of docs
    '''
    count_vect = CountVectorizer(stop_words=stoplist,max_df=max_df0, min_df=min_df0)
    corpus = # FILL

    vocabulary_dict=# FILL
    vocabulary_list=[(key,value) for # FILL]
    vocabulary_list.sort(# FILL)
        
    if print_vocabulary==True:output_vocabulary(outfolder,count_vect,outfile=output_vocabulary_fname)
    return corpus,vocabulary_list,vocabulary_dict,count_vect

In [None]:
corpus,vocabulary_list,vocabulary_dict,count_vect=# FILL
print(corpus.shape, len(vocabulary_list))

In [None]:
D,V=corpus.shape
D,V

In [None]:
corpus.toarray()

In [None]:
vocabulary_list

#### b) Run LDA

We now apply Latent Dirichlet Allocation (LDA) to our preprocessed corpus. The idea behind LDA is that each document can be understood as a mixture of "topics". For instance, documents 1 and 2 are about food because they contain the words "broccoli", "bananas", and "eat"; documents 3 and 4 are about animals ("kittens", "hamsters", "cute"); and document 5 is about both animals ("hamsters") and food ("broccoli"). LDA unveils these topics automatically from the data.



In [None]:
# Fit LDA
n_topics = # FILL
lda_model = # FILL
topic_proportions = # FILL
topics = # FILL

# Print log-likelihood
print('\nLog likelihood: ' + str(lda_model.score(# FILL)))

In [None]:
topic_proportions

In [None]:
topics

#### c) Analyze the topics

In [None]:
# Check the size of the resulting matrices
print(topic_proportions.shape)   # D x K
print(topics.shape)              # K x V

In general, one of the topics will mainly express the words "broccoli", "eat", and "bananas" with higher percentage, whereas the other topic will be mostly about "cute", "hamsters", and "kittens". This is consistent with our earlier intuitions of having a topic about animals and another topic about food.

Recall that a topic is formally defined as a distribution over the entire vocabulary.

##### Obtain the topic proportions

We now want to find the topic proportions of each individual document. For instance, we know that document 1 is mostly about food, while document 4 is mostly about animals. The following commands allow us to obtain the topic distribution of each document.

In [None]:
# Build id2term (inverse dictionary)
id2term = {v: k for k, v in # FILL}

In [None]:
# Visualize topics
n_max = 5
for kk in range(n_topics):
    print('+ Topic ' + str(kk) + ':')
    idx = np.argsort(-topics[kk,:])
    print_str = ''
    for nn in range(n_max):
        print_str += id2term[idx[nn]] + ' '
    print('   ' + print_str)

In [None]:
for d in range(D):
    print(d,# FILL)

The 1st and 2nd documents are mostly about food. The remaining two are instead about animals.

#### d) Apply to new documents

Note that this can be applied to unseen documents too. For instance, consider the following new document, which is about both animals and food:

In [None]:
doc5 = "Look at these hamsters munching on a piece of broccoli".lower()
doc5_tokenized=# FILL
print(doc5_tokenized.shape)
doc5_tokenized

In [None]:
lda_model.# FILL

The resulting topic proportions should be around $0.5$ (at least moderately close; keep in mind that these are all very short documents), indicating that this document expresses both topics.

#### e) Visualize results

##### Show topics over document

In [None]:
plt.figure()
idx_D = np.arange(D)   # x-axis locations
bar_width = 0.5
plots = []
height_cumulative = np.zeros(D)
for kk in range(n_topics):
    color = plt.cm.coolwarm(kk/n_topics, 1)
    if kk==0:
        p = plt.bar(idx_D, topic_proportions[:, kk], bar_width, color=color)
    else:
        p = plt.bar(idx_D, topic_proportions[:, kk], bar_width, bottom=height_cumulative, color=color)
    height_cumulative += topic_proportions[:, kk]
    plots.append(p)
plt.ylim((0, 1))  # proportions sum to 1
plt.ylabel('Topic proportions')
plt.title('Topic proportions in documents')
plt.yticks(np.arange(0, 1, 10))
plt.xticks([0,1,2,3], labels=[1,2,3,4])
plt.xlabel('Documents')
topic_labels = ['Topic {}'.format(kk) for kk in range(n_topics)]
plt.legend([p[0] for p in plots], topic_labels)
plt.show()

##### Visualize heatmap

In [None]:
plt.figure()
plt.pcolor(topic_proportions, norm=None, cmap='Blues')
topic_labels = ['Topic {}'.format(kk) for kk in range(n_topics)]
plt.xticks(np.arange(topic_proportions.shape[1])+0.5, topic_labels)
plt.gca().invert_yaxis()
plt.xticks(rotation=90)
plt.yticks(np.arange(topic_proportions.shape[0])+0.5, [1,2,3,4])
plt.ylabel('Documents')
plt.colorbar()
plt.tight_layout()
plt.show()

##### Plot topic proportions individually

In [None]:
plt.figure(figsize=(16,8))
for kk in range(n_topics):
    plt.subplot(1, 2, kk+1)
    plt.scatter(np.arange(D), topic_proportions[:, kk])
    plt.ylim((0, 1))
    plt.ylabel('Proportions')
    plt.title('Topic '+str(kk))
    if kk+2>=n_topics:
        plt.xticks(np.arange(D), [d for d in range(D)] )
        plt.xticks(rotation=90)
        plt.yticks(np.arange(0, 1, 10))
plt.show() 

##### Show words over topic

In [None]:
words = [x[0] for x in vocabulary_list]
plt.figure(figsize=(10,6))
plt.imshow(topics, cmap='Blues')
plt.xticks(np.arange(V), labels=words, rotation=90)
plt.yticks(np.arange(topic_proportions.shape[1]), ['Topic 0', 'Topic 1'])
plt.colorbar()
plt.tight_layout()
plt.show()

### Exercise 2: analyze real dataset of NY Times articles

#### b) Run a bash script from terminal:  
`    tail -n +4 docword.nytimes.txt > nytimes.txt`  
This will remove the first 3 lines from the file.  
The format of `nytimes.txt` is 3 columns:  
* 1st : document id
* 2nd : word id
* 3rd : frequency of word in that document  
For instance the first lines are:  
`1 413 1
1 534 1
1 2340 1
1 2806 1
1 3059 1
1 3070 1
1 3294 1`

#### c) Import data into the proper format

##### Import corpus

In [None]:
df0=pd.read_csv('nytimes.txt',sep='\s+', header=None,names=['docId','wordId','wordFreq'])
df0.head()

Reduce the dataset size to speed up implementation

In [None]:
max_D=1000

In [None]:
df=df0[df0.docId<=max_D]

In [None]:
D=max(df.docId.unique())
V=max(df.wordId.unique())
D,V

Transform into a sparse matrix

In [None]:
corpus_nyt=# FILL
corpus_nyt.data

In [None]:
corpus.nonzero()

##### Import vocabulary

In [None]:
df_voc=pd.read_csv('vocab.nytimes.txt',header=None)
df_voc.head()

#### d) Run LDA

In [None]:
# Fit LDA
n_topics = 100
lda_model_nyt = # FILL
topic_proportions =# FILL
topics =# FILL

# Print log-likelihood
print('\nLog likelihood: ' + str(lda_model_nyt.score(# FILL)))

#### e) Analyze results

##### Topic proportions
Documents with mostly one topic only

In [None]:
threshold=0.98
docs_one_topic=[d for d in range(D) if max(topic_proportions[d])>threshold]
main_topics=[# FILL]
main_topics_histo=Counter(# FILL)

df_topics_histo = pd.DataFrame.from_dict(main_topics_histo, orient='index')
df_topics_histo=df_topics_histo.sort_values(by=[0],ascending=False) 
df_topics_histo.plot(kind='bar',figsize=(12,6))
plt.tight_layout()

##### Visualize main topics

In [None]:
max_topics=list(df_topics_histo.iloc[:5].index)
n_max = 10
for kk in max_topics:
    print('+ Topic ' + str(kk) + ':')
    idx = np.argsort(-topics[kk,:])
    print_str = ''
    main_words_in_this_topic=list(np.concatenate([df_voc.iloc[idx[nn]].values for nn in range(n_max)]))
#     a=list(np.concatenate(a))
    print(main_words_in_this_topic)
    print()

##### Visualize one doc that has only one topic.  
Pick one that has a topic among the most frequent ones.

In [None]:
t=max_topics[0]
possible_d=[d for d in docs_one_topic if # FILL]
sample_d=np.random.choice(# FILL)
print('Chosen doc:',sample_d)

In [None]:
sample_d += 1

In [None]:
df_sample_d=# FILL
main_wordsId_in_this_doc=df_sample_d.iloc[:20]['wordId'].values
main_words_in_this_doc=np.concatenate([df_voc.iloc[w].values for w in main_wordsId_in_this_doc])
main_words_in_this_doc

##### Show more topic proportions
Pick a sample of documents and show their main topic proportions.

In [None]:
plt.figure(figsize=(8,6))
idx_D = np.arange(D)   # x-axis locations
idx_D=np.arange(100) # max 100 documents
bar_width = 0.5
plots = []
height_cumulative = np.zeros(max(idx_D)+1)

idx = np.argsort(-topic_proportions[:max(idx_D)+1])[:,:5] # order the first 5 topics for each document
for kk in range(5): # max 10 topics to visualize
    color = plt.cm.coolwarm(kk/5, 1)
    if kk==0:
        p = plt.bar(idx_D, topic_proportions[idx_D, idx[:,kk]], bar_width, color=color)
    else:
        p = plt.bar(idx_D, topic_proportions[idx_D, idx[:,kk]], bar_width, bottom=height_cumulative, color=color)
    height_cumulative += topic_proportions[idx_D, idx[idx_D,kk]]
    plots.append(p)
plt.ylim((0, 1))  # proportions sum to 1
plt.ylabel('Topic proportions')
plt.title('Topic proportions in documents')
plt.yticks(np.arange(0, 1, 10))
topic_labels = ['Topic {}'.format(kk) for kk in range(n_topics)]
plt.legend([p[0] for p in plots], topic_labels)
plt.show()

##### Visualize HeatMap

In [None]:
plt.figure(figsize=(12,12))
plt.pcolor(topic_proportions, norm=None, cmap='Blues')
topic_labels = ['Topic {}'.format(kk) for kk in range(n_topics)]
plt.xticks(np.arange(topic_proportions.shape[1])+0.5, topic_labels);
plt.gca().invert_yaxis()
plt.xticks(rotation=90)
plt.colorbar()
plt.tight_layout()
plt.show()

##### Weighted impact of topic over documents.

In [None]:
topic_impact=topic_proportions.sum(axis=0)
most_impactful_topics=np.argsort(-topic_proportions.sum(axis=0))

Visualize least impactful topics

In [None]:
n_max=10
for t in most_impactful_topics[-5:]:
    idx = np.argsort(-topics[t,:])
    print_str = ''
    main_words_in_this_topic=list(np.concatenate([df_voc.iloc[idx[nn]].values for nn in range(n_max)]))
#     a=list(np.concatenate(a))
    print(main_words_in_this_topic)
    print()

Visualize documents containing the least impactful topic

In [None]:
t_least=most_impactful_topics[-1]
d_least=np.argsort(-topic_proportions[:,t_least])[0]

In [None]:
topic_proportions[d_least,t_least]

t_least : topic least impactful

In [None]:
idx = # FILL
main_words_in_this_topic=# FILL
print(main_words_in_this_topic)
print()

d_least : document including topic least impactful

In [None]:
df_sample_d=# FILL
main_wordsId_in_this_doc=df_sample_d.iloc[:100]['wordId'].values
main_words_in_this_doc=np.concatenate([df_voc.iloc[w].values for w in main_wordsId_in_this_doc])
main_words_in_this_doc

Visualize most impactful topics

In [None]:
max_topics=most_impactful_topics[:5]
n_max = 10
for kk in max_topics:
    print('+ Topic ' + str(kk) + ':')
    idx = np.argsort(-topics[kk,:])
    print_str = ''
    main_words_in_this_topic=list(np.concatenate([df_voc.iloc[idx[nn]].values for nn in range(n_max)]))
#     a=list(np.concatenate(a))
    print(main_words_in_this_topic)
    print()