
# Computational Linguistics - Topic Modeling

In this lab, we will explore two topic modeling methods: LDA and NMF. Please check the readings to learn about these methods. 

To understand LDA and NMF we will create a toy dataset by downloading some wikipedia text. 

In [None]:
import wikipedia
import numpy as np
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

wikipedia.set_lang("en")

In [None]:
# # this class fetches summary text from a wiki page

# class TextFetcher:

#     def __init__(self, title):
#         self.title = title
#         page = wikipedia.page(title)
#         self.text = page.summary

#     def getText(self):
#         return self.text

In [None]:
def preprocessor(text):
#     nltk.download('stopwords')  # you might have uncomment this 
    tokens = word_tokenize(text)
    return (" ").join([word for word in tokens if word not in stopwords.words()])

## Create a toy dataset

Let's create toy dataset of 6 wiki articles. For simplitycity we will extract the summary text (see above code). 

In [None]:
nyc = wikipedia.page("New York City", auto_suggest=False)
text1 = nyc.summary
nlp = wikipedia.page("Natural Language Processing", auto_suggest=False)
text2 = nlp.summary
tgg = wikipedia.page("The Great Gatsby", auto_suggest=False)
text3 = tgg.summary
ml = wikipedia.page("Machine Learning", auto_suggest=False)
text4 = ml.summary
la = wikipedia.page("Los Angeles", auto_suggest=False)
text5 = la.summary
covid = wikipedia.page("Coronavirus",  auto_suggest=False)
text6 = covid.summary

docs = [text1, text2, text3, text4, text5, text6]

## Create a term frequency matrix

LDA works on term frequency matrix.

In [None]:
count_vectorizer = CountVectorizer(stop_words='english')
# count_vectorizer = CountVectorizer(stop_words='english', max_features=100)
term_frequency = count_vectorizer.fit_transform(docs)
feature_names = count_vectorizer.get_feature_names()

In [None]:
print(f"Shape of term freq matrix = {term_frequency.shape}")
print(f"Num of features identified = {len(feature_names)}")

In the `CountVectorizer` method, we can use `max_features` to set the number of features if required.

## Fit an LDA model

Let's fit an LDA model with 5 topics (aka components). 

In [None]:
lda = LatentDirichletAllocation(n_components=5, random_state=0)  # random_state is for replicating the result
lda.fit(term_frequency)    

## Analyze the topics

Now, let's print the top 10 words based on the words's weight learned by the LDA. 

In [None]:
print(f"Num of topics = {len(lda.components_)}")

# words' weights associated with topic 0
lda.components_[0]

In [None]:
feature_names[:25]

In [None]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, term_weights in enumerate(model.components_):
        
        # get the index of top-k terms
        sorted_indx = term_weights.argsort()
#         print(sorted_indx)
#         topk_words = [(feature_names[i], term_weights[i])for i in sorted_indx[-no_top_words - 1:]]
        topk_words = [feature_names[i] for i in sorted_indx[-no_top_words :]]
        print(f"Topic {topic_idx}:", end=None)
        print(";".join(topk_words))

#         print(" ".join([feature_names[i]
#              for i in term_weights.argsort()[:-no_top_words - 1:-1]]))


In [None]:
display_topics(lda, feature_names, 10)

I have found the following output: 

```
Topic 0:
cause viruses coronaviruses rna birds genome lethal mild humans order
Topic 1:
language natural documents computers computer understanding processing speech recognition intelligence
Topic 2:
los angeles city learning machine data area largest california metropolitan
Topic 3:
new york city world largest area united metropolitan county states
Topic 4:
novel fitzgerald gatsby american great work following island believed title
```

If we observe the top-10 words in each of the topics, we could name each of topic as follows: 

* Topic 0: Corona Virus
* Topic 1: Natural Language Processing
* Topic 2: Los Angles 
* Topic 3: New York
* Topic 4: The Great Gatsby Novel

Naming topics are subjective, and there are no correct answers. 


### Visualizing a topic

Word cloud or tag cloud is popular way to visualize topics. 

In [None]:
from wordcloud import WordCloud

If the above code produces error, then install the package and execute the line again. 

In [None]:
!pip install wordcloud

Create a list frequent terms with their weights.

In [None]:
topic = lda.components_[1]  # take the corona topic
no_top_words = 10

weights_lda = {}
for i in topic.argsort()[:-no_top_words - 1:-1]:
    print(feature_names[i], topic[i])
    weights_lda[feature_names[i]] = topic[i]


In [None]:
wc = WordCloud(background_color='black')
wc.generate_from_frequencies(weights_lda)
wc.to_image()

## Topic mixture in a document

For each of the document, we can see how each topic is represented there. 

In [None]:
topic_mixture = lda.transform(term_frequency[-1:])
np.around(topic_mixture, decimals=2)

This vector represent topic mixtures for the Corona Virus wiki page. As expected Topic 1 is the dominant one for this document. 

Remember the dimentionality reduction methods from the Applied Machien Learning course. **In a shallow sense, topic modeling could be considered as a dimensionality reduction method.** Here we represent a document as a topic mixture, and the number of topics is a way less than the number of terms in a document.

## Using a topic model for classification/clustering

Although the aim of a topic model is to identify the underlying structure of a document in terms of topics, given a corpus, we can use this method for classification and clustering. E.g., we can identify topic mixture of a new document, and label the document with the topic with maximum proportion as a class label.

In [None]:
chicago = wikipedia.page("Chicago", auto_suggest=False)
text7 = chicago.summary
print(lda.transform(count_vectorizer.transform([text7])))

The above code, we estimate the topic mixture of the "Chicago" wikipage and see that the dominant topic is Topic 4, which is City topic. 

Given the topic mixture of all of the documents, we can perform clustering on the documents in the topic space. 

## Evaluating a topic model and choosing the number of topics

As topic model is an unsupervised method, it is hard to evaluate as there is no gold standard. Here are few approaches that are commonly used for the evaluation (see [here](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0)):

- Eye Balling Models
    - Top N words
    - Topics / Documents
- Intrinsic Evaluation Metrics
    - Capturing model semantics
    - Topics interpretability
- Human Judgements
    - What is a topic
- Extrinsic Evaluation Metrics/Evaluation at task
    - Is model good at performing predefined tasks, such as classification


Like clustering, we manually set number of topics for topic models. But it is possible to use an intrinsic or extrinsic measure to identify the desirable number of topics. One such intrinsic measure is **perplexity** score (aka predictive likelihood), and it measures the goodness-of-fit. The lower perplexity is better.   

In [None]:
scores = []
for n_topics in range(2, 7): 
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=0)
    lda.fit(term_frequency)    
    score = lda.perplexity(term_frequency)
    scores.append(score)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(range(2,7), scores)
plt.xlabel('Num of Topics')
plt.ylabel('Perplexity')

As shown in the above plot, we can pick either 4 or 5 as the total number of topics

## Non-negative Matrix Factorization (NMF)

We can repeat the above practice with a Non-negative Matrix Factorization (NMF) method. For LDA we use TF matrix as input, but NMF method can take either of TF and TFIDF matrix as input. This time we will create a TFIDF matrix for representing the documents and then apply NFM. 



In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(docs)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()


In [None]:
print(f"Shape of tfidf matrix = {tfidf.shape}")
print(f"Num of features identified = {len(tfidf_feature_names)}")

### Fit an NMF

In [None]:
# Run NMF
nmf = NMF(n_components=5, random_state=0)
nmf.fit(term_frequency)

### Display topics

In [None]:
display_topics(nmf, tfidf_feature_names, 10)

Let's visualize corona topic. 

In [None]:
topic = nmf.components_[4]  # take the corona topic
no_top_words = 10

weights_nmf = {}
for i in topic.argsort()[:-no_top_words - 1:-1]:
    weights_nmf[tfidf_feature_names[i]] = topic[i]
weights_nmf

In [None]:
wc = WordCloud(background_color='black')
wc.generate_from_frequencies(weights_nmf)
wc.to_image()

In [None]:
import pandas as pd
df1 = pd.DataFrame(weights_lda.items())
df2 = pd.DataFrame(weights_nmf.items())

df = pd.concat([df1, df2], axis=1)
df

We can see both LDA and NMF has the same word set for the Corona Virus topic, although the ordering is little bit different.

---

# Save your notebook, then `File > Close and Halt`