# Topic Modeling with Latent Dirichlet Allocation

Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
from requests import get
import re

import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import nltk

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn import preprocessing

from nltk.corpus import stopwords
from nltk import SnowballStemmer
from nltk.tokenize import RegexpTokenizer

import string
import time

## Text Analysis - Topic Modeling

Suppose we have a corpus of text data that we want to understand more about. For example, let's say we have all articles that were published in the New York Times in 2021. We might want to get a sense for what was covered in the news in 2021. We could read all of them ... but that would take a really long time, and it would not be feasible to do that in any reasonable amount of time. Also, it'd be hard for us to process that amount of data manually. 

Instead, we can use a technique called **topic modeling** using **Latent Dirichlet Allocation**. This will allow us to automatically generate topics that describe the documents within a corpus, as well as determine which documents fit into which topics. 

To do this, though, we first need to get the data into a matrix form.

In [None]:
nyt_2021 = pd.read_csv('nyt_2021.csv')

In [None]:
nyt_2021.head()

## Recall: Text Data into a Matrix Format

We're going to treat each token as a variable and each document as an observation. So, in the case of NYT Article abstracts, we will treat individual article abstract as an observation. There will be as many columns as there are unique tokens in the overall corpus (so there will be many many variables!). The dataset that we end up with will looking something like this:

|document ID|about|america|author|ask|...|
|-|-|-|-|-|-|
|1|0|0|0|0|...|
|2|0|1|0|0|...|
|3|0|0|3|0|...|
|4|1|0|0|0|...|
|5|0|0|0|2|...|
|...|...|...|...|...|...|

To convert our abstracts into this format, we first take a Series of the abstracts with everything lowercased.

In [None]:
abstracts = nyt_2021.abstract.str.lower().reset_index().abstract.dropna()
abstracts.head()

Next, we create a `tokenize` function that does the tokenizing and temming steps that we had done before. This is a function that we will need to provide to `CountVectorizer` below instead of using directly.

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
stemmer = SnowballStemmer("english")
stop = stopwords.words('english')

In [None]:
stemmer = SnowballStemmer("english")

def tokenize(text):
    tokens = tokenizer.tokenize(text)
    return [stemmer.stem(token) for token in tokens]

## CountVectorizer

We can apply this to each abstract in our corpus using `CountVectorizer`. This will not only do the tokenizing, but it will also count any duplicates of words and create a matrix that contains the frequency of each word. This will be quite a large matrix (number of columns will be number of unique words), so it outputs the data as a sparse matrix.

We will first create the `vectorizer` object (you can think of this like a model object), and then fit it with our abstracts. This should give us back our overall corpus bag of words, as well as a list of features (that is, the unique words in all the abstracts).

In [None]:
# Tokenize stop words to match
eng_stopwords = [tokenize(s)[0] for s in stop]

In [None]:
vectorizer = CountVectorizer(analyzer= "word", # unit of features are single words rather then phrases of words 
                             tokenizer=tokenize, # function to create tokens
                             ngram_range=(0,1), # Tokens are individual words for now
                             strip_accents='unicode',
                             stop_words= eng_stopwords,
                             min_df = 0.01,
                             max_df = 0.99)

Once we have created the vectorizer, we can use it to transform our abstracts.

In [None]:
bag_of_words = vectorizer.fit_transform(abstracts) #transform our corpus into a bag of words 
features = vectorizer.get_feature_names_out()

Note that since this can be quite large, it will be stored as a sparse matrix. That is, it only stores information about which rows and columns have non-zero values.

### Latent Dirichlet Allocation

Next, we fit the Latent Dirichlet Allocation (LDA) model. LDA is a statistical model that generates groups based on similarities. This is an example of an **unsupervised machine learning model**. That is, we don't have any sort of outcome variable -- we're just trying to group the abstracts into rough categories.

![LDA Topics](lda1.png)

The model does this by associating words with topics and making documents parts of topics based on the words that are part of that document. The more words that a document has relating to a topic, the more likely it is to be about that topic. This is fit using an iterative process, and the topics are defined by their **top words.**

![LDA Words](lda2.png)


Let's try fitting an LDA model. We first create a `LatentDirichletAllocation` object, then fit it using our corpus bag of words.

In [None]:
# Create LDA model object
lda = LatentDirichletAllocation(n_components = 5, learning_method='online') 

# Fit using data (bag_of_words)
doctopic = lda.fit_transform(bag_of_words)

Using `lda.fit_transform` fits our model with our data (`bag_of_words`). Now, we just need to access it to see the results. One way to do this is by looking at the top words within each topic to see what each of the topics were about. 

In order to get the word membership within topics, we can use `lda.components_`. You can think about this as the number of times each word appeared within each topic (with some normalization). Each element of `lda.components_` is a list representing a topic and containing the word frequencies for each word. So, we can loop through each topic and sort by the most frequent words to print out. 

The code to do this is shown below.

In [None]:
# Displaying the top keywords in each topic
ls_keywords = []
ls_freqs = []
topic_id = []

for i,topic in enumerate(lda.components_):
    # Sorting and finding top keywords
    word_idx = np.argsort(topic)[::-1][:10]
    freqs = list(np.sort(topic)[::-1][:10])
    keywords = [features[i] for i in word_idx]
    
    # Saving keywords and frequencies for later
    ls_keywords = ls_keywords + keywords
    ls_freqs = ls_freqs + freqs
    topic_id = topic_id + [i+1] * 10

    # Printing top keywords for each topic
    print(i, ', '.join(keywords))

<font color ='red'>**Question 1: Copy and paste the code in the cell above, then change it to display the top 20 most frequent words. What are the topics that you see?**</font>

## Visualizing Word Frequencies within Topics

We were able to get the top words within each topic, but we can also look at the frequencies within each of the words to get a better idea of what words were most frequent within each topic. This can also help us identify any words that might end up being context-specific stop words.

To do this, we'll first make a DataFrame using the keywords and frequencies that we saved, as well as the topic IDs. Then, we can display them all using a bar graph. 

In [None]:
top_words_df = pd.DataFrame({'keywords':ls_keywords, 'frequency':ls_freqs, 'topic_id':topic_id})

In [None]:
sns.catplot(top_words_df, x = 'frequency', y = 'keywords', col = 'topic_id', kind = 'bar', sharey = False)

<font color ='red'>**Question 2: What are some context-specific stop words that you can identify from these bar graphs? Add those stop words to the existing `eng_stopwords` and call the new list `full_stopwords`.**</font>

In [None]:
full_stopwords = eng_stopwords + []




## Topic Memberships

We can link the topic memberships in `doctopic` back to the original documents so that we can see which documents are getting categorized into which topics. To do this, we just make a DataFrame with the `doctopic` object, then add the `abstracts`. This does mean we need to match the column name to the appropriate topic title, so you can also adjust the column titles with an appropriate one based on what you determined that topic to be. 

In [None]:
topic_memberships = pd.DataFrame(doctopic)
topic_memberships['abstract'] = abstracts
topic_memberships.head()

<font color ='red'>**Question 3: Based on the words observed earlier, rename the column names in `topic_memberships`. What topic does the first abstract seem to be about? Does this make sense?**</font>

## Iterating the Process

We've fit one LDA model. This gives us some insights into the data, but we shouldn't just stop there. We can refine this model and try to see if we can get more out of it by changing some things, like adding new stop words or changing the number of topics we think there might be. 

Let's try one example of an update we can make. Since we've defined a new stop word list, we can use that instead of the original `eng_stopwords`. 

In [None]:
vectorizer = CountVectorizer(analyzer= "word", # unit of features are single words rather then phrases of words 
                             tokenizer=tokenize, # function to create tokens
                             ngram_range=(0,1), # Tokens are individual words for now
                             strip_accents='unicode',
                             stop_words= full_stopwords,
                             min_df = 0.01,
                             max_df = 0.99)

In [None]:
# Create LDA model object
lda = LatentDirichletAllocation(n_components = 5, learning_method='online') 

# Fit using data (bag_of_words)
doctopic = lda.fit_transform( bag_of_words )

In [None]:
# Displaying the top keywords in each topic
ls_keywords = []
ls_freqs = []
topic_id = []

for i,topic in enumerate(lda.components_):
    # Sorting and finding top keywords
    word_idx = np.argsort(topic)[::-1][:10]
    freqs = list(np.sort(topic)[::-1][:10])
    keywords = [features[i] for i in word_idx]
    
    # Saving keywords and frequencies for later
    ls_keywords = ls_keywords + keywords
    ls_freqs = ls_freqs + freqs
    topic_id = topic_id + [i+1] * 10

    # Printing top keywords for each topic
    print(i, ', '.join(keywords))

<font color ='red'>**Question 4: Try running the code above again using a different number of topics. What are the topics that show up when you do this?**</font>