# Refining Latent Dirichlet Allocation Models

Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
from requests import get
import re

import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import nltk

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn import preprocessing

from nltk.corpus import stopwords
from nltk import SnowballStemmer
from nltk.tokenize import RegexpTokenizer

import string
import time

## Data - NYT 2021 Archive

As before, let's bring in all articles from the NYT Archive.

In [None]:
nyt_2021 = pd.read_csv('nyt_2021.csv')

In [None]:
nyt_2021.head()

Next, we do the setup for the pre-processing steps such as tokenizing, stemming, and removing stop words.

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
stemmer = SnowballStemmer("english")
stop = stopwords.words('english')

In [None]:
abstracts = nyt_2021.abstract.str.lower().reset_index().abstract.dropna()
abstracts.head()

In [None]:
stemmer = SnowballStemmer("english")

def tokenize(text):
    tokens = tokenizer.tokenize(text)
    return [stemmer.stem(token) for token in tokens]

## First Attempt with LDA

We can apply the pre-processing to each abstract in our corpus using `CountVectorizer`. This will not only do the tokenizing, but it will also count any duplicates of words and create a matrix that contains the frequency of each word. This will be quite a large matrix (number of columns will be number of unique words), so it outputs the data as a sparse matrix.

We will first create the `vectorizer` object (you can think of this like a model object), and then fit it with our abstracts. This should give us back our overall corpus bag of words, as well as a list of features (that is, the unique words in all the abstracts).

In [None]:
# Tokenize stop words to match
eng_stopwords = [tokenize(s)[0] for s in stop]

In [None]:
vectorizer = CountVectorizer(analyzer= "word", # unit of features are single words rather then phrases of words 
                             tokenizer=tokenize, # function to create tokens
                             ngram_range=(0,1), # Tokens are individual words for now
                             strip_accents='unicode',
                             stop_words= eng_stopwords,
                             min_df = 0.01,
                             max_df = 0.99)

Once we have created the vectorizer, we can use it to transform our abstracts.

In [None]:
bag_of_words = vectorizer.fit_transform(abstracts) #transform our corpus into a bag of words 
features = vectorizer.get_feature_names_out()

In [None]:
# Fitting LDA model

# Create LDA model object
lda = LatentDirichletAllocation(n_components = 5, learning_method='online') 

# Fit using data (bag_of_words)
doctopic = lda.fit_transform( bag_of_words )

Using `lda.fit_transform` fits our model with our data (`bag_of_words`). Now, we just need to access it. We'll define a function that does this so that it is easier to do for later cases as well.

In [None]:
def display_keywords(lda, nwords, verbose = True):
    '''
    Display the top words within each topic after running Latent Dirichlet Allocation.
    
    Arguments:
        lda: lda object
        nwords: number of words to display for each topic
    Returns:
        A DataFrame containing keywords, frequencies, and topic ID.
    '''
    # Displaying the top keywords in each topic
    ls_keywords = []
    ls_freqs = []
    topic_id = []

    for i,topic in enumerate(lda.components_):
        # Sorting and finding top keywords
        word_idx = np.argsort(topic)[::-1][:nwords]
        freqs = list(np.sort(topic)[::-1][:nwords])
        keywords = [features[i] for i in word_idx]

        # Saving keywords and frequencies for later
        ls_keywords = ls_keywords + keywords
        ls_freqs = ls_freqs + freqs
        topic_id = topic_id + [i+1] * nwords

        # Printing top keywords for each topic
        if verbose == True:
            print(i, ', '.join(keywords))
    
    return pd.DataFrame({'keywords':ls_keywords, 'frequency':ls_freqs, 'topic_id':topic_id})

In [None]:
lda_df = display_keywords(lda, 15)

### N-grams - Adding context by creating N-grams

Obviously, reducing a document to a bag of words means losing much of its meaning - we put words in certain orders, and group words together in phrases and sentences, precisely to give them more meaning. If you follow the processing steps we've gone through so far, splitting your document into individual words and then removing stopwords, you'll completely lose all phrases like "kick the bucket," "commander in chief," or "sleeps with the fishes." 

One way to address this is to break down each document similarly, but rather than treating each word as an individual unit, treat each group of 2 words, or 3 words, or *n* words, as a unit. We call this a "bag of *n*-grams," where *n* is the number of words in each chunk. Then you can analyze which groups of words commonly occur together (in a fixed order). 

In [None]:
vectorizer = CountVectorizer(analyzer= "word", # unit of features are single words rather then phrases of words 
                            tokenizer=tokenize, # function to create tokens
                            ngram_range=(0,2), # Allow for bigrams
                            strip_accents='unicode',
                            stop_words=eng_stopwords,
                            min_df = 0.001,
                            max_df = 0.999)

# Creating bag of words
bag_of_words = vectorizer.fit_transform(abstracts) #transform our corpus is a bag of words 
features = vectorizer.get_feature_names_out()

# Fitting LDA model
lda = LatentDirichletAllocation(n_components = 10, learning_method='online') 
doctopic = lda.fit_transform( bag_of_words )

In [None]:
# Displaying the top keywords in each topic
lda_df = display_keywords(lda, 15)

<font color ='red'>**Question 1: Take the code above and run it again but let it include tokens that are three words long as well. Are there any three word sequences that show up in the top words?**</font>

### TF-IDF - Weighting terms based on frequency

One additional step we can add in cleaning and processing our text data is **Term Frequency-Inverse Document Frequency (TF-IDF)**. TF-IDF is based on the idea that the words (or terms) that are most related to a certain topic will occur frequently in documents on that topic, and infrequently in unrelated documents.  TF-IDF re-weights words so that we emphasize words that are unique to a document and suppress words that are common throughout the corpus by inversely weighting terms based on their frequency within the document and across the corpus.

Recall that our data might look something like this:

|document ID|about|america|author|ask|...|
|-|-|-|-|-|-|
|1|0|0|0|0|...|
|2|0|1|0|0|...|
|3|0|0|3|0|...|
|4|1|0|0|0|...|
|5|0|0|0|2|...|
|...|...|...|...|...|...|

The values that are in the cells are the term frequencies. TF-IDF takes those values and re-weights them by the inverse of how often they occur in other documents. So, for example, if the term occurs in many other documents, the term frequency would be close to 1 (since the fraction of documents the term occurs in is close to 1). However, if the term occurs only in a smaller fraction of documents (such as 1/10th of documents), then the term frequency is multiplied by a much larger number (since we use the inverse document frequency).

Let's look at how to use TF-IDF:

In [None]:
stop = stopwords.words('english') + ['said', 'new', 'year', 'one', 'case']
full_stopwords = [tokenize(s)[0] for s in stop]

In [None]:
vectorizer = CountVectorizer(analyzer= "word", # unit of features are single words rather then phrases of words 
                            tokenizer=tokenize, # function to create tokens
                            ngram_range=(0,2),
                            strip_accents='unicode',
                            stop_words=full_stopwords,
                            min_df = 0.001,
                            max_df = 0.999)
# Creating bag of words
bag_of_words = vectorizer.fit_transform(abstracts) #transform our corpus is a bag of words 
features = vectorizer.get_feature_names_out()

# Use TfidfTransformer to re-weight bag of words 
transformer = TfidfTransformer(norm = None, smooth_idf = True, sublinear_tf = True)
tfidf = transformer.fit_transform(bag_of_words)

# Fitting LDA model
lda = LatentDirichletAllocation(n_components = 5, learning_method='online') 
doctopic = lda.fit_transform(tfidf)

<font color ='red'>**Question 2: Look at the top words after implementing TF-IDF. What are the differences compared to without using TF-IDF?**</font>

##  Next Steps: Document Classification with Supervised Learning

We used topic modeling to determine what topics were discussed within NYT articles. That is an example of unsupervised learning: we were looking to uncover structure in the form of topics, or groups of agencies, but we did not necessarily know the ground truth of what topics there were.

We can also do supervised learning with text data. In supervised learning, we have a *known* outcome or label (*Y*) that we want to produce given some data (*X*), and in general, we want to be able to produce this *Y* when we *don't* know it, or when we *only* have *X*. 

In order to produce labels we need to first have examples our algorithm can learn from, a "training set." In the context of text analysis, developing a training set can be very expensive, as it can require a large amount of human labor or linguistic expertise. **Document classification** is an example of supervised learning in which want to characterize our documents based on their contents (*X*). A common example of document classification is spam e-mail detection. Another example is *part-of-speech tagging* where *X* are individual words and *Y* is the part-of-speech. 