# LDA and PCA

In today's review session, we are going over running Latent Dirichlet Allocation and PCA on document vectors.

Before we start, we load up the libraries.

In [1]:
import numpy as np
import requests
from pprint import pprint
import pandas as pd
import time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from statsmodels.multivariate.pca import PCA

We will be working with the New York Times headlines. 

In [2]:
api_key = 'Your key'
nyt_headlines = []
nyt_abstract = []
nyt_sections = []
nyt_dates = []
year = 2019
for month in [1, 2, 3]:
    response = requests.get(f'https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={api_key}')
    content = response.json()
    for article in content['response']['docs']:
        nyt_headlines.append(article['headline']['main'])
        nyt_sections.append(article['section_name'])
        nyt_abstract.append(article['abstract'])
        nyt_dates.append(article['pub_date'])
nyt_df = pd.DataFrame({'headline': nyt_headlines, 'abstract': nyt_abstract, 'date': nyt_dates, 'section': nyt_sections})

## Latent Dirichlet Allocation

We will first go over how Latent Dirichlet Allocation works in python. What LDA does is essentially assuming a data-generating process for topics, documents, and words. Observing the documents, LDA backs out the probability distribution over words for a given topic. 

In [3]:
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(nyt_df['headline'])
lda = LatentDirichletAllocation(30, random_state=1680).fit(X)

Note that we are keeping the headline vectors in term frequency.

In [4]:
topic_words = {}
n_top_words = 10
vocab = vectorizer.get_feature_names()
for topic, comp in enumerate(lda.components_):
    word_idx = np.argsort(comp)[::-1][:n_top_words]
    topic_words[topic] = [vocab[i] for i in word_idx]
for topic, words in topic_words.items():
    print(f"Topic {topic}: {', '.join(words)}")

Topic 0: quiz, word, climate, change, economy, strong, texas, review, williams, face
Topic 1: trump, kim, college, scandal, film, north, admissions, korea, family, weekend
Topic 2: year, white, super, bowl, game, say, women, school, old, history
Topic 3: crisis, review, venezuela, stone, roger, border, pelosi, nancy, leaders, china
Topic 4: need, judge, new, brazil, strike, students, dies, teachers, john, workers
Topic 5: art, trump, season, weekend, true, save, view, dance, american, scene
Topic 6: best, says, 11, rights, reads, iran, dies, worst, civil, changed
Topic 7: el, years, chapo, trial, accused, health, care, abuse, faces, abortion
Topic 8: know, smollett, jussie, sexual, homes, sale, review, law, like, connecticut
Topic 9: 2019, march, corrections, going, february, letter, year, women, money, home
Topic 10: russia, saudi, trump, netflix, bezos, new, jeff, justice, says, arabia
Topic 11: golden, globes, house, return, review, 30, hunting, john, island, russian
Topic 12: repor

It is unclear what the topics above really are. It looks like topic 15 is about Trump-related policy, and we can see if this is true. 

In [5]:
# Retrieving probability for topic 15 for all headlines
loadings = lda.transform(X).T[15, :]
# Randomly sample 10 headlines that have loadings over the threshold of 0.3
list(nyt_df['headline'][loadings>= 0.3].sample(10, random_state=1680))

['Two Lives in Art, and a Collection Tracing Their Trajectory',
 'Steady as She Glows',
 'Automakers Retool Marketing Machines as They Go Electric',
 'Trump Laid Out Evidence That a Wall Is Needed. We Took a Hard Look.',
 'National Emergency Powers and Trump’s Border Wall, Explained',
 'Trump’s Wall of Shame',
 'DealBook Briefing: What the State of the Union Means for Business',
 'After Falling Under Obama, America’s Uninsured Rate Looks to Be Rising',
 'U.A.E. to Use Equipment From Huawei Despite American Pressure',
 'A Tipster Pointed to Where a Body Was Buried, Revealing a 40-Year-Old Mystery']

We can see that there are a few Trump or national security related headlines. 

# Principal Component Analysis 

If we want to look at historical headlines, we will have to make multiple requests. Usually, for APIs, there are per minute or per day limits. To avoid reaching the limit, it is good to "sleep" between requests. 

In [6]:
toks = nyt_df['headline'][0:200].tolist()
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(toks)
pca = PCA(X.A, ncomp=1, standardize=True)

The above principal compoenent analysis reduces the dimension of words into 1, which is particularly useful since document vectors are usually very high-dimensional. However, as was the case with PCA with numerical data, we face interpretability issues. 

In [7]:
loadings_df = pd.DataFrame({'feature': vectorizer.get_feature_names(), 'loadings': pca.loadings.T[0]})
print(loadings_df.nlargest(10, 'loadings'))
print(loadings_df.nsmallest(10, 'loadings'))

         feature  loadings
90         barry  0.305857
169  christopher  0.305857
320        flynn  0.305857
357      gillian  0.305857
447      jenkins  0.305857
452         john  0.305857
474    krasinski  0.305857
538    mcquarrie  0.305857
783       script  0.305857
790      secrets  0.305857
     feature  loadings
400      his -0.057047
121     bowl -0.055538
355     gift -0.055538
360    gives -0.055538
547    meyer -0.055538
605     ohio -0.055538
638  parting -0.055538
755     rose -0.055538
956    urban -0.055538
968  victory -0.055538


One thing we can do to check what the principal compoenent could mean, is by looking at its loadings on all the words. By looking at the loadings on words, one could get a sense of what type of document would have a high value in a principal component. Here, I printed out 10 words with the largest loadings and 10 words with the smallest loadings. The top 10 words seem to be related to people like John Krasinski, Christopher Mcquarrie, and Gillian Flynn, who are writers and directors, whereas the last few words seem to have to do with Urban Meyer and Ohio State football. 