In [3]:
import glob, pandas as pd
data = [{
  'date': pd.to_datetime(filename.split(' ')[0].lstrip('speeches/').lstrip("\\")),
  'speaker': filename.rstrip('.txt').split(' ', 1)[1],
  'text': open(filename, 'r').readline()
} for filename in glob.glob('speeches/*.txt')]
speeches = pd.DataFrame(data)
speeches = speeches.set_index(['speaker', 'date'])
speeches.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,text
speaker,date,Unnamed: 2_level_1
George Washington,1789-04-30,Fellow Citizens of the Senate and the House of...
George Washington,1789-10-03,Whereas it is the duty of all Nations to ackno...
George Washington,1790-01-08,Fellow Citizens of the Senate and House of Rep...
George Washington,1790-12-08,Fellow citizens of the Senate and House of Rep...
George Washington,1790-12-29,I the President of the United States by my own...


# Task A
Choose a reasonable number of topics for this corpus. One way to think about topics is to consider the number of issues that may have been important in the past as well as those that may have come up over the centuries. Provide a brief explanation of how you chose this number.

In [2]:
N_TOPICS = 40

# Task B
Now perform a topic modeling exercise with LDA. Show the word distributions for each topic as well as topic distributions for each speech. Do you see any shifts over time? Explain.

In [3]:
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import RandomizedSearchCV #if needed, not sure if LDA prone to overfitting
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(
  CountVectorizer(stop_words='english'),
  LDA(n_components=N_TOPICS, learning_method='batch', n_jobs=-1, max_iter=10, random_state=42)
).fit(speeches.text)

In [4]:
pd.DataFrame(pipe.transform(speeches.text), index=speeches.index)

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
speaker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
George Washington,1789-04-30 00:00:00,0.000043,0.000043,0.000043,0.000043,0.071253,0.000043,0.000043,0.000043,0.000043,0.000043,...,0.000043,0.000043,0.000043,0.000043,0.000043,0.927115,0.000043,0.000043,0.000043,0.000043
George Washington,1789-10-03 00:00:00,0.000135,0.000135,0.000135,0.000135,0.000135,0.000135,0.000135,0.000135,0.000135,0.000135,...,0.000135,0.000135,0.000135,0.000135,0.000135,0.613065,0.000135,0.000135,0.000135,0.000135
George Washington,1790-01-08 00:00:00,0.000068,0.000068,0.000068,0.000068,0.000068,0.000068,0.000068,0.000068,0.000068,0.087376,...,0.000068,0.000068,0.000068,0.000068,0.000068,0.910021,0.000068,0.000068,0.000068,0.000068
George Washington,1790-12-08 00:00:00,0.000045,0.000045,0.000045,0.000045,0.000045,0.000045,0.000045,0.000045,0.032758,0.049275,...,0.000045,0.000045,0.000045,0.000045,0.000045,0.916303,0.000045,0.000045,0.000045,0.000045
George Washington,1790-12-29 00:00:00,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,...,0.000044,0.000044,0.000044,0.000044,0.000044,0.000044,0.998280,0.000044,0.000044,0.000044
George Washington,1791-10-25 00:00:00,0.000026,0.000026,0.000026,0.000026,0.000026,0.000026,0.000026,0.000026,0.000026,0.011215,...,0.000026,0.000026,0.000026,0.000026,0.000026,0.982909,0.000026,0.000026,0.000026,0.004897
George Washington,1792-04-05 00:00:00,0.000385,0.000385,0.000385,0.000385,0.000385,0.000385,0.000385,0.000385,0.000385,0.000385,...,0.000385,0.000385,0.000385,0.000385,0.000385,0.348060,0.422625,0.000385,0.000385,0.000385
George Washington,1792-11-06 00:00:00,0.000026,0.000026,0.000026,0.000026,0.000026,0.000026,0.000026,0.000026,0.000026,0.236936,...,0.000026,0.000026,0.000026,0.000026,0.000026,0.762071,0.000026,0.000026,0.000026,0.000026
George Washington,1792-12-12 00:00:00,0.000287,0.000287,0.000287,0.258707,0.000287,0.000287,0.000287,0.000287,0.000287,0.118057,...,0.000287,0.000287,0.000287,0.000287,0.000287,0.000287,0.612604,0.000287,0.000287,0.000287
George Washington,1793-03-04 00:00:00,0.000439,0.000439,0.000439,0.000439,0.000439,0.000439,0.000439,0.000439,0.000439,0.000439,...,0.000439,0.000439,0.000439,0.179743,0.000439,0.403967,0.166912,0.000439,0.000439,0.000439


# Task C
In terms of topics addressed “heavily” in a speech, which 3 former presidents does President Trump share the highest similarity with? How did you arrive at your conclusion?

# Task D
In terms of his own speeches, do you see President Trump shifting the emphasis on certain topics over time? Explain your response.

# Task E
If you do a K-means clustering with the same number of clusters as topics, do you see President Trump’s speeches and those of the 3 former presidents you identified in Task C in the same cluster? What was the basis of clustering (e.g., tf-idf, cosine similarity, etc.). Discuss your findings.

# Task F
Provide a visualization of both clusters (with colors) and cosine scores using MDS. 