# News Recommender: Natural Language Processing

*insert title page* with links

## CLEANING AND VECTORIZATION
In the first section of this notebook, we will be pre-processing, cleaning, lemmatizing, tagging, and vectorizing our webscraped tweets. The goal is to create an optimized document-term matrix for topic modeling.

In [670]:
import pandas as pd
import numpy as np
import spacy
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from langdetect import detect
import unicodedata

### Standardize Encoding

Read in CSV as dataframe, and standardize encoding.

In [671]:
df_raw = pd.read_csv('df_raw.csv', encoding='utf-8')

In [672]:
#drop old index as we've reindex when reading in our CSV
df_raw = df_raw.drop(columns = 'Unnamed: 0')

In [673]:
#reindex the columns for easier viewing
cols = df_raw.columns.tolist()

cols.insert(2, cols.pop(cols.index('url')))

df_raw= df_raw.reindex(columns= cols)

Decode our webscrapped tweets into ascii so we can remove emojis and foreign characters easily during our pre-processing steps.

In [674]:
#decoded unicode into ascii
df_raw['clean'] = df_raw['content'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('ascii'))

In [476]:
df_raw.head(5)

Unnamed: 0,user,date,url,outlinks,content,clean
0,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-11-01T01:08:48+00:00,https://twitter.com/TheAtlantic/status/1322707...,['http://on.theatln.tc/YXH6gyR'],The Atlantic Daily: Will this decade be the ne...,The Atlantic Daily: Will this decade be the ne...
1,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-11-01T00:38:45+00:00,https://twitter.com/TheAtlantic/status/1322699...,['http://on.theatln.tc/l89Uzv7'],There's plenty that's going wrong for Trump. H...,There's plenty that's going wrong for Trump. H...
2,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-11-01T00:06:48+00:00,https://twitter.com/TheAtlantic/status/1322691...,['http://on.theatln.tc/ZGvkM7u'],"If Trump tries to steal the election, people w...","If Trump tries to steal the election, people w..."
3,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-10-31T23:34:45+00:00,https://twitter.com/TheAtlantic/status/1322683...,['http://on.theatln.tc/kypt5Zc'],The Trump campaign's “election-security operat...,The Trump campaign's election-security operati...
4,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-10-31T23:04:31+00:00,https://twitter.com/TheAtlantic/status/1322675...,['http://on.theatln.tc/rNbarVc'],"Even if Joe Biden wins decisively next week, t...","Even if Joe Biden wins decisively next week, t..."


### Preprocessing

In [675]:
def preprocess(tweet):
    """
    Takes in tweet and performs initial text cleaning/preprocessing.
    """
    #make sure doc is string
    tweet=str(tweet)
    #get rid of urls
    rem_url=re.sub(r'http\S+', '', tweet)
    #gets rid of @ tags
    rem_tag = re.sub('@\S+', '', rem_url)
    #gets rid of # in hashtag but keeps content of hashtag
    rem_hashtag = re.sub('#', '', rem_tag)
    #gets rid of special characters, numbers, etc.
    clean_text = re.sub(r'[^A-Za-z\s]','', rem_hashtag)

    return clean_text

In [676]:
df_raw['clean']=df_raw['clean'].map(preprocess)

In [677]:
df_raw[['content', 'clean']].head(3)

Unnamed: 0,content,clean
0,The Atlantic Daily: Will this decade be the ne...,The Atlantic Daily Will this decade be the new s
1,There's plenty that's going wrong for Trump. H...,Theres plenty thats going wrong for Trump Here...
2,"If Trump tries to steal the election, people w...",If Trump tries to steal the election people wi...


### Standardize Language

Let's remove any foreign language tweets to make sure we're only focusing on English. 

In [678]:
def english_only(x):
    """
    Take tweet, detect language, and only return English tweets, coding foreign language tweets as NaNs.
    """
    try:
        if detect(x) == 'en':
            return x
        else:
            return np.nan
    except:
        pass


In [679]:
%%time
#  Remove any non english tweets
df_raw['clean'] = df_raw['clean'].apply(lambda x: english_only(x))


CPU times: user 12min 13s, sys: 6.07 s, total: 12min 19s
Wall time: 12min 22s


In [680]:
df_raw.clean.isnull().sum()


1231

In [681]:
#fill non-English tweets with empty string in the "clean" column
df_raw = df_raw.fillna(" ")

In [682]:
df_raw.shape

(200000, 6)

### Remove Duplicate Tweets
Let's make sure we remove any duplicate tweets from the same news source

In [683]:
#explore duplicate tweets from the same account
df_raw['Is_Duplicate']= df_raw.duplicated(subset = ['user', 'clean'])
df_raw[df_raw['Is_Duplicate']== True].sort_values(by=['clean'])


Unnamed: 0,user,date,url,outlinks,content,clean,Is_Duplicate
170097,"{'username': 'WIRED', 'displayname': 'WIRED', ...",2019-12-04T03:00:13+00:00,https://twitter.com/WIRED/status/1202060008298...,['https://wired.trib.al/G1zf2Pg'],😬\n\nBy Pat Byrnes with @collectcartoons \nhtt...,\n\nBy Pat Byrnes with \n,True
169823,"{'username': 'WIRED', 'displayname': 'WIRED', ...",2019-12-12T14:34:31+00:00,https://twitter.com/WIRED/status/1205133834553...,['https://wired.trib.al/h4tFC82'],😬\n\nBy Pat Byrnes with @collectcartoons \nhtt...,\n\nBy Pat Byrnes with \n,True
92028,"{'username': 'Medium', 'displayname': 'Medium'...",2019-02-24T02:11:25+00:00,https://twitter.com/Medium/status/109949196171...,['http://read.medium.com/QYr4Vwi'],In defense of self-doubt https://t.co/xbGMRThL0R,,True
108171,"{'username': 'NewYorker', 'displayname': 'The ...",2020-04-15T19:28:03+00:00,https://twitter.com/NewYorker/status/125050618...,[],A cartoon by Danny Shanahan. https://t.co/MGjo...,,True
169846,"{'username': 'WIRED', 'displayname': 'WIRED', ...",2019-12-11T18:26:12+00:00,https://twitter.com/WIRED/status/1204829752344...,['https://twitter.com/jack/status/120476607846...,👀👀👀👀 https://t.co/PnW8ETVEhX,,True
...,...,...,...,...,...,...,...
189149,"{'username': 'WSJ', 'displayname': 'The Wall S...",2020-07-19T15:15:06+00:00,https://twitter.com/WSJ/status/128486937995664...,['https://on.wsj.com/3heSnBN'],8-year-old Jamya Eubanks largely finished the ...,yearold Jamya Eubanks largely finished the aca...,True
156110,"{'username': 'washingtonpost', 'displayname': ...",2020-05-06T13:05:22+00:00,https://twitter.com/washingtonpost/status/1258...,['https://wapo.st/3dmXCxr'],"1,000-year-old mill starts up again to keep ho...",yearold mill starts up again to keep homes in ...,True
158061,"{'username': 'washingtonpost', 'displayname': ...",2020-04-17T17:18:37+00:00,https://twitter.com/washingtonpost/status/1251...,['https://wapo.st/2VFbuMj'],99-year-old veteran raises $23 million for Bri...,yearold veteran raises million for Britains h...,True
157917,"{'username': 'washingtonpost', 'displayname': ...",2020-04-19T06:20:08+00:00,https://twitter.com/washingtonpost/status/1251...,['https://wapo.st/2Vi4aqQ'],99-year-old veteran raises $23 million for Bri...,yearold veteran raises million for Britains h...,True


In [684]:
#get index values of duplicate rows
dupe_index = df_raw[df_raw['Is_Duplicate']==True].index
#change "clean" column of these rows to NaN
df_raw['clean'].iloc[dupe_index] = np.nan
#fill the 'clean' column NaNs with empty string
df_raw = df_raw.fillna(" ")
#drop the duplicate row
df_raw.drop(columns = 'Is_Duplicate', inplace = True)

In [685]:
df_raw[df_raw.clean == " "]

Unnamed: 0,user,date,url,outlinks,content,clean
11,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-10-31T16:39:45+00:00,https://twitter.com/TheAtlantic/status/1322579...,['http://on.theatln.tc/Vj7AzDM'],"""The lack of universal coverage has made the U...",
16,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-10-31T14:42:13+00:00,https://twitter.com/TheAtlantic/status/1322549...,['http://on.theatln.tc/6QI8XY4'],"If Trump tries to steal the election, people w...",
17,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-10-31T14:19:17+00:00,https://twitter.com/TheAtlantic/status/1322543...,['http://on.theatln.tc/VURlE4o'],"Endings are seductive. They suggest order, and...",
18,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-10-31T13:58:37+00:00,https://twitter.com/TheAtlantic/status/1322538...,['http://on.theatln.tc/dNldIY4'],The Trump campaign's “election-security operat...,
20,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-10-31T13:02:02+00:00,https://twitter.com/TheAtlantic/status/1322524...,['http://on.theatln.tc/sUR08l8'],The Atlantic Daily: Will this decade be the ne...,
...,...,...,...,...,...,...
199921,"{'username': 'WSJ', 'displayname': 'The Wall S...",2020-03-19T15:21:04+00:00,https://twitter.com/WSJ/status/124065956037495...,['https://on.wsj.com/3a9PS0I'],"From @WSJopinion: From gloves to respirators, ...",
199935,"{'username': 'WSJ', 'displayname': 'The Wall S...",2020-03-19T12:30:15+00:00,https://twitter.com/WSJ/status/124061657381723...,['https://on.wsj.com/2J03P5i'],"A British research lab has 20,000 volunteers w...",
199947,"{'username': 'WSJ', 'displayname': 'The Wall S...",2020-03-19T09:30:09+00:00,https://twitter.com/WSJ/status/124057125130912...,['https://on.wsj.com/2UgLdmR'],"The tale of the man in his 50s, who reportedly...",
199948,"{'username': 'WSJ', 'displayname': 'The Wall S...",2020-03-19T09:15:11+00:00,https://twitter.com/WSJ/status/124056748274368...,['http://www.wsj.com'],Take an early look at the front page of The Wa...,


### Tagging and Lemmatizing

We only want the nouns (and proper nouns) in each tweet for topic modeling, as they are the essence of article subjects. Let's tag the nouns and return the lemmatized versions of tehm in a single step.

Because we also have so much data, we will incoporate the NLP pipeline in order to shorten the processing time. Tips on how to do this were found here: https://towardsdatascience.com/turbo-charge-your-spacy-nlp-pipeline-551435b664ad

In [686]:
nlp = spacy.load('en_core_web_sm', disable=[ 'parser', 'ner'])

In [854]:
def noun_lemmatize_pipe(doc):
    """
    Takes in tweet and returns only the lemmatized version of nouns (including proper nouns).
    """
    lemma_list = [token.lemma_.lower() for token in doc
                  if token.pos_ == "NOUN" or token.pos_ =="PROPN"] 
    return lemma_list

#create a pipeline in order to shorten processing time
def preprocess_pipe(texts):
    """
    Inputs noun_lemmative_pipe function into NLP pipeline for faster processing.
    """
    preproc_pipe = []
    for doc in nlp.pipe(texts, batch_size=50):
        preproc_pipe.append(noun_lemmatize_pipe(doc))
    return preproc_pipe

In [933]:
%%time
#apply function and create a new column to house the outputs
df_raw['clean_lemmatized'] = preprocess_pipe(df_raw['clean'])


CPU times: user 1min 55s, sys: 595 ms, total: 1min 55s
Wall time: 1min 56s


In [856]:
df_raw[['content', 'clean', 'clean_lemmatized']].head(3)

Unnamed: 0,content,clean,clean_lemmatized
0,The Atlantic Daily: Will this decade be the ne...,The Atlantic Daily Will this decade be the new s,"[atlantic, daily, decade, s]"
1,There's plenty that's going wrong for Trump. H...,Theres plenty thats going wrong for Trump Here...,"[plenty, trump, thing, campaign, gap, joe, bid..."
2,"If Trump tries to steal the election, people w...",If Trump tries to steal the election people wi...,"[trump, election, people, coup, strategy, write]"


We've successfully filtered out the nouns and proper nouns, lemmitized them, while making sure our function runs on optimized time!

### Remove Additional Words

Let's filter out any additional words that may appear in the tweets but aren't related to article subjects, like the names of the publications and common headline section titles.

Our removal words also include ones that have surfaced in our continuously iterative topic modeling process--any words that pop up as sigfnicant 'topic words' but do not have signficant meaning/are blanket terms will hinder our models and are thus added to this list.

In [934]:
removal_words= ["times", "wall", "street", "journal", "new", "yorker", "york", "medium", "wired", "financial", "washington", "post", "business", "insider", "economist", "the", "atlantic", "daily", "weekly", "monthly", "week", "day", "month", "quarter", "year", "sponsored", "Breaking", "news", 'way', 'page', 'edition', 'morning', 'monday', 'tuesday', 'wednesday', 'thursday','friday','saturday','sunday', 'thing', 'briefing', 'day', 'story', 'year', 'life', 'write', 'wrote', 'time', 'closer', 'look', 'opinion', 'opinions', 'looks', 'wsjwhatsnow', 'analysis', 'world', 'today', 'house', "today's", 'tomorrow', 'yesterday', 'house', 'american', 'americans', "america's", 'america', "american's", 'government', 'governments', 'city', 'country', 'state', 'question', 'questions', 'report', 'reports', 'death', 'number']

In [935]:
df_raw['clean_removed'] = df_raw['clean_lemmatized'].apply(lambda x: [word for word in x if word not in removal_words])

### Vectorize
Now let's rejoin our twice cleaned, lemmatized list of nouns and pronouns!

In [936]:
df_raw['clean_final'] = df_raw['clean_removed'].apply(lambda x: ' '.join(x))

In [937]:
df_raw[['content', 'clean', 'clean_lemmatized', 'clean_removed', 'clean_final']].head(3)

Unnamed: 0,content,clean,clean_lemmatized,clean_removed,clean_final
0,The Atlantic Daily: Will this decade be the ne...,The Atlantic Daily Will this decade be the new s,"[atlantic, daily, decade, s]","[decade, s]",decade s
1,There's plenty that's going wrong for Trump. H...,Theres plenty thats going wrong for Trump Here...,"[plenty, trump, thing, campaign, gap, joe, bid...","[plenty, trump, campaign, gap, joe, biden]",plenty trump campaign gap joe biden
2,"If Trump tries to steal the election, people w...",If Trump tries to steal the election people wi...,"[trump, election, people, coup, strategy, write]","[trump, election, people, coup, strategy]",trump election people coup strategy


In [695]:
#how many times does coronavirus appear?
counter = 0
for word in df_raw['clean_lemmatized']:
    if "coronavirus" in word:
        counter = counter +1
print(counter)

14252


In [816]:
from wordcloud import WordCloud
#from worldcloud.query_integral_image import query_integral_image
import matplotlib.pyplot as plt 

clean_words = ' '.join([doc for doc in df_raw['clean_final']])

wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = stopwords, 
                min_font_size = 10).generate(clean_words) 
  
# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show() 


ImportError: cannot import name 'query_integral_image' from 'wordcloud.query_integral_image' (unknown location)

We will be using TFID Vectorizer as opposed to Count Vectorizer, so we can give equal weight to rare words. Because our focus is on nouns and pronouns, rare words are likely to be just as, if not more impactful, as words that are frequently used. 

We'll also be using some of the built in parameters in the TFID Vectorizer as last checks before we output our doc-term matrix.

In [938]:
#define vectorizer and set parameters in order to standardize everything to lowercase, remove any stop words, and remove any word that appears below 0.005%, about 10 times, and max of more than 30%.
tfidf = TfidfVectorizer(lowercase = True, stop_words= 'english', min_df = 0.00005, max_df = 0.30)
#fit on fully cleaned dataframe column
doc_term_matrix = tfidf.fit_transform(df_raw['clean_final'])
#turn matrix into a dataframe with words as columns
matrix_df = pd.DataFrame(doc_term_matrix.toarray(), columns=tfidf.get_feature_names())

In [939]:
matrix_df

Unnamed: 0,aaron,abbey,abbott,abc,abe,abenomics,abes,abigail,ability,abiy,...,zip,zombie,zone,zoo,zoological,zoom,zoos,zora,zuckerberg,zuckerbergs
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
199996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
199997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
199998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Our doc-term matrix looks great! We have fully cleaned out all special characters and foreign language. We have extracted the nouns and pronouns in their lemmatized forms. Now we can topic model!


## TOPIC MODELING
In the second section of this notebook, we will be topic modeling.

In [837]:
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import NMF
from sklearn.metrics.pairwise import cosine_similarity

### LSA

In [940]:
lsa = TruncatedSVD(30)
doc_topic = lsa.fit_transform(doc_term_matrix)
lsa.explained_variance_ratio_

array([0.00513988, 0.0050063 , 0.0041349 , 0.0035138 , 0.00252051,
       0.00263113, 0.00248147, 0.00236477, 0.00234013, 0.00219939,
       0.00220586, 0.002142  , 0.00209055, 0.00203364, 0.00192536,
       0.00190596, 0.00189404, 0.00185413, 0.00185073, 0.00181862,
       0.00177205, 0.00171688, 0.00167118, 0.00162416, 0.00157869,
       0.00153289, 0.00152641, 0.00148511, 0.00145473, 0.00142768])

In [285]:
#topic_word = pd.DataFrame(lsa.components_.round(3),
             #index = ["component_1","component_2", "component_3", "component_4", "component_5", 
             #"component_6", "component_7", "component_8", "component_9", "component_10", 
             #"component_11", "component_12", "component_13", "component_14", "component_15", 
             #"component_16", "component_17", "component_18", "component_19", "component_20", "component_21", "component_22", "component_23", "component_24", "component_25", 
             #"component_26", "component_27", "component_28", "component_29", "component_30"],
             #columns = tfidf.get_feature_names())

In [700]:
def display_topics(model, feature_names, no_top_words, topic_names=None):
    for idx, topic in enumerate(model.components_):
        if not topic_names or not topic_names[idx]:
            print("\nTopic ", idx)
        else:
            print("\nTopic: '",topic_names[idx],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [941]:
display_topics(lsa, tfidf.get_feature_names(), 10)


Topic  0
coronavirus, pandemic, trump, people, president, case, outbreak, health, donald, trumps

Topic  1
trump, president, donald, biden, joe, election, trumps, campaign, debate, voter

Topic  2
people, covid, home, police, company, work, virus, thousand, lot, facebook

Topic  3
pandemic, company, economy, job, market, worker, home, tech, industry, money

Topic  4
company, market, election, stock, tech, china, covid, trumps, investor, home

Topic  5
biden, joe, election, trumps, campaign, voter, debate, case, president, court

Topic  6
case, covid, pandemic, court, supreme, virus, school, police, trump, child

Topic  7
election, uk, court, trump, supreme, ballot, november, donald, voting, voter

Topic  8
president, trumps, covid, health, police, white, court, woman, protest, response

Topic  9
company, case, president, trumps, election, people, pandemic, stock, court, facebook

Topic  10
covid, market, china, stock, economy, investor, vaccine, biden, virus, price

Topic  11
china, p

### NMF

In [942]:
nmf = NMF(30)
doc_topic = nmf.fit_transform(doc_term_matrix)

In [290]:
#topic_word = pd.DataFrame(nmf.components_.round(3),
             #index = ["component_1","component_2", "component_3", "component_4", "component_5", 
             #"component_6", "component_7", "component_8", "component_9", "component_10", 
             #"component_11", "component_12", "component_13", "component_14", "component_15", 
             #"component_16", "component_17", "component_18", "component_19", "component_20", "component_21", "component_22", "component_23", "component_24", "component_25", 
             #"component_26", "component_27", "component_28", "component_29", "component_30"],
             #columns = tfidf.get_feature_names())
#topic_word

In [866]:
display_topics(nmf, tfidf.get_feature_names(), 10)


Topic  0
coronavirus, outbreak, vaccine, infection, test, spread, response, lockdown, update, testing

Topic  1
trump, donald, administration, white, republicans, borowitz, debate, democrats, impeachment, poll

Topic  2
people, virus, thousand, lot, matter, color, place, mask, million, disease

Topic  3
pandemic, industry, sale, response, perspective, travel, effect, restaurant, food, impact

Topic  4
company, employee, tech, history, ceo, technology, executive, office, worker, product

Topic  5
biden, joe, debate, campaign, bidens, voter, race, harris, bernie, poll

Topic  6
case, virus, surge, florida, record, california, infection, rise, texas, increase

Topic  7
election, voter, ballot, november, voting, result, campaign, party, mail, poll

Topic  8
president, vice, white, power, obama, mike, leader, pence, trump, impeachment

Topic  9
police, protest, officer, man, george, protester, floyd, violence, department, video

Topic  10
market, stock, investor, price, oil, share, future,

In [943]:
display_topics(nmf, tfidf.get_feature_names(), 10)


Topic  0
coronavirus, outbreak, infection, toll, test, spread, lockdown, update, response, restriction

Topic  1
trump, donald, administration, white, republicans, borowitz, debate, democrats, impeachment, poll

Topic  2
people, virus, thousand, lot, matter, color, place, million, mask, rate

Topic  3
pandemic, industry, sale, response, travel, effect, restaurant, impact, food, demand

Topic  4
company, employee, tech, history, ceo, technology, executive, product, oil, startup

Topic  5
biden, joe, debate, campaign, bidens, voter, harris, race, bernie, poll

Topic  6
case, virus, surge, florida, infection, record, california, increase, rise, texas

Topic  7
election, voter, ballot, november, voting, result, party, campaign, mail, vote

Topic  8
president, vice, white, power, obama, mike, pence, leader, impeachment, trump

Topic  9
covid, test, patient, hospital, doctor, symptom, disease, study, spread, drug

Topic  10
market, stock, investor, price, oil, share, future, bank, tech, ral

### LDA

In [383]:
# gensim
from gensim import corpora, models, similarities, matutils
from gensim.corpora import Dictionary

# logging for gensim (set to INFO)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [379]:
# Convert our sparse matrix of counts to a gensim corpus
corpus = matutils.Sparse2Corpus(doc_term_matrix)

In [380]:
#mapping (dict) of row id to word (token) for later use by gensim
id2word = dict((v, k) for k, v in tfidf.vocabulary_.items())
len(id2word)

10635

In [385]:
dictionary = Dictionary(corpus)

2020-11-08 12:04:43,211 : INFO : adding document #0 to Dictionary(0 unique tokens: [])


TypeError: decoding to str: need a bytes-like object, tuple found

In [377]:
#create model
lda = models.LdaModel(corpus=corpus, num_topics=10, id2word=id2word, passes=5)

2020-11-08 11:56:32,291 : INFO : using symmetric alpha at 0.1
2020-11-08 11:56:32,293 : INFO : using symmetric eta at 0.1
2020-11-08 11:56:32,299 : INFO : using serial LDA version on this node
2020-11-08 11:56:32,312 : INFO : running online (multi-pass) LDA training, 10 topics, 5 passes over the supplied corpus of 10635 documents, updating model once every 2000 documents, evaluating perplexity every 10635 documents, iterating 50x with a convergence threshold of 0.001000
2020-11-08 11:56:32,361 : INFO : PROGRESS: pass 0, at document #2000/10635


IndexError: index 41205 is out of bounds for axis 1 with size 10635

NMF works better than LSA. It has been a feedback loop of topic modeling and cleaning, but I feel good about these results. They look very clean. Let's try to have more topics, because that will make the topics themselves more precise, and let's take a look at the specific tweets as well.

### Topic Interpretation

In [944]:
#nmf = NMF(30)
tweet_topic_matrix = nmf.fit_transform(doc_term_matrix)
tweet_topic_matrix_df = pd.DataFrame(tweet_topic_matrix).add_prefix('topic_')



In [945]:
tweet_topic_matrix_df[['content', 'clean_final']] = df_raw[['content', 'clean_final']]

In [946]:
tweet_topic_matrix_df.shape

(200000, 32)

In [947]:
word_topic_matrix_df = pd.DataFrame(nmf.components_, columns=tfidf.get_feature_names()).T.add_prefix('topic_')


In [948]:
word_topic_matrix_df.shape

(9144, 30)

In [949]:
def top_tweets(tweet_topic_matrix_df, topic, n_tweets):
    """
    """
    return (tweet_topic_matrix_df
            .sort_values(by=topic, ascending=False)
            .head(n_tweets)['content']
            .values)

def top_words(word_topic_matrix_df, topic, n_words):
    """
    """
    return (word_topic_matrix_df
            .sort_values(by=topic, ascending=False)
            .head(n_words))[topic]

#### Topic 1: Coronavirus

In [950]:
top_words(word_topic_matrix_df, 'topic_0', 10)

coronavirus    8.939366
outbreak       1.092181
infection      0.556388
toll           0.482746
test           0.409660
spread         0.387336
lockdown       0.319365
update         0.306768
response       0.297580
testing        0.257257
Name: topic_0, dtype: float64

In [951]:
top_tweets(tweet_topic_matrix_df ,'topic_0', 10)

array(['Find all the latest on the coronavirus here https://t.co/NLp2lJgrJJ',
       "Here's what happens after you call 911 for the coronavirus https://t.co/c5ei3QH0o6",
       'How to hike — safely and responsibly — during coronavirus https://t.co/YfSOFsUuNa',
       'New York struggles to bury its coronavirus dead https://t.co/zfRpY5Uxfj',
       '“I’ve come close to dying a few times, and I’m not afraid anymore, just sad,” @caitlinpacific writes. “But if I die from the corona\xadvirus, it will be one more un\xadnecessary American death.” https://t.co/X3Q2GNoA49',
       'She came to the U.S. legally and was trying to do everything right. Then came the coronavirus. https://t.co/10V1sGgErd',
       'The coronavirus changed everything, but not T.J. Maxx https://t.co/mubSxdSiYt',
       'How dangerous is new coronavirus and other questions answered https://t.co/CSJKUh4wHA',
       'As states reopen, cities are staying shut. That could mean more coronavirus in rural America. https://t.c

#### Topic 2: Trump

In [952]:
top_words(word_topic_matrix_df, 'topic_1', 10)

trump             5.248614
donald            1.429961
administration    0.380295
white             0.208782
republicans       0.137328
borowitz          0.121526
debate            0.113250
democrats         0.108134
impeachment       0.102637
poll              0.101975
Name: topic_1, dtype: float64

In [953]:
top_tweets(tweet_topic_matrix_df ,'topic_1', 10)

array(['Opinion: The pandemic Trump cannot ignore https://t.co/eEOMwar60Y',
       'Opinion: Why Trump can’t understand what Americans are feeling https://t.co/YISQmMEcUF',
       'Opinion: Trump must think autoworkers are stupid https://t.co/Ubem7DL9U7',
       'Opinion: Trump keeps claiming he cannot legitimately lose.\n\nWe should take that seriously. https://t.co/od4IJtBEdB',
       'Opinion: Trump is uniting Americans — against him https://t.co/XgSWz4m375',
       'Analysis: What Trump had in 2016 https://t.co/1b87qndlxL',
       'Opinion: Trump might not have reached bottom https://t.co/SOlrbXJDXB',
       'Opinion: Trump has exposed the "deep state" — and it is him https://t.co/p0zVzDgW5w',
       'Opinion | Trump did get one thing right: This isn’t going to end well https://t.co/Zi2kijzEbk',
       'Opinion: I can’t believe you’re forcing me to vote for Trump, which I definitely didn’t already want to do https://t.co/eF9n3tz4kh'],
      dtype=object)

#### Topic 3: People/human interest

In [954]:
top_words(word_topic_matrix_df, 'topic_2', 10)

people      4.975677
virus       0.126252
thousand    0.124961
lot         0.106357
matter      0.102437
color       0.091667
place       0.091541
mask        0.082126
million     0.078380
rate        0.069577
Name: topic_2, dtype: float64

In [955]:
top_tweets(tweet_topic_matrix_df ,'topic_2', 10)

array(['The richest people in America, ranked https://t.co/ZFeX7w5mIL',
       'By 2025, 800 million people will identify with Pentecostalism. What is it? https://t.co/iGhYpwCmGw https://t.co/UsAx77FhUi',
       '10 ways white people can develop more racial stamina: https://t.co/VhtZRIod9I',
       'Average people are important, too. https://t.co/Bv3kyBapJt',
       "Some people still aren't taking this seriously. The TL;DR: No one is immune. https://t.co/WuUDSdcqKo",
       'Far from being burned out, some people are fired up by keeping busy. https://t.co/VBhpxrZR8R',
       'Opinion: When you drown the government in the bathtub, people die https://t.co/ay6qLBsNGK',
       '"I know that people are just trying to help, but I also know that they don’t really understand." —@Rachael1013 https://t.co/00NEJMLcSL (via @humanparts)',
       "13 things mentally strong people don't do https://t.co/03D0pziI8E",
       'The 3 cities most people want to move out of right now https://t.co/cVSmKi6Yt

#### Topic 4: Pandemic Impact

In [956]:
top_words(word_topic_matrix_df, 'topic_3', 10)

pandemic       4.964991
industry       0.108869
sale           0.091772
response       0.078301
travel         0.073035
perspective    0.072812
effect         0.062418
restaurant     0.059556
impact         0.057853
food           0.056242
Name: topic_3, dtype: float64

In [957]:
top_tweets(tweet_topic_matrix_df ,'topic_3', 10)

array(['Opinion | You can be both social and safe during the pandemic. By @birthdaymoney https://t.co/xUB6gUJ89Z',
       'My third-grader was remote learning in a pandemic.\n\nI wasn’t going to worry about a C- in PE. https://t.co/Hfinec8hAT',
       'What to know about being pregnant during the covid-19 pandemic https://t.co/HV0BxCrjC8',
       "9 affordable 'covid compatible interests' to actually enjoy yourself more during the pandemic https://t.co/iHw9uYYg1C",
       'Mask-phobic countries have fared worse in the pandemic than mask-loving ones https://t.co/qTceA0APc1',
       'The 20 best countries for Americans to move to after the pandemic https://t.co/bJLgtnthqf',
       'You may be exhausted but the covid-19 pandemic is barely getting started https://t.co/Y6I6lvAH7n',
       'The recovered: How it feels to be alive on the other side of the pandemic https://t.co/27eBdnJVtF',
       'How to protest safely in a pandemic https://t.co/bTS3OG8rSl',
       '4 ways to rethink and pivo

#### Topic 5: Business

In [958]:
top_words(word_topic_matrix_df, 'topic_4', 10)

company       7.888003
employee      0.754682
tech          0.563495
history       0.411291
ceo           0.318969
technology    0.300853
executive     0.293038
worker        0.246182
office        0.236424
product       0.214292
Name: topic_4, dtype: float64

In [960]:
top_tweets(tweet_topic_matrix_df ,'topic_4', 20)

array(['Three recent reports have made some foreign companies wake up https://t.co/iw3mTKyATQ',
       'Uniqlo’s fleeces have barely changed in 20 years. How has the company sold so many? https://t.co/qqblxXIyaQ From @1843mag',
       'Uniqlo’s fleeces have barely changed in 20 years. How has the company sold so many? https://t.co/Pz5IEMedrL From @1843mag',
       'Uniqlo’s fleeces have barely changed in 20 years. How has the company sold so many? https://t.co/xtSYsakq2n From @1843mag',
       "How Taboola and Outbrain's plan to create a $2 billion clickbait company fell apart https://t.co/REcQaNUSPQ",
       '10 in-demand skills companies are hiring for right now — and how to learn them for free https://t.co/cziPtAcT1G',
       'Here are the seven most diverse and inclusive companies headquartered in the US https://t.co/5VarGLfcvB',
       'Would you discuss your horoscope in polite company? https://t.co/pwSacKhz2n',
       'Companies like @away are indeed changing the world, and not 

#### Topic 6: Biden

In [961]:
top_words(word_topic_matrix_df, 'topic_5', 10)

biden       3.517683
joe         2.904160
debate      0.726015
campaign    0.635360
bidens      0.415304
voter       0.344113
harris      0.286956
race        0.280959
bernie      0.279985
poll        0.271599
Name: topic_5, dtype: float64

In [962]:
top_tweets(tweet_topic_matrix_df ,'topic_5', 10)

array(['Opinion: Joe Biden must be doing something right https://t.co/fiZQrJcpKC',
       "Analysis: Who's afraid of Joe Biden? https://t.co/9pF4ExJH5D",
       'Opinion: Does Joe Biden have to be inspiring to win? Probably not. https://t.co/0v0zdLXrow',
       'From @WSJopinion: Now we know who knew about the Russian calls, including Joe Biden https://t.co/WRgq9Tjp1M',
       'Analysis: Joe Biden has shifted left https://t.co/TJYcKU9iar',
       'Opinion: Joe Biden, the inspirational plodder https://t.co/Nm126lmW6I',
       'Opinion: How Joe Biden — yes, Joe Biden — could revolutionize American politics https://t.co/2Kx1d5d1Tc',
       'Joe Biden tests negative for coronavirus https://t.co/4pneXbiAkO',
       'From @WSJopinion: A media-and-tech wall has been built to protect Joe Biden in the final days of the campaign, writes @gerardtbaker https://t.co/1QQuOm3zsx',
       'Joe Biden rises with a less-is-more campaign https://t.co/rpIx2qpnTQ'],
      dtype=object)

#### Topic 7: Covid spread

In [963]:
top_words(word_topic_matrix_df, 'topic_6', 10)

case          4.784447
virus         0.275697
surge         0.216582
florida       0.201637
infection     0.194398
record        0.188892
california    0.150953
increase      0.142977
rate          0.140180
rise          0.137798
Name: topic_6, dtype: float64

In [964]:
top_tweets(tweet_topic_matrix_df ,'topic_6', 10)

array(['In case you know someone who needs this. https://t.co/Flou7OLaZj',
       "Just in case you're looking ... https://t.co/1hVXxtieJY",
       'Just in case you know someone who needs this. https://t.co/2gsiRBqhA8',
       'Confirmed cases globally have passed 2.06 million, about a third of them in the U.S., which has logged over 30,000 of the more than 137,000 deaths world-wide https://t.co/E2QFKJkIdf',
       'World-wide, there were 310,000 new confirmed cases added on Saturday, after three straight days of about 280,000 new cases https://t.co/dbrO0i9haU',
       'In case you missed it:\nhttps://t.co/NtR1VG2RJN',
       'Confirmed cases near 84,000 in New York https://t.co/FtNUBO2DR2',
       'You might want to know this ... just in case https://t.co/pHf7YiobqL',
       'In case you missed it: \nhttps://t.co/NGAwWMj8cE',
       'Just in case you know someone who needs this. https://t.co/C2480wLK4k'],
      dtype=object)

#### Topic 8: Election

In [965]:
top_words(word_topic_matrix_df, 'topic_7', 10)

election    4.055427
voter       0.463508
ballot      0.383116
november    0.353057
voting      0.342344
result      0.322172
campaign    0.259658
party       0.254662
mail        0.228394
poll        0.210834
Name: topic_7, dtype: float64

In [966]:
top_tweets(tweet_topic_matrix_df ,'topic_7', 10)

array(['Zoran Milanovic wins Croatia’s presidential election https://t.co/cFRS0pNJDT',
       'Opinion: This election is all about who gets to vote https://t.co/St6cpvlf4n',
       'Of course, there’s no way to really know if our election was safe and secure until after it’s over. Here’s how we’ll be able to tell: 7/ https://t.co/AReC0OQeEF',
       'If you have an election question that we haven’t covered, let us know here: https://t.co/gXhd7fKK4S',
       'Opinion: Could Trump steal the election? Here’s one way to find out. https://t.co/FGnOh4EDJp',
       "Do you know the ins and outs of your local elections? Here's how to find out. https://t.co/TacTHw6wVH",
       '“Egypt’s Undemocratic Election” by @ForeignPolicy https://t.co/7Lufv3fB6S',
       'Americans will vote in 11 gubernatorial elections in 2020. Only one is considered a toss-up. https://t.co/EEuj2dDphw',
       'Analysis: What’s happened in the last 20 days of the last 13 elections https://t.co/F8JWhipxje',
       '"The y

#### Topic 9: World leadership

In [967]:
top_words(word_topic_matrix_df, 'topic_8', 10)

president      5.195691
vice           0.299896
white          0.234035
power          0.224681
obama          0.153108
mike           0.135055
leader         0.130674
pence          0.129769
trump          0.122139
impeachment    0.122087
Name: topic_8, dtype: float64

In [968]:
top_tweets(tweet_topic_matrix_df ,'topic_8', 10)

array(['Analysis: An isolated Trump receives an eager Polish president https://t.co/SfiijCcyHS',
       'Opinion: The "wartime president" has gone AWOL. More Americans will die. https://t.co/weOno4rwf9',
       'President Milo Djukanovic has ruled Montenegro since 1989. But things have not been entirely placid https://t.co/a1si20tbtZ',
       '“We are witnessing the steady, uninterrupted intellectual and psychological decomposition of an American president,” @Peter_Wehner argues. https://t.co/G2MiNZMBD9',
       'Opinion: The cowardly president hides — again https://t.co/6HX8RKzlxn',
       'Opinion: The worst president ever keeps getting worse https://t.co/oWIRjeJxnG',
       'Former Egyptian president Hosni Mubarak dies aged 91 https://t.co/gfKqQquQvt',
       'It’s 5 o’clock. Do you know where your president is? https://t.co/e6Qmf74UID',
       'It looks as if the country, abetted by its President, is self-destructing, @johncassidy writes. https://t.co/rCYy6LsQAv',
       'Opinion: 

#### Topic 10: Covid 

In [969]:
top_words(word_topic_matrix_df, 'topic_9', 10)

covid       11.046658
test         0.799889
patient      0.767636
hospital     0.620734
doctor       0.532481
symptom      0.500157
disease      0.447184
study        0.445934
spread       0.397324
risk         0.335668
Name: topic_9, dtype: float64

In [970]:
top_tweets(tweet_topic_matrix_df ,'topic_9', 10)

array(['Are you under 35 and sick with or recovered from Covid-19? We want to hear your story, big or small. Tell us here: https://t.co/Jm912Yg6J5',
       'Opinion: Covid-19 threatens to overwhelm the developing world https://t.co/52Kc7KkUt7',
       '.@Elemental explains how to protect yourself\u200a—\u200aand others\u200a—\u200afrom COVID-19. https://t.co/mXBHqVtmWo',
       '#Covid19 now appears to invade more than the respiratory and digestive systems. https://t.co/dEMW4aRRrj',
       'With 1m dead, are we any better at treating Covid-19? https://t.co/XXjIBojEFU',
       '‘Covid-19 has planted political time-bombs,’ says @KuperSimon\nhttps://t.co/XW4MCWHHl5',
       'Covid-19 is helping wealthy countries talk about death https://t.co/q9ZqMb4NNe',
       'Read more about how it actually feels to have Covid-19, from people who’ve had it. https://t.co/jqxyLJH7pa',
       'Within the next few days the global recorded deaths from covid-19 will surpass 1m. Too many governments are still

#### Topic 11: Stock market

In [971]:
top_words(word_topic_matrix_df, 'topic_10', 10)

market      4.200785
stock       3.698759
investor    1.640727
price       0.639436
oil         0.462679
share       0.449003
future      0.415987
bank        0.355038
tech        0.330312
rally       0.297121
Name: topic_10, dtype: float64

In [972]:
top_tweets(tweet_topic_matrix_df ,'topic_10', 10)

array(['Analysis: The world is falling apart. But the stock market keeps surging. https://t.co/sd9undf2xH',
       '#WSJWhatsNow: While U.S. stocks are up Thursday, @AmrithRamkumar explains why stocks have fallen this week and what the market is looking forward to https://t.co/Gjzip6MeLD',
       "The stock market is about to enter its worst month of the year after a historic August — but that doesn't mean investors should sell stock, LPL says https://t.co/JCaBRiGQv4",
       'Despite the market’s bounce off its March lows, investors who rely on technical analysis remain skeptical about stocks https://t.co/mw2bvNNlvg',
       'Analysis: Now more than ever, the stock market is not the economy https://t.co/27wfxW18tS',
       'Opinion: The stock market’s "fear gague" hit its highest point ever. Here’s why. https://t.co/3KkidCvZrZ',
       'Heard on the Street: China’s stock market suddenly has its mojo back https://t.co/JsOk748kBc',
       '5 reasons to be bullish on the US stock market,

#### Topic 12: Police/protests

In [973]:
top_words(word_topic_matrix_df, 'topic_11', 10)

police        4.201453
protest       1.602029
officer       1.434913
man           0.961998
george        0.933267
protester     0.745213
floyd         0.711075
violence      0.564586
department    0.491017
video         0.481483
Name: topic_11, dtype: float64

In [974]:
top_tweets(tweet_topic_matrix_df ,'topic_11', 10)

array(['Online rabble-rousers are intimidated by the police into keeping quiet https://t.co/jYC2m5oeMH',
       'Opinion: We are the governed. We no longer consent to let the police kill us. https://t.co/5sOtJEWsmI',
       'Opinion: Defund the police? Here’s what that really means. https://t.co/lwh5ODaCiC',
       'Analysis: What "defund the police" might look like https://t.co/NM9Sa2TUjK',
       'Defund the police? Here’s what has worked in other countries. https://t.co/pVrNcIkYvv',
       'Opinion: Prosecuting police officers won’t make us safer. Here’s what’s needed. https://t.co/kVAhEPW36G',
       'Clare Ramirez-Raftree, 23, said that when officers finally removed her zip-tie cuffs, she had large, red welts that were visible for more than a week afterward. “Seeing just how blatant the police were with everything, it kind of just spurred me on to go out again,” she said. https://t.co/EospVCbeh5',
       'Breaking News: At least one police officer was injured and a suspect was sho

#### Topic 13: Staying at Home

In [975]:
top_words(word_topic_matrix_df, 'topic_12', 10)

home        4.692780
nursing     0.431144
office      0.359644
family      0.307796
sale        0.166931
resident    0.161789
tip         0.159341
kid         0.128124
place       0.124506
employee    0.106067
Name: topic_12, dtype: float64

In [976]:
top_tweets(tweet_topic_matrix_df ,'topic_12', 10)

array(["Peek inside 'the most beautiful home in the world' https://t.co/0HUkDDDS0z",
       'Three ways in which our homes have changed in the past 10 years https://t.co/MuVgaEmoit',
       'Working from home does not always mean working at home.\nhttps://t.co/2yd1327mKF',
       'How to actually do this remote-learning thing while also working from home https://t.co/G2dcSKSl7G',
       'Would you live in this tech-filled tiny home? https://t.co/fNBhRDhZ8H',
       '⚡️ “Where Americans are still staying at home the most”\n\nhttps://t.co/GAwMALyrbH',
       "Don't try this at home. https://t.co/zR8THul0mL",
       'Working from home is starting to pall https://t.co/InXwdBb2T4',
       "A wildly hyped $95,000 'Loft' tiny home will be available in the US soon — see inside https://t.co/HsX2CCFl4k",
       'Americans working from home are producing more waste—and it costs more to get rid of it https://t.co/34wmYVpLJo'],
      dtype=object)

#### Topic 14: Asia

In [977]:
top_words(word_topic_matrix_df, 'topic_13', 10)

china             4.607835
war               0.381923
trade             0.286697
india             0.208873
outbreak          0.198531
beijing           0.190847
tension           0.183361
russia            0.163867
administration    0.152042
south             0.136322
Name: topic_13, dtype: float64

In [980]:
top_tweets(tweet_topic_matrix_df ,'topic_13', 10)

array(['Analysis: China strangles its world city https://t.co/Ei4Y4Zeoyd',
       'We’ve grown used to Americans getting fired for social media posts that offend other Americans. Now it’s clear they can also be punished for posts that offend China. @willoremus writes. https://t.co/lpBv0l9th5',
       'Opinion: "Decoupling" the U.S. from China would backfire https://t.co/q0ae4XsxoV',
       'Several countries have castigated China but few have done much about it https://t.co/8TPEmq0MYV',
       'Analysis: It’s not just Trump who’s angry at China https://t.co/RJABiYxNBR',
       '“I Modeled in China” by @karengeier https://t.co/pfy9THsb1W',
       '“#FakeNews: Made in China” by @DFRLab https://t.co/bwAZAbP3xj',
       'How China Inc can be tripped up by miscalculations abroad https://t.co/qcd3H31pJA',
       'Analysis: The world doesn’t want to pick between the U.S. and China https://t.co/gXfSeHpw67',
       '“This Is How China Is Feeding Itself” by @tomassidenfaden https://t.co/yFJstfrV

#### Topic 15: Economy
focused on cornavirus 

In [981]:
top_words(word_topic_matrix_df, 'topic_14', 10)

economy         6.330098
rate            0.526465
lockdown        0.476092
recession       0.323982
unemployment    0.295928
recovery        0.260892
chinas          0.237053
growth          0.215992
bank            0.210637
record          0.209990
Name: topic_14, dtype: float64

In [982]:
top_tweets(tweet_topic_matrix_df ,'topic_14', 10)

array(['How the nesting economy is changing the way Americans consume, maybe forever https://t.co/TfH8DjUmZU',
       '“The Life and Death of an Economy” by @umairh https://t.co/O3iVj3gAJM',
       'Analysis: What’s happening in states that reopened their economies? It’s complicated. https://t.co/KO2HBfeNJG',
       'No one knows for sure yet which countries’ economies have fared well or badly https://t.co/75dyLJ9Rni',
       'What will it take to create a post-pandemic economy that is fairer, more unified, and more resilient?\nhttps://t.co/qmjPtOKdAN',
       'Millennial-bashing is not funny or cute — it’s classism, and it’s destroying our economy, @MattiasLehman writes  https://t.co/oeX3D1Zh19',
       'The dire state of the global economy has economists asking whether more can be done https://t.co/OMcEJDsil4',
       'Don’t count on a fast global economy bounceback https://t.co/46WrLxINda',
       'Consciously decoupling the US economy https://t.co/rjl9fOM8sn',
       'Opinion: The 

#### Topic 16: Education

In [983]:
top_words(word_topic_matrix_df, 'topic_15', 10)

school        3.508265
student       1.511556
child         1.063251
parent        0.696992
college       0.565868
kid           0.534637
teacher       0.454093
family        0.395875
fall          0.385628
university    0.330183
Name: topic_15, dtype: float64

In [984]:
top_tweets(tweet_topic_matrix_df ,'topic_15', 10)

array(['The world’s toughest business school https://t.co/zj1tWxh6bb',
       'Here’s how countries around the world are preparing to go back to school safely https://t.co/llJxHGveBl',
       'Reopening the world’s schools safely will not be cheap. But it is essential  https://t.co/G9yW9c1c8A',
       '“These schools should be applying to have me”: inside the world of tutoring for the super rich https://t.co/TRVDWDRkg8 From @1843mag',
       'Can schools actually reopen safely? https://t.co/jAb9yDc0Hg',
       'Governments should be working out how to reopen schools as soon as it is safe https://t.co/RvQQAT6BH0',
       '“These schools should be applying to have me”: inside the world of tutoring for the super rich https://t.co/Q6V0YMAdyP From @1843mag',
       'From @petridishes in Opinions: We can’t wait for schools to reopen safely! https://t.co/AbsRDpeAeI',
       '“Like many rich Americans, I used to think better schools could heal the country’s ills,” writes @NickHanauer, “but I w

#### Topic 17: Supreme Court

In [985]:
top_words(word_topic_matrix_df, 'topic_16', 10)

court       3.189040
supreme     2.454603
justice     0.773789
amy         0.499676
coney       0.486853
barrett     0.421772
ruth        0.347114
bader       0.346876
judge       0.339498
ginsburg    0.338543
Name: topic_16, dtype: float64

In [986]:
top_tweets(tweet_topic_matrix_df ,'topic_16', 10)

array(['Perspective: The Supreme Court is leaking. That’s a good thing. https://t.co/e7yVURnHWY',
       'Opinion: Raging Trump wants the Supreme Court to save him.\n\nHere’s why it probably won’t. https://t.co/2XjXVmZft0',
       'Analysis: Who Trump might pick for the Supreme Court https://t.co/bNJb2DovIm',
       '"This advice may sound strange, but anyone who cares about the future of the Supreme Court needs to speak as little as possible about the Supreme Court," @anneapplebaum writes: https://t.co/WgWGxhNFKr',
       'Perspective: The Supreme Court just made the president more powerful https://t.co/vJGOBnXFW7',
       'Efforts to reform the Supreme Court should "recognize that the problem is not who serves on the Supreme Court but what power it has," @rddoerfler and @samuelmoyn write: https://t.co/okHOi3LLBr',
       'Perspective: The Supreme Court rules us. Here’s how to curb its power. https://t.co/uO7chabgpX',
       'Trump and Biden quickly clash over Supreme Court https://t.

#### Topic 18: U.S. Gender equality

In [987]:
top_words(word_topic_matrix_df, 'topic_17', 10)

woman      4.621350
man        1.150727
child      0.287562
color      0.273561
history    0.173704
harris     0.148753
mother     0.138528
kamala     0.126687
right      0.120364
group      0.110477
Name: topic_17, dtype: float64

In [988]:
top_tweets(tweet_topic_matrix_df ,'topic_17', 10)

array(['“How to ‘Make’ a Woman” by @BexvanKoot https://t.co/0qBHTB7wm4',
       '“How To Talk To A Woman” by @ThisIsGorman https://t.co/v0domU0xh0',
       '“Tiny Women,” “Where the Heck Are You, Godot?,” and other upcoming prequels. https://t.co/PCAEAcHWP2',
       'For conservative women, it’s still the 1980s https://t.co/9OyFZpZwFw',
       'Bacardi targeted women with its new reduced-alcohol vodkas. It went over as well as you’d expect. https://t.co/G28neAQXVE',
       'When Pocahontas is dehumanized and objectified, all Native women are (via @zoramag) https://t.co/jf1oYl0YAq',
       'Women are not monolithic. Pretending they are holds everyone back. https://t.co/vn33WcIyTH (via @genmag)',
       '“These women came ready to fight” by @meaganmday https://t.co/s9OnZE79kt',
       'Opinion: America hates to let Black women speak https://t.co/dqtNh24PNN',
       '“If Franken resigns, women lose twice” by @amandarosewhy https://t.co/U0mwARD6Lw'],
      dtype=object)

#### Topic 19: Trump

In [989]:
top_words(word_topic_matrix_df, 'topic_18', 10)

trumps         4.609630
donald         1.117977
campaign       0.550484
response       0.218502
policy         0.215458
white          0.203167
reelection     0.198600
republicans    0.183089
presidency     0.177242
voter          0.174994
Name: topic_18, dtype: float64

In [990]:
top_tweets(tweet_topic_matrix_df ,'topic_18', 10)

array(["Opinion: New numbers suggest Trump's magical race-baiting demagoguery is failing him https://t.co/adv757wzba",
       'Analysis: Trump’s playbook on "Obamagate" is extremely — and dubiously — familiar https://t.co/x5ntqAXCq7',
       'Opinion: Trump’s bizarre ranting isn’t good for the country — or for him https://t.co/tnYBJejYBI',
       'Opinion: Trump’s latest rage-fest is one of his most absurd and dangerous yet https://t.co/ad9o2fVv09',
       'Opinion: Trump’s name should live in infamy https://t.co/DKlijxRjTX',
       'Opinion: The media still hasn’t learned to corner Trump’s lackeys https://t.co/LEVhQgH71L',
       'Analysis: Trump’s coronavirus briefings were a political "Groundhog Day" https://t.co/249Yz8UtMC',
       'Analysis: Trump’s propaganda-laden, off-the-rails coronavirus briefing https://t.co/1Y7nvhBv1Z',
       'Opinion: Get ready, America. Trump’s campaign is just getting started. https://t.co/h7NVexUxLD',
       'What we know — and still want to know — abo

#### Topic 20: Health & Wellness

In [991]:
top_words(word_topic_matrix_df, 'topic_19', 10)

health          4.003602
official        1.000889
care            0.935071
expert          0.930891
risk            0.371318
mask            0.311777
organization    0.271794
hospital        0.260898
patient         0.256943
worker          0.254370
Name: topic_19, dtype: float64

In [992]:
top_tweets(tweet_topic_matrix_df ,'topic_19', 10)

array(['Trump increasingly preoccupied with defending his physical and mental health https://t.co/WWeLkvTIPY',
       "Eight exciting new health technologies—and where they're headed next https://t.co/E8ZOzrALBs",
       'Both regular and decaf boost your health (via @elemental) https://t.co/eZEMOoXrmd',
       "A hot toddy may *sound* nice, but it won't actually do very much for your health. (via @elemental) https://t.co/Z67DScPo44",
       'Bausch Health skyrockets 27% after saying it will spin off its eye-care business https://t.co/aHSPnwgzqj',
       "7 questions you should ask someone if you're worried about their mental health (via @AmyMorinLCSW) https://t.co/w9RxihBf4u",
       'The Borowitz Report: “For the past two months, it’s all I’ve been hearing about, health this, health that,” the President said. “I wish health would just disappear.”\nhttps://t.co/ydn9dga7aX',
       'When it comes to choosing a health plan, health care and policy experts are just as confused as you. htt

#### Topic 21: Climate Change

In [993]:
top_words(word_topic_matrix_df, 'topic_20', 10)

change       3.198692
climate      2.732013
scientist    0.314765
problem      0.264981
policy       0.241382
issue        0.200067
impact       0.151704
wildfire     0.138354
future       0.137654
risk         0.136558
Name: topic_20, dtype: float64

In [994]:
top_tweets(tweet_topic_matrix_df ,'topic_20', 10)

array(['If you’re looking for a new normal when it comes to climate change, you’re not going to find it. There isn’t one. https://t.co/gVuHhMekkZ',
       'If anything is going to be done about climate change, it may have to be led by cities. https://t.co/F95EY5ckuO',
       'If anything is going to be done about climate change, it may have to be led by cities. https://t.co/qHDZ8rVFFH',
       'How bad is climate change now? Here are 7 fundamental things to know. https://t.co/MDZAHzZddW',
       '“Climate Change, My Microbiome, and Me” by @grist https://t.co/Kc1ZvAkyzZ',
       'Climate change affects everything — even the coronavirus https://t.co/zamNVKZ2mF',
       '1/ We are now living through one of the first pandemics brought on by climate change. https://t.co/zjAg1lO48i https://t.co/SIg1lEKQOT',
       "Tackling climate change is a global political problem. Here's why https://t.co/08hd0nowEJ https://t.co/s20hPaEFiQ",
       "Tackling climate change is a global political problem. 

#### Topic 22: Un(employeement)

In [995]:
top_words(word_topic_matrix_df, 'topic_21', 10)

job             3.493465
worker          1.664617
unemployment    0.462153
benefit         0.276424
loss            0.242148
employee        0.197977
rate            0.166513
industry        0.163773
office          0.162800
million         0.157345
Name: topic_21, dtype: float64

In [996]:
top_tweets(tweet_topic_matrix_df ,'topic_21', 10)

array(['1m more British workers set to lose jobs this year, warn economists https://t.co/H3Z590I5hS',
       "You're doing a great job. But if you're having a hard time keeping it together, we're here to help.  https://t.co/hKHqdmdH6T",
       'The 9 most in-demand jobs hiring right now that pay $50,000 or more https://t.co/bqDjyUWSBj',
       'Opinion: The job numbers are much-needed good news. And they’re likely to get better. https://t.co/vp35UMrlfm',
       "These are the highest-paying remote jobs in computing that you can do anywhere in the world, and they're all hiring right now https://t.co/1rvN6ghYD9",
       'Some Americans fear their jobs will be lost forever https://t.co/ysaFQBt7XF',
       "4. You'll learn on the job in more ways than one. https://t.co/IMSZROWSCj https://t.co/MXycOWflRT",
       'First of all, you’re doing a great job.  https://t.co/yuRMZL993h',
       "These are the 20 highest-paying remote jobs you can do from anywhere in the world, and they're all hirin

#### Topic 23: Europe

In [997]:
top_words(word_topic_matrix_df, 'topic_22', 10)

uk          3.081612
boris       1.138090
johnson     1.037765
brexit      0.676449
deal        0.655482
ft          0.564709
minister    0.558994
eu          0.547404
lockdown    0.512748
prime       0.363423
Name: topic_22, dtype: float64

In [999]:
top_tweets(tweet_topic_matrix_df ,'topic_22', 20)

array(['Just published: front page of the Financial Times, UK edition, Monday 25 May           https://t.co/vurYm4KtdG https://t.co/r75KZVoHzA',
       'Just published: front page of the Financial Times UK edition Saturday May 30 https://t.co/xsGMtpJUdA https://t.co/soR1s61ZeE',
       'Just published: front page of the Financial Times UK edition Tuesday May 19\xa0https://t.co/SZIyZ564k8 https://t.co/0El8ArNByR',
       'Just published: front page of the Financial Times, UK edition, Wednesday 13 May https://t.co/I4GjQQgTcN https://t.co/ucoAaPiHbh',
       'Just published: front page of the Financial Times, UK edition, Tuesday 26 May       https://t.co/tNGqCfXXdN https://t.co/8sxwBf8Yli',
       'Just published: front page of the Financial Times UK edition, https://t.co/5gXWtrt5I2 https://t.co/d1nwuJ3lyq',
       'Just published: front page of the Financial Times UK edition Monday May 18\xa0https://t.co/yGEHS9H0Bx https://t.co/U9fBIWyOzc',
       '⚡️ “Will coronavirus break the UK?” by 

#### Topic 24: Crises

In [1000]:
top_words(word_topic_matrix_df, 'topic_23', 10)

crisis      4.502173
leader      0.301858
bank        0.217256
response    0.165888
industry    0.151905
power       0.122101
food        0.106583
housing     0.095988
moment      0.095483
history     0.092001
Name: topic_23, dtype: float64

In [1001]:
top_tweets(tweet_topic_matrix_df ,'topic_23', 10)

array(['Opinion | The US government finally figured out how to help average Americans during an economic crisis. And then it all fell apart. By @MattZeitlin. https://t.co/d4baMvxWm4',
       '“Don’t tell anyone, but we’re thinking to ride this whole thing out in Zurich, where the numbers are better.”\n\nTo be privileged during a global crisis: https://t.co/n8iNubzWiT',
       'The midlife crisis is no longer about yearning to be young again https://t.co/rHbR7xxjm2',
       'Opinion: Trump refuses to lead a country in crisis https://t.co/oU9whPxZA5',
       '⚡️ “What does it take to lead during a crisis?”\n\nhttps://t.co/KBMWY1JbPu',
       '⚡️ “What does it take to lead during a crisis?” by @FinancialTimes https://t.co/XbpVaXWsdR',
       'It\'s not a midlife crisis, it\'s just plain "stupid," says @ChrisHogan360 https://t.co/Uv5gzhkVCK',
       'The only epiphany you need to have in a crisis is that you deserve to be happy. https://t.co/26FWRFtsu8',
       'Nothing symbolizes the cris

#### Topic 25: Technology

In [1002]:
top_words(word_topic_matrix_df, 'topic_24', 10)

facebook    2.213741
apple       1.469661
tech        1.354153
google      1.258868
app         0.964057
datum       0.860694
amazon      0.794331
user        0.539059
group       0.529596
video       0.521765
Name: topic_24, dtype: float64

In [1003]:
top_tweets(tweet_topic_matrix_df ,'topic_24', 10)

array(['Apple is reacting differently to Facebook, Google, and Twitter. https://t.co/C8B35MqcIw',
       'A damning Congressional antitrust report into Amazon, Apple, Google and Facebook outlined the biggest assault on corporate power in the tech industry since the 1990s https://t.co/1jcbBrDwR7',
       '25 questions Congress should ask the CEOs of Apple, Amazon, Facebook, and Google when they testify, according to @profgalloway https://t.co/iUWPZ8jA8i',
       "Congress is gearing up to grill the CEOs of Facebook, Google, Amazon, and Apple in an antitrust hearing Wednesday — here's how to watch it https://t.co/HJP4AmoTFN",
       'A big day in tech: Amazon, Apple, Facebook, Google and Twitter report earnings after the market closes. Follow along for results and analysis. https://t.co/yqw8eUVaP6',
       'Find Out What Google and Facebook Know About You, by @baratunde https://t.co/tk11IHaKQA',
       'The combined market value of Apple, Amazon, Google and Facebook soared by $230bn afte

#### Topic 26: Asia

In [1004]:
top_words(word_topic_matrix_df, 'topic_25', 10)

hong                2.855730
kong                2.488578
law                 1.118356
security            0.687658
kongs               0.511221
chinas              0.467829
beijing             0.349034
protest             0.345457
prodemocracy        0.197372
nationalsecurity    0.178600
Name: topic_25, dtype: float64

In [1005]:
top_tweets(tweet_topic_matrix_df ,'topic_25', 20)

array(['Hello, Hong Kong. While you were sleeping, this was one of our most read stories: https://t.co/m1cxY8jcGh',
       'Good morning, Hong Kong. While you were sleeping, this was our most read story: https://t.co/uHMn867XBR',
       'Good morning, Hong Kong. While you were sleeping, this was one of our most read stories: https://t.co/PvX3eO17NW',
       'Good morning, Hong Kong. While you were sleeping, this was our most ready story: https://t.co/MNx1jlJs2A',
       'Good morning Hong Kong. While you were sleeping this was one of our most-read stories: https://t.co/7GbuaKj0wp',
       'Why business in Hong Kong should be worried https://t.co/rprnDMRXoT',
       'Good morning Hong Kong, while you were sleeping, this was our most read story: https://t.co/7BB4xNXbGs',
       'Hello, Hong Kong. While you were sleeping, this was our most read story:  https://t.co/jQSjZtfWbZ',
       'Hello, Hong Kong. While you were sleeping, this was one of our most read stories: https://t.co/5Vmj9z5vx

#### Topic 27: Self-Help / Personal Development

In [1006]:
top_words(word_topic_matrix_df, 'topic_26', 10)

work           3.952141
book           0.465716
office         0.370224
employee       0.283743
future         0.201817
art            0.188304
artist         0.147435
culture        0.145909
writer         0.127777
perspective    0.127231
Name: topic_26, dtype: float64

In [1007]:
top_tweets(tweet_topic_matrix_df ,'topic_26', 20)

array(['If you want to do better, you have to put in the work to unlearn. Here’s why you should. (via @levelmag) https://t.co/yek8FXZakd',
       'With a little mechanical work, the three-wheeled Piaggo Ape makes an impressive racecar https://t.co/sqaDjhmAYL',
       'How you should respond when someone tells you no at work (via @Entrepreneur) https://t.co/5UPd43mVVE',
       'Perspective: What you need to know about going back to work https://t.co/QpSYXYUPou',
       'Sometimes things happen at work that are inherently ridiculous. Laughter is the only recourse https://t.co/umNxycTeFK',
       'Come work with us! https://t.co/JgyrnWMv1s',
       'Perspective: What you need to know about going back to work https://t.co/BaWgfEXsqQ',
       'Not everyone can, or should, go back to work. https://t.co/YWz8gVgkp2',
       'There’s much more work to be done to take down machismo (via @levelmag) https://t.co/a3uTONp1dC',
       'Five smart ways to handle remote work, and two dumb ones https://

#### Topic 28: Cars

In [1008]:
top_words(word_topic_matrix_df, 'topic_27', 10)

car        3.851752
tesla      0.356559
future     0.291703
vehicle    0.260003
model      0.229848
sport      0.226277
custom     0.173571
concept    0.172030
elon       0.162479
driver     0.160584
Name: topic_27, dtype: float64

In [1009]:
top_tweets(tweet_topic_matrix_df ,'topic_27', 20)

array(['Flying cars are coming. https://t.co/fl2a4NN4Xo',
       'Why the SYMBIOZ is the smartest car https://t.co/SRhqGpGUjp',
       'The SYMBIOZ is the smartest car https://t.co/gWYKTKay7W',
       'Forget cars. These are the world’s 20 most bicycle-friendly cities. https://t.co/UQQaYeaHYX',
       'This is how car parts get recycled https://t.co/NLWi5ZEUgi',
       'Check out how a car is transformed into a limousine https://t.co/xxYZV3mzSB',
       'The Autozam AZ-1 is a highly sought-after car https://t.co/BISZFzUnp2',
       'Forget cars. These are the world’s 20 most bicycle-friendly cities. https://t.co/k6DqQ4p5Ur',
       'How soon until we have self-driving cars? https://t.co/othCz7LA2u',
       'This is how some of the most classic car looks are achieved https://t.co/qpp4d8K6FV',
       'This car with bench-style seating for two is unlike any you have ever seen https://t.co/kLbWSPyY6b',
       'Driverless cars could be coming sooner than you think — are you ready? https://t

#### Topic 29: Personal Finance

In [1010]:
top_words(word_topic_matrix_df, 'topic_28', 10)

money      3.905622
talks      0.333282
bank       0.311538
fund       0.170065
service    0.163045
lot        0.156441
podcast    0.154100
rate       0.150018
account    0.149463
plan       0.146684
Name: topic_28, dtype: float64

In [1011]:
top_tweets(tweet_topic_matrix_df ,'topic_28', 20)

array(['“How to Reinvent Money” by @umairh https://t.co/bZECx0wJVf',
       '"If you want to be successful, you must be able to account for how much and where you are spending your money" https://t.co/mwwnFAzwUM',
       '"I always think about losing money as opposed to making money." https://t.co/oXqfjvRL6P',
       "If you need to wire money, here's what you should expect https://t.co/vtrzj8nnbm",
       '"Ever since we combined our money, we’ve been fighting all the time." (via @forgemag) https://t.co/kunlK7zOdZ',
       'How will you manage your money post Covid-19? https://t.co/XZFCCAKbLV',
       'Economists would have to consider two questions: how much to pay, and how best to spend the money https://t.co/RSz2HcOauG',
       '"It\'s easy to just kill yourself," Sullins said, "and, be like, \'Yay, I made some money!\' But you can\'t keep that up forever." https://t.co/YfNonAO3hQ',
       "If you're looking for a stylish, customizable PC/PS4 gamepad, and you don't mind spending to

#### Topic 30: Covid Vaccine Research

In [1012]:
top_words(word_topic_matrix_df, 'topic_29', 10)

vaccine        3.833632
virus          1.520065
trial          0.742382
race           0.578194
scientist      0.392905
result         0.226542
drug           0.222187
perspective    0.207619
expert         0.196578
study          0.187622
Name: topic_29, dtype: float64

In [1013]:
top_tweets(tweet_topic_matrix_df ,'topic_29', 10)

array(['“The fastest vaccine we previously developed was for mumps, and that took four years to develop. And typically it takes 10 to 15 years to develop a vaccine. So 12 to 18 months would be record-breaking.” https://t.co/5iTJHk1irY',
       '⚡️ “How close are we to having a Covid-19 vaccine?”\n\nhttps://t.co/H38f1LMI06',
       'An inside look at why we are so far from a Covid-19 vaccine: https://t.co/dIJTex9Kx0',
       'There will not be a vaccine soon. https://t.co/2ZsWd5NX4H https://t.co/QKF3NWNUVs',
       'We could know if a vaccine works this month https://t.co/uauVAtqr6j',
       "We are so far from a Covid-19 vaccine. Here's why: https://t.co/U1BL3Glxq6",
       'Perspective: Here’s how hard it will be to distribute a covid-19 vaccine https://t.co/0tJzLPwr7T',
       'Opinion: What the government must do to successfully administer a covid-19 vaccine https://t.co/g7leVRPzR5',
       'A vaccine is coming soon. https://t.co/9XC71BoIeH',
       'No matter who develops an effect

### Next Steps: 
Our topics look really great, but...
The financial times has a number of repeat tweet templates like "__ Hong Kong, While you were sleeping, this was our most read story" and "Just published: front page of the Financial Times, UK edition" which may be skewing certain topics and become visible when we look at top tweets. Let's clean those out before we finalize our topics, just to be safe.