### Topic Analysis via Latent Semantic Analysis

0. load data (Trump's tweets)

1. create term vectors

2. calculate TF-IDF and perform SVD on it

3. look at topics and look at tweets with terms of interest

---

**big question:** What was Trump tweeting about at different periods?

**planned additions:** Perform sentiment analysis and see how that changes over time.

---
### load data and clean
Trump tweets from http://www.trumptwitterarchive.com/archive

looking at two time periods in this notebook

* time period 1 - Trump campaign annoucement in 2015 to election day 2016
* time period 2 - Start to end of Mueller investigation

In [2]:
d = load_data()
c = clean_data(d)
dt, drt = split_retweets(c) #split twts and re-twts

# campaign annoucement to election day
cond1 = ((dt['date'] >= pd.Timestamp('2015-07-16'))
          & (dt['date'] <= pd.Timestamp('2016-11-08')))

# start to end of Mueller investigation
cond2 = ((dt['date'] >= pd.Timestamp('2017-05-17'))
          & (dt['date'] <= pd.Timestamp('2019-03-22')))

d1 = dt[cond1]
d2 = dt[cond2]

d1.head(3)

Unnamed: 0,text,date
11120,LIVE on #Periscope: Join me for a few minutes ...,2016-11-07 23:28:48
11121,Hey Missouri let's defeat Crooked Hillary @ko...,2016-11-07 22:21:53
11122,'America must decide between failed policies o...,2016-11-07 21:37:25


---
### create term vectors

In [3]:
# arg 2 - minmum number of terms in vec to use twt in analysis
tvsdf1 = twtsdf_to_tvsdf(d1, min_vec_len=8)
tvsdf2 = twtsdf_to_tvsdf(d2, min_vec_len=8)

In [4]:
tvsdf1.head(3)

Unnamed: 0,text,date,tvs
0,Hey Missouri let's defeat Crooked Hillary @ko...,2016-11-07 22:21:53,"[hey, missouri, let, defeat, crook, hillari, k..."
1,'America must decide between failed policies o...,2016-11-07 21:37:25,"[must, decid, fail, polici, fresh, perspect, c..."
2,Just landed in North Carolina- heading to the ...,2016-11-07 19:30:12,"[land, north, carolina, head, js, dorton, aren..."


In [5]:
tvsdf2.head(3)

Unnamed: 0,text,date,tvs
0,Today we celebrate the lives and achievements ...,2019-03-21 21:12:05,"[celebr, live, achiev, american, syndrom, alwa..."
1,We are here today to take historic action to d...,2019-03-21 20:12:40,"[take, histor, action, defend, american, stude..."
2,After 52 years it is time for the United State...,2019-03-21 16:50:46,"[year, unit, fulli, recogn, israel, sovereignt..."


---
### calculate TF-IDF and perform SVD

In [6]:
# arg 2 - number of features
# arg 3 - number of PCs
svd1, v1 = tvs_to_svd(tvsdf1['tvs'], 
                      num_features=1000, 
                      num_comps=100)

svd2, v2 = tvs_to_svd(tvsdf2['tvs'], 
                      num_features=1000, 
                      num_comps=100)

print(svd1.components_.shape)
print(svd2.components_.shape)

(100, 1000)
(100, 1000)


---
### results

#### topics

In [7]:
# arg 3 - number of topics
# arg 4 - number of top words from PC to show
print_top_topics(svd1, v1, num_topics=3, num_words=30)


Topic 0: 
fals hillari clinton android poll crook say cruz said debat win bad watch total support job show run lead cant ted need jeb campaign fail beat last rubio know email 

Topic 1: 
hillari clinton crook email berni bad judgement sander rig isi said obama system run bill corrupt beat year total question fbi video want person husband decis scandal made lie dishonest 

Topic 2: 
cruz poll ted rubio carson iowa marco kasich jeb bush lead wow debat win lyin last senat show fail lightweight beat total number report campaign money said say candid john 



In [8]:
print_top_topics(svd2, v2, num_topics=3, num_words=30)


Topic 0: 
border job work wall secur tax crime fake senat militari must need american cut strong total deal come trade nation year elect hous win dem done unit meet hard endors 

Topic 1: 
border wall secur crime strong militari need endors vet tax amend southern love nd immigr must drug cut vote senat full law governor congressman congress open build weak tough stop 

Topic 2: 
korea trade north china meet deal unit tariff kim tax billion honor nation american year forward talk un work xi cut jong south dollar world pay first farmer negoti come 



#### finding tweets with terms of interest

In [9]:
print_twts_with_terms(tvsdf2, terms=['amend'], n=3)

tweet date: 2019-02-21 20:10:38
---
terms: ['senat', 'john', 'cornyn', 'done', 'outstand', 'job', 'texa', 'strong', 'crime', 'border', 'second', 'amend', 'love', 'militari', 'vet', 'john', 'complet', 'total', 'endors']
---
tweet: Senator John Cornyn has done an outstanding job for the people of Texas. He is strong on Crime the Border the Second Amendment and loves our Military and Vets. John has my complete and total endorsement. MAKE AMERICA GREAT AGAIN!


tweet date: 2019-02-15 03:16:23
---
terms: ['tri', 'use', 'th', 'amend', 'tri', 'circumv', 'elect', 'despic', 'act', 'unconstitut', 'power', 'grabbingwhich', 'happen', 'third', 'world', 'obey', 'law', 'attack', 'system', 'constitut', 'alan', 'dershowitz']
---
tweet: “Trying to use the 25th Amendment to try and circumvent the Election is a despicable act of unconstitutional power grabbing...which happens in third world countries. You have to obey the law. This is an attack on our system  Constitution.” Alan Dershowitz. @TuckerCarlson

---
## code

In [1]:
#-------------------------------------------------------------------------------
### packages
#-------------------------------------------------------------------------------
# general
import numpy as np
import pandas as pd 
import datetime

# packages for text analysis
import nltk
import re
import string
#nltk.download('punkt')
#nltk.download('stopwords')

# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

# LSA via SVD
from sklearn.decomposition import TruncatedSVD

#-------------------------------------------------------------------------------
### load data (from http://www.trumptwitterarchive.com/archive)
#-------------------------------------------------------------------------------

def load_data():
    d = pd.read_csv('./data/tweets_all.csv')
    d = d[['text', 'created_at']]
    d.columns = ['text', 'date']
    return d

#-------------------------------------------------------------------------------
### clean data 
#-------------------------------------------------------------------------------

def clean_data(d):
    c = d.copy()

    ## cleaning from issues in raw data

    # missing dates
    c = c.loc[c['date'].notnull(), :]

    # unreadable dates
    c['date'] = pd.to_datetime(c['date'], errors='coerce')
    c = c.loc[c['date'].notnull(), :]

    # removing ampersand text
    c['text'] = c['text'].str.replace('&amp;', '')
    
    return c

# split tweets and retweets
def split_retweets(d):
    cond = d['text'].str.find('RT', 0, 2) != -1 #retweets
    twts = c.loc[~cond, :].reset_index(drop=True)
    rtwts = c.loc[cond, :].reset_index(drop=True)
    return twts, rtwts

#-------------------------------------------------------------------------------
### create term vectors (including stop word removal and stemming)
#-------------------------------------------------------------------------------

def twts_to_tvs(twts):
    '''turns tweets into term vectors'''
    tvs = twts.apply(twt_clean_split_to_tv)
    tvs = tvs.apply(tv_remove_stopwords)
    tvs = tvs.apply(tv_stem)
    return tvs

def twt_clean_split_to_tv(twt):
    '''cleans characters and splits into term vector'''
    twt = twt.lower() # lower case
    twt = re.sub(r'http\S+', '', twt) # remove URL
    twt = re.sub('\d+', '', twt) # remove digits
    twt = re.sub(r'\B#\w*[a-zA-Z]+\w*', '', twt) # remove hashtag
    twt = re.sub('@[^\s]+','', twt) # remove @username

    # odd characters found not in string.punctuation
    odd_chars = ('“', '”', '’', '‘')
    chrs = string.punctuation.join(odd_chars)
    twt = (re.compile('[%s]' % re.escape(chrs))
             .sub('', twt))
    
    twt = nltk.word_tokenize(twt)
    return twt

def create_swords():
    '''function for defining stop words to be used'''
    
    # these are largely chosen when they were found to 
    # obscure the meaning of a topic grouping in latent
    # semantic analysis
    r_names = ['donald', 'trump', 'fox']
    r_politics = ['democrat', 'democrats', 
                  'republican', 'republicans',
                  'maga', 'president', 'presidents', 'presidency',
                  'us', 'state', 'states', 'country', 'countries',
                  'vote', 'usa']
    r_nonwords = ['pm', 'pme']
    r_numwords = ['one', 'two', 'three']
    r_days = ['monday', 'tuesday', 'wednesday', 'thursday',
              'friday', 'saturday', 'sunday',
              'morning', 'night', 'tonight', 
              'day', 'week', 'year',
              'today']
    r_other = ['make', 'america', 'great', 'again',
               'thank', 'thanks', 'you', 'tonight', 'get', 'go',
               'people', 'new', 'news', 'twitter', 'media',
               'much', 'good', 'big', 'want', 'look', 'like',
               'many', 'morning', 'tonight', 'night', 'time',
               'never', 'would', 'back', 'go', 'even',
               'one', 'going']
    
    rmv = (r_names + r_politics + r_nonwords + 
           r_numwords + r_days + r_other)
    swords = nltk.corpus.stopwords.words('english') + rmv
    
    swords = [re.sub('[^A-Za-z0-9]+', '', s) 
              for s in swords] # remove punc
    
    return swords

def tv_remove_stopwords(tv):
    swords = create_swords()
    newtv = []
    for t in tv:
        if t not in swords:
            newtv.append(t)
    return newtv

def tv_stem(tv):
    '''stem a term vector'''
    
    stemmer = nltk.stem.porter.PorterStemmer()
    for i in range(0, len(tv)):
        tv[i] = stemmer.stem(tv[i])
    return tv

def twtsdf_to_tvsdf(twtsdf, min_vec_len):
    '''adds column of term vectors to df with tweets'''
    twtsdf = twtsdf.copy()
    twtsdf['tvs'] = twts_to_tvs(twtsdf['text'])
    twtsdf = twtsdf[twtsdf['tvs'].map(len) >= min_vec_len]
    twtsdf = twtsdf.reset_index(drop=True)
    return twtsdf

#-------------------------------------------------------------------------------
### create TF-IDF matrix and perform SVD on it (i.e. LSA)
#-------------------------------------------------------------------------------

def tvs_to_svd(tvs, num_features, num_comps):
    '''take term vectors (tvs) and perform svd on tf-idf'''
    tvs = list(tvs.apply(lambda x: ' '.join(x)))
    
    v = TfidfVectorizer(max_features=num_features) # need to pass out to get
    tfidf = v.fit_transform(tvs)
    
    svd = TruncatedSVD(n_components=num_comps, 
                       algorithm='randomized', 
                       n_iter=100, random_state=123)
    svd.fit(tfidf)    
    return svd, v

#-------------------------------------------------------------------------------
### exploring results
#-------------------------------------------------------------------------------

def print_top_topics(svd, v, num_topics, num_words):
    '''prints top words from top topics from LSA'''
    terms = v.get_feature_names()

    for i, comp in enumerate(svd.components_[:num_topics]):
        terms_comp = zip(terms, comp)
        sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:num_words]
        print("\nTopic "+str(i)+": ")
        for t in sorted_terms:
            print(t[0], end=' ')
        print()
    print()
        
def print_twts_with_terms(tvsdf, terms, n):
    '''prints n tweets containg given a list of term vector terms'''
    count = 0
    for i in range(len(tvsdf)):
        if count==n:
            break
        if bool(set(terms) & set(tvsdf['tvs'][i])):
            print('tweet date: {}\n---'.format(tvsdf['date'][i]))
            print('terms: {}\n---'.format(tvsdf['tvs'][i]))
            print('tweet: {}\n\n'.format(tvsdf['text'][i]))
            count += 1