# Keeping Trump on Topic: LIN353C Final Project

By Hannah Brinsko and Aditya Kharosekar

In [1]:
import pandas as pd
import numpy as np
import nltk
import string
import re
import csv
import datetime
import gensim
from gensim.models.keyedvectors import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)



## Preprocessing - Scraping tweets, and cleaning them up

Importing tweets from the CSV file - 

In [2]:
tweets_csv = pd.read_csv("tweets.csv")
trump_tweets = tweets_csv[tweets_csv['handle']=="realDonaldTrump"]
print(trump_tweets.shape)
print(trump_tweets['time'].head())

(3218, 28)
5     2016-09-27T22:13:24
8     2016-09-27T21:08:22
11    2016-09-27T20:31:14
12    2016-09-27T20:14:33
13    2016-09-27T20:06:25
Name: time, dtype: object


Looking at the CSV file, we see that it contains tweets only up to 09/27/2016. We need his more recent tweets as well

## Getting his most recent ~3200 tweets.

3200 is approximately the limit to how many tweets Tweepy allows us to scrape. As it turns out, this is more than enough for our use when combined with our CSV.

In [3]:
import tweepy
import json
from tweepy import OAuthHandler
import codecs

consumer_key = "i387QW7Eqgh12UHmK3VoQO9K5"
consumer_secret = "BQI8c5eKale4etdA21mawnFqOmAziDQpnThm679V7UtLjbWlMG"
access_token = "816857419338764288-S8Ay111O2Mo32QAs88tSnv5uKvmGCkF"
access_secret = "HVU19yLuV0klltJl1fsDibAi7Hiq1U4GwsEV9kozTAc1m"

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

all_tweets = []

new_tweets = api.user_timeline(screen_name="realDonaldTrump", count=200)


all_tweets.extend(new_tweets)
oldest = all_tweets[-1].id-1

t = new_tweets[0];

while len(new_tweets) > 0:
   new_tweets = api.user_timeline(screen_name = "realDonaldTrump", count=200, max_id = oldest)
   all_tweets.extend(new_tweets)
   oldest = all_tweets[-1].id-1;

In [4]:
print(len(all_tweets))

3244


We now have his most recent tweets.

However, there is significant overlap between the tweets that we have scraped from his account and the tweets that are in the CSV file.

The latest tweet in the CSV file was posted on September 27, 2016 at 22:13:24. So, we need to keep any scraped tweets which were posted after this.

In [5]:
tweets = []
i = 0
while (all_tweets[i].created_at!=datetime.datetime(2016, 9, 27, 22, 13, 24)):
    tweets.append(all_tweets[i].text)
    i+=1

In [6]:
tweets1 = trump_tweets['text']
tweets1 = tweets1.tolist()
for t in tweets1:
    tweets.append(t)
print(type(tweets))
print("We have ", len(tweets), "tweets to work with")

<class 'list'>
We have  4694 tweets to work with


Making distributional vectors from each tweet

But to do that, we need to - 
1. Remove any twitter links and image links
2. Remove any stopwords
3. Make sure that we have a list of tweets where each tweet is a string
4. Then use CountVectorizer http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage

### Removing links

In [7]:
temp_tweets = []
for t in tweets:
    temp_tweets.append(t.lower().split())

print(temp_tweets[1])
for t in temp_tweets:
    for w in t:
        if "http" in w or "@" in w: #I've removed any instances where he tags anyone in his tweets. 
                                    #I thought the word vectors might be too sparse if I left those in.
            t.remove(w)
print(temp_tweets[1]) 

['no', 'matter', 'how', 'much', 'i', 'accomplish', 'during', 'the', 'ridiculous', 'standard', 'of', 'the', 'first', '100', 'days,', '&amp;', 'it', 'has', 'been', 'a', 'lot', '(including', 's.c.),', 'media', 'will', 'kill!']
['no', 'matter', 'how', 'much', 'i', 'accomplish', 'during', 'the', 'ridiculous', 'standard', 'of', 'the', 'first', '100', 'days,', '&amp;', 'it', 'has', 'been', 'a', 'lot', '(including', 's.c.),', 'media', 'will', 'kill!']


## NOTE: This link removal is not working properly. Have to fix later

### Removing stopwords

In [8]:
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

for t in temp_tweets:
    for w in t:
        if w in stop:
            t.remove(w)

Succesfully removed stopwords. At this point, each tweet is a list of words and temp_tweets is a list. What we need to use CountVectorizer is a list where each element is a string.

Therefore, we need to convert each tweet from a lists of words to a string.

In [9]:
tweets = []
for t in temp_tweets:
    tweets.append(' '.join(t))
type(tweets[0])

str

## Our methodology for classifying tweets

Step 0 - Download a pre-trained Word2Vec model. We tried training our own model, but we did not have enough data.

Step 1 - Hand tag some number of tweets (we ended up tagging about 280 tweets) and classify them into the following categories - 
1. Foreign Policy / International News
2. Domestic Policy / domestic news
3. Tweets about the media
4. Attack tweets
5. Other tweets
6. Tweets about the election

Step 2 - From our hand-tagged corpus, and for each category, create a list of words used.

Step 3 - Create a word vector for each category by summing up the individual word vectors

Step 4 - For each subsequent tweet, find cosine similarity between it and each category vector. Assign that tweet to the category it is most similar to

### Step 0 - Downloading a pre-trained Word2Vec model

In [10]:
def loadGloveModel(gloveFile):
    print("Loading Glove Model")
    f = open(gloveFile,'r', encoding = 'utf8')
    model = {}
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        embedding = [float(val) for val in splitLine[1:]]
        model[word] = embedding
    print("Done.",len(model)," words loaded")
    return model

glove_model = loadGloveModel('glove.txt')

Loading Glove Model
Done. 400000  words loaded


In [11]:
glove_model['hillary']

[0.14675,
 1.1692,
 0.69416,
 -0.061429,
 -0.13677,
 0.42015,
 -0.716,
 0.019014,
 -0.52896,
 -0.83643,
 -1.8561,
 -0.18324,
 0.057648,
 -0.31188,
 0.024997,
 0.045878,
 -0.098728,
 -0.21451,
 0.14298,
 -0.0080809,
 -0.14569,
 0.38326,
 0.63811,
 -0.46426,
 1.0953,
 -2.15,
 -0.18462,
 0.1738,
 -0.50607,
 0.00057719,
 0.52828,
 0.6685,
 -0.89692,
 -0.34346,
 -0.15456,
 -0.97313,
 -0.69441,
 0.59201,
 -1.2194,
 -1.3469,
 -0.25691,
 0.34537,
 -0.43824,
 -0.096233,
 0.29882,
 -0.29174,
 -0.47201,
 -0.32221,
 0.079279,
 0.59419]

In [12]:
google_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary = True)

In [13]:
google_model['campaign']

array([ 0.24023438, -0.046875  , -0.05786133, -0.17285156,  0.13476562,
       -0.03466797,  0.05957031, -0.02209473,  0.00334167, -0.03564453,
       -0.04589844,  0.04248047, -0.09570312,  0.21582031, -0.12597656,
       -0.06835938,  0.15332031,  0.17773438, -0.03662109,  0.03515625,
        0.04418945,  0.28320312,  0.05297852, -0.01953125, -0.27929688,
       -0.23828125,  0.00238037, -0.04345703,  0.26367188,  0.06591797,
       -0.02624512,  0.03369141,  0.02880859, -0.15332031,  0.11083984,
       -0.046875  , -0.02355957,  0.01000977,  0.23632812, -0.07421875,
        0.27734375, -0.14746094,  0.02478027,  0.10351562, -0.33007812,
        0.0050354 , -0.04736328,  0.16699219,  0.015625  ,  0.30859375,
        0.15039062, -0.09472656,  0.08349609,  0.05883789, -0.17578125,
       -0.00273132, -0.04101562, -0.30859375, -0.15332031, -0.05200195,
       -0.19140625,  0.13476562, -0.28515625, -0.06445312, -0.00058365,
        0.01348877, -0.00527954,  0.10498047,  0.20605469,  0.01

google_model is our pre-trained model which we will be using.

### Step 1 - Hand tagging tweets

Using a .csv file of a number of hand tagged tweets, we place the tweets into the preselected categories

In [14]:
foreign = []
domestic =[]
media =[]
attack = []
election = []
other = []

In [15]:
f = open('tagged_tweets.csv')
csv_f = csv.reader(f)

for row in csv_f:
    tweet = row[0]
    cats = row[1]
    if "1" in cats:
        foreign.append(tweet)
    elif "2" in cats:
        domestic.append(tweet)
    elif "3" in cats:
        media.append(tweet)
    elif "5" in cats:
        other.append(tweet)
    elif "6" in cats:
        election.append(tweet)



In [16]:
print("Domestic: ",len(domestic))
print("Foreign: ", len(foreign))
print("Media: ", len(media))
print("Other: ",len(other))
print("Election: ",len(election))

Domestic:  104
Foreign:  56
Media:  83
Other:  95
Election:  79


### Step 2 - Making a list of words used in each category

In [17]:
#Each tweet is a string right now.
#This function will split up the string into individual words, remove any words which start with @
#(i.e our generated tweets won't have any tags) and remove any punctuation

def clean_up(tweets):
    tweets1 = []
    for t in tweets:
        tweets1.append(t.split())
        
    tweets_words = []
    for t in tweets1:
        for w in t:
            tweets_words.append(w)
    tweets_words = tweets_words
    
    #removing '@' from any word which has it. The google_model does not have any words which start with @
    temp_words = []
    for word in tweets_words:
        if word[0]=='@':
            temp_words.append(word[1:])
        else:
            temp_words.append(word)
    return temp_words

In [18]:
domestic_words = clean_up(domestic)
foreign_words = clean_up(foreign)
media_words = clean_up(media)
election_words = clean_up(election)

In [19]:
def remove_punctuation(words):
    x = [''.join(c for c in s if c not in string.punctuation) for s in words]
    return x

In [20]:
domestic_words = remove_punctuation(domestic_words)
foreign_words = remove_punctuation(foreign_words)
media_words = remove_punctuation(media_words)
election_words = remove_punctuation(election_words)

In [21]:
def remove_short_words(words):
    temp_words = []
    for word in words:
        if len(word) >=4:
            temp_words.append(word)
    return temp_words

In [22]:
d_short_words = remove_short_words(domestic_words)
f_short_words = remove_short_words(foreign_words)

In [23]:
domestic_words

['thank',
 'to',
 'law',
 'enforcement',
 'officers',
 'lesm',
 'trump2016',
 'jeb',
 'bush',
 'did',
 'poorly',
 'last',
 'night',
 'the',
 'debate',
 'whose',
 'chances',
 'winning',
 'zero',
 'got',
 'graham',
 'endorsement',
 'graham',
 'quit',
 'o',
 'sen',
 'lindsey',
 'graham',
 'embarrassed',
 'with',
 'failed',
 'run',
 'president',
 'now',
 'embarrasses',
 'with',
 'endorsement',
 'bush',
 'will',
 'the',
 'greatest',
 'jobproducing',
 'president',
 'american',
 'history',
 'trump2016',
 'votetrump',
 'httpstcooc480lwvqg',
 'How',
 'low',
 'has',
 'President',
 'Obama',
 'gone',
 'to',
 'tapp',
 'my',
 'phones',
 'during',
 'the',
 'very',
 'sacred',
 'election',
 'process',
 'This',
 'is',
 'NixonWatergate',
 'Bad',
 'or',
 'sick',
 'guy',
 'Id',
 'bet',
 'a',
 'good',
 'lawyer',
 'could',
 'make',
 'a',
 'great',
 'case',
 'out',
 'of',
 'the',
 'fact',
 'that',
 'President',
 'Obama',
 'was',
 'tapping',
 'my',
 'phones',
 'in',
 'October',
 'just',
 'prior',
 'to',
 'Elec

### Step 3 - Create a category vector by adding up individual word vectors

In [24]:
def create_category_vector(words):
    vector = np.ones(300)
    for word in words:
        try:
            vector = vector + google_model[word]
        except KeyError: #some words are not in model. I don't want to pre-process everything so I'm just handling each exception
            pass
    return vector

In [25]:
domestic_vector = create_category_vector(domestic_words)
foreign_vector = create_category_vector(foreign_words)
media_vector = create_category_vector(media_words)
election_vector = create_category_vector(election_words)
obamacare = create_category_vector("Health")
isis = create_category_vector("Russia")

In [26]:
tweets[27]

'military building is rapidly becoming stronger ever before. frankly, have choice!'

In [27]:
def create_tweet_vector(tweet):
    vector = np.ones(300)
    for word in tweet:
        try:
            vector = vector + google_model[word]
        except KeyError:
            pass
    return vector

In [28]:
def calc_cosine_similarity(tweet_vector, category_vector):
    return cosine_similarity(tweet_vector, category_vector)

In [29]:
specific_foreign = []
specific_domestic = []
for word in foreign_words:
    if word not in domestic_words:
        specific_foreign.append(word)
for word in domestic_words:
    if word not in foreign_words:
        specific_domestic.append(word)
        
specific_dshort = []
specific_fshort = []
for word in d_short_words:
    if word not in f_short_words:
        specific_fshort.append(word)
for word in f_short_words:
    if word not in d_short_words:
        specific_dshort.append(word)

In [30]:
def calc_scores(tweet):
#     score = calc_cosine_similarity(create_tweet_vector(tweet), specific_domestic_vector)
#     print("Domestic:", score)
    score = calc_cosine_similarity(create_tweet_vector(tweet), dshort_vector)
    print("Domestic Long words:", score)
#     score = calc_cosine_similarity(create_tweet_vector(tweet), specific_foreign_vector)
#     print("Foreign: ",score)
    score = calc_cosine_similarity(create_tweet_vector(tweet), fshort_vector)
    print("Foreign Long words:", score)

In [31]:
specific_foreign = clean_up(specific_foreign)
specific_domestic = clean_up(specific_domestic)

specific_foreign = remove_punctuation(specific_foreign)
specific_domestic = remove_punctuation(specific_domestic)

specific_foreign_vector = create_category_vector(specific_foreign)
specific_domestic_vector = create_category_vector(specific_domestic)
dshort_vector = create_category_vector(specific_dshort)
fshort_vector = create_category_vector(specific_fshort)

In [34]:
for t in tweets[1:40]:
    print(t)
    print(calc_scores(t))
    print()

matter much accomplish the ridiculous standard the first 100 days, &amp; has a lot (including s.c.), media kill!
Domestic Long words: [[ 0.29918605]]
Foreign Long words: [[ 0.29549825]]
None

another terrorist attack paris. people france not take much of this. have big effect presidential election!
Domestic Long words: [[ 0.29141667]]
Foreign Long words: [[ 0.29708134]]
None

rt nyt editor apologizes misleading tweet new england patriots' visit the white house (via h…
Domestic Long words: [[ 0.2990066]]
Foreign Long words: [[ 0.30049044]]
None

great honor host pm paolo gentiloni italy the white house afternoon! #icymi- joint press conference…
Domestic Long words: [[ 0.28118862]]
Foreign Long words: [[ 0.28845017]]
None

we're going use american steel, we're going use american labor, are going come first all deals. ➡️…
Domestic Long words: [[ 0.27667859]]
Foreign Long words: [[ 0.28249208]]
None

failing has calling wrong two years, got caught a big lie concerning new england patriots 

In [33]:
for t in specific_domestic[20:40]:
    print(t)
    print(calc_scores(t))
    print()

o
Domestic Long words: [[ 0.01014544]]
Foreign Long words: [[-0.00438806]]
None

sen
Domestic Long words: [[ 0.05965827]]
Foreign Long words: [[ 0.04615618]]
None

lindsey
Domestic Long words: [[ 0.14880492]]
Foreign Long words: [[ 0.13815822]]
None

graham
Domestic Long words: [[ 0.08706938]]
Foreign Long words: [[ 0.06181411]]
None

embarrassed
Domestic Long words: [[ 0.18554745]]
Foreign Long words: [[ 0.1607589]]
None

failed
Domestic Long words: [[ 0.13266373]]
Foreign Long words: [[ 0.11580782]]
None

run
Domestic Long words: [[ 0.0693841]]
Foreign Long words: [[ 0.05718254]]
None

president
Domestic Long words: [[ 0.19738317]]
Foreign Long words: [[ 0.18692844]]
None

embarrasses
Domestic Long words: [[ 0.17816068]]
Foreign Long words: [[ 0.15297143]]
None

endorsement
Domestic Long words: [[ 0.20781486]]
Foreign Long words: [[ 0.19515333]]
None

bush
Domestic Long words: [[ 0.10439766]]
Foreign Long words: [[ 0.08620976]]
None

greatest
Domestic Long words: [[ 0.16548235]]
Fore