# Keeping Trump on Topic: LIN353C Final Project

By Hannah Brinsko and Aditya Kharosekar

In [51]:
import pandas as pd
import numpy as np
import nltk
import string
import re
import csv
import datetime
import gensim
from gensim.models.keyedvectors import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Preprocessing - Scraping tweets, and cleaning them up

Importing tweets from the CSV file - 

In [52]:
tweets_csv = pd.read_csv("tweets.csv")
trump_tweets = tweets_csv[tweets_csv['handle']=="realDonaldTrump"]
print(trump_tweets.shape)
print(trump_tweets['time'].head())

(3218, 28)
5     2016-09-27T22:13:24
8     2016-09-27T21:08:22
11    2016-09-27T20:31:14
12    2016-09-27T20:14:33
13    2016-09-27T20:06:25
Name: time, dtype: object


Looking at the CSV file, we see that it contains tweets only up to 09/27/2016. We need his more recent tweets as well

## Getting his most recent ~3200 tweets.

3200 is approximately the limit to how many tweets Tweepy allows us to scrape. As it turns out, this is more than enough for our use when combined with our CSV.

In [53]:
import tweepy
import json
from tweepy import OAuthHandler
import codecs

consumer_key = "i387QW7Eqgh12UHmK3VoQO9K5"
consumer_secret = "BQI8c5eKale4etdA21mawnFqOmAziDQpnThm679V7UtLjbWlMG"
access_token = "816857419338764288-S8Ay111O2Mo32QAs88tSnv5uKvmGCkF"
access_secret = "HVU19yLuV0klltJl1fsDibAi7Hiq1U4GwsEV9kozTAc1m"

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

all_tweets = []

new_tweets = api.user_timeline(screen_name="realDonaldTrump", count=200)


all_tweets.extend(new_tweets)
oldest = all_tweets[-1].id-1

t = new_tweets[0];

while len(new_tweets) > 0:
   new_tweets = api.user_timeline(screen_name = "realDonaldTrump", count=200, max_id = oldest)
   all_tweets.extend(new_tweets)
   oldest = all_tweets[-1].id-1;

In [54]:
print(len(all_tweets))

3221


We now have his most recent tweets.

However, there is significant overlap between the tweets that we have scraped from his account and the tweets that are in the CSV file.

The latest tweet in the CSV file was posted on September 27, 2016 at 22:13:24. So, we need to keep any scraped tweets which were posted after this.

In [55]:
tweets = []
i = 0
while (all_tweets[i].created_at!=datetime.datetime(2016, 9, 27, 22, 13, 24)):
    tweets.append(all_tweets[i].text)
    i+=1

In [56]:
tweets1 = trump_tweets['text']
tweets1 = tweets1.tolist()
for t in tweets1:
    tweets.append(t)
print(type(tweets))
print("We have ", len(tweets), "tweets to work with")

<class 'list'>
We have  4722 tweets to work with


Making distributional vectors from each tweet

But to do that, we need to - 
1. Remove any twitter links and image links
2. Remove any stopwords
3. Make sure that we have a list of tweets where each tweet is a string
4. Then use CountVectorizer http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage

### Removing links

In [57]:
temp_tweets = []
for t in tweets:
    temp_tweets.append(t.lower().split())

print(temp_tweets[1])
for t in temp_tweets:
    for w in t:
        if "http" in w or "@" in w: #I've removed any instances where he tags anyone in his tweets. 
                                    #I thought the word vectors might be too sparse if I left those in.
            t.remove(w)
print(temp_tweets[1]) 

["don't", 'let', 'the', 'fake', 'media', 'tell', 'you', 'that', 'i', 'have', 'changed', 'my', 'position', 'on', 'the', 'wall.', 'it', 'will', 'get', 'built', 'and', 'help', 'stop', 'drugs,', 'human', 'trafficking', 'etc.']
["don't", 'let', 'the', 'fake', 'media', 'tell', 'you', 'that', 'i', 'have', 'changed', 'my', 'position', 'on', 'the', 'wall.', 'it', 'will', 'get', 'built', 'and', 'help', 'stop', 'drugs,', 'human', 'trafficking', 'etc.']


## NOTE: This link removal is not working properly. Have to fix later

### Removing stopwords

In [58]:
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

for t in temp_tweets:
    for w in t:
        if w in stop:
            t.remove(w)

Succesfully removed stopwords. At this point, each tweet is a list of words and temp_tweets is a list. What we need to use CountVectorizer is a list where each element is a string.

Therefore, we need to convert each tweet from a lists of words to a string.

In [59]:
tweets = []
for t in temp_tweets:
    tweets.append(' '.join(t))
type(tweets[0])

str

## Our methodology for classifying tweets

Step 0 - Download a pre-trained Word2Vec model. We tried training our own model, but we did not have enough data.

Step 1 - Hand tag some number of tweets (we ended up tagging about 280 tweets) and classify them into the following categories - 
1. Foreign Policy / International News
2. Domestic Policy / domestic news
3. Tweets about the media
4. Attack tweets
5. Other tweets
6. Tweets about the election

Step 2 - From our hand-tagged corpus, and for each category, create a list of words used.

Step 3 - Create a word vector for each category by summing up the individual word vectors

Step 4 - For each subsequent tweet, find cosine similarity between it and each category vector. Assign that tweet to the category it is most similar to

### Step 0 - Downloading a pre-trained Word2Vec model

In [60]:
google_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary = True)

In [61]:
google_model['campaign']

array([ 0.24023438, -0.046875  , -0.05786133, -0.17285156,  0.13476562,
       -0.03466797,  0.05957031, -0.02209473,  0.00334167, -0.03564453,
       -0.04589844,  0.04248047, -0.09570312,  0.21582031, -0.12597656,
       -0.06835938,  0.15332031,  0.17773438, -0.03662109,  0.03515625,
        0.04418945,  0.28320312,  0.05297852, -0.01953125, -0.27929688,
       -0.23828125,  0.00238037, -0.04345703,  0.26367188,  0.06591797,
       -0.02624512,  0.03369141,  0.02880859, -0.15332031,  0.11083984,
       -0.046875  , -0.02355957,  0.01000977,  0.23632812, -0.07421875,
        0.27734375, -0.14746094,  0.02478027,  0.10351562, -0.33007812,
        0.0050354 , -0.04736328,  0.16699219,  0.015625  ,  0.30859375,
        0.15039062, -0.09472656,  0.08349609,  0.05883789, -0.17578125,
       -0.00273132, -0.04101562, -0.30859375, -0.15332031, -0.05200195,
       -0.19140625,  0.13476562, -0.28515625, -0.06445312, -0.00058365,
        0.01348877, -0.00527954,  0.10498047,  0.20605469,  0.01

google_model is our pre-trained model which we will be using.

### Step 1 - Hand tagging tweets

Using a .csv file of a number of hand tagged tweets, we place the tweets into the preselected categories

In [62]:
foreign = []
domestic =[]

In [63]:
f = open('tagged_tweets.csv')
csv_f = csv.reader(f)

for row in csv_f:
    tweet = row[0]
    cats = row[1]
    if "1" in cats:
        foreign.append(tweet)
    elif "2" in cats:
        domestic.append(tweet)

In [64]:
print("Domestic: ",len(domestic))
print("Foreign: ", len(foreign))

Domestic:  104
Foreign:  56
Media:  83
Other:  95
Election:  79


### Step 2 - Making a list of words used in each category

In [65]:
#Each tweet is a string right now.
#This function will split up the string into individual words, remove any words which start with @
#(i.e our generated tweets won't have any tags) and remove any punctuation

def clean_up(tweets):
    tweets1 = []
    for t in tweets:
        tweets1.append(t.split())
        
    tweets_words = []
    for t in tweets1:
        for w in t:
            tweets_words.append(w)
    tweets_words = tweets_words
    
    #removing '@' from any word which has it. The google_model does not have any words which start with @
    temp_words = []
    for word in tweets_words:
        if word[0]=='@':
            temp_words.append(word[1:])
        else:
            temp_words.append(word)
    return temp_words

In [66]:
domestic_words = clean_up(domestic)
foreign_words = clean_up(foreign)

In [67]:
def remove_punctuation(words):
    x = [''.join(c for c in s if c not in string.punctuation) for s in words]
    return x

In [68]:
domestic_words = remove_punctuation(domestic_words)
foreign_words = remove_punctuation(foreign_words)

When we generate tweets, we will use these words as the basis for our bigram model.

In [69]:
def remove_short_words(words, length):
    temp_words = []
    for word in words:
        if len(word)>=length:
            temp_words.append(word)
    return temp_words

In [76]:
d_short_words = remove_short_words(domestic_words, 4)
f_short_words = remove_short_words(foreign_words, 4)

### Step 3 - Create a category vector by adding up individual word vectors

In [85]:
def create_vector(words):
    vector = np.ones(300)
    for word in words:
        try:
            vector = vector + google_model[word]
        except KeyError: #some words are not in model. I don't want to pre-process everything so I'm just handling each exception
            pass
    return vector

In [86]:
domestic_vector = create_vector(domestic_words)
foreign_vector = create_vector(foreign_words)
media_vector = create_vector(media_words)
election_vector = create_vector(election_words)

In [73]:
tweets[27]

'welcome home, aya! #godblesstheusa🇺🇸'

In [87]:
def create_vector(tweet):
    vector = np.ones(300)
    for word in tweet:
        try:
            vector = vector + google_model[word]
        except KeyError:
            pass
    return vector

In [78]:
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

specific_foreign = []
specific_domestic = []
for word in foreign_words:
    if word not in domestic_words:
        specific_foreign.append(word)
for word in domestic_words:
    if word not in foreign_words:
        specific_domestic.append(word)

specific_dshort = []
specific_fshort = []
for word in d_short_words:
    #if word not in f_short_words and not in stop:
    if word not in stop:
        specific_fshort.append(word)
for word in f_short_words:
    #if word not in d_short_words:
    if word not in stop:
        specific_dshort.append(word)

In [97]:
dshort_vector = create_vector(specific_dshort)
fshort_vector = create_vector(specific_fshort)

In [89]:
domestic_tags = ['press', 'media', 'election', 'healthcare', 'Obamacare', 'obamacare','american', 'immigrant', 'immigrants',
                'Committee', 'wall','Wall', 'jobs', 'taxes', 'senate', 'congress', 'dems','drugs']
foreign_tags = ['Russia', 'russia', 'China', 'trade', 'mexico', 'terrorist', 'terrorists', 'terrorism',
               'migrants', 'immigration','Immigration', 'President', 'Egypt', 'Syria', 'Minister', 'Ambassador','Korea','war']

def count_tag_occurrences(tweet):
    dcount= 0
    fcount = 0
    tweet = tweet.split()
    tweet = remove_punctuation(tweet)
    for word in tweet:
        if word in domestic_tags:
            dcount+=1
        if word in foreign_tags:
            fcount+=1
    return dcount, fcount

In [90]:
def calc_cosine_similarity(tweet_vector, category_vector):
    return cosine_similarity(tweet_vector, category_vector)

In [91]:
def calc_scores(tweet):
    dcount, fcount = count_tag_occurrences(tweet)
    dscore = calc_cosine_similarity(create_tweet_vector(tweet), dshort_vector)
    fscore = calc_cosine_similarity(create_tweet_vector(tweet), fshort_vector)
    return dscore, fscore, dcount, fcount

In [124]:
def model(dom, fore):
    dcount = 0
    fcount = 0
    total = 0
    for tweet in dom:
        domestic_score, foreign_score, domestic_tag_count, foreign_tag_count = calc_scores(tweet)
        total+=1
        if abs(domestic_score - foreign_score) <= 0.005:
            if domestic_tag_count >=foreign_tag_count:
                dcount+=1
            else:
                fcount+=1
        else:
            if domestic_score > foreign_score:
                dcount+=1
            else:
                fcount+=1

    print("Number of domestic tweets = ", total)
    print("Number of domestic tweets tagged as domestic = ", dcount)
    print("Accuracy of domestic tweets = ", dcount / total)

    dcount = 0
    fcount = 0
    total = 0
    for tweet in fore:
        domestic_score, foreign_score,domestic_tag_count, foreign_tag_count = calc_scores(tweet)
        total+=1
        if domestic_score > foreign_score:
            dcount+=1
        else:
            fcount+=1
        
    print("\nNumber of foreign tweets = ", total)
    print("Number of foreign tweets tagged as foreign = ", fcount)
    print("Accuracy of foreign tweets = ", fcount / total)

In [125]:
model(domestic, foreign)

Number of domestic tweets =  104
Number of domestic tweets tagged as domestic =  76
Accuracy of domestic tweets =  0.7307692307692307

Number of foreign tweets =  56
Number of foreign tweets tagged as foreign =  39
Accuracy of foreign tweets =  0.6964285714285714


The above accuracy scores are based on testing our model on our training set.

We will now test our model on a small test set.

In [126]:
domestic_index = [1, 5, 8, 9, 11, 13, 38, 41, 42, 43, 50, 51, 54, 57, 66, 67, 68, 74, 79, 100, 105, 112, 120]
foreign_index = [16, 28, 30, 32, 59, 64, 65, 75, 76, 82, 87, 96, 98, 123, 157, 175, 178, 224, 332, 333]

dom_tweets = [tweets[i] for i in domestic_index]
fore_tweets = [tweets[i] for i in foreign_index]

model(dom_tweets, fore_tweets)

Number of domestic tweets =  23
Number of domestic tweets tagged as domestic =  17
Accuracy of domestic tweets =  0.7391304347826086

Number of foreign tweets =  20
Number of foreign tweets tagged as foreign =  18
Accuracy of foreign tweets =  0.9


In [122]:
for index in range(0, 100):
    print(index, ":", tweets[index])

0 : remarks the united states holocaust memorial museum's national days remembrance. full remarks:…
1 : don't let fake media tell that have changed position the wall. will get built help stop drugs, human trafficking etc.
2 : canada made business our dairy farmers wisconsin other border states difficult. will stand this. watch!
3 : proud for leadership these important issues. looking forward hearing speak the w20!
4 : today, signed holocaust remembrance proclamation: #icymi- statement last night at…
5 : our healthcare plan approved, see real healthcare premiums will start tumbling down. obamacare in death spiral!
6 : join in congratulating @astropeggy using hashtag #congratspeggy! earlier today:…
7 : ....the wall not built, be, drug situation never fixed way it should be! #buildthewall
8 : wall a important tool stopping drugs pouring country poisoning our youth (and many others)!
9 : two fake news polls released yesterday, abc &amp; nbc, containing very positive info, totally wrong gen