# Final Project - Trump/Hillary Tweets
### Question: 

What combination of words result in the highest amount of retweets? As in, are there a set of #N words that Trump or Hillary could tweet that garners the most retweets and likes?

We are using Trump and Hillary Tweets before the 2016 Presidential Election and analyzing these two datasets to find the most common words that each respective candidates used.

In [156]:
#Basic packages to be used in the project. 
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as skl
import string

#Natural Language Toolkit
import nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords

#NLTK tokenizer for tweets.
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\brand\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\brand\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Loading and Cleaning the Data

We are loading both the datasets so we can retrieve both Trump and Hillary tweets.

In [157]:
#Dataframe of Trump tweets.
df_trump = pd.read_csv('Trump_Tweets.csv', encoding='latin-1');

#Dataframe of Hillary tweets.
df_th = pd.read_csv('Trump_Hillary_Tweets.csv');
df_hillary = df_th[df_th['handle'] == 'HillaryClinton'];

In [158]:
#Cleaned unnecessary columns of the Trump tweets.
del df_trump['Unnamed: 10'];
del df_trump['Unnamed: 11'];

In [159]:
#TESTING
#df_trump

In [160]:
#TESTING
#tweet = tknzr.tokenize(df_trump['Tweet_Text'][0])
#tweet = words_stop(tweet)
#tweet = words_only(tweet)
#tweet = words_extra(tweet)
#print(tweet);

## Data Manipulation

In [161]:
#Helper method to filter out stopwords.
def words_stop(tweet_list):
    punctuation = list(string.punctuation)
    stop = stopwords.words('english') + punctuation + ['rt','via']
    return [word for word in tweet_list if word not in stop]

#Helper method to filter out hashtags and mentions.
def words_only(tweet_list):
    return [word for word in tweet_list if not word.startswith(('#','@','û','https'))]

#Helper method to filter extra words.
def words_extra(tweet_list):
    extra = ['\x89','...','…','“','”','’','—']
    return [word for word in tweet_list if word not in extra]

### 1. Parsing the Trump Tweets.

We are parsing the Trump tweets, so we can create a frequency distribution of words contained in his tweets.

In [162]:
#METHOD: Parse Trump tweets and create a frequency distribution of words.

#Tokenizes the Trump tweets.
trump_list = []
for trump_tweets in df_trump['Tweet_Text']:
    trump_list.extend(tknzr.tokenize(trump_tweets))

#Filters the tweets.
trump_list = words_stop(trump_list)
trump_list = words_only(trump_list)
trump_list = words_extra(trump_list)

#Create the frequency distribution.
fdist_t = nltk.FreqDist(trump_list)

In [163]:
#print(fdist_t);
#trump_list

In [164]:
fdist_t.most_common(20);

### 2. Parsing the Hillary Tweets.

We are parsing the Hillary tweets, so we can create a frequency distribution of words contained in her tweets.

In [165]:
#Tokenizes the Hillary tweets.
hillary_list = []
for hillary_tweets in df_hillary['text']:
    hillary_list.extend(tknzr.tokenize(hillary_tweets))

#Filters the tweets.
hillary_list = words_stop(hillary_list)
hillary_list = words_only(hillary_list)
hillary_list = words_extra(hillary_list)

#Create the frequency distribution.
fdist_h = nltk.FreqDist(hillary_list)

In [166]:
#print(fdist);
#hillary_list

In [167]:
fdist_h.most_common(20);

### 3. Dictionary of words with favourites and retweets.

Create a dictionary with the words as the key and a tuple of retweets and favorites, unweighted. Then, weigh retweets more heavily by multiplying by the ratio.

In [168]:
#Trump tweets dictionary.
from collections import namedtuple
Tweets = namedtuple('Tweets', 'favourites retweets')

trump_dict = {}
i = 0

for trump_tweet in df_trump['Tweet_Text']:
    
    tweet = tknzr.tokenize(trump_tweet)
    tweet = words_stop(tweet)
    tweet = words_only(tweet)
    tweet = words_extra(tweet)
    
    for word in tweet:
        trump_dict.setdefault(word, Tweets(0, 0))
        num_fav = df_trump['twt_favourites_IS_THIS_LIKE_QUESTION_MARK'][i]
        num_rtwt = df_trump['Retweets'][i]
        
        fav = trump_dict[word].favourites + num_fav
        rtwt = trump_dict[word].retweets + num_rtwt
        
        trump_dict[word] = trump_dict[word]._replace(favourites = fav, retweets = rtwt)
        
    i = i + 1

In [169]:
trump_dict;

In [170]:
#Hillary tweets dictionary.
from collections import namedtuple
Tweets = namedtuple('Tweets', 'favourites retweets')

hillary_dict = {}
i = 0

for hillary_tweet in df_hillary['text']:
    
    tweet = tknzr.tokenize(hillary_tweet)
    tweet = words_stop(tweet)
    tweet = words_only(tweet)
    tweet = words_extra(tweet)
    
    for word in tweet:
        hillary_dict.setdefault(word, Tweets(0, 0))
        num_fav = df_hillary['favorite_count'].iloc[i]
        num_rtwt = df_hillary['retweet_count'].iloc[i]
        
        fav = hillary_dict[word].favourites + num_fav
        rtwt = hillary_dict[word].retweets + num_rtwt
        
        hillary_dict[word] = hillary_dict[word]._replace(favourites = fav, retweets = rtwt)
        
    i = i + 1

In [171]:
hillary_dict;

### Finding the ratio between favorites and retweets.

In [172]:
#METHOD: We are finding the ratio between favorites and retweets. 

#Weigh retweets more heavily over favorites. Multiply retweets by ratio so it is weighed.

sum_fav = df_trump['twt_favourites_IS_THIS_LIKE_QUESTION_MARK'].sum();
sum_rtwt = df_trump['Retweets'].sum();

ratio = sum_fav/sum_rtwt

print('The ratio of favorites to retweet is: ', ratio)

The ratio of favorites to retweet is:  2.5952112676056336


# Data Visualizations

In [13]:
#Make a histogram. (1-2 graphs)

# Analysis