# Project Code
#### Joey Livorno | gil15@pitt.edu | 2.24.2020

The purpose of this project is to process the contents of Donald Trump's Twitter feed and make linguistic and statistical discoveries based on the data, specifically using sentiment analysis.

These are the required libraries for this assignment:

In [1]:
#import libraries
import numpy as np
import pandas as pd
import re
from textblob import TextBlob as tb ##TextBlob object will allow for quick sentiment analysis
from emo_unicode import EMO_UNICODE #A set of dictionaries of emoticons : their text represenstations
from emo_unicode import UNICODE_EMO ## courtesy of NeelShah18 on GitHub
                                  

## Basic Data Processing

This section of the code will focus on reading in the twitter data and manipulating it into a much more workable object than simply the raw .csv file.

The first step is to read the .csv file into a pandas dataframe:

In [2]:
#read tweet information into dataframe
tweets = pd.read_csv('../data/tweets.csv')
tweets.head() #preview the df

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str
0,Twitter for iPhone,RT @SenBillCassidy: January #JobsReport:✅22500...,02-21-2020 16:06:40,896.0,0.0,True,1.230887e+18
1,Twitter for iPhone,RT @SteveDaines: Obama sure didn’t build this ...,02-21-2020 16:06:20,2086.0,0.0,True,1.230886e+18
2,Twitter for iPhone,RT @JohnBoozman: Our servicemembers stand read...,02-21-2020 16:06:10,521.0,0.0,True,1.230886e+18
3,Twitter for iPhone,RT @RoyBlunt: ▶️ Unemployment is at a nearly 5...,02-21-2020 16:05:52,623.0,0.0,True,1.230886e+18
4,Twitter for iPhone,RT @JimInhofe: Happy 79th birthday to @USCGRes...,02-21-2020 16:05:25,533.0,0.0,True,1.230886e+18


The next step is to make the text full parseable. Right now, it is not because the tweets contain emojis. We can use the emoji dictionaries and this method to convert all of the emojis to text:

(right now this does not function correctly, so for now we will keep the emojis in the text)

In [3]:
#replace emojis with their text representation
##this method is courtesy of kaggle.com user SRK

UNICODE_EMO = {v: k for k, v in EMO_UNICODE.items()}

def convert_emojis(text):
    for emot in UNICODE_EMO:
        text = re.sub(r'('+emot+')', "_".join(UNICODE_EMO[emot].replace(",","").replace(":","").split()), text)
    return text

Next we will use the TextBlob library to assign each of the tweets a polarity and subjectivity value. Polarity is on a scale from -1 to 1 and describes how positive or negative the tweet is, -1 being negative and 1 being positive. Subjectivity is on a scale from 0 to 1 and describes whether or not the text is presented as a fact or an opinion. For this measure, 0 would be objective information and 1 would be subjective information.

In [5]:
#create series objects of the polarity and subjectivity of the tweets
polarity = tweets.text.map(lambda x: tb(x).sentiment.polarity)
subjectivity = tweets.text.map(lambda x: tb(x).sentiment.subjectivity)

Then we will create new columns for each of the new series:

In [6]:
#add new columns to tweets df corresponding to the new info
tweets['polarity'] = polarity
tweets['subjectivity'] = subjectivity
tweets.head() #preview the df

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str,polarity,subjectivity
0,Twitter for iPhone,RT @SenBillCassidy: January #JobsReport:✅22500...,02-21-2020 16:06:40,896.0,0.0,True,1.230887e+18,0.068182,0.377273
1,Twitter for iPhone,RT @SteveDaines: Obama sure didn’t build this ...,02-21-2020 16:06:20,2086.0,0.0,True,1.230886e+18,0.625,0.888889
2,Twitter for iPhone,RT @JohnBoozman: Our servicemembers stand read...,02-21-2020 16:06:10,521.0,0.0,True,1.230886e+18,0.2,0.5
3,Twitter for iPhone,RT @RoyBlunt: ▶️ Unemployment is at a nearly 5...,02-21-2020 16:05:52,623.0,0.0,True,1.230886e+18,0.1,0.4
4,Twitter for iPhone,RT @JimInhofe: Happy 79th birthday to @USCGRes...,02-21-2020 16:05:25,533.0,0.0,True,1.230886e+18,0.75,0.75


Since we are going to be looking at change over time, we will need a way to group the tweets chronologically. Luckily, the timestamp is included in the original csv. The most logical grouping seems to be by year though, so let's make a new column that isolates that value:

In [7]:
#make new column that shows year
tweets['created_at'] = pd.to_datetime(tweets['created_at']) #convert created_at column to timestamp data type
tweets['year'] = tweets['created_at'].dt.year.astype('Int64') #store the dates in a new column, isolating the year
tweets.head() #preview df

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str,polarity,subjectivity,year
0,Twitter for iPhone,RT @SenBillCassidy: January #JobsReport:✅22500...,2020-02-21 16:06:40,896.0,0.0,True,1.230887e+18,0.068182,0.377273,2020
1,Twitter for iPhone,RT @SteveDaines: Obama sure didn’t build this ...,2020-02-21 16:06:20,2086.0,0.0,True,1.230886e+18,0.625,0.888889,2020
2,Twitter for iPhone,RT @JohnBoozman: Our servicemembers stand read...,2020-02-21 16:06:10,521.0,0.0,True,1.230886e+18,0.2,0.5,2020
3,Twitter for iPhone,RT @RoyBlunt: ▶️ Unemployment is at a nearly 5...,2020-02-21 16:05:52,623.0,0.0,True,1.230886e+18,0.1,0.4,2020
4,Twitter for iPhone,RT @JimInhofe: Happy 79th birthday to @USCGRes...,2020-02-21 16:05:25,533.0,0.0,True,1.230886e+18,0.75,0.75,2020


Now, the data is much more workable; we have the measurements we are looking to compare neatly organized within their corresponding rows, and we have a way of grouping the data chronologically. The last step is to create subgroups for different political issues that we can compare to:

(these are subject to change)

In [19]:
#create the subsets for each topic
russia = tweets['text'].str.contains('russia|Russia|moscow|Moscow|putin|Putin')
iran = tweets['text'].str.contains('iran|Iran|tehran|Tehran|Nuclear Deal|Rouhani')
nkorea = tweets['text'].str.contains('north korea|North Korea|DPRK|Pyongyang|pyongyang|Jong-Un|Jong-Il')