As of Saturday March 23:

To run this program you need the three datasets and the stopwords file in the same directory as this program.

The program does the following:
Puts the three tweet datasets into dataframes.
Performs sentiment analysis on the tweets using the nltk.vader tool.   This is a lexicon based sentiment analysis trained using social media sources, so we assume it is somewhat applicable.  The analysis is added to the dataframes in two forms, the overall score from -1 to 1 showing magnitude of sentiment, as well as an integer score of -1,0,1 (meaning negative positive neutral) showing only direction of sentiment.  Called Vader_Score and Trinary_Score.

A shortcoming of this analysis as is is that any new slang terms or created words or hashtags likely won't be interpretted by the classifier so they'll be simply counted as neutral.  Might miss SOME of the data.

In [1]:
#import math
#import numpy as np
import pandas as pd
import collections
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [2]:
###Downloads the lexicon used for sentiment anlysis.  Can comment out after run once.
nltk.downloader.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/chris/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

Import the datasets and turn into DataFrames.

In [3]:
trump=pd.read_csv('trump_raw.csv')
clinton=pd.read_csv('clinton_raw.csv')
congress=pd.read_csv('congress_raw.csv')

Stopwords for basic text cleaning for wordclouds

In [4]:
stopwords = []
file2=open('stopwords.csv', encoding='utf8')   #file of stopwords from another project...may need to make this file bigger.
for stopword in file2.read().split():
    stopword = stopword.replace('"','')
    stopwords.append(stopword)
file2.close()

In [5]:
#####Setting up wordcloud creation.   Takes a list of text entries that we create later.
###Second argument is size of the cloud you want returned.  default is top 20 words
def make_cloud(words, num=20):
    wordcount={}
    for line in words:
        for word in line.split():
            if word in stopwords:
                pass
        
            elif word not in wordcount:
                wordcount[word] = 1
            else:
                wordcount[word] += 1
    
    d = collections.Counter(wordcount)
        
    for word, count in d.most_common(num):
        print(word, ": ", count)
    print("\n")

Grab All of the text from the tweets from a given DataFrame.  For use with wordcloud generation.

In [6]:
def grab_text(df):
    cw =[]
    df = df["text"]   #Grab just the fourth column
        
    for x in df.index:    #Iterate over the valid indicies. Need this since congresslib/con are partials. 
        temp=str(df[x])
        temp.strip()                  ##Cleans up the text of junk characters
        temp=temp.replace('.','')  #stripping out common punctuation so words ending with commas and periods don't count as two different words.
        temp=temp.replace(',','')
        temp=temp.replace('“','')
        temp=temp.replace('”','')
        temp=temp.replace('&amp','')
        temp=temp.replace(';','')
        temp=temp.replace('-',' ')
        temp=temp.lower()
        cw.append(temp)
    return cw

In [7]:
###Example of word cloud.  Not sure if we'll use this.  We may use its helpr functions if we build our own 
###classifier...


make_cloud(grab_text(trump))

thank :  749
great :  663
@realdonaldtrump :  534
hillary :  467
#trump2016 :  453
trump :  441
#makeamericagreatagain :  295
new :  288
people :  274
america :  274
clinton :  258
crooked :  224
cruz :  200
big :  185
you! :  185
join :  165
poll :  164
one :  160
@cnn :  158
going :  149




Sentiment Analysis Stuff follows.

In [8]:
###The analyzer.  returns a number between [-1,1] with -1 being very negative and 1 being very positive.
###Use the compound output as the overall sentiment. (it's some kind of combination of all three attributes)
vader = SentimentIntensityAnalyzer()

In [9]:
###Gives a Trinary Poisitive/Negative/Neutral answer.  Will be useful for strict counts of positive/negative/neutral.

def vader_polarity(text):
    """ Transform the output to a binary 0/1 result """
    score = vader.polarity_scores(text)
    if score['pos'] > score['neg']:
        x=1
    elif score['pos'] < score['neg']:
        x=-1
    else:
        x=0
    return x

In [10]:
##Example of the sentiment analyzer and the trinary classifications
##Shows a tweet, the vader nltk.vader analysis, and the trinary classification.

x1=trump['text'][0]
print(x1)
print(vader.polarity_scores(x1))
print(str(vader_polarity(x1))+"\n")
x2=trump['text'][5]
print(x2)
print(vader.polarity_scores(x2))
print(str(vader_polarity(x2))+"\n")
x3=trump['text'][4]
print(x3)
print(vader.polarity_scores(x3))
print(vader_polarity(x3))

Would be nice if @jmartNYT learned how to read the polls before writing his next story. Probably done on purpose, but not good reporting!
{'neg': 0.125, 'neu': 0.806, 'pos': 0.07, 'compound': -0.3614}
-1

.@RobertGBeckel Please thank your brother for his nice words on television. Seems like a great guy and character! @CNN
{'neg': 0.0, 'neu': 0.473, 'pos': 0.527, 'compound': 0.9259}
1

"@essygalloway: @realDonaldTrump @nbcsnl @Sia  I can't wait to watch snl tomorrow." A really big show!
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
0


Note that the output from vader gives 4 numbers.  we should use the compound number.  It is calculated using some sort of squishing formula behind the scenes... it's exact function is not important to this analysis and the number should work fine.


Function for analyzing sentiment of all tweets and appending to datasets.

In [11]:
def grab_data_sentiment(df):
    Vader_Score = []
    Trinary_Score = []
    df = df["text"]   #Grab just the fourth column
        
    for x in df.index:    #Iterate over the valid indicies. Need this since congresslib/con are partials. 
        temp=str(df[x])
        vad_score=vader.polarity_scores(temp)['compound']
        trin_score=vader_polarity(temp)
        
        Vader_Score.append(vad_score)
        Trinary_Score.append(trin_score)
    return Vader_Score, Trinary_Score

In [12]:
y,z=grab_data_sentiment(trump)
trump.insert(0,'Vader_Score',y)
trump.insert(0,'Trinary_Score',z)

y,z=grab_data_sentiment(clinton)
clinton.insert(0,'Vader_Score',y)
clinton.insert(0,'Trinary_Score',z)

##This one takes a while, large dataset...
y,z=grab_data_sentiment(congress)
congress.insert(0,'Vader_Score',y)
congress.insert(0,'Trinary_Score',z)

Showing the dataframe with the sentiment scores inserted.

In [13]:
trump.head()

Unnamed: 0,Trinary_Score,Vader_Score,created_at,favorite_count,retweet_count,text,time,AffectCount,MoralCount,shared
0,-1,-0.3614,11/7/2015 0:07,1824,796,Would be nice if @jmartNYT learned how to read...,11/7/2015,1,0,1
1,0,0.0,11/7/2015 0:08,2285,4029,"""@nbcsnl: One more day! Donald Trump hosts #SN...",11/7/2015,0,0,0
2,-1,-0.6616,11/7/2015 3:23,2333,986,"""@Bubbachitchat1: THIS IS WHY THE POLLS ARE WR...",11/7/2015,0,0,1
3,-1,-0.7712,11/7/2015 5:20,3012,1215,One of the dumbest political pundits on televi...,11/7/2015,1,1,1
4,0,0.0,11/7/2015 5:23,1892,703,"""@essygalloway: @realDonaldTrump @nbcsnl @Sia ...",11/7/2015,0,0,0


create DFs for separated congress for basic analysis

In [14]:
pd.to_numeric(congress['dw_score'])  ##negatives were parsing as strings
lib_filter=congress['dw_score']<0
con_filter=congress['dw_score']>0
congress_lib=pd.DataFrame(congress[lib_filter])  ##Creating copies to get rid of indexing issues.
congress_con=pd.DataFrame(congress[con_filter])

In [15]:
congress_lib.head()

Unnamed: 0,Trinary_Score,Vader_Score,text,elite,retweet_count,media,gender,dw_extr,dwextr_rs,dw_score,race,age,followers
41,1,0.4019,Today is the 1 yr anniversary of the #STEMEduc...,Adams,3,-0.5,-0.5,0.462,0.462,-0.462,-0.5,71,8386.0
42,-1,-0.6597,Find a Breast Cancer screening provider near y...,Adams,0,0.5,-0.5,0.462,0.462,-0.462,-0.5,71,8386.0
43,0,0.0,October is #BreastCancerAwarenessMonth. Find e...,Adams,0,-0.5,-0.5,0.462,0.462,-0.462,-0.5,71,8386.0
44,1,0.3818,My office is now accepting applications for no...,Adams,0,-0.5,-0.5,0.462,0.462,-0.462,-0.5,71,8386.0
45,1,0.3262,Please stay safe this week NC. For more inform...,Adams,2,-0.5,-0.5,0.462,0.462,-0.462,-0.5,71,8386.0
