## HW2 - Stream Analysis
This notebook looks at the data collected using the hw2_stream.py file. The .py file creates a connection between the Twitter API and a mongodb, extarcting key field from the Streaming API to conduct analysis upon. In the stream, roughly 28k tweets were collected pertaining to leaders from around the world, namely: 'Trump','Xi Jinping', 'Maduro','Kim Jong Un','Elizabeth Warren', and 'Theresa May'. Let's look at the data.

In [1]:
# connecting to local mongoclient
from pymongo import MongoClient
client = MongoClient('mongodb://localhost/27017')
db = client.twitterdb_final
tweets = db.twitter_stream
print('Total Record for the collection: ' + str(tweets.estimated_document_count()))

Total Record for the collection: 28243


Now the mongodb data can be loaded into pandas quickly thanks to the Python God and GOAT, Wes McKinney and the Pandas Team.

In [2]:
# load data into dataframe
import pandas as pd
db = client.twitterdb_final
collection = db.twitter_stream
data = pd.DataFrame(list(collection.find()))

In [3]:
data.head(5)

Unnamed: 0,_id,created,followers,hashtags,id,language,text,username
0,5cb4a408f970c156282d477c,2019-04-15 15:32:26,241,[],1117812934846091264,es,RT @EmmaZeinep: Mippcivzla: RT SMoncada_VEN: ¿...,AliCrack25
1,5cb4a408f970c156282d477e,2019-04-15 15:32:26,56,[],1117812934644727809,en,My confidence in the government under Trump an...,louisiana2times
2,5cb4a408f970c156282d4780,2019-04-15 15:32:26,3334,[],1117812935093624834,en,RT @nathanTbernard: Trump and @benshapiro cont...,RicanInBoston2
3,5cb4a408f970c156282d4782,2019-04-15 15:32:26,12006,[],1117812935252955136,en,RT @RISINGforum: Still warm off the press: wha...,CSS_Zurich
4,5cb4a408f970c156282d4784,2019-04-15 15:32:26,244,[],1117812935441756164,en,RT @Mimirocah1: Reminder from Stone indictment...,ishkabibble54


Let's quickly add a unique ID field to ensure we can join this data frame with the sentiment analysis dataframe below

In [4]:
data.insert(0, 'New_ID', range(880, 880 + len(data)))

Vader Sentiment uses fast and well written code to do column-wise sentiment analysis on a pandas dataframe.

In [5]:
# work description field
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

Using sent package to iterate over the text of tweets, getting scores for later analysis.

In [6]:
analyser = SentimentIntensityAnalyzer()

results = []

for line in data['text']:
    score = analyser.polarity_scores(line)
    score['description'] = line
    results.append(score)

Load the data into pandas dataframe.

In [7]:
df_scores = pd.DataFrame.from_records(results)

In [8]:
df_scores = df_scores[['description','compound','neg','neu','pos']]

Create a unique ID to join the scores back with original dataframe

In [9]:
df_scores.insert(0, 'New_ID', range(880, 880 + len(df_scores)))

We can now append a score based on the compund sentiment score, using -.2 and .2 to denote positive and negative tweets and the rest are labeled as neutral.

In [10]:
df_scores['label'] = 0
df_scores.loc[df_scores['compound'] > 0.2, 'label'] = 1
df_scores.loc[df_scores['compound'] < -0.2, 'label'] = -1
df_scores.head()

Unnamed: 0,New_ID,description,compound,neg,neu,pos,label
0,880,RT @EmmaZeinep: Mippcivzla: RT SMoncada_VEN: ¿...,0.0,0.0,1.0,0.0,0
1,881,My confidence in the government under Trump an...,-0.3473,0.205,0.678,0.118,-1
2,882,RT @nathanTbernard: Trump and @benshapiro cont...,0.1027,0.191,0.604,0.205,0
3,883,RT @RISINGforum: Still warm off the press: wha...,0.2263,0.0,0.913,0.087,1
4,884,RT @Mimirocah1: Reminder from Stone indictment...,-0.4939,0.144,0.856,0.0,-1


In [11]:
df2 = pd.merge(data, df_scores, on='New_ID')

In [12]:
df2.head()

Unnamed: 0,New_ID,_id,created,followers,hashtags,id,language,text,username,description,compound,neg,neu,pos,label
0,880,5cb4a408f970c156282d477c,2019-04-15 15:32:26,241,[],1117812934846091264,es,RT @EmmaZeinep: Mippcivzla: RT SMoncada_VEN: ¿...,AliCrack25,RT @EmmaZeinep: Mippcivzla: RT SMoncada_VEN: ¿...,0.0,0.0,1.0,0.0,0
1,881,5cb4a408f970c156282d477e,2019-04-15 15:32:26,56,[],1117812934644727809,en,My confidence in the government under Trump an...,louisiana2times,My confidence in the government under Trump an...,-0.3473,0.205,0.678,0.118,-1
2,882,5cb4a408f970c156282d4780,2019-04-15 15:32:26,3334,[],1117812935093624834,en,RT @nathanTbernard: Trump and @benshapiro cont...,RicanInBoston2,RT @nathanTbernard: Trump and @benshapiro cont...,0.1027,0.191,0.604,0.205,0
3,883,5cb4a408f970c156282d4782,2019-04-15 15:32:26,12006,[],1117812935252955136,en,RT @RISINGforum: Still warm off the press: wha...,CSS_Zurich,RT @RISINGforum: Still warm off the press: wha...,0.2263,0.0,0.913,0.087,1
4,884,5cb4a408f970c156282d4784,2019-04-15 15:32:26,244,[],1117812935441756164,en,RT @Mimirocah1: Reminder from Stone indictment...,ishkabibble54,RT @Mimirocah1: Reminder from Stone indictment...,-0.4939,0.144,0.856,0.0,-1


In [13]:
df2_clean = df2.drop(['description','New_ID','id'],axis=1)

In [14]:
df2_clean.head()

Unnamed: 0,_id,created,followers,hashtags,language,text,username,compound,neg,neu,pos,label
0,5cb4a408f970c156282d477c,2019-04-15 15:32:26,241,[],es,RT @EmmaZeinep: Mippcivzla: RT SMoncada_VEN: ¿...,AliCrack25,0.0,0.0,1.0,0.0,0
1,5cb4a408f970c156282d477e,2019-04-15 15:32:26,56,[],en,My confidence in the government under Trump an...,louisiana2times,-0.3473,0.205,0.678,0.118,-1
2,5cb4a408f970c156282d4780,2019-04-15 15:32:26,3334,[],en,RT @nathanTbernard: Trump and @benshapiro cont...,RicanInBoston2,0.1027,0.191,0.604,0.205,0
3,5cb4a408f970c156282d4782,2019-04-15 15:32:26,12006,[],en,RT @RISINGforum: Still warm off the press: wha...,CSS_Zurich,0.2263,0.0,0.913,0.087,1
4,5cb4a408f970c156282d4784,2019-04-15 15:32:26,244,[],en,RT @Mimirocah1: Reminder from Stone indictment...,ishkabibble54,-0.4939,0.144,0.856,0.0,-1


Using a nice little chunck of code, making use of the dictionary python data structure and regular expression, we can label the tweets based on the contents, searching for leaders names and appending a dummy variable to label each tweet.

In [15]:
topics = {"Trump": ["Trump",'trump'],
          "Xi_Jinping": ["Xi Jinping",'xi jinping'],
          "Maduro": ["Maduro",'maduro'],
          "Kim_Jong_Un": ['Kim Jong Un','kim jong un'],
         "Elizabeth_Warren":['Elizabeth Warren',"Warren",'elizabeth warren'],
         "Theresa_May": ["Theresa May","May"]} 

for k,v in topics.items():
    df2_clean[k] = df2_clean.text.str.contains('|'.join(v), 
                                               case=False, regex=True).astype(int)

In [16]:
df2_clean.head()

Unnamed: 0,_id,created,followers,hashtags,language,text,username,compound,neg,neu,pos,label,Trump,Xi_Jinping,Maduro,Kim_Jong_Un,Elizabeth_Warren,Theresa_May
0,5cb4a408f970c156282d477c,2019-04-15 15:32:26,241,[],es,RT @EmmaZeinep: Mippcivzla: RT SMoncada_VEN: ¿...,AliCrack25,0.0,0.0,1.0,0.0,0,0,0,0,0,0,0
1,5cb4a408f970c156282d477e,2019-04-15 15:32:26,56,[],en,My confidence in the government under Trump an...,louisiana2times,-0.3473,0.205,0.678,0.118,-1,1,0,0,0,0,0
2,5cb4a408f970c156282d4780,2019-04-15 15:32:26,3334,[],en,RT @nathanTbernard: Trump and @benshapiro cont...,RicanInBoston2,0.1027,0.191,0.604,0.205,0,1,0,0,0,0,1
3,5cb4a408f970c156282d4782,2019-04-15 15:32:26,12006,[],en,RT @RISINGforum: Still warm off the press: wha...,CSS_Zurich,0.2263,0.0,0.913,0.087,1,1,0,0,0,0,0
4,5cb4a408f970c156282d4784,2019-04-15 15:32:26,244,[],en,RT @Mimirocah1: Reminder from Stone indictment...,ishkabibble54,-0.4939,0.144,0.856,0.0,-1,1,0,0,0,0,0


In [17]:
sum(df2_clean.Trump)

17910

In [18]:
sum(df2_clean.Elizabeth_Warren)

331

In [19]:
sum(df2_clean.Maduro)

1356

In [20]:
sum(df2_clean.Theresa_May)

500

In [23]:
sum(df2_clean.Kim_Jong_Un)

28

In [24]:
sum(df2_clean.Xi_Jinping)

6

As we can see, Trump tweets dominated the stream while tweets about the chinese leader Xi Jinping didn't have too mnay tweets about him. We can now group the tweets together by leader and view the sentiment scores.

In [30]:
trump=df2_clean.loc[df2_clean.Trump == 1]

In [31]:
warren=df2_clean.loc[df2_clean.Elizabeth_Warren == 1]

In [32]:
maduro=df2_clean.loc[df2_clean.Maduro == 1]

In [33]:
may=df2_clean.loc[df2_clean.Theresa_May == 1]

In [54]:
trump[['compound','neg','neu','pos']].describe()

Unnamed: 0,compound,neg,neu,pos
count,17910.0,17910.0,17910.0,17910.0
mean,-0.066966,0.087417,0.849002,0.063596
std,0.423968,0.121202,0.148157,0.101547
min,-0.9689,0.0,0.196,0.0
25%,-0.3612,0.0,0.748,0.0
50%,0.0,0.0,0.87,0.0
75%,0.0772,0.134,1.0,0.12
max,0.9773,0.737,1.0,0.804


In [55]:
warren[['compound','neg','neu','pos']].describe()

Unnamed: 0,compound,neg,neu,pos
count,331.0,331.0,331.0,331.0
mean,0.142885,0.035172,0.878634,0.08616
std,0.332276,0.070001,0.111375,0.097116
min,-0.8625,0.0,0.448,0.0
25%,0.0,0.0,0.804,0.0
50%,0.0,0.0,0.895,0.098
75%,0.4019,0.07,1.0,0.106
max,0.8555,0.508,1.0,0.552


In [56]:
maduro[['compound','neg','neu','pos']].describe()

Unnamed: 0,compound,neg,neu,pos
count,1356.0,1356.0,1356.0,1356.0
mean,-0.047766,0.030037,0.956493,0.01347
std,0.205158,0.062217,0.082494,0.044688
min,-0.886,0.0,0.4,0.0
25%,0.0,0.0,0.903,0.0
50%,0.0,0.0,1.0,0.0
75%,0.0,0.0,1.0,0.0
max,0.8156,0.391,1.0,0.458


In [57]:
may[['compound','neg','neu','pos']].describe()

Unnamed: 0,compound,neg,neu,pos
count,500.0,500.0,500.0,500.0
mean,-0.016461,0.074742,0.855818,0.069456
std,0.409589,0.099836,0.138639,0.099405
min,-0.8886,0.0,0.445,0.0
25%,-0.2732,0.0,0.758,0.0
50%,0.0,0.0,0.8805,0.0
75%,0.3164,0.141,1.0,0.11925
max,0.8856,0.421,1.0,0.493


In [58]:
trump['label'].sum()

-1968

In [59]:
may['label'].sum()

-13

In [61]:
maduro['label'].sum()

-143

In [63]:
warren['label'].sum()

113

Looking at the summary statistics for the sentiment columns for each respective leader and also summing the label column, we can see which leaders are recieving positive, neutral, and negative sentiments around the world. While Trump and Maduro are getting largely negative reviews, we can see that Elizabeth Warrem, the deomcratic candidate for the elcetion in 2020 is getting largelu positive reviews. For grading purposes, I will include a .csv file as well.

In [70]:
#df2_clean.to_csv('hw2_file.csv',index=False)