# Challenges for week 3

Now that we've seen how to clean in Pandas, it's time for you to apply this knowledge. This week has three challenges. Make sure to give it a try and complete all of them. 

**Some important notes for the challenges:**
1. These challenges are a warming up, and help you get ready for class. Make sure to give them a try. If you get an error message, try to troubleshoot it (using Google often helps). If all else fails, go to the next challenge (but make sure to hand it in).
2. While we of course like when you get all the answers right, the important thing is to exercise and apply the knowledge. So we will still accept challenges that may not be complete, as long as we see enough effort *for each challenge*. This means that if one of the challenges is not delivered (not started and no attempt shown), we unfortunately will not be able to provide a full grade for that week.
3. Delivering the challenge on time on Canvas assignment is critical, as it helps also prepare for the DA live session. Check on Canvas how to hand it in.

### Facing issues? 

We are constantly monitoring the issues on the GitHub general repository (https://github.com/uva-cw-digitalanalytics/2021s2/issues) to help you out. Don't hesitate to log an issue there, explaining well what the problem is, showing the code you are using, and the error message you may be receiving. 

**Important:** We are only monitoring the repository in weekdays, from 9.30 to 17.00. Issues logged after this time will most likely be answered the next day. This means you should now wait for our response before submitting a challenge :-)

## Getting setup for the challenges

We will use actual Twitter data for the challenges of this week. To do so, you need:
* To download DMI-TCAT data that you may already be collecting for yourself, or from a colleague (if you haven't requested data collection yet). Please use **the same data** that you requested sentiment analysis for
* The sentiment analysis results (get them from SurfDrive)

If you don't have sentiment analysis results, get them from a colleague (in SurfDrive), but then make sure to download also their Twitter data from DMI-TCAT - otherwise the merge won't work.

**All the challenges below are with this Twitter data. Make sure to start your challenge by doing the basics of loading and inspecting the data, even if not specified in challenge itself.**



## Challenge 1

Create two binary variables for the Twitter data based on the text column. They should be two **meaningful** categories for your data, and they should have either the value 0 (when the tweet is not of that category) or 1 (when the tweet is of that category). 

Make sure to explain (in MarkDown) what these variables are, and provide some descriptives when they are done.

In [8]:
import pandas as pd

In [31]:
df_jsonl = pd.read_json('greenpeacetweets.jsonl', lines=True)

In [32]:
len(df_jsonl) 

1197

In [33]:
df_jsonl.head()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,metadata,source,...,is_quote_status,retweet_count,favorite_count,favorited,retweeted,possibly_sensitive,lang,quoted_status_id,quoted_status_id_str,quoted_status
0,2021-09-15 20:21:15+00:00,1438236491394785288,1438236491394785280,"RT @GreenpeaceArg: - ¬øPor qu√© tan elegante, Ho...",False,"[0, 117]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 1438173629850832905, 'id_str...","{'iso_language_code': 'es', 'result_type': 're...","<a href=""https://mobile.twitter.com"" rel=""nofo...",...,False,14,0,False,False,0.0,es,,,
1,2021-09-15 20:21:13+00:00,1438236482532151301,1438236482532151296,"¬°Excusas, excusas, excusas! C√≥mo las empresas ...",False,"[0, 115]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"{'iso_language_code': 'es', 'result_type': 're...","<a href=""http://twitter.com/download/iphone"" r...",...,False,0,0,False,False,0.0,es,,,
2,2021-09-15 20:21:03+00:00,1438236443638370308,1438236443638370304,RT @Herbert_Diess: Greenpeace m√ºsste man erfin...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"{'iso_language_code': 'de', 'result_type': 're...","<a href=""http://twitter.com/download/android"" ...",...,True,41,0,False,False,,de,1.438075e+18,1.438075e+18,
3,2021-09-15 20:21:02+00:00,1438236436084391936,1438236436084391936,RT @PBIcanada: We congratulate @GreenpeaceCA o...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/iphone"" r...",...,False,7,0,False,False,,en,,,
4,2021-09-15 20:20:54+00:00,1438236402114772995,1438236402114772992,RT @cadebe_: @GJGamble @BorisJohnson @trussliz...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/android"" ...",...,False,1,0,False,False,,en,,,


In [34]:
df_jsonl.columns

Index(['created_at', 'id', 'id_str', 'full_text', 'truncated',
       'display_text_range', 'entities', 'extended_entities', 'metadata',
       'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'retweeted_status', 'is_quote_status', 'retweet_count',
       'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive',
       'lang', 'quoted_status_id', 'quoted_status_id_str', 'quoted_status'],
      dtype='object')

In [35]:
df_jsonl[['lang']].value_counts()

lang
en      420
es      286
de      238
fr       80
pt       50
nl       44
und      31
it       15
pl       10
ca        5
cs        4
no        3
sv        2
th        2
et        1
in        1
ja        1
el        1
ro        1
ru        1
ar        1
dtype: int64

In [36]:
df_jsonl.isna().sum()

created_at                      0
id                              0
id_str                          0
full_text                       0
truncated                       0
display_text_range              0
entities                        0
extended_entities            1102
metadata                        0
source                          0
in_reply_to_status_id         982
in_reply_to_status_id_str     982
in_reply_to_user_id           973
in_reply_to_user_id_str       973
in_reply_to_screen_name       973
user                            0
geo                          1197
coordinates                  1197
place                        1195
contributors                 1197
retweeted_status              401
is_quote_status                 0
retweet_count                   0
favorite_count                  0
favorited                       0
retweeted                       0
possibly_sensitive            899
lang                            0
quoted_status_id             1031
quoted_status_

In [37]:
df_jsonl.dtypes

created_at                   datetime64[ns, UTC]
id                                         int64
id_str                                     int64
full_text                                 object
truncated                                   bool
display_text_range                        object
entities                                  object
extended_entities                         object
metadata                                  object
source                                    object
in_reply_to_status_id                    float64
in_reply_to_status_id_str                float64
in_reply_to_user_id                      float64
in_reply_to_user_id_str                  float64
in_reply_to_screen_name                   object
user                                      object
geo                                      float64
coordinates                              float64
place                                     object
contributors                             float64
retweeted_status    

In [38]:
en_df = df_jsonl[df_jsonl['lang']=='en'][['id','user','created_at','source','full_text','retweet_count','favorite_count']]

In [45]:
en_df.head()

Unnamed: 0,id,user,created_at,source,full_text,retweet_count,favorite_count
3,1438236436084391936,"{'id': 27281240, 'id_str': '27281240', 'name':...",2021-09-15 20:21:02+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @PBIcanada: We congratulate @GreenpeaceCA o...,7,0
4,1438236402114772995,"{'id': 1119365287, 'id_str': '1119365287', 'na...",2021-09-15 20:20:54+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @cadebe_: @GJGamble @BorisJohnson @trussliz...,1,0
6,1438236275283214343,"{'id': 738106329080987648, 'id_str': '73810632...",2021-09-15 20:20:23+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...","RT @GreenpeaceCA: I scream, you scream, WE ALL...",2,0
9,1438235849058045956,"{'id': 20565828, 'id_str': '20565828', 'name':...",2021-09-15 20:18:42+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @LongTimeAmy: TODAY IN HISTORY\n\n1971‚ÄìEnvi...,16,0
10,1438235834415714308,"{'id': 59791627, 'id_str': '59791627', 'name':...",2021-09-15 20:18:38+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @zoev213: Happy 50th @greenpeaceusa Join me...,2,0


In [52]:
len(en_df)

420

In [51]:
en_df['user'][10]

{'id': 59791627,
 'id_str': '59791627',
 'name': 'Mark Trudgeon',
 'screen_name': 'mark_trudgeon',
 'location': 'Southampton',
 'description': 'Green Supporter and campaigner, Saints season ticket holder. Dad and husband.',
 'url': None,
 'entities': {'description': {'urls': []}},
 'protected': False,
 'followers_count': 340,
 'friends_count': 1516,
 'listed_count': 12,
 'created_at': 'Fri Jul 24 14:10:52 +0000 2009',
 'favourites_count': 24437,
 'utc_offset': None,
 'time_zone': None,
 'geo_enabled': True,
 'verified': False,
 'statuses_count': 5730,
 'lang': None,
 'contributors_enabled': False,
 'is_translator': False,
 'is_translation_enabled': False,
 'profile_background_color': 'C0DEED',
 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
 'profile_background_tile': False,
 'profile_image_url': 'http://pbs.twimg.com/profile_images/1185628113713188873/jmnlXN

In [44]:
en_df[['retweet_count','favorite_count']].describe()

Unnamed: 0,retweet_count,favorite_count
count,420.0,420.0
mean,24.057143,1.890476
std,57.408028,29.456613
min,0.0,0.0
25%,0.0,0.0
50%,4.0,0.0
75%,25.0,0.0
max,648.0,602.0


In [46]:
users = pd.json_normalize(en_df['user'])

In [47]:
users

Unnamed: 0,id,id_str,name,screen_name,location,description,url,protected,followers_count,friends_count,...,default_profile,default_profile_image,following,follow_request_sent,notifications,translator_type,withheld_in_countries,entities.description.urls,profile_banner_url,entities.url.urls
0,27281240,27281240,wei wu wei,Osmotheque78000,"California, USA",Buddhist - Daoist - Quaker Questioner,,False,5665,6235,...,False,False,False,False,False,none,[],[],,
1,1119365287,1119365287,Elize‚öò,ElizeCronje1,Big Bay,I am a happy postive person and don't get angr...,,False,1165,1722,...,True,False,False,False,False,none,[],[],,
2,738106329080987648,738106329080987648,Craig Laferriere,JackLaferriere,"Mississauga, Ontario","Craig Laferri√®re lives in Mississauga, Ontario...",https://t.co/r2Bw6rEwjo,False,25,116,...,True,False,False,False,False,none,[],[],https://pbs.twimg.com/profile_banners/73810632...,"[{'url': 'https://t.co/r2Bw6rEwjo', 'expanded_..."
3,20565828,20565828,CitizenWonk #DCStatehood üíâ,CitizenWonk,"Washington, DC USA üá∫üá∏",anti-fascist Political Scientist #ONEV1 Z22. R...,,False,32259,27188,...,False,False,False,False,False,none,[],[],https://pbs.twimg.com/profile_banners/20565828...,
4,59791627,59791627,Mark Trudgeon,mark_trudgeon,Southampton,"Green Supporter and campaigner, Saints season ...",,False,340,1516,...,False,False,False,False,False,none,[],[],https://pbs.twimg.com/profile_banners/59791627...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
415,1022655642,1022655642,Backdoor Conquistador üá¨üá∑,LiteralDiego,,Just üêù-ing myself,,False,35,386,...,True,False,False,False,False,none,[],[],https://pbs.twimg.com/profile_banners/10226556...,
416,2925571294,2925571294,emgfind,emgfind,"Alberta, Canada",interests:exposed environmentalconcerns&coveru...,,False,4663,3828,...,True,False,False,False,False,none,[],[],,
417,882766377660166144,882766377660166144,Adrienne Moreau,AdriMoreau,Dark dungeons,"Cat. Co-writer of brutal fiction and ""fan-fict...",,False,1521,1385,...,True,False,False,False,False,none,[],[],https://pbs.twimg.com/profile_banners/88276637...,
418,792280784468111360,792280784468111360,Innocent Indeje,ian_indeje,Kenya,"SRHR Advocate & TOT, Peer Educator, teacher by...",,False,1521,1103,...,True,False,False,False,False,none,[],[],https://pbs.twimg.com/profile_banners/79228078...,


In [48]:
len(users)

420

In [49]:
users.columns

Index(['id', 'id_str', 'name', 'screen_name', 'location', 'description', 'url',
       'protected', 'followers_count', 'friends_count', 'listed_count',
       'created_at', 'favourites_count', 'utc_offset', 'time_zone',
       'geo_enabled', 'verified', 'statuses_count', 'lang',
       'contributors_enabled', 'is_translator', 'is_translation_enabled',
       'profile_background_color', 'profile_background_image_url',
       'profile_background_image_url_https', 'profile_background_tile',
       'profile_image_url', 'profile_image_url_https', 'profile_link_color',
       'profile_sidebar_border_color', 'profile_sidebar_fill_color',
       'profile_text_color', 'profile_use_background_image',
       'has_extended_profile', 'default_profile', 'default_profile_image',
       'following', 'follow_request_sent', 'notifications', 'translator_type',
       'withheld_in_countries', 'entities.description.urls',
       'profile_banner_url', 'entities.url.urls'],
      dtype='object')

In [50]:
users.isna().sum()

id                                      0
id_str                                  0
name                                    0
screen_name                             0
location                                0
description                             0
url                                   262
protected                               0
followers_count                         0
friends_count                           0
listed_count                            0
created_at                              0
favourites_count                        0
utc_offset                            420
time_zone                             420
geo_enabled                             0
verified                                0
statuses_count                          0
lang                                  420
contributors_enabled                    0
is_translator                           0
is_translation_enabled                  0
profile_background_color                0
profile_background_image_url      

In [53]:
tweets = en_df.join(users, rsuffix = '_user')

In [55]:
len(tweets)

420

In [58]:
tweets.columns

Index(['id', 'user', 'created_at', 'source', 'full_text', 'retweet_count',
       'favorite_count', 'id_user', 'id_str', 'name', 'screen_name',
       'location', 'description', 'url', 'protected', 'followers_count',
       'friends_count', 'listed_count', 'created_at_user', 'favourites_count',
       'utc_offset', 'time_zone', 'geo_enabled', 'verified', 'statuses_count',
       'lang', 'contributors_enabled', 'is_translator',
       'is_translation_enabled', 'profile_background_color',
       'profile_background_image_url', 'profile_background_image_url_https',
       'profile_background_tile', 'profile_image_url',
       'profile_image_url_https', 'profile_link_color',
       'profile_sidebar_border_color', 'profile_sidebar_fill_color',
       'profile_text_color', 'profile_use_background_image',
       'has_extended_profile', 'default_profile', 'default_profile_image',
       'following', 'follow_request_sent', 'notifications', 'translator_type',
       'withheld_in_countries', 'ent

In [60]:
tweets.isna().sum()

id                                      0
user                                    0
created_at                              0
source                                  0
full_text                               0
retweet_count                           0
favorite_count                          0
id_user                               252
id_str                                252
name                                  252
screen_name                           252
location                              252
description                           252
url                                   351
protected                             252
followers_count                       252
friends_count                         252
listed_count                          252
created_at_user                       252
favourites_count                      252
utc_offset                            420
time_zone                             420
geo_enabled                           252
verified                          

In [73]:
final_df = tweets[['id','screen_name','created_at','source','full_text','retweet_count','favorite_count','followers_count','friends_count']]

In [74]:
final_df

Unnamed: 0,id,created_at,source,full_text,retweet_count,favorite_count,followers_count,friends_count
3,1438236436084391936,2021-09-15 20:21:02+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @PBIcanada: We congratulate @GreenpeaceCA o...,7,0,32259.0,27188.0
4,1438236402114772995,2021-09-15 20:20:54+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @cadebe_: @GJGamble @BorisJohnson @trussliz...,1,0,340.0,1516.0
6,1438236275283214343,2021-09-15 20:20:23+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...","RT @GreenpeaceCA: I scream, you scream, WE ALL...",2,0,387.0,672.0
9,1438235849058045956,2021-09-15 20:18:42+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @LongTimeAmy: TODAY IN HISTORY\n\n1971‚ÄìEnvi...,16,0,1.0,35.0
10,1438235834415714308,2021-09-15 20:18:38+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @zoev213: Happy 50th @greenpeaceusa Join me...,2,0,446.0,443.0
...,...,...,...,...,...,...,...,...
1167,1438192539400429577,2021-09-15 17:26:36+00:00,"<a href=""http://twitter.com/download/iphone"" r...",@cowboy_satin @23brookside @People4Bernie @Ber...,0,1,,
1168,1438192531787825157,2021-09-15 17:26:34+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @jubileam8: #Glyphosate spraying purposeful...,5,0,,
1177,1438192131206782979,2021-09-15 17:24:59+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @EDProgramme: Companies like ExxonMobil kne...,17,0,,
1183,1438192017847226369,2021-09-15 17:24:32+00:00,"<a href=""http://twitter.com/download/android"" ...","RT @AbukaAlfred: Different countries, differen...",5,0,,


In [75]:
final_df.isna().sum()

id                   0
created_at           0
source               0
full_text            0
retweet_count        0
favorite_count       0
followers_count    252
friends_count      252
dtype: int64

In [78]:
final_df[['friends_count','followers_count']].describe()

Unnamed: 0,friends_count,followers_count
count,168.0,168.0
mean,1837.047619,15220.12
std,3291.62433,146543.7
min,1.0,0.0
25%,264.0,193.25
50%,893.0,860.0
75%,1990.5,2084.0
max,27188.0,1889553.0


In [79]:
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [80]:
final_df[['friends_count','followers_count']].describe()

Unnamed: 0,friends_count,followers_count
count,168.0,168.0
mean,1837.04762,15220.11905
std,3291.62433,146543.71983
min,1.0,0.0
25%,264.0,193.25
50%,893.0,860.0
75%,1990.5,2084.0
max,27188.0,1889553.0


In [84]:
final_df['friendsCount_no_na'] = final_df['friends_count'].fillna(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['friendsCount_no_na'] = final_df['friends_count'].fillna(0)


In [85]:
final_df['followers_count_no_na'] = final_df['followers_count'].fillna(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['followers_count_no_na'] = final_df['followers_count'].fillna(0)


In [86]:
final_df.isna().sum()

id                         0
created_at                 0
source                     0
full_text                  0
retweet_count              0
favorite_count             0
followers_count          252
friends_count            252
friendsCount_no_na         0
followers_count_no_na      0
dtype: int64

In [89]:
final_df2 = final_df[['id','created_at','source','full_text','retweet_count','favorite_count','followers_count_no_na','friendsCount_no_na']]

In [90]:
final_df2.isna().sum()

id                       0
created_at               0
source                   0
full_text                0
retweet_count            0
favorite_count           0
followers_count_no_na    0
friendsCount_no_na       0
dtype: int64

In [97]:
final_df2.dtypes

id                                     int64
created_at               datetime64[ns, UTC]
source                                object
full_text                             object
retweet_count                          int64
favorite_count                         int64
followers_count_no_na                float64
friendsCount_no_na                   float64
dtype: object

## Challenge 2

Merge the sentiment analysis results with your data. Make sure to check whether the length of the dataframe generated by the merge makes sense.


In [91]:
sent = pd.read_pickle('FlorindoArgondizzo_EN_completed.pkl')

In [104]:
len(sent)

420

In [92]:
sent

Unnamed: 0,id,positive,negative,neutral
3,1438236436084391936,3,-1,1
4,1438236402114772995,1,-1,0
6,1438236275283214343,2,-5,-1
9,1438235849058045956,2,-1,1
10,1438235834415714308,3,-1,1
...,...,...,...,...
1167,1438192539400429577,2,-1,1
1168,1438192531787825157,1,-3,-1
1177,1438192131206782979,2,-1,1
1183,1438192017847226369,1,-2,-1


In [94]:
sent.columns

Index(['id', 'positive', 'negative', 'neutral'], dtype='object')

In [96]:
sent.isna().sum()

id          0
positive    0
negative    0
neutral     0
dtype: int64

In [98]:
sent.dtypes

id          object
positive    object
negative    object
neutral     object
dtype: object

In [102]:
sent[['id','positive','negative','neutral']] = sent[['id','positive','negative','neutral']].apply(pd.to_numeric)

In [103]:
sent.dtypes

id          int64
positive    int64
negative    int64
neutral     int64
dtype: object

In [105]:
len(final_df2.merge(sent, on='id', how='left'))

420

In [106]:
len(final_df2.merge(sent, on='id', how='right'))

420

In [107]:
len(final_df2.merge(sent, on='id', how='inner'))

420

In [108]:
len(final_df2.merge(sent, on='id', how='outer'))

420

In [109]:
complete = final_df2.merge(sent, on='id')

In [110]:
complete

Unnamed: 0,id,created_at,source,full_text,retweet_count,favorite_count,followers_count_no_na,friendsCount_no_na,positive,negative,neutral
0,1438236436084391936,2021-09-15 20:21:02+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @PBIcanada: We congratulate @GreenpeaceCA o...,7,0,32259.00000,27188.00000,3,-1,1
1,1438236402114772995,2021-09-15 20:20:54+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @cadebe_: @GJGamble @BorisJohnson @trussliz...,1,0,340.00000,1516.00000,1,-1,0
2,1438236275283214343,2021-09-15 20:20:23+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...","RT @GreenpeaceCA: I scream, you scream, WE ALL...",2,0,387.00000,672.00000,2,-5,-1
3,1438235849058045956,2021-09-15 20:18:42+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @LongTimeAmy: TODAY IN HISTORY\n\n1971‚ÄìEnvi...,16,0,1.00000,35.00000,2,-1,1
4,1438235834415714308,2021-09-15 20:18:38+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @zoev213: Happy 50th @greenpeaceusa Join me...,2,0,446.00000,443.00000,3,-1,1
...,...,...,...,...,...,...,...,...,...,...,...
415,1438192539400429577,2021-09-15 17:26:36+00:00,"<a href=""http://twitter.com/download/iphone"" r...",@cowboy_satin @23brookside @People4Bernie @Ber...,0,1,0.00000,0.00000,2,-1,1
416,1438192531787825157,2021-09-15 17:26:34+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @jubileam8: #Glyphosate spraying purposeful...,5,0,0.00000,0.00000,1,-3,-1
417,1438192131206782979,2021-09-15 17:24:59+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @EDProgramme: Companies like ExxonMobil kne...,17,0,0.00000,0.00000,2,-1,1
418,1438192017847226369,2021-09-15 17:24:32+00:00,"<a href=""http://twitter.com/download/android"" ...","RT @AbukaAlfred: Different countries, differen...",5,0,0.00000,0.00000,1,-2,-1


In [111]:
complete.columns

Index(['id', 'created_at', 'source', 'full_text', 'retweet_count',
       'favorite_count', 'followers_count_no_na', 'friendsCount_no_na',
       'positive', 'negative', 'neutral'],
      dtype='object')

In [112]:
complete.isna().sum()

id                       0
created_at               0
source                   0
full_text                0
retweet_count            0
favorite_count           0
followers_count_no_na    0
friendsCount_no_na       0
positive                 0
negative                 0
neutral                  0
dtype: int64

## Challenge 3

The sentiment analysis results has three interesting columns: ```neutral```,  ```positive```, and ```negative```. It is coming from the SentiStrength (http://sentistrength.wlv.ac.uk/) algorithm, trinary version.

For this challenge, you need to:
1. Create one variable that summarizes the sentiment (i.e., that somehow aggregates the information of it being positive or negative - or potentially neutral - into one single variable)
2. Using the ```.groupby``` function, compare the means and standard deviations of that variable per category (that you created in Challenge 1).

*Tip: Pandas makes it easy to run numerical operations across columns. Let's say that I want to multiply the value that is in column A by the value that is in column B and store it in column C... I can simply use:*
```df['C'] = df['A'] * df['B']```


**Note:** if you cannot complete #1, make sure to at least complete #2 with each column separately. But do give it a try ;-)

In [113]:
complete['positive'].value_counts()

1    188
2    143
3     81
4      8
Name: positive, dtype: int64

In [114]:
complete['negative'].value_counts()

-1    286
-2     72
-3     41
-4     14
-5      7
Name: negative, dtype: int64

In [115]:
complete['neutral'].value_counts()

 1    190
-1    124
 0    106
Name: neutral, dtype: int64

In [117]:
complete['sent_score'] = complete['negative'] + complete['positive']

In [118]:
complete

Unnamed: 0,id,created_at,source,full_text,retweet_count,favorite_count,followers_count_no_na,friendsCount_no_na,positive,negative,neutral,sent_score
0,1438236436084391936,2021-09-15 20:21:02+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @PBIcanada: We congratulate @GreenpeaceCA o...,7,0,32259.00000,27188.00000,3,-1,1,2
1,1438236402114772995,2021-09-15 20:20:54+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @cadebe_: @GJGamble @BorisJohnson @trussliz...,1,0,340.00000,1516.00000,1,-1,0,0
2,1438236275283214343,2021-09-15 20:20:23+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...","RT @GreenpeaceCA: I scream, you scream, WE ALL...",2,0,387.00000,672.00000,2,-5,-1,-3
3,1438235849058045956,2021-09-15 20:18:42+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @LongTimeAmy: TODAY IN HISTORY\n\n1971‚ÄìEnvi...,16,0,1.00000,35.00000,2,-1,1,1
4,1438235834415714308,2021-09-15 20:18:38+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @zoev213: Happy 50th @greenpeaceusa Join me...,2,0,446.00000,443.00000,3,-1,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...
415,1438192539400429577,2021-09-15 17:26:36+00:00,"<a href=""http://twitter.com/download/iphone"" r...",@cowboy_satin @23brookside @People4Bernie @Ber...,0,1,0.00000,0.00000,2,-1,1,1
416,1438192531787825157,2021-09-15 17:26:34+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @jubileam8: #Glyphosate spraying purposeful...,5,0,0.00000,0.00000,1,-3,-1,-2
417,1438192131206782979,2021-09-15 17:24:59+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @EDProgramme: Companies like ExxonMobil kne...,17,0,0.00000,0.00000,2,-1,1,1
418,1438192017847226369,2021-09-15 17:24:32+00:00,"<a href=""http://twitter.com/download/android"" ...","RT @AbukaAlfred: Different countries, differen...",5,0,0.00000,0.00000,1,-2,-1,-1


In [120]:
pd.set_option('display.max_rows', 500)

In [125]:
pd.options.display.max_colwidth = 10000

In [126]:
complete[['full_text','sent_score']]

Unnamed: 0,full_text,sent_score
0,"RT @PBIcanada: We congratulate @GreenpeaceCA on their 50th anniversary today! We recall that one of our founders, George Willoughby, sailed‚Ä¶",2
1,RT @cadebe_: @GJGamble @BorisJohnson @trussliz @ChrisGPackham @Greenpeace @seashepherd @Seasaver @CarolineLucas @natalieben I have been fee‚Ä¶,0
2,"RT @GreenpeaceCA: I scream, you scream, WE ALL SCREAM TO #TAXTHERICH!\n\nPolls consistently show: Canadians are overwhelmingly in favour of a‚Ä¶",-3
3,RT @LongTimeAmy: TODAY IN HISTORY\n\n1971‚ÄìEnvironmental group Greenpeace is founded.\n\n2021‚ÄìCows can be 'potty trained like children'for green‚Ä¶,1
4,RT @zoev213: Happy 50th @greenpeaceusa Join me and support their important work to protect the environment today! https://t.co/pOKUWxVGZS,2
5,"RT @Greenpeace: ""Today we need huge, unprecedented change. For billions to unite as a movement, taking action around the world and demandin‚Ä¶",0
6,Im fundraising a speed climb supporting Greenpeace climbing a tree 100 climbs on 21102021 Watch me climb 38 times in sample video\n\nhttps://t.co/mBa11woVPi #treeclimb @alexrotas @fitforover60 @GreenpeaceUK,1
7,RT @cosmicadriestar: HAPPY BIRTHDAY GREENPEACE!!!,2
8,@Greenpeace @climatemorgan Dolphins from the Faroe Islands say thank you! You haven't even honored them with a mention these days! Your birthday is definitely more important! https://t.co/ybcJdlPMSV,2
9,"@AllatRaTV @BardemAntarctic @carlosbardem @greenpeace_esp Pay attention to the climate problem! Mr. Javier Bardem, invite you to speak at the conference ""Global Crisis. Time for Truth"" on December 04! Your word counts! @BardemAntarctic @carlosbardem @greenpeace_esp #JavierBardem #ClimateCrisis",-1


In [134]:
complete['sent_score'].value_counts()

 0    124
 1    119
 2     64
-1     64
-2     29
-3     11
 3      7
-4      2
Name: sent_score, dtype: int64

In [132]:
complete['sent_score'].value_counts(normalize=True)

 0   0.29524
 1   0.28333
 2   0.15238
-1   0.15238
-2   0.06905
-3   0.02619
 3   0.01667
-4   0.00476
Name: sent_score, dtype: float64

In [135]:
def recategorize(category):
    if category == 0:
        return 'neutral'
    if category == 1:
        return 'positive'
    if category == 2:
        return 'positive'
    if category == 3:
        return 'positive'
    if category == 4:
        return 'positive'
    if category == -1:
        return 'negative'
    if category == -2:
        return 'negative'
    if category == -3:
        return 'negative'
    if category == -4:
        return 'negative'
    if category == -5:
        return 'negative'
    else:
        return 'Other'

In [136]:
complete['sentiment'] = complete['sent_score'].apply(recategorize)

In [139]:
complete.head()

Unnamed: 0,id,created_at,source,full_text,retweet_count,favorite_count,followers_count_no_na,friendsCount_no_na,positive,negative,neutral,sent_score,sentiment
0,1438236436084391936,2021-09-15 20:21:02+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","RT @PBIcanada: We congratulate @GreenpeaceCA on their 50th anniversary today! We recall that one of our founders, George Willoughby, sailed‚Ä¶",7,0,32259.0,27188.0,3,-1,1,2,positive
1,1438236402114772995,2021-09-15 20:20:54+00:00,"<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>",RT @cadebe_: @GJGamble @BorisJohnson @trussliz @ChrisGPackham @Greenpeace @seashepherd @Seasaver @CarolineLucas @natalieben I have been fee‚Ä¶,1,0,340.0,1516.0,1,-1,0,0,neutral
2,1438236275283214343,2021-09-15 20:20:23+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofollow"">Twitter Web App</a>","RT @GreenpeaceCA: I scream, you scream, WE ALL SCREAM TO #TAXTHERICH!\n\nPolls consistently show: Canadians are overwhelmingly in favour of a‚Ä¶",2,0,387.0,672.0,2,-5,-1,-3,negative
3,1438235849058045956,2021-09-15 20:18:42+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @LongTimeAmy: TODAY IN HISTORY\n\n1971‚ÄìEnvironmental group Greenpeace is founded.\n\n2021‚ÄìCows can be 'potty trained like children'for green‚Ä¶,16,0,1.0,35.0,2,-1,1,1,positive
4,1438235834415714308,2021-09-15 20:18:38+00:00,"<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>",RT @zoev213: Happy 50th @greenpeaceusa Join me and support their important work to protect the environment today! https://t.co/pOKUWxVGZS,2,0,446.0,443.0,3,-1,1,2,positive


In [141]:
complete.groupby('sentiment')[['retweet_count','favorite_count']].describe().transpose()

Unnamed: 0,sentiment,negative,neutral,positive
retweet_count,count,106.0,124.0,190.0
retweet_count,mean,11.41509,13.32258,38.11579
retweet_count,std,34.58898,36.35161,73.60518
retweet_count,min,0.0,0.0,0.0
retweet_count,25%,0.0,0.0,1.0
retweet_count,50%,0.0,2.0,16.0
retweet_count,75%,15.0,14.0,62.75
retweet_count,max,268.0,333.0,648.0
favorite_count,count,106.0,124.0,190.0
favorite_count,mean,0.48113,0.55645,3.54737


In [146]:
complete['sent_score'].describe()

count   420.00000
mean      0.25000
std       1.31337
min      -4.00000
25%      -1.00000
50%       0.00000
75%       1.00000
max       3.00000
Name: sent_score, dtype: float64

In [147]:
sample = complete.sample(15)

In [149]:
complete.to_pickle('COMPLETE.pkl')

In [150]:
import re
from collections import Counter

In [151]:
texts = complete['full_text'].values.tolist()

In [152]:
total_words = Counter()
for text in texts:
    # making text lower case
    text = text.lower()
    # removing URLS 
    text = text.split(' ')
    newtext = []
    for item in text:
        if 'http' not in item:
            newtext.append(item)
    
    newtext = ' '.join(newtext)
        
    # splitting the text in words (tokens)
    tokens = re.findall(r"[\w']+|[.,!?;$@]", newtext)
    for token in tokens:
        total_words[token] += 1
    

In [154]:
total_words.most_common(100)

[('@', 1127),
 (',', 415),
 ('.', 360),
 ('the', 295),
 ('greenpeace', 287),
 ('rt', 255),
 ('to', 248),
 ('a', 162),
 ('of', 160),
 ('and', 148),
 ('!', 141),
 ('is', 140),
 ('in', 134),
 ('you', 131),
 ('for', 109),
 ('we', 86),
 ('today', 72),
 ('this', 67),
 ('happy', 61),
 ('our', 60),
 ('are', 58),
 ('on', 56),
 ('it', 55),
 (';', 54),
 ('years', 54),
 ('like', 52),
 ('i', 51),
 ('from', 51),
 ('all', 49),
 ('birthday', 48),
 ('?', 46),
 ('that', 45),
 ('action', 43),
 ('day', 42),
 ('can', 41),
 ('50th', 39),
 ('s', 38),
 ('green', 34),
 ('gt', 34),
 ('be', 32),
 ('greenpeaceusa', 32),
 ('thank', 31),
 ('have', 30),
 ('more', 30),
 ('greenpeaceca', 29),
 ('need', 29),
 ('climate', 29),
 ('so', 29),
 ('anniversary', 28),
 ('luisamneubauer', 28),
 ('seashepherd', 27),
 ('as', 27),
 ('50', 27),
 ('borisjohnson', 26),
 ('chrisgpackham', 26),
 ('seasaver', 26),
 ('carolinelucas', 26),
 ('change', 26),
 ('what', 26),
 ('has', 26),
 ('gjgamble', 25),
 ('trussliz', 25),
 ('environmental