## Challenge 1

For this project I have been asked to answer the following question: 

**"To what extent does the sentiment expressed in a tweet influence user engagement with the tweet (likes and retweets)?"**

### Programming challenge

Using your Twitter data collected via ```twarc```, the tasks I have to complete are: 

1. Determine which variables are relevant for the research question (including control variables that are relevant related to users and tweets).
2. Create a minimized dataset (dataset with only variables necessary to answer the research question).
3. Make sure that the minimized dataset is pseudonymized (identifying information about users are removed from user-related columns and from text).

I have to create two binary variables for the Twitter data based on the text column. They should be two **meaningful** categories for this data, and they should have either the value 0 (when the tweet is not part of that category) or 1 (when the tweet is in that category). 



* **Technical** summary:

**The following variables have been retained:**

created_at, id, full_text, retweet_count, favorite_count, followers_count, friends_count, screen_name.

**The following can be considered Control Variables: the time (Created at) the tweet has been created (if the tweet has been made during some environemntal scandal/emergency it may be considered with more sentiment expressed in it and consequent engagment), the Id, if the person who made the tweet was a famous journalist or politician or related to the scandal/emergency or somehow to such topic (the tweets query was on the word "Greenpeace"). Friends Count and Followers Count have been kept as Control Variables as well since the number of followers/friends may influence the engagment of the users with the tweets. The sentiment expressed in the Tweet is the main Independent Variable. Instead, the Retweet Count and the Favourite Count are the Dependent Variables since they are a measure of the engagment with the Tweet itself.**

### Step-by-step Interpretation:

**Importing Pandas**

In [1]:
import pandas as pd

**Importing the Dataset which is made of Tweets containing the word "Greenpeace"**

In [2]:
df_jsonl = pd.read_json('greenpeacetweets.jsonl', lines=True)

**Checking the n. of Rows the dataset contains**

In [3]:
len(df_jsonl) 

1197

**Displaying first 5 rows to have a preview of the Dataframe**

In [4]:
df_jsonl.head()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,metadata,source,...,is_quote_status,retweet_count,favorite_count,favorited,retweeted,possibly_sensitive,lang,quoted_status_id,quoted_status_id_str,quoted_status
0,2021-09-15 20:21:15+00:00,1438236491394785288,1438236491394785280,"RT @GreenpeaceArg: - ¿Por qué tan elegante, Ho...",False,"[0, 117]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 1438173629850832905, 'id_str...","{'iso_language_code': 'es', 'result_type': 're...","<a href=""https://mobile.twitter.com"" rel=""nofo...",...,False,14,0,False,False,0.0,es,,,
1,2021-09-15 20:21:13+00:00,1438236482532151301,1438236482532151296,"¡Excusas, excusas, excusas! Cómo las empresas ...",False,"[0, 115]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"{'iso_language_code': 'es', 'result_type': 're...","<a href=""http://twitter.com/download/iphone"" r...",...,False,0,0,False,False,0.0,es,,,
2,2021-09-15 20:21:03+00:00,1438236443638370308,1438236443638370304,RT @Herbert_Diess: Greenpeace müsste man erfin...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"{'iso_language_code': 'de', 'result_type': 're...","<a href=""http://twitter.com/download/android"" ...",...,True,41,0,False,False,,de,1.438075e+18,1.438075e+18,
3,2021-09-15 20:21:02+00:00,1438236436084391936,1438236436084391936,RT @PBIcanada: We congratulate @GreenpeaceCA o...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/iphone"" r...",...,False,7,0,False,False,,en,,,
4,2021-09-15 20:20:54+00:00,1438236402114772995,1438236402114772992,RT @cadebe_: @GJGamble @BorisJohnson @trussliz...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/android"" ...",...,False,1,0,False,False,,en,,,


**Checking which columns are present in the Data**

In [5]:
df_jsonl.columns

Index(['created_at', 'id', 'id_str', 'full_text', 'truncated',
       'display_text_range', 'entities', 'extended_entities', 'metadata',
       'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'retweeted_status', 'is_quote_status', 'retweet_count',
       'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive',
       'lang', 'quoted_status_id', 'quoted_status_id_str', 'quoted_status'],
      dtype='object')

**Cheching the languages present in the Data**

In [6]:
df_jsonl[['lang']].value_counts()

lang
en      420
es      286
de      238
fr       80
pt       50
nl       44
und      31
it       15
pl       10
ca        5
cs        4
no        3
sv        2
th        2
et        1
in        1
ja        1
el        1
ro        1
ru        1
ar        1
dtype: int64

**Looking for missing values**

In [7]:
df_jsonl.isna().sum()

created_at                      0
id                              0
id_str                          0
full_text                       0
truncated                       0
display_text_range              0
entities                        0
extended_entities            1102
metadata                        0
source                          0
in_reply_to_status_id         982
in_reply_to_status_id_str     982
in_reply_to_user_id           973
in_reply_to_user_id_str       973
in_reply_to_screen_name       973
user                            0
geo                          1197
coordinates                  1197
place                        1195
contributors                 1197
retweeted_status              401
is_quote_status                 0
retweet_count                   0
favorite_count                  0
favorited                       0
retweeted                       0
possibly_sensitive            899
lang                            0
quoted_status_id             1031
quoted_status_

**Checking which in which values the Data have been stored**

In [8]:
df_jsonl.dtypes

created_at                   datetime64[ns, UTC]
id                                         int64
id_str                                     int64
full_text                                 object
truncated                                   bool
display_text_range                        object
entities                                  object
extended_entities                         object
metadata                                  object
source                                    object
in_reply_to_status_id                    float64
in_reply_to_status_id_str                float64
in_reply_to_user_id                      float64
in_reply_to_user_id_str                  float64
in_reply_to_screen_name                   object
user                                      object
geo                                      float64
coordinates                              float64
place                                     object
contributors                             float64
retweeted_status    

**Making a minimised Dataframe only with the variable relevant to answer my question (only in English language). Tasks 1 and 2 completed.**

In [9]:
en_df = df_jsonl[df_jsonl['lang']=='en'][['id','user','created_at','source','full_text','retweet_count','favorite_count']]

**Testing the new Dataset displaying the first 5 rows. As we can see the "User" column contain more dictionaries in it and Python is not able to vidualise it correctly**

In [10]:
en_df.head()

Unnamed: 0,id,user,created_at,source,full_text,retweet_count,favorite_count
3,1438236436084391936,"{'id': 27281240, 'id_str': '27281240', 'name':...",2021-09-15 20:21:02+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @PBIcanada: We congratulate @GreenpeaceCA o...,7,0
4,1438236402114772995,"{'id': 1119365287, 'id_str': '1119365287', 'na...",2021-09-15 20:20:54+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @cadebe_: @GJGamble @BorisJohnson @trussliz...,1,0
6,1438236275283214343,"{'id': 738106329080987648, 'id_str': '73810632...",2021-09-15 20:20:23+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...","RT @GreenpeaceCA: I scream, you scream, WE ALL...",2,0
9,1438235849058045956,"{'id': 20565828, 'id_str': '20565828', 'name':...",2021-09-15 20:18:42+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @LongTimeAmy: TODAY IN HISTORY\n\n1971–Envi...,16,0
10,1438235834415714308,"{'id': 59791627, 'id_str': '59791627', 'name':...",2021-09-15 20:18:38+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @zoev213: Happy 50th @greenpeaceusa Join me...,2,0


**Checking lenght of the new Dataset**

In [11]:
len(en_df)

420

In [12]:
en_df[['retweet_count','favorite_count']].describe()

Unnamed: 0,retweet_count,favorite_count
count,420.0,420.0
mean,24.057143,1.890476
std,57.408028,29.456613
min,0.0,0.0
25%,0.0,0.0
50%,4.0,0.0
75%,25.0,0.0
max,648.0,602.0


**Creating a minimised Dataframe for the Users. With this command I am asking Pandas to flatten the nested dictionaries and  converts them into separate columns** 

In [13]:
users = pd.json_normalize(en_df['user'])

**Testing if the command worked. It did**

In [14]:
users

Unnamed: 0,id,id_str,name,screen_name,location,description,url,protected,followers_count,friends_count,...,default_profile,default_profile_image,following,follow_request_sent,notifications,translator_type,withheld_in_countries,entities.description.urls,profile_banner_url,entities.url.urls
0,27281240,27281240,wei wu wei,Osmotheque78000,"California, USA",Buddhist - Daoist - Quaker Questioner,,False,5665,6235,...,False,False,False,False,False,none,[],[],,
1,1119365287,1119365287,Elize⚘,ElizeCronje1,Big Bay,I am a happy postive person and don't get angr...,,False,1165,1722,...,True,False,False,False,False,none,[],[],,
2,738106329080987648,738106329080987648,Craig Laferriere,JackLaferriere,"Mississauga, Ontario","Craig Laferrière lives in Mississauga, Ontario...",https://t.co/r2Bw6rEwjo,False,25,116,...,True,False,False,False,False,none,[],[],https://pbs.twimg.com/profile_banners/73810632...,"[{'url': 'https://t.co/r2Bw6rEwjo', 'expanded_..."
3,20565828,20565828,CitizenWonk #DCStatehood 💉,CitizenWonk,"Washington, DC USA 🇺🇸",anti-fascist Political Scientist #ONEV1 Z22. R...,,False,32259,27188,...,False,False,False,False,False,none,[],[],https://pbs.twimg.com/profile_banners/20565828...,
4,59791627,59791627,Mark Trudgeon,mark_trudgeon,Southampton,"Green Supporter and campaigner, Saints season ...",,False,340,1516,...,False,False,False,False,False,none,[],[],https://pbs.twimg.com/profile_banners/59791627...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
415,1022655642,1022655642,Backdoor Conquistador 🇬🇷,LiteralDiego,,Just 🐝-ing myself,,False,35,386,...,True,False,False,False,False,none,[],[],https://pbs.twimg.com/profile_banners/10226556...,
416,2925571294,2925571294,emgfind,emgfind,"Alberta, Canada",interests:exposed environmentalconcerns&coveru...,,False,4663,3828,...,True,False,False,False,False,none,[],[],,
417,882766377660166144,882766377660166144,Adrienne Moreau,AdriMoreau,Dark dungeons,"Cat. Co-writer of brutal fiction and ""fan-fict...",,False,1521,1385,...,True,False,False,False,False,none,[],[],https://pbs.twimg.com/profile_banners/88276637...,
418,792280784468111360,792280784468111360,Innocent Indeje,ian_indeje,Kenya,"SRHR Advocate & TOT, Peer Educator, teacher by...",,False,1521,1103,...,True,False,False,False,False,none,[],[],https://pbs.twimg.com/profile_banners/79228078...,


**Checking the columns contained in the nested dictionaries** 

In [15]:
users.columns

Index(['id', 'id_str', 'name', 'screen_name', 'location', 'description', 'url',
       'protected', 'followers_count', 'friends_count', 'listed_count',
       'created_at', 'favourites_count', 'utc_offset', 'time_zone',
       'geo_enabled', 'verified', 'statuses_count', 'lang',
       'contributors_enabled', 'is_translator', 'is_translation_enabled',
       'profile_background_color', 'profile_background_image_url',
       'profile_background_image_url_https', 'profile_background_tile',
       'profile_image_url', 'profile_image_url_https', 'profile_link_color',
       'profile_sidebar_border_color', 'profile_sidebar_fill_color',
       'profile_text_color', 'profile_use_background_image',
       'has_extended_profile', 'default_profile', 'default_profile_image',
       'following', 'follow_request_sent', 'notifications', 'translator_type',
       'withheld_in_countries', 'entities.description.urls',
       'profile_banner_url', 'entities.url.urls'],
      dtype='object')

**Checking for missing values in the new minimised Dataset**

In [16]:
users.isna().sum()

id                                      0
id_str                                  0
name                                    0
screen_name                             0
location                                0
description                             0
url                                   262
protected                               0
followers_count                         0
friends_count                           0
listed_count                            0
created_at                              0
favourites_count                        0
utc_offset                            420
time_zone                             420
geo_enabled                             0
verified                                0
statuses_count                          0
lang                                  420
contributors_enabled                    0
is_translator                           0
is_translation_enabled                  0
profile_background_color                0
profile_background_image_url      

**Joining my 2 minimised Datasets together**

In [17]:
tweets = en_df.join(users, rsuffix = '_user')

**As we can see lenght hasn't changed**

In [18]:
len(tweets)

420

**All the columns have been categorised together**

In [19]:
tweets.columns

Index(['id', 'user', 'created_at', 'source', 'full_text', 'retweet_count',
       'favorite_count', 'id_user', 'id_str', 'name', 'screen_name',
       'location', 'description', 'url', 'protected', 'followers_count',
       'friends_count', 'listed_count', 'created_at_user', 'favourites_count',
       'utc_offset', 'time_zone', 'geo_enabled', 'verified', 'statuses_count',
       'lang', 'contributors_enabled', 'is_translator',
       'is_translation_enabled', 'profile_background_color',
       'profile_background_image_url', 'profile_background_image_url_https',
       'profile_background_tile', 'profile_image_url',
       'profile_image_url_https', 'profile_link_color',
       'profile_sidebar_border_color', 'profile_sidebar_fill_color',
       'profile_text_color', 'profile_use_background_image',
       'has_extended_profile', 'default_profile', 'default_profile_image',
       'following', 'follow_request_sent', 'notifications', 'translator_type',
       'withheld_in_countries', 'ent

In [20]:
tweets.isna().sum()

id                                      0
user                                    0
created_at                              0
source                                  0
full_text                               0
retweet_count                           0
favorite_count                          0
id_user                               252
id_str                                252
name                                  252
screen_name                           252
location                              252
description                           252
url                                   351
protected                             252
followers_count                       252
friends_count                         252
listed_count                          252
created_at_user                       252
favourites_count                      252
utc_offset                            420
time_zone                             420
geo_enabled                           252
verified                          

**Creating my final minimised dataframe with all the infi I need to answer my question**

In [21]:
final_df = tweets[['id','screen_name','created_at','source','full_text','retweet_count','favorite_count','followers_count','friends_count']]

In [22]:
final_df

Unnamed: 0,id,screen_name,created_at,source,full_text,retweet_count,favorite_count,followers_count,friends_count
3,1438236436084391936,CitizenWonk,2021-09-15 20:21:02+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @PBIcanada: We congratulate @GreenpeaceCA o...,7,0,32259.0,27188.0
4,1438236402114772995,mark_trudgeon,2021-09-15 20:20:54+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @cadebe_: @GJGamble @BorisJohnson @trussliz...,1,0,340.0,1516.0
6,1438236275283214343,AspieandMe,2021-09-15 20:20:23+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...","RT @GreenpeaceCA: I scream, you scream, WE ALL...",2,0,387.0,672.0
9,1438235849058045956,gazetka75,2021-09-15 20:18:42+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @LongTimeAmy: TODAY IN HISTORY\n\n1971–Envi...,16,0,1.0,35.0
10,1438235834415714308,arynpekik,2021-09-15 20:18:38+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @zoev213: Happy 50th @greenpeaceusa Join me...,2,0,446.0,443.0
...,...,...,...,...,...,...,...,...,...
1167,1438192539400429577,,2021-09-15 17:26:36+00:00,"<a href=""http://twitter.com/download/iphone"" r...",@cowboy_satin @23brookside @People4Bernie @Ber...,0,1,,
1168,1438192531787825157,,2021-09-15 17:26:34+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @jubileam8: #Glyphosate spraying purposeful...,5,0,,
1177,1438192131206782979,,2021-09-15 17:24:59+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @EDProgramme: Companies like ExxonMobil kne...,17,0,,
1183,1438192017847226369,,2021-09-15 17:24:32+00:00,"<a href=""http://twitter.com/download/android"" ...","RT @AbukaAlfred: Different countries, differen...",5,0,,


**As we can see there are 252 missing values in 2 of my columns**

In [23]:
final_df.isna().sum()

id                   0
screen_name        252
created_at           0
source               0
full_text            0
retweet_count        0
favorite_count       0
followers_count    252
friends_count      252
dtype: int64

In [24]:
final_df[['friends_count','followers_count']].describe()

Unnamed: 0,friends_count,followers_count
count,168.0,168.0
mean,1837.047619,15220.12
std,3291.62433,146543.7
min,1.0,0.0
25%,264.0,193.25
50%,893.0,860.0
75%,1990.5,2084.0
max,27188.0,1889553.0


In [25]:
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [26]:
final_df[['friends_count','followers_count']].describe()

Unnamed: 0,friends_count,followers_count
count,168.0,168.0
mean,1837.04762,15220.11905
std,3291.62433,146543.71983
min,1.0,0.0
25%,264.0,193.25
50%,893.0,860.0
75%,1990.5,2084.0
max,27188.0,1889553.0


**I have been asked to replace the missing values with a 0 and this is what I am doing in the next operations**

In [27]:
final_df['friendsCount_no_na'] = final_df['friends_count'].fillna(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['friendsCount_no_na'] = final_df['friends_count'].fillna(0)


In [28]:
final_df['followers_count_no_na'] = final_df['followers_count'].fillna(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['followers_count_no_na'] = final_df['followers_count'].fillna(0)


**Slicing the Dataframe excluding the columns with missing values (they have been replaced by 2 new columns for which the missing values have been replaced by a zero)**

In [29]:
final_df2 = final_df[['id','screen_name','created_at','source','full_text','retweet_count','favorite_count','followers_count_no_na','friendsCount_no_na']]

In [30]:
final_df2.isna().sum()

id                         0
screen_name              252
created_at                 0
source                     0
full_text                  0
retweet_count              0
favorite_count             0
followers_count_no_na      0
friendsCount_no_na         0
dtype: int64

In [31]:
final_df2.dtypes

id                                     int64
screen_name                           object
created_at               datetime64[ns, UTC]
source                                object
full_text                             object
retweet_count                          int64
favorite_count                         int64
followers_count_no_na                float64
friendsCount_no_na                   float64
dtype: object

**Now let's be sure that the minimized dataset is pseudonymized (identifying information about users are removed from user-related columns and from text). Unique users have been picked up asking pandas to drop the duplicates (In case the same user posted more then one Tweet)**

In [32]:
users = tweets[['screen_name']].drop_duplicates()

In [33]:
users.head()

Unnamed: 0,screen_name
3,CitizenWonk
4,mark_trudgeon
6,AspieandMe
9,gazetka75
10,arynpekik


**Here the index column has been resettled and matched with a newly created pseudID used to re-name the index coloumn**

In [34]:
users = users.reset_index()

**The pseudID has been matched with the "screen name" of the user through the creation of a dictionary which contains every pseuID corrisponding at each screen name**

In [35]:
users = users.rename(columns={'index': 'pseudID'})

**A function to have all the sceen name in lower case has been created in order to have pandas able to find all the screen name in the dictionary. Task 3 completed**

In [36]:
userids = {}
for screen_name, pseudID in users[['screen_name', 'pseudID']].values.tolist():
    screen_name = str(screen_name).lower()
    userids[screen_name] = str(pseudID).lower()

In [37]:
def pseudonimise_user(row):
    row['pseudID'] = userids[str(row['screen_name']).lower()]
    return row

In [38]:
df_ps = final_df2.apply(pseudonimise_user, axis=1)

In [39]:
del df_ps['screen_name']

In [40]:
df_pseudo= df_ps.replace(to_replace ='@\S+', value = '@mention', regex = True)

In [41]:
df_pseudo

Unnamed: 0,id,created_at,source,full_text,retweet_count,favorite_count,followers_count_no_na,friendsCount_no_na,pseudID
3,1438236436084391936,2021-09-15 20:21:02+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @mention We congratulate @mention on their ...,7,0,32259.00000,27188.00000,3
4,1438236402114772995,2021-09-15 20:20:54+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @mention @mention @mention @mention @mentio...,1,0,340.00000,1516.00000,4
6,1438236275283214343,2021-09-15 20:20:23+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...","RT @mention I scream, you scream, WE ALL SCREA...",2,0,387.00000,672.00000,6
9,1438235849058045956,2021-09-15 20:18:42+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @mention TODAY IN HISTORY\n\n1971–Environme...,16,0,1.00000,35.00000,9
10,1438235834415714308,2021-09-15 20:18:38+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @mention Happy 50th @mention Join me and su...,2,0,446.00000,443.00000,10
...,...,...,...,...,...,...,...,...,...
1167,1438192539400429577,2021-09-15 17:26:36+00:00,"<a href=""http://twitter.com/download/iphone"" r...",@mention @mention @mention @mention @mention @...,0,1,0.00000,0.00000,421
1168,1438192531787825157,2021-09-15 17:26:34+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @mention #Glyphosate spraying purposefully ...,5,0,0.00000,0.00000,421
1177,1438192131206782979,2021-09-15 17:24:59+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @mention Companies like ExxonMobil knew abo...,17,0,0.00000,0.00000,421
1183,1438192017847226369,2021-09-15 17:24:32+00:00,"<a href=""http://twitter.com/download/android"" ...","RT @mention Different countries, different rac...",5,0,0.00000,0.00000,421


## Challenge 2

For this challenge I will need to merge the sentiment analysis results with the Dataset and check whether the length of the dataframe generated by the merge makes sense.

### Step-by-step Interpretation:



**Importing my Sentiment Analysis**

In [42]:
sent = pd.read_pickle('FlorindoArgondizzo_EN_completed.pkl')

**Checking the lenght of the Data**

In [43]:
len(sent)

420

**Exploring the Dataframe**

In [44]:
sent

Unnamed: 0,id,positive,negative,neutral
3,1438236436084391936,3,-1,1
4,1438236402114772995,1,-1,0
6,1438236275283214343,2,-5,-1
9,1438235849058045956,2,-1,1
10,1438235834415714308,3,-1,1
...,...,...,...,...
1167,1438192539400429577,2,-1,1
1168,1438192531787825157,1,-3,-1
1177,1438192131206782979,2,-1,1
1183,1438192017847226369,1,-2,-1


In [45]:
sent.columns

Index(['id', 'positive', 'negative', 'neutral'], dtype='object')

In [46]:
sent.isna().sum()

id          0
positive    0
negative    0
neutral     0
dtype: int64

In [47]:
sent.dtypes

id          object
positive    object
negative    object
neutral     object
dtype: object

**Changing the type of value on which the Sentiment Analysis results has been stored especially since I will use the "id" column as unique identifier for the merging, consequently, the type of data on which the merge will be made must be the same (both integers in this case)**

In [48]:
sent[['id','positive','negative','neutral']] = sent[['id','positive','negative','neutral']].apply(pd.to_numeric)

In [49]:
sent.dtypes

id          int64
positive    int64
negative    int64
neutral     int64
dtype: object

**Making an inner merging between the 2 Dataframes using the "id" column as unique identifier**

In [50]:
complete = final_df2.merge(sent, on='id', how='inner')

**Checking the Dataframe**

In [51]:
complete

Unnamed: 0,id,screen_name,created_at,source,full_text,retweet_count,favorite_count,followers_count_no_na,friendsCount_no_na,positive,negative,neutral
0,1438236436084391936,CitizenWonk,2021-09-15 20:21:02+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @PBIcanada: We congratulate @GreenpeaceCA o...,7,0,32259.00000,27188.00000,3,-1,1
1,1438236402114772995,mark_trudgeon,2021-09-15 20:20:54+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @cadebe_: @GJGamble @BorisJohnson @trussliz...,1,0,340.00000,1516.00000,1,-1,0
2,1438236275283214343,AspieandMe,2021-09-15 20:20:23+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...","RT @GreenpeaceCA: I scream, you scream, WE ALL...",2,0,387.00000,672.00000,2,-5,-1
3,1438235849058045956,gazetka75,2021-09-15 20:18:42+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @LongTimeAmy: TODAY IN HISTORY\n\n1971–Envi...,16,0,1.00000,35.00000,2,-1,1
4,1438235834415714308,arynpekik,2021-09-15 20:18:38+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @zoev213: Happy 50th @greenpeaceusa Join me...,2,0,446.00000,443.00000,3,-1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
415,1438192539400429577,,2021-09-15 17:26:36+00:00,"<a href=""http://twitter.com/download/iphone"" r...",@cowboy_satin @23brookside @People4Bernie @Ber...,0,1,0.00000,0.00000,2,-1,1
416,1438192531787825157,,2021-09-15 17:26:34+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @jubileam8: #Glyphosate spraying purposeful...,5,0,0.00000,0.00000,1,-3,-1
417,1438192131206782979,,2021-09-15 17:24:59+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @EDProgramme: Companies like ExxonMobil kne...,17,0,0.00000,0.00000,2,-1,1
418,1438192017847226369,,2021-09-15 17:24:32+00:00,"<a href=""http://twitter.com/download/android"" ...","RT @AbukaAlfred: Different countries, differen...",5,0,0.00000,0.00000,1,-2,-1


In [52]:
complete.columns

Index(['id', 'screen_name', 'created_at', 'source', 'full_text',
       'retweet_count', 'favorite_count', 'followers_count_no_na',
       'friendsCount_no_na', 'positive', 'negative', 'neutral'],
      dtype='object')

In [53]:
complete.isna().sum()

id                         0
screen_name              252
created_at                 0
source                     0
full_text                  0
retweet_count              0
favorite_count             0
followers_count_no_na      0
friendsCount_no_na         0
positive                   0
negative                   0
neutral                    0
dtype: int64

## Challenge 3

The results of the sentiment analysis present three different columns: ```neutral```,  ```positive```, and ```negative```.

For this challenge, I will need to:
1. Create one variable that summarizes the sentiment (i.e., that somehow aggregates the information of it being positive or negative - or potentially neutral - into one single variable)
2. Describe the sentiment of your tweets (mean, SD, mode - select metrics that make sense depending on how you created the sentiment variable).
3. Create a new dataframe taking a random sample of 15 tweets from your dataset. 


### Step-by-step Interpretation:


**Exploring in details the results of the Sentiment Analysis**

In [54]:
complete['positive'].value_counts()

1    188
2    143
3     81
4      8
Name: positive, dtype: int64

In [55]:
complete['negative'].value_counts()

-1    286
-2     72
-3     41
-4     14
-5      7
Name: negative, dtype: int64

In [56]:
complete['neutral'].value_counts()

 1    190
-1    124
 0    106
Name: neutral, dtype: int64

**Summarizing the scores creating a new column with the sum of the negative and the positive scores together (the neutral scores will be ignored)**

In [57]:
complete['sent_score'] = complete['negative'] + complete['positive']

In [58]:
complete

Unnamed: 0,id,screen_name,created_at,source,full_text,retweet_count,favorite_count,followers_count_no_na,friendsCount_no_na,positive,negative,neutral,sent_score
0,1438236436084391936,CitizenWonk,2021-09-15 20:21:02+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @PBIcanada: We congratulate @GreenpeaceCA o...,7,0,32259.00000,27188.00000,3,-1,1,2
1,1438236402114772995,mark_trudgeon,2021-09-15 20:20:54+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @cadebe_: @GJGamble @BorisJohnson @trussliz...,1,0,340.00000,1516.00000,1,-1,0,0
2,1438236275283214343,AspieandMe,2021-09-15 20:20:23+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...","RT @GreenpeaceCA: I scream, you scream, WE ALL...",2,0,387.00000,672.00000,2,-5,-1,-3
3,1438235849058045956,gazetka75,2021-09-15 20:18:42+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @LongTimeAmy: TODAY IN HISTORY\n\n1971–Envi...,16,0,1.00000,35.00000,2,-1,1,1
4,1438235834415714308,arynpekik,2021-09-15 20:18:38+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @zoev213: Happy 50th @greenpeaceusa Join me...,2,0,446.00000,443.00000,3,-1,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
415,1438192539400429577,,2021-09-15 17:26:36+00:00,"<a href=""http://twitter.com/download/iphone"" r...",@cowboy_satin @23brookside @People4Bernie @Ber...,0,1,0.00000,0.00000,2,-1,1,1
416,1438192531787825157,,2021-09-15 17:26:34+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @jubileam8: #Glyphosate spraying purposeful...,5,0,0.00000,0.00000,1,-3,-1,-2
417,1438192131206782979,,2021-09-15 17:24:59+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @EDProgramme: Companies like ExxonMobil kne...,17,0,0.00000,0.00000,2,-1,1,1
418,1438192017847226369,,2021-09-15 17:24:32+00:00,"<a href=""http://twitter.com/download/android"" ...","RT @AbukaAlfred: Different countries, differen...",5,0,0.00000,0.00000,1,-2,-1,-1


**Asking Pandas to show me up to 500 rows of the Dataframe**

In [59]:
pd.set_option('display.max_rows', 500)

In [60]:
complete[['full_text','sent_score']]

Unnamed: 0,full_text,sent_score
0,RT @PBIcanada: We congratulate @GreenpeaceCA o...,2
1,RT @cadebe_: @GJGamble @BorisJohnson @trussliz...,0
2,"RT @GreenpeaceCA: I scream, you scream, WE ALL...",-3
3,RT @LongTimeAmy: TODAY IN HISTORY\n\n1971–Envi...,1
4,RT @zoev213: Happy 50th @greenpeaceusa Join me...,2
5,"RT @Greenpeace: ""Today we need huge, unprecede...",0
6,Im fundraising a speed climb supporting Green...,1
7,RT @cosmicadriestar: HAPPY BIRTHDAY GREENPEACE!!!,2
8,@Greenpeace @climatemorgan Dolphins from the F...,2
9,@AllatRaTV @BardemAntarctic @carlosbardem @gre...,-1


**Checking final results of the sum of the 2 scores (negative and positive)**

In [61]:
complete['sent_score'].value_counts()

 0    124
 1    119
 2     64
-1     64
-2     29
-3     11
 3      7
-4      2
Name: sent_score, dtype: int64

**Asking Python to show me the results in form of percentages: as we can see a 29.5% of Tweets have scored a value of 0, followed by a 45% of postive scores (28% scored 1, 15% scored 2, 1.6% scored 3) and a 25% of negative result**

In [62]:
complete['sent_score'].value_counts(normalize=True)

 0   0.29524
 1   0.28333
 2   0.15238
-1   0.15238
-2   0.06905
-3   0.02619
 3   0.01667
-4   0.00476
Name: sent_score, dtype: float64

**Making a function which will summarize the results of the sentiment analysis and display them in a column simply saying if the Tweet is supposed to be a positive or a negative one. task 1 completed** 

In [63]:
def recategorize(category):
    if category == 0:
        return 'neutral'
    if category == 1:
        return 'positive'
    if category == 2:
        return 'positive'
    if category == 3:
        return 'positive'
    if category == 4:
        return 'positive'
    if category == -1:
        return 'negative'
    if category == -2:
        return 'negative'
    if category == -3:
        return 'negative'
    if category == -4:
        return 'negative'
    if category == -5:
        return 'negative'
    else:
        return 'Other'

In [64]:
complete['sentiment'] = complete['sent_score'].apply(recategorize)

**The algorithm on the sentiment score seems to perform well! The first tweet of the sample shows a positive text and it is reported as a positive sentiment score. The second tweet of the sample shows instead a quite negative text and the algorithm reported id as negative. It does look quite relaible and I do agree with all of the judgement it made in the sample**

In [65]:
complete.head()

Unnamed: 0,id,screen_name,created_at,source,full_text,retweet_count,favorite_count,followers_count_no_na,friendsCount_no_na,positive,negative,neutral,sent_score,sentiment
0,1438236436084391936,CitizenWonk,2021-09-15 20:21:02+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @PBIcanada: We congratulate @GreenpeaceCA o...,7,0,32259.0,27188.0,3,-1,1,2,positive
1,1438236402114772995,mark_trudgeon,2021-09-15 20:20:54+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @cadebe_: @GJGamble @BorisJohnson @trussliz...,1,0,340.0,1516.0,1,-1,0,0,neutral
2,1438236275283214343,AspieandMe,2021-09-15 20:20:23+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...","RT @GreenpeaceCA: I scream, you scream, WE ALL...",2,0,387.0,672.0,2,-5,-1,-3,negative
3,1438235849058045956,gazetka75,2021-09-15 20:18:42+00:00,"<a href=""http://twitter.com/download/iphone"" r...",RT @LongTimeAmy: TODAY IN HISTORY\n\n1971–Envi...,16,0,1.0,35.0,2,-1,1,1,positive
4,1438235834415714308,arynpekik,2021-09-15 20:18:38+00:00,"<a href=""http://twitter.com/download/android"" ...",RT @zoev213: Happy 50th @greenpeaceusa Join me...,2,0,446.0,443.0,3,-1,1,2,positive


**As we can see the average of the Retweets and  Favorites is higher on Positive Tweets rather than Negative or Neutral Tweets and this already partially ansered my Research Question: the positive sentiment expressed in a post (Tweet in this case) positively influence user engagment**

In [66]:
complete.groupby('sentiment')[['retweet_count','favorite_count']].describe().transpose()

Unnamed: 0,sentiment,negative,neutral,positive
retweet_count,count,106.0,124.0,190.0
retweet_count,mean,11.41509,13.32258,38.11579
retweet_count,std,34.58898,36.35161,73.60518
retweet_count,min,0.0,0.0,0.0
retweet_count,25%,0.0,0.0,1.0
retweet_count,50%,0.0,2.0,16.0
retweet_count,75%,15.0,14.0,62.75
retweet_count,max,268.0,333.0,648.0
favorite_count,count,106.0,124.0,190.0
favorite_count,mean,0.48113,0.55645,3.54737


**The avergae Sentiment Analysis score is slightly above the zero so the Tweets extracted with Twarc were mostly positive Tweets. Task 2 completed**

In [67]:
complete['sent_score'].describe()

count   420.00000
mean      0.25000
std       1.31337
min      -4.00000
25%      -1.00000
50%       0.00000
75%       1.00000
max       3.00000
Name: sent_score, dtype: float64

**Making a sample of the Dataframe as requested by the project**

In [68]:
sample = complete.sample(15)

**Saving the Dataframe**

In [69]:
complete.to_pickle('COMPLETE.pkl')

**Importing Counter library just for curiosity in order to see which were the most common words used in ther Tweets**

In [70]:
import re
from collections import Counter

In [71]:
texts = complete['full_text'].values.tolist()

**Writing a function able to count the words present among the Tweets**

In [72]:
total_words = Counter()
for text in texts:
    # making text lower case
    text = text.lower()
    # removing URLS 
    text = text.split(' ')
    newtext = []
    for item in text:
        if 'http' not in item:
            newtext.append(item)
    
    newtext = ' '.join(newtext)
        
    # splitting the text in words (tokens)
    tokens = re.findall(r"[\w']+|[.,!?;$@]", newtext)
    for token in tokens:
        total_words[token] += 1
    

**Asking Python to show me the 100 most common words used in the Tweets**

In [73]:
total_words.most_common(100)

[('@', 1127),
 (',', 415),
 ('.', 360),
 ('the', 295),
 ('greenpeace', 287),
 ('rt', 255),
 ('to', 248),
 ('a', 162),
 ('of', 160),
 ('and', 148),
 ('!', 141),
 ('is', 140),
 ('in', 134),
 ('you', 131),
 ('for', 109),
 ('we', 86),
 ('today', 72),
 ('this', 67),
 ('happy', 61),
 ('our', 60),
 ('are', 58),
 ('on', 56),
 ('it', 55),
 (';', 54),
 ('years', 54),
 ('like', 52),
 ('i', 51),
 ('from', 51),
 ('all', 49),
 ('birthday', 48),
 ('?', 46),
 ('that', 45),
 ('action', 43),
 ('day', 42),
 ('can', 41),
 ('50th', 39),
 ('s', 38),
 ('green', 34),
 ('gt', 34),
 ('be', 32),
 ('greenpeaceusa', 32),
 ('thank', 31),
 ('have', 30),
 ('more', 30),
 ('greenpeaceca', 29),
 ('need', 29),
 ('climate', 29),
 ('so', 29),
 ('anniversary', 28),
 ('luisamneubauer', 28),
 ('seashepherd', 27),
 ('as', 27),
 ('50', 27),
 ('borisjohnson', 26),
 ('chrisgpackham', 26),
 ('seasaver', 26),
 ('carolinelucas', 26),
 ('change', 26),
 ('what', 26),
 ('has', 26),
 ('gjgamble', 25),
 ('trussliz', 25),
 ('environmental