# Challenges for week 3

Now that we've seen how to clean in Pandas, it's time for you to apply this knowledge. This week has three challenges. Make sure to give it a try and complete all of them. 

Each challenge has three components:
1. **Programming**: Applying one of the programming or data analysis steps in Python you learned in the tutorial
2. **Interpretation**: Explaining what you are doing and interpreting the results of the data analysis in MarkDown 
3. **Reflection**: Connecting these concepts with the literature of the week in a short reflection (*max 300 words*)

**Some important notes for the challenges:**
1. These challenges are a warming up, and help you get ready for class. Make sure to give them a try on all of them. If you get an error message, try to troubleshoot it (using Google often helps). If all else fails, go to the next challenge (but make sure to hand it in).
2. While we of course like when you get all the answers right, the important thing is to exercise and apply the knowledge. So we will still accept challenges that may not be complete, as long as we see enough effort *for each challenge*. The rubric (see Canvas) reflects this.
3. Delivering the challenge on time on Canvas assignment is critical, as it helps also prepare for the DA live session. Check on Canvas how to hand it in.

### Facing issues? 

We are constantly monitoring the issues on the GitHub to help you out. Don't hesitate to log an issue there, explaining well what the problem is, showing the code you are using, and the error message you may be receiving. 

**Important:** We are only monitoring the repository in weekdays, from 9.30 to 17.00. Issues logged after this time will most likely be answered the next day. This means you should not wait for our response before submitting a challenge :-)




## Getting setup for the challenges

We will again use  Twitter data for the challenges of this week. This means you need:
* Twitter data you have collected using twarc
* The sentiment analysis results (get them from SurfDrive)

**All the challenges below are with this Twitter data. Make sure to start your challenge by doing the basics of loading and inspecting the data, even if not specified in challenge itself.**



## Challenge 1

In the last tutorial, we have talked about the importance of data minimization and pseudonimization. Now you know it, we would like to ask you to prepare your Twitter dataset in this way.

Imagine you are currenlty working on the following research question:

To what extent does the sentiment expressed in a tweet influence user engagement with the tweet (likes and retweets)?

### Programming challenge

Using your Twitter data collected via ```twarc```, load it  with ```pandas``` and
1. Determine which variables are relevant for the research question (don't forget to include control variables that you think are relevant related both to users and tweets).
2. Create a minimized dataset (dataset with only variables necessary to answer the research question).
3. Make sure that the minimized dataset is pseudonymized (identifying information about users is removed from user-related columns and from text).

### Reflection

Tucker (2019) discusses privacy risks involved when data is used for artificial intelligence applications and digital analytics, namely *data persistance, repurposing and spillovers*. Select one of the risks and reflect to what extent such technical solutions as data minimization and pseudonimization can mitigate these risks. Is more needed to protect privacy of users?


### Choosing releavnt columns:

IV: Sentiment - I need text

DV: Engagement - I need likes and retweets

Control variables - number of followers of the author, author of the tweet

In [1]:
import pandas as pd

In [2]:
df_jsonl = pd.read_json('/Users/jstrych1/Documents/Digital_Analytics/2223s1_materials/DA-StudentFiles/LocalFiles/results_privacy.jsonl', lines=True)

In [4]:
df_jsonl.head()

Unnamed: 0,data,includes,meta,__twarc,errors
0,"[{'source': 'Twitter for Android', 'id': '1496...",{'users': [{'pinned_tweet_id': '14960329870493...,"{'newest_id': '1496434995132575748', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,
1,[{'attachments': {'media_keys': ['3_1496434149...,{'media': [{'url': 'https://pbs.twimg.com/medi...,"{'newest_id': '1496434152077471745', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,"[{'resource_id': '1496433036564791297', 'param..."
2,"[{'lang': 'en', 'author_id': '285509027', 'sou...",{'users': [{'public_metrics': {'followers_coun...,"{'newest_id': '1496433251954606081', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,"[{'parameter': 'entities.mentions.username', '..."
3,"[{'reply_settings': 'everyone', 'public_metric...","{'users': [{'protected': False, 'location': 'L...","{'newest_id': '1496432538725883904', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,"[{'parameter': 'entities.mentions.username', '..."
4,"[{'conversation_id': '1496431783487086592', 'a...",{'users': [{'profile_image_url': 'https://pbs....,"{'newest_id': '1496431783487086592', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,


Now I need to make sure I get the relevant information from this dataframe using functions shown in week 2. It looks like I also need to get the information about users to create my control variables

In [5]:
def get_public_metrics(row):
    if 'public_metrics' in row.keys():
        if type(row['public_metrics']) == dict:
            for key, value in row['public_metrics'].items():
                row['metric_' + str(key)] = value
    return row

def get_tweets(df):
    if 'data' not in df.columns:
        return None
    results = pd.DataFrame()
    for item in df['data'].values.tolist():
        results = pd.concat([results, pd.DataFrame(item)])
        
    results = results.apply(get_public_metrics, axis=1)
        
    results = results.reset_index()
    del results['index']
        
    return results

In [6]:
#This dataframe contains all tweets - one tweet per row
tweets = get_tweets(df_jsonl) 

In [7]:
def get_users(df):
    if 'includes' not in df.columns:
        return None
    results = pd.DataFrame()
    for item in df['includes'].values.tolist():
        results = pd.concat([results,pd.DataFrame(item['users'])])
    
    results = results.apply(get_public_metrics, axis=1)
       
    results = results.reset_index()
    del results['index']
        
    return results

In [8]:
#This dataframe contains all users - one user per row
users = get_users(df_jsonl)

In [8]:
df_user.head()

Unnamed: 0,id,id_str,name,screen_name,location,description,url,protected,followers_count,friends_count,...,has_extended_profile,default_profile,default_profile_image,following,follow_request_sent,notifications,translator_type,withheld_in_countries,entities.description.urls,entities.url.urls
0,1427919498,1427919498,DarSzym 💯🇵🇱,darszym1,"Warszawa, Polska","Naród może przetrwać gdy rządzą nim głupcy, a ...",,False,1759,1757,...,False,False,False,False,False,False,none,[],[],
1,1405176502396895239,1405176502396895239,Hye,nhnaychiw,United States,|🐳|\n•b a n g t a n b o y s•\n\n\n\n\n\n\n\n\n...,,False,21,205,...,True,True,False,False,False,False,none,[],[],
2,3371551073,3371551073,Natamas,s4nuy3,,7 και μπλε,,False,879,166,...,False,False,False,False,False,False,none,[],[],
3,1389404162899673090,1389404162899673090,Madeleine Love,MaddyLoveSpare,Indi electorate,A pandemic is the consequence of two things - ...,,False,230,222,...,False,True,False,False,False,False,none,[],[],
4,188739286,188739286,Médias Libres,mediaslibres,,Agrégateur des (glorieux) médias alternatifs f...,https://t.co/7aNjkNZHAn,False,6332,26,...,False,False,False,False,False,False,none,[],"[{'url': 'https://t.co/7aNjkNZHAn', 'expanded_...","[{'url': 'https://t.co/7aNjkNZHAn', 'expanded_..."


Now I have two dataframes - with users and with Tweets. To make sure I have my control variables, I need to merge them. Before merging, I make sure that I have a dataframe with unique user information.

In [9]:
#users_unique has unique users only
users_unique = users.drop_duplicates(subset=['id'])

In [10]:
df = tweets.merge(users_unique, how='left', left_on='author_id', right_on='id', suffixes=('_tweets', '_users'))

In [11]:
df.columns

Index(['source', 'id_tweets', 'author_id', 'possibly_sensitive',
       'reply_settings', 'created_at_tweets', 'conversation_id',
       'public_metrics_tweets', 'referenced_tweets', 'entities_tweets', 'lang',
       'text', 'context_annotations', 'in_reply_to_user_id', 'attachments',
       'geo', 'metric_retweet_count', 'metric_reply_count',
       'metric_like_count', 'metric_quote_count', 'pinned_tweet_id',
       'description', 'created_at_users', 'profile_image_url', 'name',
       'username', 'id_users', 'verified', 'public_metrics_users', 'url',
       'protected', 'entities_users', 'location', 'metric_followers_count',
       'metric_following_count', 'metric_tweet_count', 'metric_listed_count'],
      dtype='object')

It looks like the merge has worked well and I have a dataframe with all the necessary information. However, it is more than I need for my research question so let's make sure to minimize and pseudonymize the data.

Now, let's minimize the dataset.

It looks like I need the following columns:
* id
* full text
* retweets
* likes
* user name
* followers count

In [12]:
df_min = df[['id_tweets', 'text', 'metric_retweet_count', 'metric_like_count', 'username', 
                  'metric_followers_count']]

In [13]:
df_min.to_pickle('minimized_tweets.pkl')

In [14]:
df_min.head()

Unnamed: 0,id_tweets,text,metric_retweet_count,metric_like_count,username,metric_followers_count
0,1496434995132575748,RT @tatuadoysafado: Gravei meu primeiro conteu...,208,0,Thirolagrossa,33
1,1496434990413975557,Conservazione sostitutiva e messa a norma in a...,0,0,hitechlaw,454
2,1496434987666755588,RT @Mindfulness_DQ: More prayer. More self-car...,2011,0,cleopatrabbg,282
3,1496434984332275715,RT @TechHerNG: Google is rethinking its privac...,4,0,CCConsultingSL,417
4,1496434984076419074,RT @odisha_police: OTP is not only stand for t...,64,0,spdeogarh,3951


### Anonymization

Now, let's make sure we do not have identifying information in the Tweet (at least when it comes to user names)

In [15]:
users = df[['username']].drop_duplicates()

In [16]:
len(users)

57987

In [17]:
len(df_min)

71537

In [18]:
users.head(20)

darszym1          1
twilgaming        1
msmack68          1
Duskyqueen20      1
PersistFighter    1
                 ..
EcnErin           1
MGHAllergy        1
PeterLougee       1
dassakaye         1
GerardAraud       1
Name: screen_name, Length: 57987, dtype: int64

In [20]:
users = users.reset_index()

In [21]:
users.head()

Unnamed: 0,index,screen_name
0,0,darszym1
1,1,nhnaychiw
2,2,s4nuy3
3,3,MaddyLoveSpare
4,4,mediaslibres


In [22]:
users = users.rename(columns={'index': 'pseudID'})

In [23]:
users.head()

Unnamed: 0,pseudID,screen_name
0,0,darszym1
1,1,nhnaychiw
2,2,s4nuy3
3,3,MaddyLoveSpare
4,4,mediaslibres


In [24]:
df_min = df_min.merge(users, how='left', on='username')

Let's check if everything has worked out.

In [25]:
df_min['username'].value_counts()

'2'

In [26]:
df_min['pseudID'].value_counts()

In [27]:
del df_min['username']

In [32]:
df_min.head()

Unnamed: 0,id,full_text,retweet_count,followers_count,favorite_count,pseudID
0,1438405325690974209,RT @janek917: Tabela 5 raportu „Vaccine Survei...,3,1759,0,0
1,1438405322603999239,RT @JusticeMyanmar: .@TelenorGroup is trying t...,118,21,0,1
2,1438405320204853248,“conspiracy practices —the methods by which tr...,0,879,0,2
3,1438405307693027328,@NSWHealth Have you done any work to estimate ...,0,230,0,3
4,1438405243365101568,De quel type de surveillance le passe sanitair...,0,6332,0,4


Let's also make sure to remove all users names from tweet texts.

In [33]:
df_min= df_min.replace(to_replace ='@\S+', value = '@mention', regex = True)

In [15]:
df_min.head()

Unnamed: 0,id_tweets,text,metric_retweet_count,metric_like_count,username,metric_followers_count
0,1496434995132575748,RT @tatuadoysafado: Gravei meu primeiro conteu...,208,0,Thirolagrossa,33
1,1496434990413975557,Conservazione sostitutiva e messa a norma in a...,0,0,hitechlaw,454
2,1496434987666755588,RT @Mindfulness_DQ: More prayer. More self-car...,2011,0,cleopatrabbg,282
3,1496434984332275715,RT @TechHerNG: Google is rethinking its privac...,4,0,CCConsultingSL,417
4,1496434984076419074,RT @odisha_police: OTP is not only stand for t...,64,0,spdeogarh,3951


## Challenge 2

Last week, you requested sentiment analysis from us. Now, you will work further with it.

### Programming challenge

Merge the sentiment analysis results with your Twitter data. Make sure to check whether the length of the dataframe generated by the merge makes sense.

### Reflection
Possler et al. (2019) describe different ways communication scientists can access digital trace data. Looking at the different ways they describe and the advanatages and disadvantages they mention, reflect on the way you have collected data with twarc. In what way is the data collection method limiting? What are the opportunities and challanges you see?  


In [16]:
df_min = pd.read_pickle('minimized_tweets.pkl')

In [17]:
df_min.dtypes

id_tweets                  int64
text                      object
metric_retweet_count       int64
metric_followers_count     int64
metric_like_count          int64
pseudID                   object
dtype: object

In [19]:
df_min['id_tweets'].value_counts()

1438405325690974209    1
1437227708480643074    1
1437227535683698690    1
1437227569611296777    1
1437227587944783874    1
                      ..
1437775002535878667    1
1437775000401022983    1
1437774993203638281    1
1437774984248627201    1
1436627917321166852    1
Name: id_tweets, Length: 71537, dtype: int64

In [20]:
sentiment = pd.read_pickle('/Users/jstrych1/Downloads/JoannaStrycharz_EN_completed.pkl')

In [21]:
sentiment.head()

Unnamed: 0,id,positive,negative,neutral
1,1438405322603999239,1,-1,0
2,1438405320204853248,2,-2,-1
3,1438405307693027328,1,-1,0
5,1438405226801807364,2,-3,-1
6,1438405184544067589,2,-4,-1


In [22]:
sentiment.dtypes

id          object
positive    object
negative    object
neutral     object
dtype: object

In [23]:
sentiment['id'].value_counts()

1438405322603999239    1
1437182041381089283    1
1437182574665871365    1
1437182572732182529    1
1437182544647237633    1
                      ..
1437792639538057224    1
1437792634555228168    1
1437792520205918210    1
1437792488773799950    1
1436627917321166852    1
Name: id, Length: 60926, dtype: int64

Conditions for merging:
* Same name
* Same data type
* Unique key

In [24]:
sentiment['id'] = sentiment['id'].apply(pd.to_numeric)

In [25]:
sentiment.dtypes

id           int64
positive    object
negative    object
neutral     object
dtype: object

In [28]:
df_min = df_min.rename(columns={'id_tweets': 'id'})

In [29]:
len(df_min), len(sentiment)

(71537, 60926)

In [30]:
df_min.merge(sentiment, on='id')

Unnamed: 0,id,text,metric_retweet_count,metric_followers_count,metric_like_count,pseudID,positive,negative,neutral
0,1438405322603999239,RT @mention .@mention is trying to skirt human...,118,21,0,1,1,-1,0
1,1438405320204853248,“conspiracy practices —the methods by which tr...,0,879,0,2,2,-2,-1
2,1438405307693027328,@mention Have you done any work to estimate th...,0,230,0,3,1,-1,0
3,1438405226801807364,"📝We have examined the design, implementation a...",0,12992,0,5,2,-3,-1
4,1438405184544067589,"first time trying out the #sarkargame. and, it...",0,197,0,6,2,-4,-1
...,...,...,...,...,...,...,...,...,...
60921,1436628090692808705,"RT @mention Twenty years on from 9/11, and the...",11,2103,0,71530,1,-2,-1
60922,1436628088901840898,Despite the heavy surveillance of SAC forces i...,0,556,0,71531,1,-2,-1
60923,1436628042512695297,RT @mention More surveillance cameras going up...,4,8202,0,71532,1,-1,0
60924,1436627933167296513,RT @mention Holyyyyy shit. I can’t believe thi...,2072,103,0,71535,2,-2,-1


In [31]:
len(df_min.merge(sentiment, on='id'))

60926

In [32]:
len(df_min.merge(sentiment, on='id', how='left'))

71537

In [33]:
df_min.merge(sentiment, on='id', how='left').isna().sum()

id                            0
text                          0
metric_retweet_count          0
metric_followers_count        0
metric_like_count             0
pseudID                       0
positive                  10611
negative                  10611
neutral                   10611
dtype: int64

In [34]:
len(df_min.merge(sentiment, on='id', how='right'))

60926

In [35]:
df_min.merge(sentiment, on='id', how='right').isna().sum()

id                        0
text                      0
metric_retweet_count      0
metric_followers_count    0
metric_like_count         0
pseudID                   0
positive                  0
negative                  0
neutral                   0
dtype: int64

In [36]:
len(df_min.merge(sentiment, on='id', how='outer'))

71537

In [37]:
df_min.merge(sentiment, on='id', how='outer').isna().sum()

id                            0
text                          0
metric_retweet_count          0
metric_followers_count        0
metric_like_count             0
pseudID                       0
positive                  10611
negative                  10611
neutral                   10611
dtype: int64

In [38]:
len(df_min.merge(sentiment, on='id', how='inner'))

60926

In [39]:
df_merged = df_min.merge(sentiment, on='id', how='inner')

## Challenge 3

The sentiment analysis results has three interesting columns: ```neutral```,  ```positive```, and ```negative```. It is coming from the SentiStrength (http://sentistrength.wlv.ac.uk/) algorithm, trinary version.

### Programming challenge

1. Create one variable that summarizes the sentiment (i.e., that somehow aggregates the information of it being positive or negative - or potentially neutral - into one single variable)
2. Describe the sentiment of your tweets (mean, SD, mode - select metrics that make sense depending on how you created the sentiment variable).
3. Create a new dataframe taking a random sample of 15 tweets from your dataset. 

*Tip1: Pandas makes it easy to run numerical operations across columns. Let's say that I want to multiply the value that is in column A by the value that is in column B and store it in column C... I can simply use:*
```df['C'] = df['A'] * df['B']```

*Tip2: Use ```df.sample``` that we learned in the last tutorial to take a random sample of your tweets.*

### Reflection
In this challenge you were asked to assess on reliability of sentiment analysis. For the reflection, we would like to ask you to reflect on the implication of large-scale usage of sentiment analysis. What does the reliability mean for automated decision made based on sentiment analysis of different digital traces? What challanges and threats can you identify?




### Creating sentiment variable

In [40]:
df_merged.dtypes

id                         int64
text                      object
metric_retweet_count       int64
metric_followers_count     int64
metric_like_count          int64
pseudID                   object
positive                  object
negative                  object
neutral                   object
dtype: object

In [41]:
df_merged['positive'] = df_merged['positive'].apply(pd.to_numeric)

In [42]:
df_merged['negative'] = df_merged['negative'].apply(pd.to_numeric)

In [43]:
df_merged.dtypes

id                         int64
text                      object
metric_retweet_count       int64
metric_followers_count     int64
metric_like_count          int64
pseudID                   object
positive                   int64
negative                   int64
neutral                   object
dtype: object

In [44]:
df_merged['overall_sent'] = df_merged['negative'] + df_merged['positive']

In [45]:
df_merged.head()

Unnamed: 0,id,text,metric_retweet_count,metric_followers_count,metric_like_count,pseudID,positive,negative,neutral,overall_sent
0,1438405322603999239,RT @mention .@mention is trying to skirt human...,118,21,0,1,1,-1,0,0
1,1438405320204853248,“conspiracy practices —the methods by which tr...,0,879,0,2,2,-2,-1,0
2,1438405307693027328,@mention Have you done any work to estimate th...,0,230,0,3,1,-1,0,0
3,1438405226801807364,"📝We have examined the design, implementation a...",0,12992,0,5,2,-3,-1,-1
4,1438405184544067589,"first time trying out the #sarkargame. and, it...",0,197,0,6,2,-4,-1,-2


In [46]:
df_merged['overall_sent'].describe()

count    60926.000000
mean        -0.536027
std          1.149694
min         -4.000000
25%         -1.000000
50%          0.000000
75%          0.000000
max          4.000000
Name: overall_sent, dtype: float64

In [47]:
df_sample_sent = df_merged.sample(15)

In [48]:
df_sample_sent

Unnamed: 0,id,text,metric_retweet_count,metric_followers_count,metric_like_count,pseudID,positive,negative,neutral,overall_sent
5809,1438214311223218181,@mention Erm....empty them......then save the ...,0,182,0,6362,3,-2,1,1
52571,1436785704290357251,RT @mention Holyyyyy shit. I can’t believe thi...,2072,142,0,62243,2,-2,-1,0
3577,1438277219219017729,@mention @mention @mention @mention @mention I...,0,38,1,3898,2,-2,-1,0
40795,1437176415770775557,RT @mention #UPDATE The head of the UN's nucle...,32,240,0,21915,1,-2,-1,-1
51241,1436815332342124546,"RT @mention George W. Bush, the man who brough...",1232,107,0,60802,1,-4,-1,-3
34541,1437434706186883085,RT @mention dear @mention - stocks are moved b...,67,122,0,39899,2,-1,1,1
15203,1437937575126253572,RT @mention Activision Blizzard workers are su...,395,31,0,17069,1,-3,-1,-2
28387,1437568645258006533,"RT @mention Of course, the answer to the incre...",1,11,0,32088,1,-2,-1,-1
46131,1437007661351018496,RT @mention As a police officer when undertaki...,339,508,0,55190,1,-1,0,0
1359,1438350143934418945,RT @mention Woman struck by a stray bullet whi...,10,45,0,1549,1,-2,-1,-1


In [49]:
pd.set_option('display.max_colwidth', None)

In [51]:
df_sample_sent[['text', 'overall_sent']]

Unnamed: 0,text,overall_sent
5809,@mention Erm....empty them......then save the bloody money spent on QR code surveillance!,1
52571,RT @mention Holyyyyy shit. I can’t believe this is a thing (a mouse over a watch face to keep from going into “away” status in Teams). T…,0
3577,@mention @mention @mention @mention @mention Im genuinely curious how you think that works. The only way you could think there is a causality from crime to police interaction and not the other way around is if a) whenever a crime is committed police just materialises or b) we are all under perfect surveillance at all time,0
40795,RT @mention #UPDATE The head of the UN's nuclear watchdog hailed a deal struck with Iran over access to surveillance equipment at Iranian nucl…,-1
51241,"RT @mention George W. Bush, the man who brought us torture, indefinite detention, assassination, and mass surveillance, rebukes those who’…",-3
34541,RT @mention dear @mention - stocks are moved back to ASM 4 in a day...can't find the circular on nse notices. can u plz explain t…,1
15203,"RT @mention Activision Blizzard workers are suing, saying the company is using coercive tactics to stop them from unionizing — incl. intimi…",-2
28387,"RT @mention Of course, the answer to the increasing intrusion of the mass surveillance state into every aspect of our lives is to give i…",-1
46131,"RT @mention As a police officer when undertaking surveillance, it required 6 cars and 12 men. We did this for drug dealers and ot…",0
1359,RT @mention Woman struck by a stray bullet while inside of her residence last night on the 2300 block of 16th St SE. She was taken t…,-1


In [53]:
df_sample_sent['text'][52571]

'RT @mention Holyyyyy shit. I can’t believe this is a thing (a mouse over a watch face to keep from going into “away” status in Teams). T…'