# Warm-up challenges for week 2

Now that we've seen how Python and Jupyter Notebooks work and that you have read about Digital Analytics and Computational Social Science, it's time for you to combine apply this knowledge. This week has three preparatory challenges. 

Each challenge has two components:
1. **Programming and interpretation**
3. **Reflection**

**Some important notes for the challenges:**
1. These challenges are a warming up, and help you get ready for class. Make sure to give them a try on all of them. 
2. If you get an error message, try to troubleshoot it (using Google often helps). If all else fails, go to the next challenge (but make sure to give it a try).
3. These challenges are ungraded, yet they help you prepare for the graded challenges in the portfolio. If you want to be efficient, have a look at what you need to do for the upcoming graded challenges and see how to combine the work.

### Facing issues? 

We are constantly monitoring the issues on the GitHub to help you out. Don't hesitate to log an issue there, explaining well what the problem is, showing the code you are using, and the error message you may be receiving. 

**Important:** We are only monitoring the repository in weekdays, from 9.30 to 17.00. Issues logged after this time will most likely be answered the next day. 


### Using Markdown

1. Make sure to combine code *and* markdown to answer all questions. Mention specifically the question (and question number) and the answer in markdown, relating to the code and the output of the code. For the graded challenges, failing to do so will impact the grade, as we will not be able to see whether you answered the question.
2. For every line of code, please include a cell in MarkDown explaining what the code is expected to do.


## Challenge 1


### Programming challenge

Collect Twitter data using ```twarc``` if you have the developer credentials approved (see the Twitter API training details on Canvas) or request Twitter data to the lecturers (see Canvas) and use that file. Load that file using ```pandas``` and:
1. Display the first 5 rows
2. Check which columns the dataframe contains
3. Check which columns have missing values
4. Check how many tweets are available in the dataset
5. Calculate the average and the standard deviation of retweets and likes
6. Calculate the average, minimum and maximum number of followers that users in the dataset have
7. Indicate the most popular languages in the dataset


### Reflection

Wagner et al. (2021) discuss a set of important challenges for measurement in what they call *algorithmically infused societies*. Select one relevant challenge that they discuss in their text, and relate to the Twitter data you have just loaded and reviewed. How are the measures that you have just reported to stakeholders (in the interpretation section) affected by this challenge? Please motivate your response, and be as specific as possible.

In [1]:
import pandas as pd

In [2]:
df = pd.read_json('TheoAraujo.jsonl', lines=True)

In [3]:
df.head()

Unnamed: 0,data,includes,meta,__twarc,errors
0,[{'text': 'RT @I_Am_The_ICT: Screenshot this p...,"{'users': [{'username': 'Awoken_Soul_', 'prote...","{'newest_id': '1570085814817722368', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,
1,[{'text': 'RT @jeffreyboadi_: All those exclai...,"{'users': [{'username': 'liligotthekeys', 'pro...","{'newest_id': '1570084255195070468', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,"[{'resource_id': '1569959556025036800', 'param..."
2,"[{'lang': 'en', 'id': '1570082754871582721', '...",{'users': [{'public_metrics': {'followers_coun...,"{'newest_id': '1570082754871582721', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,
3,"[{'public_metrics': {'retweet_count': 130, 're...",{'users': [{'profile_image_url': 'https://pbs....,"{'newest_id': '1570082705618120705', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,
4,"[{'id': '1570082631567691776', 'source': 'Twit...",{'users': [{'public_metrics': {'followers_coun...,"{'newest_id': '1570082631567691776', 'oldest_i...",{'url': 'https://api.twitter.com/2/tweets/sear...,


In [4]:
df.columns

TypeError: 'Index' object is not callable

In [5]:
def get_public_metrics(row):
    if 'public_metrics' in row.keys():
        if type(row['public_metrics']) == dict:
            for key, value in row['public_metrics'].items():
                row['metric_' + str(key)] = value
    return row

def get_tweets(df):
    if 'data' not in df.columns:
        return None
    results = pd.DataFrame()
    for item in df['data'].values.tolist():
        results = pd.concat([results, pd.DataFrame(item)])
        
    results = results.apply(get_public_metrics, axis=1)
        
    results = results.reset_index()
    del results['index']
        
    return results

In [6]:
tweets = get_tweets(df)

## 1. Display the first 5 rows

In [7]:
tweets.head()

Unnamed: 0,text,entities,public_metrics,reply_settings,lang,source,possibly_sensitive,author_id,referenced_tweets,conversation_id,created_at,id,attachments,context_annotations,in_reply_to_user_id,geo,metric_retweet_count,metric_reply_count,metric_like_count,metric_quote_count
0,RT @I_Am_The_ICT: Screenshot this please so wh...,"{'mentions': [{'start': 3, 'end': 16, 'usernam...","{'retweet_count': 239, 'reply_count': 0, 'like...",everyone,en,Twitter for Android,False,1206072760239349761,"[{'type': 'retweeted', 'id': '1568180588191911...",1570085814817722368,2022-09-14T16:23:23.000Z,1570085814817722368,,,,,239,0,0,0
1,Do you have difficulty in preparing neetpg cho...,"{'urls': [{'start': 268, 'end': 291, 'url': 'h...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter Web App,False,1569551671969333249,,1570085811206627329,2022-09-14T16:23:22.000Z,1570085811206627329,{'media_keys': ['7_1570085212662669313']},,,,0,0,1,0
2,UHM? jd probably there cause i was rting shit ...,"{'urls': [{'start': 99, 'end': 122, 'url': 'ht...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for Android,False,1002918664113442816,,1570085801098383360,2022-09-14T16:23:20.000Z,1570085801098383360,{'media_keys': ['3_1570085798472568832']},"[{'domain': {'id': '45', 'name': 'Brand Vertic...",,,0,0,0,0
3,cant believe i missed so many q2han videos 😢 i...,,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for Android,False,748598123732185088,,1570085777710907392,2022-09-14T16:23:14.000Z,1570085777710907392,,,,,0,0,0,0
4,@8x5tl8 The holy algorithm does that. :) When ...,"{'mentions': [{'start': 0, 'end': 7, 'username...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for iPhone,False,1035008103987769346,"[{'type': 'replied_to', 'id': '157007157427923...",1570071574279237636,2022-09-14T16:23:05.000Z,1570085737869070339,,,23709361.0,,0,0,0,0


## 2. Check which columns the dataframe contains

In [8]:
tweets.columns

Index(['text', 'entities', 'public_metrics', 'reply_settings', 'lang',
       'source', 'possibly_sensitive', 'author_id', 'referenced_tweets',
       'conversation_id', 'created_at', 'id', 'attachments',
       'context_annotations', 'in_reply_to_user_id', 'geo',
       'metric_retweet_count', 'metric_reply_count', 'metric_like_count',
       'metric_quote_count'],
      dtype='object')

## 3. Check which columns have missing values

In [9]:
tweets.isna().sum()

text                      0
entities                 25
public_metrics            0
reply_settings            0
lang                      0
source                    0
possibly_sensitive        0
author_id                 0
referenced_tweets        88
conversation_id           0
created_at                0
id                        0
attachments             526
context_annotations     389
in_reply_to_user_id     485
geo                     594
metric_retweet_count      0
metric_reply_count        0
metric_like_count         0
metric_quote_count        0
dtype: int64

## 4. Check how many tweets are available in the dataset

In [10]:
tweets.count()

text                    599
entities                574
public_metrics          599
reply_settings          599
lang                    599
source                  599
possibly_sensitive      599
author_id               599
referenced_tweets       511
conversation_id         599
created_at              599
id                      599
attachments              73
context_annotations     210
in_reply_to_user_id     114
geo                       5
metric_retweet_count    599
metric_reply_count      599
metric_like_count       599
metric_quote_count      599
dtype: int64

In [11]:
len(tweets)

599

## 5. Calculate the average and the standard deviation of retweets and likes

In [12]:
tweets.describe()

Unnamed: 0,metric_retweet_count,metric_reply_count,metric_like_count,metric_quote_count
count,599.0,599.0,599.0,599.0
mean,99.806344,0.086811,0.248748,0.001669
std,122.740185,0.373642,1.333851,0.040859
min,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0
50%,19.0,0.0,0.0,0.0
75%,220.0,0.0,0.0,0.0
max,1522.0,6.0,21.0,1.0


In [13]:
tweets[['metric_retweet_count', 'metric_like_count']].describe()

Unnamed: 0,metric_retweet_count,metric_like_count
count,599.0,599.0
mean,99.806344,0.248748
std,122.740185,1.333851
min,0.0,0.0
25%,0.0,0.0
50%,19.0,0.0
75%,220.0,0.0
max,1522.0,21.0


In [14]:
tweets[['metric_retweet_count', 'metric_like_count']].mean()

metric_retweet_count    99.806344
metric_like_count        0.248748
dtype: float64

In [15]:
tweets[['metric_retweet_count', 'metric_like_count']].std()

metric_retweet_count    122.740185
metric_like_count         1.333851
dtype: float64

In [16]:
tweets[['metric_retweet_count', 'metric_like_count']].agg(['mean', 'std'])

Unnamed: 0,metric_retweet_count,metric_like_count
mean,99.806344,0.248748
std,122.740185,1.333851


## 6. Calculate the average, minimum and maximum number of followers that users in the dataset have

In [17]:
tweets.columns

Index(['text', 'entities', 'public_metrics', 'reply_settings', 'lang',
       'source', 'possibly_sensitive', 'author_id', 'referenced_tweets',
       'conversation_id', 'created_at', 'id', 'attachments',
       'context_annotations', 'in_reply_to_user_id', 'geo',
       'metric_retweet_count', 'metric_reply_count', 'metric_like_count',
       'metric_quote_count'],
      dtype='object')

In [18]:
def get_users(df):
    if 'includes' not in df.columns:
        return None
    results = pd.DataFrame()
    for item in df['includes'].values.tolist():
        results = pd.concat([results,pd.DataFrame(item['users'])])
    
    results = results.apply(get_public_metrics, axis=1)
       
    results = results.reset_index()
    del results['index']
        
    return results

In [19]:
users = get_users(df)
users.head()

Unnamed: 0,username,protected,verified,profile_image_url,name,public_metrics,id,created_at,url,pinned_tweet_id,description,entities,location,metric_followers_count,metric_following_count,metric_tweet_count,metric_listed_count
0,Awoken_Soul_,False,False,https://pbs.twimg.com/profile_images/147175193...,Awoken Soul,"{'followers_count': 470, 'following_count': 4,...",1206072760239349761,2019-12-15T04:45:53.000Z,,1.4181538628133967e+18,Humanity can't take the truth.,,,470,4,51,0
1,I_Am_The_ICT,False,False,https://pbs.twimg.com/profile_images/151902731...,The Inner Circle Trader,"{'followers_count': 109092, 'following_count':...",1519027155467911170,2022-04-26T18:55:09.000Z,https://t.co/olU3CAMIam,1.568180588191912e+18,The Ghost In The Machine,"{'url': {'urls': [{'start': 0, 'end': 23, 'url...",,109092,0,2801,767
2,collegecounsel_,False,False,https://pbs.twimg.com/profile_images/156955193...,College counsel,"{'followers_count': 1, 'following_count': 6, '...",1569551671969333249,2022-09-13T05:01:08.000Z,,,,,,1,6,11,0
3,fruitbleed,False,False,https://pbs.twimg.com/profile_images/156748208...,ً☆,"{'followers_count': 287, 'following_count': 16...",1002918664113442816,2018-06-02T14:23:37.000Z,,1.5374440616734925e+18,kpop & on the political and economic state of ...,,she they ☭ 20,287,161,29468,13
4,pinktyongs,False,False,https://pbs.twimg.com/profile_images/151521171...,nana ८´ ᵔ `ა,"{'followers_count': 641, 'following_count': 85...",748598123732185088,2016-06-30T19:24:35.000Z,https://t.co/DuYItHoeP2,1.5261094720490496e+18,love makes us\n#TAEYONG\n#NCT127,"{'url': {'urls': [{'start': 0, 'end': 23, 'url...","tyongf, she/her, bi",641,852,43119,12


In [20]:
users['metric_followers_count'].describe()

count    8.510000e+02
mean     7.353244e+05
std      6.849113e+06
min      0.000000e+00
25%      8.500000e+00
50%      3.740000e+02
75%      3.361500e+03
max      1.056357e+08
Name: metric_followers_count, dtype: float64

In [21]:
pd.set_option('display.float_format', lambda x: '%.5f' % x)


In [22]:
users['metric_followers_count'].describe()

count         851.00000
mean       735324.38190
std       6849112.88098
min             0.00000
25%             8.50000
50%           374.00000
75%          3361.50000
max     105635707.00000
Name: metric_followers_count, dtype: float64

In [23]:
users[['metric_followers_count']].agg(['min', 'max', 'mean'])

Unnamed: 0,metric_followers_count
min,0.0
max,105635707.0
mean,735324.3819


## 7. Indicate the most popular languages in the dataset

In [24]:
tweets['lang'].value_counts()

en     523
ja      41
es      21
ca       6
fr       4
th       2
und      1
et       1
Name: lang, dtype: int64

**NOTE:** The code above is showing different options. Always check the video for more context. For the graded challenges, keep in mind you need to explain each step in MarkDown.

## Challenge 2

### Programming challenge

Download your own data from a digital platform in **JSON** or **CSV** format. You can use Facebook or Instagram data (which make the data available almost immediately), or other platforms (e.g., Google, TikTok, Spotify etc.) - but be mindful of how much time the platform says they will take to make the data available to you.

Download the data in your computer, and select one of the files that has advertising-related data (or profile interests) - but **not** your posts, personal, or network (friends or followers).

After finding this file, move it to the appropriate folder where you are running the notebook. Load it in Pandas and: 

1. Display the first 5 rows
2. Check with columns the dataframe contains
3. Check which columns have missing values
4. Summarize the information for at least one column (if it is numeric, descriptive statistics should do, if it is categorical or text, then counting frequencies and showing the top 5 items is enough).

**Note:** as shown in the videos, sometimes you may need to apply the function ```expand_dictionary``` to get meaningful data. We will cover this in more detail on DA3. If the function does not work, log an issue on GitHub with the problem (if there's still time before the submission) or select a different file in your data download package. 

### Reflection

Salz & Dewar (2019) discuss a set of important chalenges in their proposed ethics framework. Imagine that you will conduct a research project and ask multiple respondents to donate their platform usage data to you. Please select two challenges from those suggested by the authors, and explain how these challenges relate to doing research using these data donations.  

In [25]:
advertisers = pd.read_json('advertisers_who_uploaded_a_contact_list_with_your_information.json')

In [26]:
advertisers.head()

Unnamed: 0,custom_audiences_v2
0,"80,000 Hours"
1,Abstract Home Art
2,Adobe Commerce
3,Adobe Design
4,Adobe Photoshop


In [27]:
advertisers.isna().sum()

custom_audiences_v2    0
dtype: int64

In [28]:
advertisers.value_counts()

custom_audiences_v2           
80,000 Hours                      1
Serasa                            1
Seguralta Corretora de Seguros    1
Sebastian Wilson                  1
Scarface 1920                     1
                                 ..
Hilton Hotels & Resorts           1
Hilton Honors                     1
Het Parool                        1
Headspace                         1
wijnvoordeel.nl                   1
Length: 301, dtype: int64

In [29]:
len(advertisers)

301

## Challenge 3

### Programming challenge

Social media data and datasets with other types of digital traces are often "messy" - they include missing data, duplicate information or unwanted observations. In the upcoming weeks, you will learn how to deal with such data.

The dataset you are now working with is also an example of such a messy dataset. One of the steps to clean it involves removing unwanted or irrelevant observations. Next week, you will continue working with the Twitter data and will focus on analysing text. To be able to e.g., classify the tweets or analyse their sentiment, you will be asked to focus on tweets in one language. To prepare for next week, in this challenge you need to select only tweets that are in one language (that have language set to English (i.e., en)). You can choose any language that is substantially represented in the dataset (see your answer to Challenge 1, question 7). Make sure to save this selection.

### Reflection

For challenges 1, 2 and 3, you have done many data science activities, from collecting tweets or your own data, loading the data, inspecting and even requesting sentiment analysis. Using the CRISP-DM explanation found in Larose 2014, provide a short overview of the actions you took, and to which step(s) of the CRISP-DM process they belong. Motivate your answer.

What I need to do:
* Most popular langauge is English - I will select rows that have tweets in 'en'
* I will save this new dataframe as a pickle file 

In [9]:
tweets[tweets['lang']=='en']

Unnamed: 0,text,entities,public_metrics,reply_settings,lang,source,possibly_sensitive,author_id,referenced_tweets,conversation_id,created_at,id,attachments,context_annotations,in_reply_to_user_id,geo,metric_retweet_count,metric_reply_count,metric_like_count,metric_quote_count
0,RT @I_Am_The_ICT: Screenshot this please so wh...,"{'mentions': [{'start': 3, 'end': 16, 'usernam...","{'retweet_count': 239, 'reply_count': 0, 'like...",everyone,en,Twitter for Android,False,1206072760239349761,"[{'type': 'retweeted', 'id': '1568180588191911...",1570085814817722368,2022-09-14T16:23:23.000Z,1570085814817722368,,,,,239,0,0,0
1,Do you have difficulty in preparing neetpg cho...,"{'urls': [{'start': 268, 'end': 291, 'url': 'h...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter Web App,False,1569551671969333249,,1570085811206627329,2022-09-14T16:23:22.000Z,1570085811206627329,{'media_keys': ['7_1570085212662669313']},,,,0,0,1,0
2,UHM? jd probably there cause i was rting shit ...,"{'urls': [{'start': 99, 'end': 122, 'url': 'ht...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for Android,False,1002918664113442816,,1570085801098383360,2022-09-14T16:23:20.000Z,1570085801098383360,{'media_keys': ['3_1570085798472568832']},"[{'domain': {'id': '45', 'name': 'Brand Vertic...",,,0,0,0,0
3,cant believe i missed so many q2han videos 😢 i...,,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for Android,False,748598123732185088,,1570085777710907392,2022-09-14T16:23:14.000Z,1570085777710907392,,,,,0,0,0,0
4,@8x5tl8 The holy algorithm does that. :) When ...,"{'mentions': [{'start': 0, 'end': 7, 'username...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for iPhone,False,1035008103987769346,"[{'type': 'replied_to', 'id': '157007157427923...",1570071574279237636,2022-09-14T16:23:05.000Z,1570085737869070339,,,23709361,,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
594,@dmrachnik For money and popularity algorithm....,"{'mentions': [{'start': 0, 'end': 10, 'usernam...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for Android,False,1503651649805422592,"[{'type': 'replied_to', 'id': '157007881416357...",1570075717219532801,2022-09-14T16:00:06.000Z,1570079953638813698,,"[{'domain': {'id': '123', 'name': 'Ongoing New...",297355518,,0,0,0,0
595,RT @legendzsport: Download it : \n👉🏽https://t...,"{'mentions': [{'start': 3, 'end': 16, 'usernam...","{'retweet_count': 6, 'reply_count': 0, 'like_c...",everyone,en,AdvanceML,False,1280935835789975553,"[{'type': 'retweeted', 'id': '1570048900525965...",1570079947330584576,2022-09-14T16:00:04.000Z,1570079947330584576,,"[{'domain': {'id': '46', 'name': 'Business Tax...",,,6,0,0,0
596,Should I make a new account at this point I ki...,,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for Android,False,1079905658504466433,,1570079946067922945,2022-09-14T16:00:04.000Z,1570079946067922945,,,,,0,0,1,0
597,@MelanieMoser_ @British_Airways For example I ...,"{'mentions': [{'start': 0, 'end': 14, 'usernam...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for Android,False,1486039748,"[{'type': 'replied_to', 'id': '157007963096653...",1570069285170565121,2022-09-14T16:00:02.000Z,1570079937658327040,,"[{'domain': {'id': '45', 'name': 'Brand Vertic...",1486039748,,0,0,0,0


In [10]:
tweets_english = tweets[tweets['lang']=='en']

In [11]:
len(tweets_english)

523

In [12]:
tweets_english.head()

Unnamed: 0,text,entities,public_metrics,reply_settings,lang,source,possibly_sensitive,author_id,referenced_tweets,conversation_id,created_at,id,attachments,context_annotations,in_reply_to_user_id,geo,metric_retweet_count,metric_reply_count,metric_like_count,metric_quote_count
0,RT @I_Am_The_ICT: Screenshot this please so wh...,"{'mentions': [{'start': 3, 'end': 16, 'usernam...","{'retweet_count': 239, 'reply_count': 0, 'like...",everyone,en,Twitter for Android,False,1206072760239349761,"[{'type': 'retweeted', 'id': '1568180588191911...",1570085814817722368,2022-09-14T16:23:23.000Z,1570085814817722368,,,,,239,0,0,0
1,Do you have difficulty in preparing neetpg cho...,"{'urls': [{'start': 268, 'end': 291, 'url': 'h...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter Web App,False,1569551671969333249,,1570085811206627329,2022-09-14T16:23:22.000Z,1570085811206627329,{'media_keys': ['7_1570085212662669313']},,,,0,0,1,0
2,UHM? jd probably there cause i was rting shit ...,"{'urls': [{'start': 99, 'end': 122, 'url': 'ht...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for Android,False,1002918664113442816,,1570085801098383360,2022-09-14T16:23:20.000Z,1570085801098383360,{'media_keys': ['3_1570085798472568832']},"[{'domain': {'id': '45', 'name': 'Brand Vertic...",,,0,0,0,0
3,cant believe i missed so many q2han videos 😢 i...,,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for Android,False,748598123732185088,,1570085777710907392,2022-09-14T16:23:14.000Z,1570085777710907392,,,,,0,0,0,0
4,@8x5tl8 The holy algorithm does that. :) When ...,"{'mentions': [{'start': 0, 'end': 7, 'username...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,en,Twitter for iPhone,False,1035008103987769346,"[{'type': 'replied_to', 'id': '157007157427923...",1570071574279237636,2022-09-14T16:23:05.000Z,1570085737869070339,,,23709361.0,,0,0,0,0


In [13]:
tweets_english.to_pickle('Tweets_EN.pkl')

In [14]:
tweets[tweets['lang']=='en'].to_pickle('Tweets2_EN.pkl')