# Challenges for week 2

Now that we've seen how to load and explore data in Pandas, it's time for you to apply this knowledge. This week has three challenges. Make sure to give it a try and complete all of them. 

**Some important notes for the challenges:**
1. These challenges are a warming up, and help you get ready for class. Make sure to give them a try. If you get an error message, try to troubleshoot it (using Google often helps). If all else fails, go to the next challenge (but make sure to hand it in).
2. While we of course like when you get all the answers right, the important thing is to exercise and apply the knowledge. So we will still accept challenges that may not be complete, as long as we see enough effort *for each challenge*. This means that if one of the challenges is not delivered (not started and no attempt shown), we unfortunately will not be able to provide a full grade for that week.
3. Delivering the challenge to the right place is a critical part of the challenge. This means we will only be able to grade and accept challenges that are live on your own private GitHub repository (so with a link starting with https://github.com/uva-cw-digitalanalytics/2021s1-) **and** delivered on time as a Canvas assignment. Watch the videos on Canvas on how to hand in your challenges.

### Facing issues? 

We are constantly monitoring the issues on the GitHub general repository (https://github.com/uva-cw-digitalanalytics/2021s1/issues) to help you out. Don't hesitate to log an issue there, explaining well what the problem is, showing the code you are using, and the error message you may be receiving. 

**Important:** We are only monitoring the repository in weekdays, and until 17.00. Issues logged after this time will most likely be answered the next day. This means you should now wait for our response before submitting a challenge :-)

## Getting setup for the challenges

We will use actual Twitter data for the challenges of this week. To do so, you need to download DMI-TCAT data that you may already be collecting for yourself, or from a colleague (if you haven't requested data collection yet).

After exporting the data (see the export options from DMI-TCAT - *Export all tweets from the selection*), make sure you add the ```.csv``` file **in the same folder** where you are working with the assignment. 

**Note:** If you have a lot of tweets (over 200,000), it's probably better to select a smaller timeframe, so your dataset is not too large.

***

In [11]:
import pandas as pd

In [12]:
tweets = pd.read_csv('tcat_NuriaVila-20210207-20210208------------fullExport--9654fe3ff4.csv')

In [13]:
len(tweets)

5121

***

## Challenge 1

Load the Twitter data in Pandas, and:
1. Display the first few rows
2. Check which columns the dataframe contains
3. Check which columns contain missing values

***

Because the Twitter data contains many rows, I use ```transpose()``` to present the rows vertically.

In [14]:
#read first few rows
tweets.head().transpose()

Unnamed: 0,0,1,2,3,4
id,1358446207450951681,1358446277013491718,1358446292922433550,1358446328729137152,1358446424250318853
time,1612713789,1612713805,1612713809,1612713818,1612713840
created_at,2021-02-07 16:03:09,2021-02-07 16:03:25,2021-02-07 16:03:29,2021-02-07 16:03:38,2021-02-07 16:04:00
from_user_name,KenFry10,JoyceSchneider1,melasface,sdsimper,mo_content
text,Volume up! Listen to this sample from The Brod...,RT @UviPoznansky: Uvi 💕 You're such a tease #...,RT @CharlieBCuff: Go and buy Candice Braithwai...,RT @sdsimper: @EvieDrae The Fate of Stars is a...,RT @mo_content: The Saving Raphael Santiago - ...
filter_level,low,low,low,low,low
possibly_sensitive,0.0,0.0,,,0.0
withheld_copyright,,,,,
withheld_scope,,,,,
truncated,,,,,


In [15]:
#check columns
tweets.columns

Index(['id', 'time', 'created_at', 'from_user_name', 'text', 'filter_level',
       'possibly_sensitive', 'withheld_copyright', 'withheld_scope',
       'truncated', 'retweet_count', 'favorite_count', 'lang', 'to_user_name',
       'in_reply_to_status_id', 'quoted_status_id', 'source', 'location',
       'lat', 'lng', 'from_user_id', 'from_user_realname',
       'from_user_verified', 'from_user_description', 'from_user_url',
       'from_user_profile_image_url', 'from_user_utcoffset',
       'from_user_timezone', 'from_user_lang', 'from_user_tweetcount',
       'from_user_followercount', 'from_user_friendcount',
       'from_user_favourites_count', 'from_user_listed',
       'from_user_withheld_scope', 'from_user_created_at'],
      dtype='object')

In [16]:
#check missing values
tweets.isna().sum()

id                                0
time                              0
created_at                        0
from_user_name                    0
text                              0
filter_level                      0
possibly_sensitive             2975
withheld_copyright             5121
withheld_scope                 5121
truncated                      5121
retweet_count                     0
favorite_count                    0
lang                              0
to_user_name                   4248
in_reply_to_status_id          4280
quoted_status_id               4863
source                           17
location                       1478
lat                            5119
lng                            5119
from_user_id                      0
from_user_realname                0
from_user_verified                0
from_user_description           371
from_user_url                  1958
from_user_profile_image_url       0
from_user_utcoffset            5121
from_user_timezone          

From the result I discover that this Twitter data has plenty of missing values. Columns including copyright and timezome have no data, and some columns (user description, sensitive possibility, etc.) contain some missing values, but not all.

***

## Challenge 2

Still with the Twitter data, please answer the following questions:
1. How many tweets are there in the dataset?
2. What is the average number of friends that the users have?
3. What is the average number of followers that the users have?
4. What is the average volume of retweets that the tweets have?
5. What is the most popular language?

**Make sure to run the code to get the information, and include your responses in markdown**


***

From the missing value checking results shown above, I find that the variable 'text' has no missing values, so I can just check the length of the dataset to get the number of tweets instead of slicing this column and count.

In [17]:
#tweets number
len(tweets)

5121

In [18]:
#average friend
tweets[['from_user_friendcount']].describe()

Unnamed: 0,from_user_friendcount
count,5121.0
mean,4180.694786
std,14583.895399
min,0.0
25%,177.0
50%,600.0
75%,2352.0
max,372887.0


On average, the users have 4181 friends. This is quite a lot.

In [19]:
#average follower

In [20]:
tweets[['from_user_followercount']].describe()

Unnamed: 0,from_user_followercount
count,5121.0
mean,9272.073618
std,41807.896419
min,0.0
25%,124.0
50%,660.0
75%,3891.0
max,843855.0


On average, the users have 9272 followers. The reason why the average follower number is so high is that the maximum number here is high. This data may contain the tweets sent by some celebrities.

In [21]:
#average retweets
tweets[['retweet_count']].describe()

Unnamed: 0,retweet_count
count,5121.0
mean,0.0
std,0.0
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,0.0


All users in this dataset have 0 retweets. Analyzing together with the followers and friends number shown above, this is quite strange, because there are accounts in this dataset have many friends and followers, but their tweets have no repost.

In [22]:
#average retweets
tweets[['retweet_count']].value_counts()

retweet_count
0                5121
dtype: int64

I use another code to confirm that all tweets have no reposts.

In [23]:
#most popular language
tweets[['lang']].value_counts()

lang
en      4002
pt       471
th       117
in       110
es        92
und       63
de        61
ja        44
pl        42
fr        33
it        23
nl        15
tl        10
tr         7
el         4
da         4
ko         3
hi         3
cs         3
ht         2
hu         2
et         2
sv         1
ar         1
is         1
no         1
iw         1
fi         1
fa         1
zh         1
dtype: int64

The most popular language in this Twitter data is English. There are also many people in this data use Portuguese.

***

## Challenge 3

We will run sentiment analysis on the tweets for next week. So this week, you need to **request** the sentiment analysis to us. For this challenge, you need to:
1. Select only tweets that have language set to English (i.e., ```en```)
2. Select only the columns ```id``` and ```text```

After slicing the dataframe following the needs above you need to:
1. Save the results to a pickle file containing your name, and the language of the tweets (in ISO two-letter codes). In my case, it would be:
    * **TheoAraujo_EN.pkl**
2. Upload the pickle file to SurfDrive (see the link in the homepage of the General Repository).

***

In [24]:
#select English
tweets_senti = tweets[tweets['lang'] == 'en']

In [25]:
#select id and text
tweets_senti = tweets_senti[['id', 'text']]

In [26]:
#save file
tweets_senti.to_pickle('AstridHe_EN.pkl')