# KEN1435 - Principles of Data Science | Lab 3: The Normal Approximation for Data

First we load the necessary python packages

In [1]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates # you can use this to customize your figures for exercises 9, 10 and 14
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd
import seaborn as sns

from palmerpenguins import load_penguins

%matplotlib inline

Let's start off with a normal approximation for the bill length of the Palmer penguins. Recall that we already displayed the mean absolute deviation of this quantity in exercise 7 of the previous lab. First, we load the data.

In [2]:
penguins = load_penguins()

1. Calculate the average and standard deviations of the bill length for each species

In [None]:
## YOUR CODE HERE

2. Calculate the standardized bill lengths for all penguins, taking into account which species they belong to. Save your result in a column named `bill_sd` in the `penguins` data frame.

In [None]:
## YOUR CODE HERE

3. Plot histograms of the column `bill_sd`, such that each species is displayed in its own panel. Also calculate which fraction of the penguins fall within two standard deviations from the mean. Based on the output, could you use a normal approximation for the bill length?

In [None]:
## YOUR CODE HERE

***Answer:*** *YOUR ANSWER HERE*

## COVID-19: sentiment in Maastricht
Let us now consider another data set that is extracted from social media. It covers a collection of users on social media that were retrieved based on posts that referenced a list of keywords related to the COVID-19 pandemic. Among all messages acquired with these keywords, those users were extacted that indicated their location was a city in the Netherlands. Specifically, we take a look at the accounts that specified "Maastricht" as their location.

First, we load a data frame with general information with regards to the accounts. As the file uses tab as a seperator, we specify this in while loading the file by `sep="\t"`. As the index of the file is contained in the first row of the file, we include `index_col=0`.

In [3]:
uinfo = pd.read_csv("users_maastricht.tsv", sep="\t", index_col=0)
uinfo.head()

Unnamed: 0_level_0,followers,friends
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
043u0001,733,894
043u0002,3882,539
043u0003,92,647
043u0004,87,122
043u0005,3271,3021


Let's plot the distributions of this data.

4. Visualize friends and followers distributions and plot the corresponding histograms.

In [None]:
## YOUR CODE HERE

As you can see from the figures, this is far from a normal distribution. Lets first look at the minimum and maximum observations for the friends and followers.

5. Determine the minimum and the maximum counts for both the friend and follower distributions.

In [None]:
## YOUR CODE HERE

As you can see, the values span multiple orders of magnitude, with only a few very large observations. This is a clear indication of a heavy tail distribution. However, there is another way that we can try to perform a normal approximation for this data.

In this alternative, we first calculate the logarithm of all observations (with `np.log10`, see documentation [here](https://numpy.org/doc/stable/reference/generated/numpy.log10.html)) and use those values for a normalization.

6. Plot the normal approximation of the logarithm of the friend and follower data.

In [None]:
## YOUR CODE HERE

7. Which fraction of the logarithmic scale observations are within two standard deviations from the mean?

In [None]:
## YOUR CODE HERE

The fact that calculating the logarithm of the data made our observations fall within a good normal approximation, tells us that the data can be approximated using a log-normal distribution.

Next, we move on to the messages placed by the users. For this, we load the next data file `tweets_maastricht.tsv`. In this file, we have information about which user posted the tweet and at which time, in addition to the sentiment scores for the valence, arousal and dominance dimensions. We set the column `tweet_id` as the index. Moreover, the file is again tab-separated, so we use `sep="\t"`. Finally, to process the dates at which the tweets are posted, we also specify a function that converts them to the correct timestamps, namely `pd.to_datetime`. Note that loading this dataframe might take some time as a result of this conversion to datetime objects. Therefore, we include the `%%time` command, which allows us to track how long it takes to run a particular cell of code.

In [None]:
%%time
sent = pd.read_csv("tweets_maastricht.tsv", sep="\t", index_col="tweet_id", converters={"created_at": pd.to_datetime})
sent.head()

Let's take a closer look at the `created_at` column. Obviously, this is `Series`-object, which consists of `datetime64` objects. Note that these objects are in UTC (Universal Coordinated Time). 

In [None]:
sent.created_at

In the case that we have a `Series` of datetime objects, we can use the `.dt` accessor (see documentation [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dt-accessor)) to extract properties of each timestamp across the entire `Series`. For instance, we can determine the date at which tweet is posted as follows:

In [None]:
sent.created_at.dt.date

8. Construct a time-series that contains the daily number of tweets contained in the data set.

In [None]:
## YOUR CODE HERE

9. Determine the time series of daily number of tweets in the data set (save the result in the variable `ts`) and plot the number of daily tweets starting from January 1st 2017 up to the latest day in the data set

In [None]:
## YOUR CODE HERE

It is clear that the majority of the tweets are observed in the last months in the timespan of the data. Let's also look at the weekly average number of  tweets. We can do this by using `.rolling` (see documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html)), by using the window 7. After we use `.rolling(7)`, we can then do a calculation on the rolling window, e.g., by executing `mean` in our case. We get the following output:

In [None]:
ts.rolling(7).mean()

As you can see, the first observations are `NaN`-values, as there are not enough observations before that obervation to average over seven observations.

10. Include the rolling weekly average in the plot that you made in the previous exercise.

In [None]:
## YOUR CODE HERE

Looking at this weekly average, we see a slight increase in number of tweets over time. Let's take a close look at this. First, we will determine the times of the first and last tweets per account in the dataset and store this information in the variables `first_tweet` and `last_tweet`, respectively.

11. Determine the time of the first and last tweet of all accounts in the data set

In [None]:
## YOUR CODE HERE

12. Calculate the difference in days between the two, so we can sort the accounts based on length, use the `.dt` accessor to do so.

In [None]:
## YOUR CODE HERE

With this days difference between first and last tweet, we can sort the accounts based on the length of their timeline for visualization purposes. To do so, we have to determine the rank of the observations and subsequently sort these ranks. Save the ordering in the variable `order`.

13. Sort the accounts based on the number of days in their timeline.

In [None]:
## YOUR CODE HERE

Let's visualize these quantities by drawing horizontal lines that connect the first and last tweet timings.

14. Visualize the lenghts of each timeline of the accounts

In [None]:
## YOUR CODE HERE

As you can see in the figure above, some of the time-series of message for specific users are rather short. This is a result of the fact that we could only obtain a specific number of tweets per user, so if a user is more active, we will have a shorter history for that user. This also directly explains why the number of tweets steadily increases as we move closer to the end of the observed times.

## Sentiment distribution
Next, we turn our attention to the sentiment data. As we saw in the first overview of the data, there are several `NaN`-values in the data. First, we want to investigate how many observations have `NaN`-values.

15. Save the names of the sentiment columns in the variable `sent_cols`

In [None]:
## YOUR CODE HERE

16. Determine how many observations are `NaN` in the three sentiment columns in the data frame `sent`

In [None]:
## YOUR CODE HERE

Next, let's first take a look at the distributions of the observations by visualizaing the distributions

17. Display the distriubtion of the sentiment values in a histogram for each of the dimensions. Bin the observations from `1` to `9` in bins of a width of `0.1`.

In [None]:
## YOUR CODE HERE

Let's explore whether the normal approximation works for the sentiment data. First, we have to normalize the data.

18. Calculate how many standard deviations away from the mean each observation is for each sentiment dimension.

In [None]:
## YOUR CODE HERE

Now that we have the deviations from the mean, we can determine how many observations lie within two standard deviations of the mean.

19. Determine which fraction of the observations falls within two standard deviations from the mean.

In [None]:
## YOUR CODE HERE

Based on these observations, would you say that the normal approximation for the sentiment data is appropriate?

***Answer:*** *YOUR ANSWER HERE*