# Homework 1 (HW1)

---
By the end of this homework we expect you to be able to:  
1. Load data from different formats using [pandas](https://pandas.pydata.org/);  
2. Navigate the documentation of Python packages by yourself;  
3. Filter and tidy up noisy data sets;  
4. Aggregate your data in different (and hopefully helpful) ways;  
5. Create meaningful visualizations to analyze the data;

---

## Important Dates

- Homework release: Fri 2 Oct 2020
- **Homework due**: Fri 16 Oct 2020, 23:59
- Grade release: Fri 23 Oct 2020

---

##  Some rules

1. You are allowed to use any built-in Python library that comes with Anaconda. If you want to use an external library, you have to justify your choice.
2. Make sure you use the `data` folder provided in the repository in **read-only** mode.
3. Be sure to provide a textual description of your thought process, the assumptions you made, the solution you implemented, and explanations for your answers. A notebook that only has code cells will not suffice.
4. For questions containing the **/Discuss:/** prefix, answer not with code, but with a textual explanation (in either comments or markdown).
5. Back up any hypotheses and claims with data, since this is an important aspect of the course.
6. Please write all your comments in English, and use meaningful variable names in your code. Your repo should have a single notebook (plus the required data files) in the master branch. If there are multiple notebooks present, we will not grade anything.
7. Also, be sure to hand in a fully-run and evaluated notebook. We will not run your notebook for you, we will grade it as is, which means that only the results contained in your evaluated code cells will be considered, and we will not see the results in unevaluated code cells. In order to check whether everything looks as intended, you can check the rendered notebook on the GitHub website once you have pushed your solution there.
8. Make sure to print results or dataframes that confirm you have properly addressed the task.



## Context

The coronavirus pandemic has led to the implementation of unprecedented non-pharmaceutical interventions ranging from case isolation to national lockdowns. These interventions, along with the disease itself, have created massive shifts in people’s lives. For instance, in mid-May 2020, more than one third of the global population was under lockdown, and millions have since lost their jobs or have moved to work-from-home arrangements.


Importantly, the disease has shifted people's [needs](https://en.wikipedia.org/wiki/Toilet_paper), [interests](https://en.wikipedia.org/wiki/TikTok), and [concerns](https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Switzerland) across the globe.

In this homework, we will take a deep dive into Wikipedia data and try to uncover what changed with the pandemic. More specifically, we will be focusing on Wikipedia pageviews, that is, how many people read each article on Wikipedia each day.
A nice graphical user interface for playing with Wikipedia pageviews is available [here](https://pageviews.toolforge.org/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0&range=latest-20&pages=Cat|Dog).
Also, the Wikimedia Foundation releases dump files with the number of pageviews per article across all Wikimedia websites, including Wikipedia in all its language editions [(amazing, right?)](https://dumps.wikimedia.org/other/pagecounts-ez/). 

#### But wait, what is a pageview?

> A pageview or page view, abbreviated in business to PV and occasionally called page impression, is a request to load a single HTML file (web page) of an Internet site. On the World Wide Web, a page request would result from a web surfer clicking on a link on another page pointing to the page in question. (Source: [Wikipedia article "Pageviews"](https://en.wikipedia.org/wiki/Pageview))

Pageviews in Wikipedia can tell us that people are looking for certain information online. Analyzing how the volume and the distribution of pageviews changed can tell us about how the behavior of Wikipedia readers has changed.

In this homework, you will take a deep dive into analyzing Wikipedia pageview logs and uncover shifts in interests associated with the current pandemic.

---

## The data

First, you need to download a **meraviglioso** dataset from the Italian Wikipedia that we prepared for you. The structure of the data is described next. 

**The dataset is available in the `data` directory pushed to the same GitHub repo as the homework**. Inside of the data directory, you will find three files:

### `articles.tsv.gz`

This is a tab-separated file containing daily pageviews for a subset of the articles from Italian Wikipedia. It is compressed! Each row corresponds to a different article, and each column (except the first) corresponds to the number of pageviews this article received on a given day. The example below shows the structure for two of the things [Kristina Gligorić](https://kristinagligoric.github.io/), one of your TAs, likes the most on her Pizza:

**Example:**
~~~
index       2018-01-01 00:00:00    2018-01-02 00:00:00 (...)
Formaggio   100                    101                 (...)
Ananas      12                     54                  (...)
(...)       (...)                  (...)
~~~


### `topics.jsonl.gz`

This is a classification of which topics an article belongs to, according to a model released by the Wikimedia Foundation (the classes are derived from this [taxonomy](https://www.mediawiki.org/wiki/ORES/Articletopic)). Importantly, this file was obtained from English Wikipedia, while the previous one contains articles from the Italian Wikipedia. This is important because article titles in the Italian Wikipedia are in Italian, while article titles in the English page are in English (duh!). In any case, each line contains a .json object with
the English name of the article (name);
1. the English name of the article (`name`);
2. a set of fields related to topics themselves. Each of these fields is set as either `True` (if the article belongs to this topic) or `False` (if it does not). Notice that the same article may belong to multiple topics. 

**Example:**
~~~
{"index": "Cheese", "Culture.Food and drink": True, "Culture.Literature": False (...)}
{"index": "Pineapple", "Culture.Food and drink": True, "Culture.Literature": False (...)}
(...)
~~~

 
### `mapping.pickle`

This is a `.pickle` file, that is, a serialized Python object. You can read about Python pickles  [here](https://wiki.python.org/moin/UsingPickle), 
but in short: the default Python library `pickle` allows you to save and load Python objects to and from disk. This is one object saved via the pickle library: a Python dictionary containing a mapping between the English names and the Italian names of Wikipedia articles:

**Example:**
~~~
{
    "Cheese": "Formaggio",
    "Ananas": "Pineapple"
    (...)
}
~~~
---


## _Step 1:_ Loading the data

---
### **Task 1**

Your first task is to load all these datasets into memory using pandas and pickle. 
**You should load the files compressed.**

Here, the files at hand are rather small, and you could easily uncompress  the files to disk and work with them as plain text. 
Why, then, are we asking you to load the files compressed? The reason is that, in your life as a data scientist, this will often not be the case.

Then, working with compressed files is key so that you don't receive e-mail from your (often more responsible) colleagues demanding to know how you have managed to fill the entire cluster with your datasets. 
Another big advantage of compressing files is to simply read files faster. You will often find that reading compressed data on the fly (uncompressing it as you go), is much faster than reading compressed data, since reading and writing to the disk may be your [bottleneck](https://skipperkongen.dk/2012/02/28/uncompressed-versus-compressed-read/). 

 
---

**Hint:** `pandas` can open compressed files.

**Hint:** In the real world (and in ADA-homework), your file often comes with some weird lines! 
This time you can safely ignore them (but in the real world you must try to understand why they are there!). Check the `error_bad_lines` parameter on `read_csv`.

In [None]:
import pandas as pd
import pickle

### ~ 1.1
# Data directory
data_dir = 'data/'

# Load datasets
articles_raw = pd.read_csv(data_dir + 'articles.tsv.gz', sep='\t', compression='gzip', \
                           error_bad_lines=False, warn_bad_lines=False)
topics_raw = pd.read_json(data_dir + 'topics.jsonl.gz', lines=True, compression='gzip')
mapping = pd.read_pickle(data_dir + 'mapping.pickle')

## _Step 2:_ Filtering the data

---
### **Task 2**

Oh no! Something seems wrong with your dataframe!
It seems that some of the lines in the `articles.tsv.gz` are weird! 
They have titles in the format `"Discussione:name_of_the_page"`.

Unsure of what they mean, you ask about them in the [Wiki-research mailing list](https://lists.wikimedia.org/mailman/listinfo/wiki-research-l).
Twenty minutes later a kind internet stranger comes with an answer! 
She tells you that these are talk pages, where people discussing what should and should not be in the article (in fact it can be pretty funny to read, eg, [you can read Italians debating about pizza](https://it.wikipedia.org/wiki/Discussione:Pizza))

After understanding what they are, your task is now to filter these lines using `pandas`! After all, we are interested in pageviews going towards articles! Not discussion pages!

---

**Hint**: There is one of them in the position \#180 of the dataframe, if you want to check it!

In [None]:
### ~ 2.1
# First verify that index are unique
assert(articles_raw['index'].is_unique)
assert(topics_raw['index'].is_unique)

# Set the index column as index for both dataframes
articles = articles_raw.set_index('index')
topics = topics_raw.set_index('index')

# Remove all discussions pages from the article dataframe
articles = articles[~articles.index.str.contains("Discussione")]
print("==> {} discussions pages have been deleted".format(articles_raw.size - articles.size))

## *Step 3*: Understanding the data

---
### **Task 3.1**
Data cleaning is hard huh? But now that this headache is behind us we can go on to explore our data.

Let's begin with some basic stats. It is always important to do this as a sanity check.

You should:

1. Start by calculating how many topics and articles there are. Also, while you are at it, print the names of the topics to get a grasp of what they are about. 
2. Calculate the average daily number of pageviews in the dataset.
3. **Discuss:** As previously mentioned, your data is a sample of _some_ (and not all) Wikipedia articles! Estimate (roughly) what percentage of Italian Wikipedia articles are in your dataset comparing your daily average pageview numbers with the official statistics (which can be found [here](https://pageviews.toolforge.org/siteviews/?platform=desktop&source=pageviews&agent=user&start=2020-01-01&end=2020-09-21&sites=it.wikipedia.org)). Notice that we are focusing on the desktop version of Wikipedia.

---
**Hint**: topics are in the columns of the topic file!

In [None]:
### ~ 3.1.1
### Your code here! ###

In [None]:
### ~ 3.1.2
### Your code here! ###

In [None]:
### ~ 3.1.3
### Your text (and code if necessary) here! ###


### **Task 3.2**

Now that we have a better understanding of the data, let's look at some articles to get a feeling of what is happening. 

Your task is to:

1. Find all articles whose names contain the sequence of characters `"virus"` (case insensitive) and that received least 7,000 pageviews across the entire period (no point in zooming on very insignificant articles);
2. Find a way to nicely visualize __each__ one of the time-series (in a single plot, which may have multiple panels; in the lecture, Bob referred to these as “small multiples); Your visualization should allow one to see overall trends across each of the different articles and depict the overall trends, with the least noise possible. Additionally, highlight two specific dates in your plot: 31 January ([first case reported in Italy](https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Italy#First_confirmed_cases)) and 21 February ([when multiple cases were confirmed in northern Italy](https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Italy#Clusters_in_northern_Italy)).
4. **Discuss**: What did you observe? Did all the articles behave similarly?

---

**Hint**: The column dates are currently strings which are not very plot friendly. You can turn them into datetime objects using: 

~~~python
your_dataframe_name.columns = pd.to_datetime(your_dataframe_name.columns)
~~~

Notice that this only works if you have only date-related columns. Fortunately, if you get rid of the `index` column by making it a real pandas index, things should work just fine.

**Hint**: Choose your axes wisely!



In [None]:
### ~ 3.2.1
### Your code here! ###

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(14,3)) # change this if needed
### ~ 3.2.2
### Your code here! ###

In [None]:
### ~ 3.2.3
### Your text here! ###


### **Task 3.3**

Before we move on, let's make a final sanity check and analyze the distribution of pageviews over all articles in our dataset. You are given a function to calculate the **cumulative distribution function** (CDF) of a sample. The CDF is a function f(x) associated with a data sample. For each value x, f(x) represents the percentage of elements in your sample that have values smaller or equal to x (read more about it [here](https://en.wikipedia.org/wiki/Empirical_distribution_function)).
Your task is to:

1. Calculate the CDF of the distribution of pageviews across all days over articles. That is, a) calculate the total number of pageviews each article has received and then, b) calculate the CDF for these values.


2. Now plot this function using different scales for the x- and y-axis. You should plot it in 4 different ways:

    a. x-axis on linear scale, y-axis on linear scale
    
    b. x-axis on log scale, y-axis on linear scale
    
    c. x-axis on linear scale, y-axis on log scale
    
    d. x-axis on log scale, y-axis on log scale
    
3. **Discuss:** There is a pretty odd fact about the distribution of our data! Can you spot it and describe it? Which of the different plots (a-d) allows you to find this oddity? Why isn't this visible in the other plots?

---

**Hint:** You can use `plt.xscale` and `plt.yscale`.

In [None]:
# This function is being given to you with a usage example :)! Make good use!

import numpy as np

def get_cdf(vals):
    # Input:
    # -- vals: an np.array of positive integers
    # Output:
    # -- x: an array containing all numbers from 1 to max(vals);
    # -- y: an array containing the (empirically calculated) probability that vals <= x
    
    y = np.cumsum(np.bincount(vals)[1:])
    y = y / y[-1]
    y = y
    x = list(range(1, max(vals) + 1))
    return x, y  

vals = np.array([1,2,3,4,1,2,4,3,4,4,5,4])
x, y = get_cdf(vals)
plt.plot(x, y)
plt.show()

In [None]:
### ~ 3.3.1
### Your code here! ###

In [None]:
### ~ 3.3.2
### Your code here! ###

In [None]:
### ~ 3.3.3
### Your text here! ###

## *Step 4*: Analyzing Overall Pageview Volume


---
### **Task 4.1**

So far we have seen anecdotal examples. Now let’s move to the big picture! How did Wikipedia pageviews change in general? To gain a better understanding of how Wikipedia’s overall pageview volume has changed during the pandemic, you should do the following:

1. Calculate and visualize the pageviews trends across summed across **all** articles in Italian Wikipedia for the year 2020. (and only for 2020!). 
2. **Discuss**: what regular pattern (something that repeats over and over) do you see in the data?
3. Pre-process the data to remove this regular pattern and make the overall trend clearer. Repeat the plot with the processed data.

---

**Hint**: A convenient way to use `.groupby` alongside dates is to use the [`pd.Grouper`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Grouper.html) class. Basically, it allows you to group by date periods given frequencies determined by the parameter `freq`. To read how to specify different types of frequencies, see [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases). Recall that, in order to turn an index, column index -- or pretty much anything -- into a timestamp, you can use  [`pd.to_timestamp`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_timestamp.html).

### 4.1.1

In [None]:
articles_2020 = articles.copy()
# convert column names to timestamp
articles_2020.columns = articles_2020.columns.map(lambda time_string: pd.Timestamp(time_string))
# remove years before 2020
articles_2020 = articles_2020.drop(articles_2020.columns[articles_2020.columns < pd.Timestamp("2020-01-01")]
                                 , axis='columns')
# compute total views of all article for a given day
articles_2020_total = articles_2020.sum()
# visualize the results
plt.figure(figsize=(14,3))
articles_2020_total.plot()
plt.xlabel('Date')
plt.ylabel('Total number of pageviews')
plt.title('Pageview Volume in 2020');

**Observation** 

The graph has spikes and is hard to read. Nonetheless, we can observe a patern of a sharp increases followed by a rapid decrease. Furthermore, there seems to be about four spikes per month. We can infer that the distribution of page views changes throughout the week. To confirm this let's group pageviews by day.

In [None]:
# function taken from data visualisation tutorial. It alows us to compute non-parametric confidence intervals
def bootstrap_CI(data, nbr_draws):
    means = np.zeros(nbr_draws)
    data = np.array(data)

    for n in range(nbr_draws):
        indices = np.random.randint(0, len(data), len(data))
        data_tmp = data[indices] 
        means[n] = np.nanmean(data_tmp)

    return [np.nanpercentile(means, 2.5),np.nanpercentile(means, 97.5)]

In [None]:
# get back frame of total pageviews per day
articles_per_day_total = articles_2020_total.to_frame()
articles_per_day_total = articles_per_day_total.reset_index()
articles_per_day_total = articles_per_day_total.rename(columns={0: "total pageviews", "index": "day"})

# find day of the week for each date
articles_per_day_total.day = articles_per_day_total.day.map(lambda timestamp: timestamp.dayofweek)

# find mean page views for each day in 2020 with non-parametric confidence intervals
# this line of code was adapted from the data visualisation tutorial solution
articles_per_day_total_group = articles_per_day_total.groupby(articles_per_day_total.day).apply(lambda day: pd.Series({
        'average': day["total pageviews"].mean(),
        'lower_err': bootstrap_CI(day["total pageviews"], 1000)[0],
        'upper_err': bootstrap_CI(day["total pageviews"], 1000)[1]
    }))

# visualize the results
plt.figure(figsize=(7,3))
capsize = 5
plt.errorbar(articles_per_day_total_group.index, articles_per_day_total_group.average,
             yerr = [articles_per_day_total_group.lower_err - articles_per_day_total_group.average, 
                    articles_per_day_total_group.average - articles_per_day_total_group.upper_err], capsize  = capsize)

plt.xticks(range(7), ['Monday','Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.xlabel('Day')
plt.ylabel('Mean number of pageviews')
plt.title('Mean number of pageviews per weekday in 2020');

**Observation**

The confidence intervals are calculated with a γ of 95% and are non-parametric. 
From the mean number of pageviews per weekday we can separate the week into two categories. These are the buisness days (Monday to Friday), where the pageview amount is high but decreases each day, and the weekend (Saturday and Sunday) where the pageviews are low. 

### 4.1.2

**Discussion**

From this observation we can explain the paterns of the first graph because the peeks correspond to the sharp increase in pageviews after the weekend and the rapid decrease is also seen throughout the week.

To smooth out this effect we should instead group **by week** so this effect is removed.

### 4.1.3

In [None]:
plt.figure(figsize=(14,3))
# group by week (note here that we anchor on Tuesday because 2020 starts on Wednesday which means it will take a week starting from that Wednesday)
articles_2020_total_week = articles_2020_total.to_frame().reset_index().groupby(pd.Grouper(key="index", freq="W-TUE",axis=0)).sum()

# visualise smoothed result
plt.figure(figsize=(14,3))
plt.plot(articles_2020_total_week.index[:-1], articles_2020_total_week[0][:-1])
plt.xlabel('Date (weeks)')
plt.ylabel('Total number of pageviews')
plt.title('Pageview Volume aggregated by week in 2020');

In [None]:
articles_mobile_avg_week_2020 = pd.DataFrame(data=np.convolve(articles_2020_total, [1/7.0]*7, 'valid'), columns=["Moving average by week"]) 
# visualise smoothed result
plt.figure(figsize=(14,3))
plt.plot(articles_mobile_avg_week_2020.index, articles_mobile_avg_week_2020["Moving average by week"])
plt.xlabel('Date (weeks)')
plt.ylabel('Total number of pageviews')
plt.title('Pageview Volume aggregated by a mobile average of one week in 2020');

**Note for graphs grouped by year** a retirer si on garde le moving average

We group by week starting by the first day of each year. For 2020 it is a Wednesday (so we anchor on Tuesday) and for 2019 it is a Tuesday (so we anchor on Monday). However, the last week of each grouping will not be a full week worth of pageviews so we have decided to remove it for the visualisations.

### **Task 4.2**

To get an even clearer picture, your task now is to compare the pageview time series of the current year (2020) with the time series of the previous year (2019).

1. Make a visualization where the two years are somehow "aligned", that is, where it is possible to compare the same time of year across the two years. Additionally, your visualization should highlight the date on which the nationwide lockdown started in Italy, 9 March 2020. Preprocess each one of the time series (for each year) the same way you did in Task 4.1.

3. **Discuss:** What changed from 2019 to 2020? Form and justify hypotheses about  the reasons behind this change.

---

**Hint**: In order to use two different y-axes in the same plot, you can use plt.twiny() or ax.twinx() (the latter if you are using the subplots environment;  [See this example](https://matplotlib.org/3.3.1/gallery/subplots_axes_and_figures/two_scales.html)).

### 4.2.1 Grouping

In [None]:
articles_2019 = articles.copy()
# convert column names to timestamp
articles_2019.columns = articles_2019.columns.map(lambda time_string: pd.Timestamp(time_string))
# keep only year 2019
articles_2019 = articles_2019.drop(articles_2019.columns[articles_2019.columns < pd.Timestamp("2019-01-01")]
                                 , axis='columns')
articles_2019 = articles_2019.drop(articles_2019.columns[articles_2019.columns > pd.Timestamp("2019-12-31")]
                                 , axis='columns')
# sum all articles per day
articles_2019_total = articles_2019.sum()



In [None]:
# group by week
articles_2019_total_week = articles_2019_total.to_frame().reset_index().groupby(pd.Grouper(key="index", freq="W-MON", axis=0)).sum()

In [None]:
fig, ax1 = plt.subplots()

# 2019
color = "#91bfdb" #blue 
ax1.set_xlabel('time (week)')

size_of_weeks_2019 = len(articles_2019_total_week.index) - 1 # we remove the last week grouping as explained in the note earlier
ax1.plot(range(size_of_weeks_2019), articles_2019_total_week[0][:-1], color=color)
ax1.set_ylim(0,3400000) # share y axis
ax1.legend(["pageviews in 2019"],loc='upper right', bbox_to_anchor=(1, 1))


# 2020
color = "#fc8d59" #orange
ax2 = ax1.twinx()  # share x
size_of_weeks_2020 = len(articles_2020_total_week.index) - 1 # we remove the last week grouping as explained in the note earlier
ax2.plot(range(size_of_weeks_2020), articles_2020_total_week[0][:-1], color=color)
ax2.set_ylim(0,3400000) # share y axis
ax2.set_yticklabels([])
ax2.legend(["pageviews in 2020"],loc='upper right', bbox_to_anchor=(1, 0.92))

# lockdown

# find position of lockdown on 9th of March
# since the grouping takes weeks ending on Tuesday we simply need to find how 
# many Tuesday we have before the 9th of March. The 9th of March will be the positioned at the interval start 
# offset by 6 days which corresponds to +6/7
lockdown_position_in_year = len(articles_2020_total_week[articles_2020_total_week.index <= pd.Timestamp("2020-03-09")])
lockdown_position_in_year += 6.0/7.0
plt.axvline(x=lockdown_position_in_year, ymin=0, ymax=1, color="#a50026")
text_x_offset = 0.5
text_y_offset = 750000
plt.text(lockdown_position_in_year + text_x_offset,text_y_offset,'9 March 2020, Lockdown', color="#a50026", fontsize=20)

fig.set_figwidth(12)
fig.set_figheight(5)

plt.title('Comparing the pageview volume evolution between 2019 and 2020');
plt.show()

# add convolve

In [None]:
articles_mobile_avg_week_2019 = pd.DataFrame(data=np.convolve(articles_2019_total.values, [1/7.0]*7, 'valid'), columns=["Moving average by week"]) 


In [None]:
fig, ax1 = plt.subplots()

# 2019
color = "#91bfdb" #blue 
ax1.set_xlabel('time (week)')
size_of_weeks_2019 = len(articles_mobile_avg_week_2019.index)
ax1.plot(articles_mobile_avg_week_2019.index, articles_mobile_avg_week_2019["Moving average by week"], color=color)
ax1.set_ylim(0,500000) # share y axis
ax1.legend(["mobile average of pageviews in 2019"],loc='upper right', bbox_to_anchor=(1, 1))


# 2020
color = "#fc8d59" #orange
ax2 = ax1.twinx()  # share x
size_of_weeks_2020 = len(articles_mobile_avg_week_2020.index) # we remove the last week grouping as explained in the note earlier
ax2.plot(articles_mobile_avg_week_2020.index, articles_mobile_avg_week_2020["Moving average by week"], color=color)
ax2.set_ylim(0,500000) # share y axis
ax2.set_yticklabels([])
ax2.legend(["mobile average of pageviews in 2020"],loc='upper right', bbox_to_anchor=(1, 0.92))


# lockdown

# find position of lockdown on 9th of March
lockdown_position_in_year = float(articles_2020_total.reset_index().loc[lambda df: df["index"] == "2020-03-09"].index.values)
plt.axvline(x=lockdown_position_in_year, ymin=0, ymax=1, color="#a50026")
text_x_offset = 10
text_y_offset = 25000
plt.text(lockdown_position_in_year + text_x_offset,text_y_offset,'9 March 2020, Lockdown', color="#a50026", fontsize=20)

fig.set_figwidth(12)
fig.set_figheight(5)

plt.title('Comparing the pageview volume evolution between 2019 and 2020 with a mobile average on a week');
plt.show()



**Discussion**

We use a mobile average to smooth out the effect a week has on the pageviews distributions. This means we average one full week from each day to compute the sum of pageviews on that day.

Both the year 2019 and 2020 seemed to follow the same trend in the first 10 weeks with 2020 having less pageviews overall than 2019. However, around the lockdown on the 9th of March 2020, the number of pageviews increases dramatically. In contrast, in 2019, we have a relatively constant amount of pageviews. Since the sharp increase in pageviews coincides with the lockdown we can infer that the **lockdown and the increase in page views are correlated.**

However, why would one affect the other? Here are two hypotheses:
 1. The coronavirus prompted an interest in medical articles to understand what the situation was about which significantly increased the overall pageview volume.
 2. Similarly, due to lockdown boredom there could simply be a global increase in pageviews 
 
In the following part we will have the opportunity to examine these hypothesis.

## *Step 5*: Fiddling with Topics

---
### **Task 5.1**

We now turn to a different question: what topics were impacted by the lockdown? 
To start unpacking this question, your task now is to aggregate, for each day, all pageviews that went to each one of the 64 topics. 

There are multiple ways to do this, but for the sake of this exercise, you must create a dataframe where each row contains the number of pageviews a topic obtained on a given day! Example:

~~~
index       date                   views             
TOPIC1      2019-01-01             101              
TOPIC1      2019-01-02             151             
(...)       (...)                  (...)
TOPICK      2019-01-01             1010              
TOPICK      2019-01-02             2123            
(...)       (...)                  (...)
~~~

---

**Hint**: You've should find a way to make the index in the dataframe with the topics be the same as the index in the dataframes with the articles. See the file `mapping.pickle`.

**Hint**: You may want to use `.melt`.

In [None]:
### ~ 5.1.1
### Your code here! ###

### **Task 5.2**

Now to the **grand finale**. We will consider two periods:
- the 35 days before the quarantine started (in the 9th of March); and 
- the 34 days after the quarantined started (including the day of the quarantine itself).

Create a visualization where you can compare, for each topic, the mean **number of views** in the aforementioned periods (that is, before and after the quarantine started). **Although there is a very large number of topics, your visualization should be a compact panel, small enough to fit on an A4 page.**

---

**Hint**: [Hoooray](https://seaborn.pydata.org/examples/index.html).

In [None]:
import seaborn as sns
plt.figure(figsize=(14,10)) # change this if needed

### ~ 5.2
### Your code here! ###

### **Task 5.3**

Notice that the previous analysis fails to isolate the increases or decreases in each individual topic from the overall increases or decreases in pageviews across Wikipedia in general. That is, it could be that all topics gained/lost pageviews, but some did so more than articles in general, while others did so less than articles in general. To address this issue, you should:


1. Normalize the pageviews counts in the dataframe created in Task 5.1. Instead of using the raw number of pageviews, you should compute, for each day, what fraction of all pageviews a topic received.

2. Create a second visualization that shows not the **raw** value of pageviews before and after, but the **relative** value that you just calculated.

3. **Discuss:** According to Task 5.2, what topics have increased in terms of the raw, absolute number of pageviews after the quarantine started? In relative, rather than absolute, terms, do these findings still hold? If not, what has changed?

---

In [None]:
### ~ 5.3.1
### Your code here! ###

In [None]:
### ~ 5.3.2
### Your code here! ###

In [None]:
### ~ 5.3.3
### Your text here! ###

---