Course Human-Centered Data Science ([HCDS](https://www.mi.fu-berlin.de/en/inf/groups/hcc/teaching/winter_term_2020_21/course_human_centered_data_science.html)) - Winter Term 2020/21 - [HCC](https://www.mi.fu-berlin.de/en/inf/groups/hcc/index.html) | [Freie Universität Berlin](https://www.fu-berlin.de/)
***
# A2 - Reproducibility Workflow


Your assignment is to create a graph that looks a lot like the one below one, starting from scratch, and following best practices for reproducible research.

![wikipedia_pageViews_2008-2020.png](img/wikipedia_pageViews_2008-2020.png)

## Before you start
1. Read all instructions carefully before you begin.
1. Read all API documentation carefully before you begin.
1. Experiment with queries in the sandbox of the technical documentation for each API to familiarize yourself with the schema and the data.
1. Ask questions if you are unsure about anything!
1. When documenting your project, please keep the following questions in your mind:
   * _If I found this GitHub repository, and wanted to fully reproduce the analysis, what information would I want?_
   * _What information would I need?_

## Step 1️⃣: Data acquisition
In order to measure Wikipedia traffic from January 2008 until October 2020, you will need to collect data from two different APIs:

1. The **Legacy Pagecounts API** ([documentation](https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts), [endpoint](https://wikimedia.org/api/rest_v1/#!/Pagecounts_data_(legacy)/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end)) provides access to desktop and mobile traffic data from December 2007 through July 2016.
1. The **Pageviews API** ([documentation](https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews), [endpoint](https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end)) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through last month.

For each API, you need to collect data for all months where data is available and then save the raw results into five (3+2) separate `JSON`files (one file per API query type) before continuing to step 2.

To get you started, you can use the following **sample code for API calls**:

In [35]:
# Source: https://public.paws.wmcloud.org/User:Jtmorgan/data512_a1_example.ipynb?format=raw
import json
import requests

endpoint_legacy = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'
endpoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

def get_param_pc(start, end, access_type):
    parameters = {"api_name" : "pagecounts",
                 "project" : "en.wikipedia.org",
                 "access-site" : access_type,
                 "granularity" : "monthly",
                 "start" : start,
                # for end use 1st day of month following final month of data
                 "end" : end
                    }
    return parameters

def get_param_pv(start, end, access_type):
    parameters = {"api_name" : "pageviews",
                    "project" : "en.wikipedia.org",
                    "access" : access_type,
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : start,
                    # for end use 1st day of month following final month of data
                    "end" : end
                        }
    return parameters

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/yourusername',
    'From': 'youremail@fu-berlin.de'
}

def api_call(endpoint,parameters):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()   
    return response

def get_filename(parameters):
    startdate = '{start}'.format(**parameters)[ 0 : 6 ]
    enddate = '{end}'.format(**parameters)[ 0 : 6 ]
    api = '{api_name}'.format(**parameters)
    file_name = ''
    if(api == "pagecounts"):
        file_name = '{api_name}_{access-site}_'.format(**parameters)+startdate+'-'+enddate+'.json'
    elif(api == "pageviews"):
        file_name = '{api_name}_{access}_'.format(**parameters)+startdate+'-'+enddate+'.json'
        
    return file_name

def store_pagecounts(endpoint, start, end, acces_type):
    example_params_legacy = get_param_pc(start, end, acces_type)
    file_name = get_filename(example_params_legacy)  
    monthly_legacy = api_call(endpoint, example_params_legacy)
    
    file = open(file_name, "w")
    json.dump(monthly_legacy, file)
    file.close()  
    print("Saved pagecouts with access site "+acces_type+" from "+start+" to "+end+" into file: "+file_name)
    
def store_pageviews(endpoint, start, end, acces_type):
    example_params_pageviews = get_param_pv(start, end, acces_type)
    file_name = get_filename(example_params_pageviews)  
    monthly_legacy = api_call(endpoint, example_params_pageviews)
    
    file = open(file_name, "w")
    json.dump(monthly_legacy, file)
    file.close()  
    print("Saved pagecouts with access site "+acces_type+" from "+start+" to "+end+" into file: "+file_name)

In [36]:
store_pagecounts(endpoint_legacy, "2001010100", "2018100100","desktop-site")
store_pagecounts(endpoint_legacy, "2001010100", "2018100100", "mobile-site")

store_pageviews(endpoint_pageviews, "2001010100", '2018101000', "desktop")
store_pageviews(endpoint_pageviews, "2001010100", '2018101000', "mobile-web")
store_pageviews(endpoint_pageviews, "2001010100", '2018101000', "mobile-app")

Saved pagecouts with access site desktop-site from 2001010100 to 2018100100 into file: pagecounts_desktop-site_200101-201810.json
Saved pagecouts with access site mobile-site from 2001010100 to 2018100100 into file: pagecounts_mobile-site_200101-201810.json
Saved pagecouts with access site desktop from 2001010100 to 2018101000 into file: pageviews_desktop_200101-201810.json
Saved pagecouts with access site mobile-web from 2001010100 to 2018101000 into file: pageviews_mobile-web_200101-201810.json
Saved pagecouts with access site mobile-app from 2001010100 to 2018101000 into file: pageviews_mobile-app_200101-201810.json


Your `JSON`-formatted source data file must contain the complete and un-edited output of your API queries. The naming convention for the source data files is: `apiname_accesstype_firstmonth-lastmonth.json`. For example, your filename for monthly page views on desktop should be: `pagecounts_desktop-site_200712-202010.json`

### Important notes❗
1. As much as possible, we're interested in *organic* (user) traffic, as opposed to traffic by web crawlers or spiders. The Pageview API (but not the Pagecount API) allows you to filter by `agent=user`. You should do that.
1. There is about one year of overlapping traffic data between the two APIs. You need to gather, and later graph, data from both APIs for this period of time.

## Step 2: Data processing

You will need to perform a series of processing steps on these data files in order to prepare them for analysis. These steps must be followed exactly in order to prepare the data for analysis. At the end of this step, you will have a single `CSV`-formatted data file `en-wikipedia_traffic_200712-202010.csv` that can be used in your analysis (step 3) with no significant additional processing.

* For data collected from the Pageviews API, combine the monthly values for `mobile-app` and `mobile-web` to create a total mobile traffic count for each month.
* For all data, separate the value of `timestamp` into four-digit year (`YYYY`) and two-digit month (`MM`) and discard values for day and hour (`DDHH`).

Combine all data into a single CSV file with the following headers:

| year | month |pagecount_all_views|pagecount_desktop_views|pagecount_mobile_views|pageview_all_views|pageview_desktop_views|pageview_mobile_views|
|------| ------|-------------------|-----------------------|----------------------|------------------|----------------------|---------------------|
| YYYY | MM    |num_views          |num_views              |num_views             |num_views         |num_views             |num_views            | 

In [37]:
import pandas as pd

In [38]:
def read_pagecounts( start, end, acces_type):
    example_params_legacy = get_param_pc(start, end, acces_type)
    file_name = get_filename(example_params_legacy)
    
    with open(file_name, 'r') as openfile: 
        # Reading from json file 
        json_object = json.load(openfile) 
           
    return json_object

def read_pageviews( start, end, acces_type):
    example_params_legacy = get_param_pv(start, end, acces_type)
    file_name = get_filename(example_params_legacy)
    
    with open(file_name, 'r') as openfile: 
        # Reading from json file 
        json_object = json.load(openfile) 
           
    return json_object

In [39]:
pagecounts_desktop_dict = read_pagecounts("2001010100", "2018100100","desktop-site")
pagecounts_mobile_dict  = read_pagecounts("2001010100", "2018100100","mobile-site")

pageviews_desktop_dict    = read_pageviews("2001010100", '2018101000', "desktop")
pageviews_mobile_web_dict = read_pageviews("2001010100", '2018101000', "mobile-web")
pageviews_mobile_app_dict = read_pageviews("2001010100", '2018101000', "mobile-app")

In [45]:
pagecounts_desktop = pd.DataFrame.from_dict(pagecounts_desktop_dict['items'])
pagecounts_mobile = pd.DataFrame.from_dict(pagecounts_mobile_dict['items'])
pageviews_desktop = pd.DataFrame.from_dict(pageviews_desktop_dict['items'])
pageviews_mobile_web = pd.DataFrame.from_dict(pageviews_mobile_web_dict['items'])
pageviews_mobile_app = pd.DataFrame.from_dict(pageviews_mobile_app_dict['items'])

## Step 3: Analysis

For this assignment, the "analysis" will be fairly straightforward: you will visualize the dataset you have created as a **time series graph**. Your visualization will track three traffic metrics: mobile traffic, desktop traffic, and all traffic (mobile + desktop). In order to complete the analysis correctly and receive full credit, your graph will need to be the right scale to view the data; all units, axes, and values should be clearly labeled; and the graph should possess a legend and a title. You must also generate a .png or .jpeg formatted image of your final graph.
Please graph the data in your notebook, rather than using an external application!

***

#### Credits

This exercise is slighty adapted from the course [Human Centered Data Science (Fall 2019)](https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)) of [Univeristy of Washington](https://www.washington.edu/datasciencemasters/) by [Jonathan T. Morgan](https://wiki.communitydata.science/User:Jtmorgan).

Same as the original inventors, we release the notebooks under the [Creative Commons Attribution license (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).