# Stage 0: SETUP

The below libraries are used for this project. For a full list of requirements and versions, please see the requirements.txt file included in the repository.

In [3]:
import json
import requests
import os
import pandas as pd

# Stage 1: DATA ACQUISITION

## Overview
Data is acquired through the Wikimedia REST API and saved as json files. These files are included in the repository in the *data* folder; you made skip to Stage 2 and use the included files if desired.

We will request data from both the [Legacy](https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts) and [Pageviews](https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews) API.

We define base templates for the parameters. English wikipedia with monthyl granularity will always be requested, and on the pageviews api we always request agent=user to filter out crawler and bot traffic. We also request consistent dateranges for each api

In [10]:
endpoint_legacy = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{site}/{granularity}/{start}/{end}'
endpoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{site}/{agent}/{granularity}/{start}/{end}'
params_legacy = {"project" : "en.wikipedia.org",
                 "granularity" : "monthly",
                 "start" : "2008010100",
                 "end" : "2016080100"
                }

params_pageviews = {"project" : "en.wikipedia.org",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "2015070100",
                    "end" : "2021090100"
                }



We request each endpoint for each access type, except for aggregates. All data is saved in the *data* folder.
    

In [11]:
def api_call(endpoint,parameters):
    headers = {
        'User-Agent': 'https://github.com/Cain93',
        'From': 'ccase20@uw.edu'
    }
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

In [2]:
legacy_sites = ["desktop-site", "mobile-site"]
pageview_sites = ["desktop", "mobile-app", "mobile-web"]
file_template = "data/{apiname}_{site}_{daterange}.json"
for site in legacy_sites:
    data = api_call(endpoint_legacy, {**params_legacy, "site":site})
    fileName = file_template.format(apiname="pagecount", site=site, daterange = "200801-201607")
    with open(fileName, 'w') as outfile:
        json.dump(data, outfile)
for site in pageview_sites:
    data = api_call(endpoint_pageviews, {**params_pageviews, "site":site})
    fileName = file_template.format(apiname="pageview", site=site, daterange = "201507-202108")
    with open(fileName, 'w') as outfile:
        json.dump(data, outfile)  
    


NameError: name 'api_call' is not defined

# Stage 2: DATA PROCESSING

First we open each file and combine into a dataframe. While doing, we rename columns to make them consistent between legacy and pageview data.

In [4]:
combined_data = pd.DataFrame()
col_names = {
    "access-site": "access",
    "count": "views"
}

for filename in os.listdir("data"):
    file = open("data/" + filename, "r")
    file_data = json.loads(file.read())
    file_df = pd.DataFrame.from_records(file_data["items"]).rename(columns = col_names)
    
    combined_data = combined_data.append(file_df)

combined_data.head()

Unnamed: 0,project,access,granularity,timestamp,views,agent
0,en.wikipedia,desktop-site,monthly,2008010100,4930902570,
1,en.wikipedia,desktop-site,monthly,2008020100,4818393763,
2,en.wikipedia,desktop-site,monthly,2008030100,4955405809,
3,en.wikipedia,desktop-site,monthly,2008040100,5159162183,
4,en.wikipedia,desktop-site,monthly,2008050100,5584691092,


Then we parse the timestamp into year and month, and remove unused columns.

In [5]:
combined_data["year"] = combined_data["timestamp"].apply(lambda x: x[0:4])
combined_data["month"] = combined_data["timestamp"].apply(lambda x: x[4:6])
cleaned_data = combined_data.drop(columns=["timestamp", "granularity", "project", "agent"])
cleaned_data.head()

Unnamed: 0,access,views,year,month
0,desktop-site,4930902570,2008,1
1,desktop-site,4818393763,2008,2
2,desktop-site,4955405809,2008,3
3,desktop-site,5159162183,2008,4
4,desktop-site,5584691092,2008,5


Now data is pivoted to create a new column for each type of view. After pivoting:
1. Mobile-web and mobile-app columns are combined into mobile
1. Columns are rename into more descriptive names
1. Aggregate columns for all pageview and pagecount views are created
1. Unused columns are dropped

In [6]:
# Pivot
pivot_data = cleaned_data.pivot(index = ["year", "month"], columns=["access"])
pivot_data.columns = pivot_data.columns.droplevel()

print(pivot_data.head())

# Combine mobil views
pivot_data["mobile"] = pivot_data["mobile-web"] + pivot_data["mobile-app"]
pivot_data = pivot_data.drop(columns = ["mobile-web", "mobile-app"])

# Rename and aggregate
pivot_data = pivot_data.rename(columns = {"desktop-site":"pagecount_desktop_views",
                                          "mobile-site": "pagecount_mobile_views",
                                          "desktop":"pageview_desktop_views",
                                          "mobile":"pageview_mobile_views",
                                          })
pivot_data["pagecount_all_views"] = pivot_data["pagecount_desktop_views"] + pivot_data["pagecount_mobile_views"]
pivot_data["pageview_all_views"] = pivot_data["pageview_desktop_views"] + pivot_data["pageview_mobile_views"]

pivot_data.head()


access      desktop  desktop-site  mobile-app  mobile-site  mobile-web
year month                                                            
2008 01         NaN  4.930903e+09         NaN          NaN         NaN
     02         NaN  4.818394e+09         NaN          NaN         NaN
     03         NaN  4.955406e+09         NaN          NaN         NaN
     04         NaN  5.159162e+09         NaN          NaN         NaN
     05         NaN  5.584691e+09         NaN          NaN         NaN


Unnamed: 0_level_0,access,pageview_desktop_views,pagecount_desktop_views,pagecount_mobile_views,pageview_mobile_views,pagecount_all_views,pageview_all_views
year,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2008,1,,4930903000.0,,,,
2008,2,,4818394000.0,,,,
2008,3,,4955406000.0,,,,
2008,4,,5159162000.0,,,,
2008,5,,5584691000.0,,,,


The data is converted to csv and saved.

In [8]:
pivot_data.to_csv('en-wikipedia_traffic_200712-202108.csv')

# Stage 3: ANALYSIS

In [9]:
pivot_data.shape

(164, 6)