The code in this repository shows how to gather, analyze and graph monthly view traffic on the English Wikipedia. The code looks at the data from January 1st 2008 through August 30th 2019.
├── README.md
├── assets
│ └── images
│ └── wikipedia_page_views.png
├── clean_data
│ └── en-wikipedia-traffic_200801-201908.csv
├── human.yml
├── raw_data
│ ├── legacy_desktop-site_200801-201908.json
│ ├── legacy_mobile-site_200801-201908.json
│ ├── pageviews_desktop_200801-201908.json
│ ├── pageviews_mobile-app_200801-201908.json
│ └── pageviews_mobile-web_200801-201908.json
└── src
└── hcds-a1-data-curation.ipynb
We used two public apis to gather the data:
- The Legacy Pagecounts
- The Pageviews
Which are licenced under the CC-BY-SA 3.0 and GFDL license. More licensing and usage information is available at the api documentation website. The two apis above are labeled 'Legacy data' and 'Pageviews data' respectively as of 10/02/2019.
You will need a computer with access to the internet and access to a command line which has the required previledges to install open-source software.
- Install conda or miniconda.
- Replicate the conda environment using the human.yml file provided by running:
conda env create -f human.yml
- Activate the environment with:
conda activate human
- Using a terminal or cmd, navigate to the src folder.
- Lauch jupyter by running:
jupyter notebook
- Select the hcds-a1-data-curation notebook.
The cleaned data file en-wikipedia-traffic_200801-201908.csv
, has the following format
Column | Value | Description |
---|---|---|
year | YYYY | Year of view |
month | MM | Month of view |
pagecount_all_views | num_views | Combined view count for desktop and mobile views for the legacy api |
pagecount_desktop_views | num_views | View count from a desktop in the legacy api |
pagecount_mobile_views | num_views | View count from mobile in the legacy api |
pageview_all_views | num_views | Combined view count for desktop and mobile views for the pageview api |
pageview_desktop_views | num_views | View count from a desktop in the pageview api |
pageview_mobile_views | num_views | View count from a mobile in the pageview api |
Months for which no data was available have a value of 0.
Data from the Legacy Pagecounts API has desktop and mobile data from December 2007 to July 2016.
- The last data point for these monthly series '2016-08' are incomplete and were removed during analysis.
- The Legacy Pagecounts API includes views from crawlers and spiders.
Data from the Pageviews API has desktop and mobile data from July 2015 through last month.
- The api can filter out crawlers and spiders.
- The api differentiate between the mobile app and the mobile site. This information was not leveraged as we combined these counts in our analysis.