Data 512A - A1: Data curation

Project goal

The code in this repository shows how to gather, analyze and graph monthly view traffic on the English Wikipedia. The code looks at the data from January 1st 2008 through August 30th 2019.

Repository structure

├── README.md  
├── assets  
│   └── images  
│       └── wikipedia_page_views.png  
├── clean_data  
│   └── en-wikipedia-traffic_200801-201908.csv  
├── human.yml  
├── raw_data  
│   ├── legacy_desktop-site_200801-201908.json  
│   ├── legacy_mobile-site_200801-201908.json  
│   ├── pageviews_desktop_200801-201908.json  
│   ├── pageviews_mobile-app_200801-201908.json  
│   └── pageviews_mobile-web_200801-201908.json  
└── src  
    └── hcds-a1-data-curation.ipynb

Resources

We used two public apis to gather the data:

The Legacy Pagecounts
The Pageviews

Which are licenced under the CC-BY-SA 3.0 and GFDL license. More licensing and usage information is available at the api documentation website. The two apis above are labeled 'Legacy data' and 'Pageviews data' respectively as of 10/02/2019.

How to run the notebook

You will need a computer with access to the internet and access to a command line which has the required previledges to install open-source software.

Install conda or miniconda.
Replicate the conda environment using the human.yml file provided by running: conda env create -f human.yml
Activate the environment with: conda activate human
Using a terminal or cmd, navigate to the src folder.
Lauch jupyter by running: jupyter notebook
Select the hcds-a1-data-curation notebook.

Results

The cleaned data file en-wikipedia-traffic_200801-201908.csv, has the following format

Column	Value	Description
year	YYYY	Year of view
month	MM	Month of view
pagecount_all_views	num_views	Combined view count for desktop and mobile views for the legacy api
pagecount_desktop_views	num_views	View count from a desktop in the legacy api
pagecount_mobile_views	num_views	View count from mobile in the legacy api
pageview_all_views	num_views	Combined view count for desktop and mobile views for the pageview api
pageview_desktop_views	num_views	View count from a desktop in the pageview api
pageview_mobile_views	num_views	View count from a mobile in the pageview api

Months for which no data was available have a value of 0.

Considerations

Data from the Legacy Pagecounts API has desktop and mobile data from December 2007 to July 2016.

The last data point for these monthly series '2016-08' are incomplete and were removed during analysis.
The Legacy Pagecounts API includes views from crawlers and spiders.

Data from the Pageviews API has desktop and mobile data from July 2015 through last month.

The api can filter out crawlers and spiders.
The api differentiate between the mobile app and the mobile site. This information was not leveraged as we combined these counts in our analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data 512A - A1: Data curation

Project goal

Repository structure

Resources

How to run the notebook

Results

Considerations

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets/images		assets/images
clean_data		clean_data
raw_data		raw_data
src		src
LICENSE		LICENSE
README.md		README.md
human.yml		human.yml

License

ALotOfData/data-512-a1

Folders and files

Latest commit

History

Repository files navigation

Data 512A - A1: Data curation

Project goal

Repository structure

Resources

How to run the notebook

Results

Considerations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages