- Check if there is a page with title TITLE on Wikipedia. If it is a redirect use the page towards which it is redirected
- get the redirects for a page at: http://dispenser.homenet.org/~dispenser/cgi-bin/rdcheck.py and save them in a file named ./{output}/{lang}/{title}.redirects.txt
- quote page and redirect titles and save everything in a file called: ./{output}/{lang}/{title}.quoted-redirects.txt
- get the pageview data this will save a bunch of files in ./{data}/part_data/part-XXXXXX.gz
- change the permissions on files in ./data/part_data/
- Extract the pageview data a. extract the pageviews only for the page named TITLE and save them in ./{output}/{lang}/{title}.clean.pageviews.txt.gz b. extract the pageviews only for the page named TITLE and save them in ./{output}/{lang}/{title}.quoted-redirects.pageviews.txt.gz
NGI4eu/engineroom-wikipedia-pageviews-extraction
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
A collection of scripts to extract data from the Wikipedia pagecounts dataset
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published