GitHub - NGI4eu/engineroom-wikipedia-pageviews-extraction: A collection of scripts to extract data from the Wikipedia pagecounts dataset

Wikipedia pageviews extraction

Check if there is a page with title TITLE on Wikipedia. If it is a redirect use the page towards which it is redirected
get the redirects for a page at: http://dispenser.homenet.org/~dispenser/cgi-bin/rdcheck.py and save them in a file named ./{output}/{lang}/{title}.redirects.txt
quote page and redirect titles and save everything in a file called: ./{output}/{lang}/{title}.quoted-redirects.txt
get the pageview data this will save a bunch of files in ./{data}/part_data/part-XXXXXX.gz
change the permissions on files in ./data/part_data/
Extract the pageview data a. extract the pageviews only for the page named TITLE and save them in ./{output}/{lang}/{title}.clean.pageviews.txt.gz b. extract the pageviews only for the page named TITLE and save them in ./{output}/{lang}/{title}.quoted-redirects.pageviews.txt.gz

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
input		input
output		output
.gitignore		.gitignore
README.md		README.md
build_index.sh		build_index.sh
build_lists.sh		build_lists.sh
build_quoted_redirects.sh		build_quoted_redirects.sh
copy_pageview_files.sh		copy_pageview_files.sh
data2csv.sh		data2csv.sh
extract_all.sh		extract_all.sh
extract_data.sh		extract_data.sh
extract_pageviews.sh		extract_pageviews.sh
get_redirects.sh		get_redirects.sh
normalize_title.sh		normalize_title.sh
quote_pagetitle.sh		quote_pagetitle.sh
select_pageviews.sh		select_pageviews.sh
simplify_regexes.py		simplify_regexes.py