Skip to content

JacobSanford/wikiHistoryParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

jacobsanford/wikiHistoryParser

Easily process and browse a Wikimedia's anonymous page edit history for specific IP ranges.

Browsing edits

Requirements

Overview and Notes

Parsing the entire history of a language's wiki pages takes significant computational resources and will temporarily consume large amounts of storage and bandwidth. Although the processQueue.sh script cleans up orphaned images and volumes with extreme prejudice, expect to require (conservatively) MAX_THREADS * 100GB of disk space while processing the files. The final storage requirements are minimal.

Quick Start

Populate the wiki edit history page file queue

Fetch a list of archive .gz files to parse, storing them in the ElasticSearch index.

docker-compose up

Define IP ranges to parse

Change the processfile/src/FilterIpRanges.py file to set the CIDR ranges you wish to parse.

Process a single page history archive file

Grab a single file from the index and process it.

docker-compose run processfile

Process many/all page history archive files

It is likely that you would like to parse the entire history rather than a single file. To do so, edit the processQueue.sh script to alter variables to suit your needs, and run it:

./processQueue.sh

View the edits

The docker-compose.yml file deploys a vanilla Kibana instance that can be used to browse and filter the ElasticSearch data. You can access that via http://localhost:5601.

Update the wiki page edit history page file queue

At a later time, you may wish to queue the new files that have been added with recent edits.

docker-compose run gethistoryfiles

Sources

License

About

A framework used to quickly process and browse a Wikimedia's anonymous page edit history for specific IP ranges.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published