Easily process and browse a Wikimedia's anonymous page edit history for specific IP ranges.
Parsing the entire history of a language's wiki pages takes significant computational resources and will temporarily consume large amounts of storage and bandwidth. Although the processQueue.sh
script cleans up orphaned images and volumes with extreme prejudice, expect to require (conservatively) MAX_THREADS * 100GB
of disk space while processing the files. The final storage requirements are minimal.
Fetch a list of archive .gz files to parse, storing them in the ElasticSearch index.
docker-compose up
Change the processfile/src/FilterIpRanges.py
file to set the CIDR ranges you wish to parse.
Grab a single file from the index and process it.
docker-compose run processfile
It is likely that you would like to parse the entire history rather than a single file. To do so, edit the processQueue.sh
script to alter variables to suit your needs, and run it:
./processQueue.sh
The docker-compose.yml
file deploys a vanilla Kibana instance that can be used to browse and filter the ElasticSearch data. You can access that via http://localhost:5601.
At a later time, you may wish to queue the new files that have been added with recent edits.
docker-compose run gethistoryfiles
- The ELK stack components of this work were based on the wonderful Docker ELK stack by Antoine Cotten.
- IP Ranges for the Government of Canada based on Nick Ruest's list
- wikiHistoryParser is licensed under the MIT License:
- Attribution is not required, but much appreciated:
wikiHistoryParser by Jacob Sanford