-
Notifications
You must be signed in to change notification settings - Fork 8
How to add a new wiki
Here are the steps for adding a new wiki to WikiChron (note that, by default, wikichron already includes some wikis for testing and example purposes).
WikiChron can use any kind of mediawiki wiki as long as the input data is in the right format. You will have to process the data into a csv with some specific format and data. To do that we provide a script that transforms any mediawiki XML dump into that csv input data file. Below, you can find the steps to go and add a new wiki to your WikiChron instance.
First, you will need the xml file corresponding to the full revision history of the wiki.
If you have shell access to the server, there is an easy way to generate the dump manually using the dumpBackup.php script
If you don't have shell access to the server you have different options available to download it depending on the hosting provider that your wiki is using:
- Wikia wikis: Supposedly, Wikia automatically generates dumps and provides them for every wiki they host. However, there is a known bug in this generation that cuts off the output dumps for large wikis. The best option here is to use this nice script that request and download the complete xml dump through the Special:Export interface. Please, keep in mind that this script does not download all the namespaces available when generating dumps for a wiki, but a wide subset of them (you can fin more detailed info in the wiki).
- Wikimedia project wikis: For wikis belonging to the Wikimedia project, you already have a regular updated repo with all the dumps here: http://dumps.wikimedia.org. Select your target wiki from the list and download the complete edit history dump and uncompress it.
- For other wikis, like self-hosted wikis, you should use the wikiteam's dumpgenerator.py script. You have a simple tutorial in their wiki: https://github.com/WikiTeam/wikiteam/wiki/Tutorial#I_have_no_shell_access_to_server. Its usage is very straightforward and the script is well maintained. Remember to use the --xml option to download the full history dump.
Remember to join all possible parts of the dump in one dump only and make sure it has the full history of every page of the wiki you want to analyze, and not only the current-only version of the dump.
Once you have your XML dump, you need to process the dump in order to get the corresponding .csv file. To do so, go run the script dump_parser, this script is listed as requirement for WikiChron, but you could also install it standalone with pip install wiki-dump-parser
. This script process any mediawiki dump and outputs a pre-processed and simplified csv file with all the information that WikiChron needs to print its plots. Run the script using:
python -m wiki_dump_parser data/<name_of_your.xml>
This will create the corresponding .csv file in your local data/ directory. If you have more than one XML file, you can run the script as follows and process them all at once:
python -m wiki_dump_parser data/*.xml
Finally, you need to provide some metadata of your wiki in the wikis.json
file. That file must have an entry for each wiki with, at least, the following fields: name, url, data, pages number and, optionally, a list of user ids (typically, bots) from which you want to remove their activity.
There is an example of wikis.json
file in the data/ directory of the repo with a given sample set of wikis. But note that the required information in this file will change in the future. Stay tuned to that file and to the new updates coming.
You can edit this file by hand and write the corresponding data or, in case you are using Wikia wikis or similar compatible wikis, you can use the script generate_wikis_json.py
coded for this purpose.
The script will look for the directory defined in the environment variable WIKICHRON_DATA_DIR
for a file called wikis.csv
, which has a list of the wiki urls and the filename of the csvs you want to add to WikiChron, and properly find the metadata needed and edits the file wikis.json
accordingly. That output wikis.json
will be placed in the directory defined by WIKICHRON_DATA_DIR
as well.
You can see the wikis.csv
given as an example for the wikis data provided with WikiChron by default. Once you have set your wikis.csv
file (you can just append your wikis to the one provided, since the script won't overwrite the previous gotten wikis.json
data), just run:
python generate_wikis_json.py
Launch WikiChron and now you should see your new wikis added to the list. Note that you might need to restart WikiChron if it was running before you added the new wiki