WikiBank is a new partially annotated resource for multilingual frame-semantic parsing task.
The available datasets are for 5 languages: EN, ES, DE, FR, and IT and their are in the dataset folder.
NOTES: The space required is round 1TB, so be sure to have the required amount of space before starting the process.
- MongoDB
- Python
- Download Wikidata JSON dump from here
- Download Wikipedia XML dump from here, or a JSON dump from here (download the "content" one).
If using the XML dump, converto it to JSON using one of the tools present in this page.
To merge Wikidata and Wikipedia, we need to have in both documents the Wikidata id. If your Wikipedia dump, doesn't contain this filed, to can compute a mapping from wikipedia id to wikidata id using the script "src/scripts/wiki_props.py" and the dump of the Wikipedia properties (here - called wiki-latest-page_props.sql) and then use the output file to add the wikidata id to the JSON document.
- Import the Wikidata dump into MongoDB in it's own collection using:
mongoimport --db WikiSRL --collection wikidata --file wikidata_dump.json --jsonArray
- Create an index on the "id" field
db.wikidata.createIndex({"id": 1})
- Import the JSON wikipedia dump into MongoDB
- Create an index on the wikidata id field:
db.wikidata.createIndex({"wikidata_id": 1})
- To merge Wikidata and wikipedia configure the config.py file, and then run merge_wikis.py
- To extract the triples and create the SRL file, configure the config.py file, and run srl.py