This project is following our research paper: ״WEC: Deriving a Large-scale Cross-document Event Coreference dataset from Wikipedia״
A video tutorial of our research is also available here.
Here can be found both The WEC-Eng cross document dataset from English Wikipedia and the method for creating WEC for other languages.
Note: In our original WEC paper, we used several methods that were all aggregated into one project here. To that end, we replaced some of the original python implementations with corollating Java ones (for examle: SpaCy implementation replaced with StanfordNLP).
WEC-Eng is part of huggingface_hub and available at this location: https://huggingface.co/datasets/biu-nlp/WEC-Eng
See the Dataset card, for instructions on how to read and use WEC-Eng
Below are the instructions of how-to generate a new version of WEC, whether required from a more recent English Wikipdia dump, or in order to extract it from one of the other supported languages (e.g., French, Spanish, German, Chinese).
- A Wikipedia ElasticSearch Index created by wikipedia-to-elastic project (index must contain at least the Infobox "relationTypes").
- Java 11
This code repo contains two main processes:
- Code to generate the initial crude version of WEC-Lang
- Code to generate the final Json of WEC-Lang
Configuration file - resources/application.properties
spring.datasource.url=jdbc:h2:file:/demo => h2 database file url
poolSize=8 => Number of thread to run
elasticHost=localhost => Elastic engine host
elasticPort=9200 => (Elastic engine port)
elasticWikiIndex=enwiki_v3 => (Elastic index to read from (as generated by *wikipedia-to-elastic*)
infoboxConfiguration=/infobox_config/en_infobox_config.json => Explained below
multiRequestInterval=100 (recommended value) => Control the number of search pages to retrive from elastic
elasticSearchInterval=100 (recommended value) => Control the number of pages to read by the elasitc scroller
totalAmountToExtract=-1 => if < 0 then read all wikipedia pages, otherwise will read upto the amount specified
main.lexicalThresh=4 => lexical diversity threshold
main.outputDir=output => the output folder where WEC json should be created and saved
main.outputFile=GenWEC.json => WEC json file name, will contain the final version of the generated dataset
We have extracted the relevant infobox configuration for the English Wikipedia.
In order to create a newer version of WEC-Eng, use/update the default infobox_config/en_infobox_config.json
in configuration.
To generate WEC in one of the supported languages (other than English) follow those steps:
- Export Wikipedia in the required language using wikipedia-to-elastic project
- Explore for infoboxs categories, the script below can help by producing candidate as well as the amount of pages related to an infobox category.
- Run the infobox categories report:
./gradlew bootRun --args=infobox
- Now, you can create a new infobox configuration (for the new language) file in
src/main/resources/infobox_config/<lang>_infobox_config.json
File should contain all needed infobox language specific configurations (based on the generated infobox categories report). - Finally, set it as the
infoboxConfiguration
file inapplication.properties
{
"infoboxLangText" : "Infobox", // wikipedia markdown element name in the language (e.g., <Infobox sport>)
"infoboxConfigs": [
{
"corefType": "ACCIDENT_EVENT", // Type you would like to give the infobox category
"include": true, // Should be included when extracting WEC
"infoboxs": [ // list of infobox categories that should be included in this type (lowercased and concat)
"airlinerincident",
"airlineraccident",
"aircraftcrash",
"aircraftaccident",
"aircraftincident",
"aircraftoccurrence",
"railaccident",
"busaccident",
"publictransitaccident"
]
}
]
}
Make sure the Wikipedia Elastic engine is running
- Running WikiToWECMain in order to generate the H2 database:
#>./gradlew bootRun --args=wecdb
Program output - an H2 dataset containing the crude extraction of coreference relations from Wikipedia (this resource can be used for experiments before generating the final version of WEC-Lang) - Generate the WEC-Lang Json format file:
#>./gradlew bootRun --args=wecjson
Program output - A JSON format resource of the WEC-Lang dataset
In order to produce more statistics and/or create a visualized output of the generated dataset, refer to those scripts for more information.