Indirect Supervision for Relation Extraction Using Question-Answer Pairs (WSDM'18)
Switch branches/tags
Nothing to show
Clone or download
Latest commit 7a49681 Nov 2, 2017
Permalink
Failed to load latest commit information.
code first commit Oct 30, 2017
data/source first commit Oct 30, 2017
LICENSE Initial commit Oct 30, 2017
README.md Update README.md Nov 2, 2017
run_kbp.sh first commit Oct 30, 2017
run_nyt.sh first commit Oct 30, 2017

README.md

Relation Extraction with Question-Answer Pairs (ReQuest)

Source code and data for WSDM'18 paper Indirect Supervision for Relation Extraction Using Question-Answer Pairs.

Performance

Performance comparison with several relation extraction systems over KBP 2013 dataset (sentence-level extraction).

Method Precision Recall F1
Mintz (our implementation, Mintz et al., 2009) 0.296 0.387 0.335
LINE + Dist Sup (Tang et al., 2015) 0.360 0.257 0.299
MultiR (Hoffmann et al., 2011) 0.325 0.278 0.301
FCM + Dist Sup (Gormley et al., 2015) 0.151 0.498 0.300
CoType-RM (Ren et al., 2017) 0.342 0.339 0.340
ReQuest (our model, [Wu et al., 2018]) 0.386 0.410 0.397

Dependencies

We will take Ubuntu for example.

  • python 2.7
  • Python library dependencies
$ pip install pexpect ujson tqdm
$ cd code/DataProcessor/
$ git clone git@github.com:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip

Data

We process (using our data pipeline) two public RE datasets to our JSON format. We ran Stanford NER on training set to detect entity mentions, and performed distant supervision using DBpediaSpotlight to assign type labels:

  • NYT (Riedel et al., 2011): 1.18M sentences sampled from 294K New York Times news articles. 395 sentences are manually annotated with 24 relation types and 47 entity types. (Download JSON)
  • Wiki-KBP: the training corpus contains 1.5M sentences sampled from 780k Wikipedia articles (Ling & Weld, 2012) plus ~7,000 sentences from 2013 KBP corpus. Test data consists of 14k mannually labeled sentences from 2013 KBP slot filling assessment results. It has 13 relation types and 126 entity types after filtering of numeric value-related relations. (Download JSON)

Please put the data files in corresponding subdirectories under ReQuest/data/source

We use the answer sentence selection dataset from TREC QA as our source of indirect supervision. We ran Stanford NER to extract entity mentions on both question and answer sentences and process the dataset into JSON format containing QA-pairs. Details of how we construct QA-pairs can be found in our paper.

We provide the processed qa.json file and it should be put into each data folder under ReQuest/data/source.

Makefile

To compile request.cpp under your own g++ environment

$ cd ReQuest/code/Model/request; make

Default Run & Parameters

Run ReQuest for the task of Relation Extraction on the Wiki-KBP dataset

Start the Stanford corenlp server for the python wrapper.

$ java -mx4g -cp "code/DataProcessor/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer

Feature extraction, embedding learning on training data, and evaluation on test data.

$ ./run_kbp.sh  

The hyperparamters for embedding learning are included in the run_{dataname}.sh script.

Evaluation

Evaluates relation extraction performance (precision, recall, F1): produce predictions along with their confidence score; filter the predicted instances by tuning the thresholds.

$ python code/Evaluation/emb_test.py extract KBP request cosine 0.0
$ python code/Evaluation/tune_threshold.py extract KBP emb request cosine