Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach (EMNLP 17)
C Python Shell Makefile
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Data
DataProcessor
LabelGeneration
Model
docs
.gitattributes
LICENSE
README.md
demo_KBP.sh
demo_NYT.sh
labelGeneration.sh

README.md

Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach

Source code and data for Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach


ReHession conducts Relation Extraction with Heterogeneous Supervision, e.g., the labeling functions at left corner.

ReHession conducts Relation Extraction, featuring:

  • employ heterogeneous supervision, e.g., knowledge base and heuristic patterns, to train the model (as picked in left corner)
  • infers true label from noisy labels in a context-aware manner
  • true label discovery and relation extraction can mutually enhance each other

Quick Start

A demo is provided and can be execurated by:

bash demo_KBP.sh

Details

This Project are in an early-release beta. Expect some adventures...

Pipeline Overview

And the pipeline of ReHession is:

  • recognize entities for labeling functions
  • apply labeling functions to get heterogeneous supervision
  • generate pos-tagging and brown clustering
  • encoding training / testing corpus
  • training and evaluation

Data

We include corpus, labeling functions and knowledge base in the Data folder.

Corpus

We stored KBP[1] corpus under the path Data/source/KBP/corpus.txt.zip, and NYT[2] corpus under the folder Data/source/NYT/corpus.txt.zip. Pos-tagging and entity detection has been conducted by Stanford NER tools.

Patterns

We stored the pattern-based labeling functions in the folder of the corpus, named nlf.json. These files stored one pattern-based labeling function at one line in the json format.

Knowledge base

We stored the annotations generated by KB-based labeling functions in the folder of the corpus, named train.json. It's also used as the training file adopted by CoType

Labeling Functions

We adopted three kinds of labeling functions in ReHession: KB-based, pattern-based and inversed. And the annotations generated by those labeling functions are save in the path Data/intermediate/.

KB based

KB based labeling functions are adopted to encode information of KB. Accordingly, we adopted the training file generated by distant supervision, by treating annotations of the same relation type are generated by the same labeling function (in the form of if r(e1, e2) in KB: return r).

Pattern based

Pattern-based labeling functions would annotate entity pairs with matched entity type and texture pattern with preset relation types. And each pattern-based labeling functions (as stored in nlf.json) has the following fields:

  • reserved: whether entity 1 is before entity 2
  • PID: pattern id
  • rule: the rule of matching multiple entities
  • Texture: texture pattern
  • relationType: the detected type of this labeling function
  • Type1: the type of entity 1
  • Type2: the type of entity 2

The rule field has value in the format of [a,b], while a and b can be any number or n. It indicates how many entities would the labeling function try to match (n indicates matching all entities). For example, {"reserved": "1", "PID": 0, "rule": ["1", "1"], "Texture": "founder of", "relationType": "/business/company/founders", "Type1": "ORGANIZATION", "Type2": "PERSON"} requires the entity before texture pattern (indicated by reserved) to be Person (indicated by Type2), and entity after texture pattern to be ORGANIZATION (indicated by Type1). Also this labeling function would only annotate the entity pair most close to the texture pattern (indicated by ["1", "1"])

Inverse

In order to annotate None type, we designed another type of labeling function, i.e., if a set of labeling functions not annotate a instance, it would annotate it as None.

Specifically, for KBP dataset, we adopted a reverse labeling function who reserved all pattern-based labeling functions; for NYT dataset, a reverse labeling function reserving all kb-based labeling functions is adopted.

KBP Dataset

We now proceed to use KBP dataset as an example (also save as labelGeneration.sh) to demonstrate the pipeline of generating heterogeneous supervision.

python LabelGeneration/UIDExtractKB.py
python LabelGeneration/post_process_chunked_corpus.py
python LabelGeneration/applying_labelling_func.py --save_all
python LabelGeneration/applying_KB.py
python LabelGeneration/reCodeFuncs.py
python LabelGeneration/MergeLFS.py
python LabelGeneration/cal_Mention_Distance.py
python LabelGeneration/applyingInverse.py

These commands requires original data to be stored in Data/source/KBP/, while the specific requirements are stored in the default setting.

Feature Extraction

We provided the training files used in our experiments, which is saved in the path Data/intermediate/. Also we wrote a demo scripts for the whole feature extraction, model learning and evaluation pipeline. You can simply execute it by

bash demo_KBP.sh

With dataset with annotation and brown clustering file stored in the path Data/intermediate/KBP/, the feature extraction can be performed by

python DataProcessor/relation_feature_generation.py

Model Learning and Evaluation

Encoding

The model is designed to run on encoded training / evaluation / testing corpus. Each line in the encoded file is an instance, which is in the format of

InsID	FeatureNum	AnnotationNum	FeatureList	AnnotationList

InsID, FeatureNum and AnnotationNum are integers; FeatureList is in the format of featureId,featureId; AnnotationList is in the format of lfId typeId. For example, an instance in KBP dataset is like:

10      78      2       457984,153472,646018,120323,1119382,739279,152889,199945,1146378,1077643,1138574,501136,1091345,55555,65558,1125465,131097,947866,1128477,869918,485663,289,201307,1066993,359723,681233,767278,1053381,1019443,742453,324534,570977,244,1059642,434747,267324,99135,376075,815283,84805,382022,1119585,1107934,43594,43595,809036,676557,749903,485662,78163,297812,43278,95417,1092708,673498,538331,218205,977502,189791,991328,380641,354658,1030244,1173222,492263,209340,520684,379757,335215,409201,786164,835966,21497,16250,872059,562651,137214,498815        6,10,6,9

Compile

Run make under the folder of Model would compile the model

Execute

The Execute commands for KBP dataset and NYT dataset are:

./Model/ReHession -train ./Data/intermediate/KBP/train.data -test ./Data/intermediate/KBP/test.data -none_idx 6 -instances 225977 -test_instances 2111
./Model/ReHession -train ./Data/intermediate/NYT/train.data -test ./Data/intermediate/NYT/test.data -none_idx 0 -instances 530767 -test_instances 3803

Reference

Please cite the following paper if you find the codes and datasets useful:

@inproceedings{Liu2017rehession,
 title={Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach},
 author={Liu, Liyuan and Ren, Xiang and Zhu, Qi and Zhi, Shi and Gui, Huan and Ji, Heng and Han, Jiawei},
 booktitle={Proc. EMNLP},
 year={2017}
}