-
Notifications
You must be signed in to change notification settings - Fork 210
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
0701a16
commit 35b0045
Showing
3 changed files
with
105 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,23 +1,114 @@ | ||
# LM-LSTM-CRF | ||
Empower Sequence Labeling with Task-Aware Language Model | ||
|
||
## Training | ||
This project provides high-performance character-aware sequence labeling tools and tutorials. The implementation is based on the PyTorch library. | ||
|
||
LM-LSTM-CRF achieves F1 score of 91.71 (mean) on CoNLL03 NER task, without using any additional corpus. | ||
|
||
The documents would be available soon. | ||
|
||
## Quick Links | ||
|
||
- [Model](#model-notes) | ||
- [Installation](#installation) | ||
- [Data](#data) | ||
- [Usage](#usage) | ||
- [Benchmarks](#benchmarks) | ||
|
||
## Model Notes | ||
|
||
<p align="center"><img width="100%" src="docs/framework.png"/></p> | ||
|
||
As visualized above, we use conditional random field (CRF) to capture labels' dependency, and adopt hierarchical LSTM to take char-level and word-level input. | ||
The char-level structure is further guided by language model, while pre-trained word embedding is leveraged in word-level. | ||
Language model and sequence labeling made predictions at word-level, and are trained at the same time. | ||
[Highway networks](https://arxiv.org/abs/1507.06228) is used to transform output of char-level into different semantic spaces, which mediates these two tasks and allow them enhance each other. | ||
|
||
## Installation | ||
|
||
For training, a GPU is strongly recommended for speed. CPU is supported but training could be very slow. | ||
|
||
### PyTorch | ||
|
||
The code based on PyTorch; you can find installation instructions [here](http://pytorch.org/). | ||
|
||
### Dependencies | ||
|
||
The code is written in Python 3.6; its dependencies are in the file ```requirements.txt```. You can install these dependencies like this: | ||
``` | ||
python train_nwc.py --checkpoint ./checkpoint/ner_ --gpu 0 --caseless --fine_tune --high_way --co_train | ||
pip install -r requirements.txt | ||
``` | ||
|
||
- Named Entity Recognition (NER) | ||
## Data | ||
|
||
We mainly focus on CoNLL 2003 NER task, and the code takes its format as input. | ||
However, due to the license issue, we are restricted to distribute this dataset. | ||
You should be able to get it [here](http://www.cnts.ua.ac.be/conll2003/ner/). | ||
You can also search it on github, there might be someone who released it. | ||
|
||
### Format | ||
|
||
We assume the corpus is formatted as CoNLL03 NER corpus. | ||
Specifically, empty lines are used as separators between sentences, and the separator between documents is a special line: | ||
``` | ||
python train_nwc.py --patience 15 --checkpoint ./checkpoint/w_${num}_ --gpu ${gpuid} --epoch 200 --lr 0.01 --lr_decay 0.05 --momentum 0.9 --caseless --fine_tune --mini_count 5 --char_hidden 300 --word_hidden 300 --pc_type w --high_way --co_train 2>> l_ner/out_${num}.txt | ||
-DOCSTART- -X- -X- -X- O | ||
``` | ||
Other lines contains words, labels and other fields. Word must be the first field, label mush be the last, and these fields are separated by space. | ||
For example, WSJ portion of PTB POS tagging corpus should be corpus like: | ||
|
||
- Part-of-Speech (POS) Tagging | ||
``` | ||
python train_nwc.py --train_file ./data/pos/train.txt --dev_file ./data/pos/testa.txt --test_file ./data/pos/testb.txt --eva_matrix a --checkpoint ./checkpoint/pos_ --gpu 1 --lr 0.015 --caseless --fine_tune --high_way --co_train | ||
-DOCSTART- -X- -X- -X- O | ||
Pierre NNP | ||
Vinken NNP | ||
, , | ||
61 CD | ||
years NNS | ||
old JJ | ||
, , | ||
will MD | ||
join VB | ||
the DT | ||
board NN | ||
as IN | ||
a DT | ||
nonexecutive JJ | ||
director NN | ||
Nov. NNP | ||
29 CD | ||
. . | ||
``` | ||
|
||
## Usage | ||
|
||
Here we provides implements for two models, one is LM-LSTM-CRF, the other is its variant, LSTM-CRF, which only contains the word-level structure and CRF. | ||
```train_wc.py``` and ```eval_wc.py``` are scripts for LM-LSTM-CRF, while ```train_w.py``` and ```eval_w.py``` are scripts for LSTM-CRF. | ||
The usage of these scripts can be accessed by ````-h````, e.g, | ||
``` | ||
python train_wc.py -h | ||
``` | ||
|
||
The default running command for NER, Noun Phrase Chunking are: | ||
|
||
- Named Entity Recognition (NER): | ||
``` | ||
python train_wc.py --train_file ./data/ner/train.txt --dev_file ./data/ner/testa.txt --test_file ./data/ner/testb.txt --checkpoint ./checkpoint/ner_ --caseless --fine_tune --high_way --co_train | ||
``` | ||
|
||
- Part-of-Speech (POS) Tagging: | ||
``` | ||
python train_wc.py --train_file ./data/pos/train.txt --dev_file ./data/pos/testa.txt --test_file ./data/pos/testb.txt --eva_matrix a --checkpoint ./checkpoint/pos_ --lr 0.015 --caseless --fine_tune --high_way --co_train | ||
``` | ||
|
||
- Noun Phrase (NP) Chunking | ||
- Noun Phrase (NP) Chunking: | ||
``` | ||
python train_nwc.py --train_file ./data/np/train.txt.iobes --dev_file ./data/np/testa.txt.iobes --test_file ./data/np/testb.txt.iobes --checkpoint ./checkpoint/np_ --gpu 2 --caseless --fine_tune --high_way --co_train | ||
python train_wc.py --train_file ./data/np/train.txt.iobes --dev_file ./data/np/testa.txt.iobes --test_file ./data/np/testb.txt.iobes --checkpoint ./checkpoint/np_ --caseless --fine_tune --high_way --co_train | ||
``` | ||
|
||
## Benchmarks | ||
|
||
Here we compare LM-LSTM-CRF with recent sequence labeling models on CoNLL00 Chunking, CoNLL03 NER, and WSJ PTB POS Tagging task. | ||
|
||
### NER | ||
|
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
Cython==0.25.2 | ||
scipy==0.19.0 | ||
numpy==1.12.1 | ||
torch==0.2.0.post2 | ||
tqdm==4.15.0 |