Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
LiyuanLucasLiu committed Sep 13, 2017
1 parent 0701a16 commit 35b0045
Show file tree
Hide file tree
Showing 3 changed files with 105 additions and 9 deletions.
109 changes: 100 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,114 @@
# LM-LSTM-CRF
Empower Sequence Labeling with Task-Aware Language Model

## Training
This project provides high-performance character-aware sequence labeling tools and tutorials. The implementation is based on the PyTorch library.

LM-LSTM-CRF achieves F1 score of 91.71 (mean) on CoNLL03 NER task, without using any additional corpus.

The documents would be available soon.

## Quick Links

- [Model](#model-notes)
- [Installation](#installation)
- [Data](#data)
- [Usage](#usage)
- [Benchmarks](#benchmarks)

## Model Notes

<p align="center"><img width="100%" src="docs/framework.png"/></p>

As visualized above, we use conditional random field (CRF) to capture labels' dependency, and adopt hierarchical LSTM to take char-level and word-level input.
The char-level structure is further guided by language model, while pre-trained word embedding is leveraged in word-level.
Language model and sequence labeling made predictions at word-level, and are trained at the same time.
[Highway networks](https://arxiv.org/abs/1507.06228) is used to transform output of char-level into different semantic spaces, which mediates these two tasks and allow them enhance each other.

## Installation

For training, a GPU is strongly recommended for speed. CPU is supported but training could be very slow.

### PyTorch

The code based on PyTorch; you can find installation instructions [here](http://pytorch.org/).

### Dependencies

The code is written in Python 3.6; its dependencies are in the file ```requirements.txt```. You can install these dependencies like this:
```
python train_nwc.py --checkpoint ./checkpoint/ner_ --gpu 0 --caseless --fine_tune --high_way --co_train
pip install -r requirements.txt
```

- Named Entity Recognition (NER)
## Data

We mainly focus on CoNLL 2003 NER task, and the code takes its format as input.
However, due to the license issue, we are restricted to distribute this dataset.
You should be able to get it [here](http://www.cnts.ua.ac.be/conll2003/ner/).
You can also search it on github, there might be someone who released it.

### Format

We assume the corpus is formatted as CoNLL03 NER corpus.
Specifically, empty lines are used as separators between sentences, and the separator between documents is a special line:
```
python train_nwc.py --patience 15 --checkpoint ./checkpoint/w_${num}_ --gpu ${gpuid} --epoch 200 --lr 0.01 --lr_decay 0.05 --momentum 0.9 --caseless --fine_tune --mini_count 5 --char_hidden 300 --word_hidden 300 --pc_type w --high_way --co_train 2>> l_ner/out_${num}.txt
-DOCSTART- -X- -X- -X- O
```
Other lines contains words, labels and other fields. Word must be the first field, label mush be the last, and these fields are separated by space.
For example, WSJ portion of PTB POS tagging corpus should be corpus like:

- Part-of-Speech (POS) Tagging
```
python train_nwc.py --train_file ./data/pos/train.txt --dev_file ./data/pos/testa.txt --test_file ./data/pos/testb.txt --eva_matrix a --checkpoint ./checkpoint/pos_ --gpu 1 --lr 0.015 --caseless --fine_tune --high_way --co_train
-DOCSTART- -X- -X- -X- O
Pierre NNP
Vinken NNP
, ,
61 CD
years NNS
old JJ
, ,
will MD
join VB
the DT
board NN
as IN
a DT
nonexecutive JJ
director NN
Nov. NNP
29 CD
. .
```

## Usage

Here we provides implements for two models, one is LM-LSTM-CRF, the other is its variant, LSTM-CRF, which only contains the word-level structure and CRF.
```train_wc.py``` and ```eval_wc.py``` are scripts for LM-LSTM-CRF, while ```train_w.py``` and ```eval_w.py``` are scripts for LSTM-CRF.
The usage of these scripts can be accessed by ````-h````, e.g,
```
python train_wc.py -h
```

The default running command for NER, Noun Phrase Chunking are:

- Named Entity Recognition (NER):
```
python train_wc.py --train_file ./data/ner/train.txt --dev_file ./data/ner/testa.txt --test_file ./data/ner/testb.txt --checkpoint ./checkpoint/ner_ --caseless --fine_tune --high_way --co_train
```

- Part-of-Speech (POS) Tagging:
```
python train_wc.py --train_file ./data/pos/train.txt --dev_file ./data/pos/testa.txt --test_file ./data/pos/testb.txt --eva_matrix a --checkpoint ./checkpoint/pos_ --lr 0.015 --caseless --fine_tune --high_way --co_train
```

- Noun Phrase (NP) Chunking
- Noun Phrase (NP) Chunking:
```
python train_nwc.py --train_file ./data/np/train.txt.iobes --dev_file ./data/np/testa.txt.iobes --test_file ./data/np/testb.txt.iobes --checkpoint ./checkpoint/np_ --gpu 2 --caseless --fine_tune --high_way --co_train
python train_wc.py --train_file ./data/np/train.txt.iobes --dev_file ./data/np/testa.txt.iobes --test_file ./data/np/testb.txt.iobes --checkpoint ./checkpoint/np_ --caseless --fine_tune --high_way --co_train
```

## Benchmarks

Here we compare LM-LSTM-CRF with recent sequence labeling models on CoNLL00 Chunking, CoNLL03 NER, and WSJ PTB POS Tagging task.

### NER

Binary file added docs/framework.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Cython==0.25.2
scipy==0.19.0
numpy==1.12.1
torch==0.2.0.post2
tqdm==4.15.0

0 comments on commit 35b0045

Please sign in to comment.