update

LiyuanLucasLiu · Sep 13, 2017 · 35b0045 · 35b0045
1 parent 0701a16
commit 35b0045
Show file tree

Hide file tree

Showing 3 changed files with 105 additions and 9 deletions.
diff --git a/README.md b/README.md
@@ -1,23 +1,114 @@
 # LM-LSTM-CRF
-Empower Sequence Labeling with Task-Aware Language Model
 
-## Training
+This project provides high-performance character-aware sequence labeling tools and tutorials. The implementation is based on the PyTorch library.
 
+LM-LSTM-CRF achieves F1 score of 91.71 (mean) on CoNLL03 NER task, without using any additional corpus.
+
+The documents would be available soon.
+
+## Quick Links
+
+- [Model](#model-notes)
+- [Installation](#installation)
+- [Data](#data)
+- [Usage](#usage)
+- [Benchmarks](#benchmarks)
+
+## Model Notes
+
+<p align="center"><img width="100%" src="docs/framework.png"/></p>
+
+As visualized above, we use conditional random field (CRF) to capture labels' dependency, and adopt hierarchical LSTM to take char-level and word-level input. 
+The char-level structure is further guided by language model, while pre-trained word embedding is leveraged in word-level.
+Language model and sequence labeling made predictions at word-level, and are trained at the same time.
+[Highway networks](https://arxiv.org/abs/1507.06228) is used to transform output of char-level into different semantic spaces, which mediates these two tasks and allow them enhance each other.
+
+## Installation
+
+For training, a GPU is strongly recommended for speed. CPU is supported but training could be very slow.
+
+### PyTorch
+
+The code based on PyTorch; you can find installation instructions [here](http://pytorch.org/). 
+
+### Dependencies
+
+The code is written in Python 3.6; its dependencies are in the file ```requirements.txt```. You can install these dependencies like this:
 ```
-python train_nwc.py --checkpoint ./checkpoint/ner_ --gpu 0 --caseless --fine_tune --high_way --co_train
+pip install -r requirements.txt
 ```
 
-- Named Entity Recognition (NER)
+## Data
+
+We mainly focus on CoNLL 2003 NER task, and the code takes its format as input. 
+However, due to the license issue, we are restricted to distribute this dataset.
+You should be able to get it [here](http://www.cnts.ua.ac.be/conll2003/ner/).
+You can also search it on github, there might be someone who released it.
+
+### Format
+
+We assume the corpus is formatted as CoNLL03 NER corpus.
+Specifically, empty lines are used as separators between sentences, and the separator between documents is a special line:
 ```
-python train_nwc.py --patience 15 --checkpoint ./checkpoint/w_${num}_ --gpu ${gpuid} --epoch 200 --lr 0.01 --lr_decay 0.05 --momentum 0.9 --caseless --fine_tune --mini_count 5 --char_hidden 300 --word_hidden 300 --pc_type w --high_way --co_train 2>> l_ner/out_${num}.txt
+-DOCSTART- -X- -X- -X- O
 ```
+Other lines contains words, labels and other fields. Word must be the first field, label mush be the last, and these fields are separated by space.
+For example, WSJ portion of PTB POS tagging corpus should be corpus like:
 
-- Part-of-Speech (POS) Tagging
 ```
-python train_nwc.py --train_file ./data/pos/train.txt --dev_file ./data/pos/testa.txt --test_file ./data/pos/testb.txt --eva_matrix a --checkpoint ./checkpoint/pos_ --gpu 1 --lr 0.015 --caseless --fine_tune --high_way --co_train
+-DOCSTART- -X- -X- -X- O
+
+Pierre NNP
+Vinken NNP
+, ,
+61 CD
+years NNS
+old JJ
+, ,
+will MD
+join VB
+the DT
+board NN
+as IN
+a DT
+nonexecutive JJ
+director NN
+Nov. NNP
+29 CD
+. .
+
+
+```
+
+## Usage
+
+Here we provides implements for two models, one is LM-LSTM-CRF, the other is its variant, LSTM-CRF, which only contains the word-level structure and CRF.
+```train_wc.py``` and ```eval_wc.py``` are scripts for LM-LSTM-CRF, while ```train_w.py``` and ```eval_w.py``` are scripts for LSTM-CRF.
+The usage of these scripts can be accessed by ````-h````, e.g, 
+```
+python train_wc.py -h
+```
+
+The default running command for NER, Noun Phrase Chunking are:
+
+- Named Entity Recognition (NER):
+```
+python train_wc.py --train_file ./data/ner/train.txt --dev_file ./data/ner/testa.txt --test_file ./data/ner/testb.txt --checkpoint ./checkpoint/ner_ --caseless --fine_tune --high_way --co_train
+```
+
+- Part-of-Speech (POS) Tagging:
+```
+python train_wc.py --train_file ./data/pos/train.txt --dev_file ./data/pos/testa.txt --test_file ./data/pos/testb.txt --eva_matrix a --checkpoint ./checkpoint/pos_ --lr 0.015 --caseless --fine_tune --high_way --co_train
 ```
 
-- Noun Phrase (NP) Chunking
+- Noun Phrase (NP) Chunking:
 ```
-python train_nwc.py --train_file ./data/np/train.txt.iobes --dev_file ./data/np/testa.txt.iobes --test_file ./data/np/testb.txt.iobes --checkpoint ./checkpoint/np_ --gpu 2 --caseless --fine_tune --high_way --co_train
+python train_wc.py --train_file ./data/np/train.txt.iobes --dev_file ./data/np/testa.txt.iobes --test_file ./data/np/testb.txt.iobes --checkpoint ./checkpoint/np_ --caseless --fine_tune --high_way --co_train
 ```
+
+## Benchmarks
+
+Here we compare LM-LSTM-CRF with recent sequence labeling models on CoNLL00 Chunking, CoNLL03 NER, and WSJ PTB POS Tagging task.
+
+### NER
+
diff --git a/docs/framework.png b/docs/framework.png
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,5 @@
+Cython==0.25.2
+scipy==0.19.0
+numpy==1.12.1
+torch==0.2.0.post2
+tqdm==4.15.0