diff --git a/README.md b/README.md index 9a4e122..01327c0 100644 --- a/README.md +++ b/README.md @@ -1,23 +1,114 @@ # LM-LSTM-CRF -Empower Sequence Labeling with Task-Aware Language Model -## Training +This project provides high-performance character-aware sequence labeling tools and tutorials. The implementation is based on the PyTorch library. +LM-LSTM-CRF achieves F1 score of 91.71 (mean) on CoNLL03 NER task, without using any additional corpus. + +The documents would be available soon. + +## Quick Links + +- [Model](#model-notes) +- [Installation](#installation) +- [Data](#data) +- [Usage](#usage) +- [Benchmarks](#benchmarks) + +## Model Notes + +

+ +As visualized above, we use conditional random field (CRF) to capture labels' dependency, and adopt hierarchical LSTM to take char-level and word-level input. +The char-level structure is further guided by language model, while pre-trained word embedding is leveraged in word-level. +Language model and sequence labeling made predictions at word-level, and are trained at the same time. +[Highway networks](https://arxiv.org/abs/1507.06228) is used to transform output of char-level into different semantic spaces, which mediates these two tasks and allow them enhance each other. + +## Installation + +For training, a GPU is strongly recommended for speed. CPU is supported but training could be very slow. + +### PyTorch + +The code based on PyTorch; you can find installation instructions [here](http://pytorch.org/). + +### Dependencies + +The code is written in Python 3.6; its dependencies are in the file ```requirements.txt```. You can install these dependencies like this: ``` -python train_nwc.py --checkpoint ./checkpoint/ner_ --gpu 0 --caseless --fine_tune --high_way --co_train +pip install -r requirements.txt ``` -- Named Entity Recognition (NER) +## Data + +We mainly focus on CoNLL 2003 NER task, and the code takes its format as input. +However, due to the license issue, we are restricted to distribute this dataset. +You should be able to get it [here](http://www.cnts.ua.ac.be/conll2003/ner/). +You can also search it on github, there might be someone who released it. + +### Format + +We assume the corpus is formatted as CoNLL03 NER corpus. +Specifically, empty lines are used as separators between sentences, and the separator between documents is a special line: ``` -python train_nwc.py --patience 15 --checkpoint ./checkpoint/w_${num}_ --gpu ${gpuid} --epoch 200 --lr 0.01 --lr_decay 0.05 --momentum 0.9 --caseless --fine_tune --mini_count 5 --char_hidden 300 --word_hidden 300 --pc_type w --high_way --co_train 2>> l_ner/out_${num}.txt +-DOCSTART- -X- -X- -X- O ``` +Other lines contains words, labels and other fields. Word must be the first field, label mush be the last, and these fields are separated by space. +For example, WSJ portion of PTB POS tagging corpus should be corpus like: -- Part-of-Speech (POS) Tagging ``` -python train_nwc.py --train_file ./data/pos/train.txt --dev_file ./data/pos/testa.txt --test_file ./data/pos/testb.txt --eva_matrix a --checkpoint ./checkpoint/pos_ --gpu 1 --lr 0.015 --caseless --fine_tune --high_way --co_train +-DOCSTART- -X- -X- -X- O + +Pierre NNP +Vinken NNP +, , +61 CD +years NNS +old JJ +, , +will MD +join VB +the DT +board NN +as IN +a DT +nonexecutive JJ +director NN +Nov. NNP +29 CD +. . + + +``` + +## Usage + +Here we provides implements for two models, one is LM-LSTM-CRF, the other is its variant, LSTM-CRF, which only contains the word-level structure and CRF. +```train_wc.py``` and ```eval_wc.py``` are scripts for LM-LSTM-CRF, while ```train_w.py``` and ```eval_w.py``` are scripts for LSTM-CRF. +The usage of these scripts can be accessed by ````-h````, e.g, +``` +python train_wc.py -h +``` + +The default running command for NER, Noun Phrase Chunking are: + +- Named Entity Recognition (NER): +``` +python train_wc.py --train_file ./data/ner/train.txt --dev_file ./data/ner/testa.txt --test_file ./data/ner/testb.txt --checkpoint ./checkpoint/ner_ --caseless --fine_tune --high_way --co_train +``` + +- Part-of-Speech (POS) Tagging: +``` +python train_wc.py --train_file ./data/pos/train.txt --dev_file ./data/pos/testa.txt --test_file ./data/pos/testb.txt --eva_matrix a --checkpoint ./checkpoint/pos_ --lr 0.015 --caseless --fine_tune --high_way --co_train ``` -- Noun Phrase (NP) Chunking +- Noun Phrase (NP) Chunking: ``` -python train_nwc.py --train_file ./data/np/train.txt.iobes --dev_file ./data/np/testa.txt.iobes --test_file ./data/np/testb.txt.iobes --checkpoint ./checkpoint/np_ --gpu 2 --caseless --fine_tune --high_way --co_train +python train_wc.py --train_file ./data/np/train.txt.iobes --dev_file ./data/np/testa.txt.iobes --test_file ./data/np/testb.txt.iobes --checkpoint ./checkpoint/np_ --caseless --fine_tune --high_way --co_train ``` + +## Benchmarks + +Here we compare LM-LSTM-CRF with recent sequence labeling models on CoNLL00 Chunking, CoNLL03 NER, and WSJ PTB POS Tagging task. + +### NER + diff --git a/docs/framework.png b/docs/framework.png new file mode 100644 index 0000000..a3b75a7 Binary files /dev/null and b/docs/framework.png differ diff --git a/requirements.txt b/requirements.txt index e69de29..8710d25 100644 --- a/requirements.txt +++ b/requirements.txt @@ -0,0 +1,5 @@ +Cython==0.25.2 +scipy==0.19.0 +numpy==1.12.1 +torch==0.2.0.post2 +tqdm==4.15.0 \ No newline at end of file