Code associated with the paper Pyramid: A Layered Model for Nested Named Entity Recognition at ACL 2020.
@inproceedings{jue2020pyramid,
title={Pyramid: A Layered Model for Nested Named Entity Recognition},
author={Wang, Jue and Shou, Lidan and Chen, Ke and Chen, Gang},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
pages={5918--5928},
year={2020}
}
-
Preprocess the corpora
For ACE and GENIA, we follow the script from https://github.com/yahshibu/nested-ner-tacl2020-transformers to preprocess the corpora. For NNE, we use the proprocessing script from https://github.com/nickyringland/nested_named_entities.
Each dataset is further unified in json and placed in "./datasets/unified/" as "train.{dataset}.json", "valid.{dataset}.json", and "test.{dataset}.json" three files.
Each json file consists of a list items, and each of item looks like:
{ "tokens": ["token0", "token1", "token2"], "entities": [ { "entity_type": "PER", "span": [0, 1], }, { "entity_type": "ORG", "span": [2, 3], }, ] }
-
Generate embeddings
Then we prepare the pretrained word embeddings, such as GloVe. Available at https://nlp.stanford.edu/projects/glove/.
Each line of this file represents a token or word. Here is an example with a vector of length 5:
word 0.002 1.9999 4.323 4.1231 -1.2323
(Optional) It is also recommended to use language model based contextualized embeddings, such as BERT. Check "./run/gen_XXX_emb.py" to generate them.
-
Start training
Run the following cmd to start the training, e.g., on ACE05.
$ python train_ner.py \ --batch_size 32 \ --evaluate_interval 500 \ --dataset ACE05 \ --pretrained_wv ../wv/PATH_TO_WV_FILE \ --max_epoches 500 \ --model_class PyramidNestNER \ --model_write_ckpt ./PATH_TO_CKPT_TO_WRITE \ --optimizer sgd \ --lr 0.01 \ --tag_form iob2 \ --cased 0 \ --token_emb_dim 100 \ --char_emb_dim 30 \ --char_encoder lstm \ --lm_emb_dim 0 \ --lm_emb_path ../wv/PATH_TO_LM_EMB_PICKLE_OBJECT \ --tag_vocab_size 100 \ --vocab_size 20000 \ --dropout 0.4 \ --max_depth 16
Log samples are placed in "./logs/"
The batch size to use.
The evaluation interval, which means evaluate the model for every {evaluate_interval} training steps.
The name of dataset to be used. The dataset should be unified in json and placed in "./datasets/unified/", ias"train.{dataset}.json", "valid.{dataset}.json", and "test.{dataset}.json" three files.
Each json file consists of a list items, and each of item is as follows:
{
"tokens": ["token0", "token1", "token2"],
"entities": [
{
"entity_type": "PER",
"span": [0, 1],
},
{
"entity_type": "ORG",
"span": [2, 3],
},
]
}
The pretrained word vectors file, such as GloVe.
Each line of this file represents a token or word. Here is an example with a vector of length 5:
word 0.002 1.9999 4.323 4.1231 -1.2323
max_epoches
model_class, should be PyramidNestNER or BiPyramidNestNER
Path of model_write_ckpt. None if you don't want to save checkpoints.
Optimizer to use. "adam" or "sgd".
Learning rate. E.g. 1e-2.
tag_form. Currently only support IOB2.
Whether cased for word embeddings (0 or 1). Note for char embs, it is always cased.
Word embedding dimension. This should be in line with "pretrained_wv" file.
Character embedding dimension. 0 to disable it.
30 works fine.
Use "lstm" or "cnn" char encoder.
Language model based embedding dimension. 0 to disable it.
Language model embeddings. "lm_emb_path" is required if "lm_emb_dim" > 0.
which is a pickle file, representing a dictionary object, mapping a tuple of tokens to a numpy matrix:
{
(t0_0,t0_1,t0_2,...,t0_23): np.array([24, 1024]),
(t1_0,t1_1,t1_2,...,t1_16): np.array([17, 1024]),
...
}
check "./run/gen_XXX_emb.py" to know how to generate the language model embeddings.
Maximum of tag vocab size. A value bigger than the possible number of IOB2 tags.
Maximum of token vocab size.
dropout rate
Max height for the Pyramid.
Bigger for better support for longer nested entities;
Smaller for quicker training/inference speed.
The model is defined in "./models/pyramid_nest_ner.py".
Feel free to modify and test it.