# train-conll2003: Train NER on Conll2003 data

In [1]:
import os
import gatenlp
import gatenlp_ml_tner
import torch
from gatenlp import Document
from gatenlp.corpora.dirs import DirFilesCorpus
from gatenlp.visualization import CorpusViewer
print("gatenlp:", gatenlp.__version__)
print("gatenlp_ml_tner:", gatenlp_ml_tner.__version__)
print("torch:", torch.__version__)
print("CUDA devices:", os.environ.get("CUDA_VISIBLE_DEVICES", "[not set]"))

gatenlp: 1.0.8.dev3
gatenlp_ml_tner: 0.2.0.dev1
torch: 1.12.0+cu113
CUDA devices: 1


In [2]:
HOME = os.environ["HOME"]
CDIR = os.path.join(HOME, "corpora", "conll2003-gatenlp", "eng", "train")

In [3]:
corpus = DirFilesCorpus(CDIR, fmt="bdocjs", ext="bdocjs")
len(corpus)

946

In [4]:
cviewer = CorpusViewer(corpus)
cviewer.show()

HBox(children=(Button(icon='arrow-left', layout=Layout(width='5em'), style=ButtonStyle()), IntSlider(value=0, …

## Export a training file

This can be done from the command line using the command `gatenlp-tner-docs2dataset` which reads documents from a directory of GateNLP document (as the corpus above) and creates a single training file in Conll2003 format. 
Detailed usage information is shown when running `gatenlp-tner-docs2dataset --help`

The command allows us to specify the annotation set name where to take the annotations from (`--annset_name`)
the annotation types for sentence annotations (`--sentence_type`) and token annotations (`--token_type`) and 
the list of annotation types for named entities (`--chunk_types`). The output directory must already exist.

```
gatenlp-tner-docs2dataset corpora/conll2003-gatenlp/eng/train/ $HOME/tmp/train \
   --annset_name '' --sentence_type Sentence --token_type Token --chunk_types LOC MISC ORG PER
```

This creates the training file `train.txt` in the output directory


In [5]:
!gatenlp-tner-docs2dataset $HOME/corpora/conll2003-gatenlp/eng/train/ $HOME/tmp/train --annset_name '' --sentence_type Sentence --token_type Token --chunk_types LOC MISC ORG PER

2022-07-02 13:42:45,668|INFO|/home/johann/software/anaconda3/envs/tner/bin/gatenlp-tner-docs2dataset|Number of documents read: 946
2022-07-02 13:42:45,668|INFO|/home/johann/software/anaconda3/envs/tner/bin/gatenlp-tner-docs2dataset|Number of errors: 0


In [6]:
!head -10 $HOME/tmp/train/train.txt

MOF	B-ORG
's	O
Kubo	B-PER
says	O
believes	O
BOJ	B-ORG
rate	O
policy	O
unchanged	O
.	O


## Train a model

To train a model, the command `gatenlp-tner-train` can be userd (detailed usage info with `--help`). This trains a transformer-based token classification model for chunking from the CONLL-format training file. 

To train the model, the pretrained transformer model to use as a base should be specified (see https://huggingface.co/models). We use `distilbert-base-cased` here. Note that a model that has already been pretrained for your NER task may be better to start from. The model will get downloaded from the Huggingface servers and cached locally. 

Note: currently the model directory must be a directory name in the current path, not a full path name!

```
gatenlp-tner-train $HOME/tmp/train model --transformers_model distilbert-base-cased 
```

IMPORTANT: training a good model usually requires more experimentation, evaluation, hyperparameter search and
more. This is outside of the scope of this example.

In [7]:
!gatenlp-tner-train $HOME/tmp/train model --transformers_model distilbert-base-cased 

2022-07-02 13:42:47 INFO     *** initialize network ***
2022-07-02 13:42:47 INFO     create new checkpoint
2022-07-02 13:42:47 INFO     removed incomplete checkpoint model
2022-07-02 13:42:47 INFO     checkpoint: model
2022-07-02 13:42:47 INFO      - [arg] dataset: /home/johann/tmp/train
2022-07-02 13:42:47 INFO      - [arg] transformers_model: distilbert-base-cased
2022-07-02 13:42:47 INFO      - [arg] random_seed: 42
2022-07-02 13:42:47 INFO      - [arg] lr: 2e-05
2022-07-02 13:42:47 INFO      - [arg] total_step: 5000
2022-07-02 13:42:47 INFO      - [arg] warmup_step: 700
2022-07-02 13:42:47 INFO      - [arg] weight_decay: 1e-07
2022-07-02 13:42:47 INFO      - [arg] batch_size: 16
2022-07-02 13:42:47 INFO      - [arg] max_seq_length: 128
2022-07-02 13:42:47 INFO      - [arg] fp16: False
2022-07-02 13:42:47 INFO      - [arg] max_grad_norm: 1.0
2022-07-02 13:42:47 INFO      - [arg] lower_case: False
2022-07-02 13:42:47 INFO     Initialized trainer, running training ... 
2022-07-02 13:4

## Use the model

Once the model has been trained, it can be used to annotate new documents. 

Important: since this model is trained on sentences, the document we want to annotate also needs to contain sentence annotations, or be short enough to be used as a whole!

In [8]:
from gatenlp_ml_tner.annotators import TnerTokenClassificationAnnotator

In [9]:
# For our short test documents, we do not need to do Sentence splitting first.
anntr = TnerTokenClassificationAnnotator("model", annset_name="", outset_name="Entities", sentence_type=None)

2022-07-02 13:47:09,857|INFO|root|*** initialize network ***


In [10]:
doc = Document("Washington DC and Washington state were named after George Washington.")
doc = anntr(doc)
doc