# How to train a NER model with Flair

In this notebook, you will find all code that you need to train or finetune a language model and a NER model with the flairNLP framework.

Note that this task is very computation-heavy and you will need a machine (server or local) with a CUDA-enabled video card.

Another note: you can always about the execution of a cell or script to abort a model training. Flair will then quit and save the best current model for you.

## Training a Language Model

In [None]:
# Make sure that flair is installed 
%pip install -U flair

In [None]:
# set the path to your project directory here (change this!)
root = "./home/project/"

### Training Data
You will need a good amount of training data.
The data can simply be provided in txt-files
and if you may get a better model if you add additional segmentation information such as `[START]`, `[END]` and/or `[SEP]` to separate sample documents or sentences. 

For this example, we assume our training data is stored in the project folder in a folder `lm_data`. Inside this folder, Flair requires two files, `test.txt`, `valid.txt`, and one folder `train/`. The folder `train` can contain as many files with whatever names. The idea here is that training data for modern languages is usually so large you will store them in multiple files instead of a single one. If you have little amounts of data, splitting it in a 90/10/10 distribution is usually the way to go, but 70/15/15 may also work for you.

#### Generating the character dictionary
You will need to create a custom dictionary for your dataset. Any character not included in this dictionary will be considered "unknown" by Flair. This doesn't break anything, but performance may decrease. If you don't have huge amounts of data, simply generate the dictionary from all files in your lm_data folder.

In [None]:
files = [
    root + "lm_data/test.txt",
    root + "lm_data/valid.txt",
    root + "lm_data/train/train.txt",
]

# we save the character dictionary simply to the project folder
out = root + 'char_dict.pkl'

In [None]:
# make an empty character dictionary
from flair.data import Dictionary
char_dictionary: Dictionary = Dictionary()

# counter object
import collections
counter = collections.Counter()

processed = 0

for file in files:
    print(file)

    with open(file, 'r', encoding='utf-8') as f:
        tokens = 0
        actual_tokens = 0
        for line in f:

            processed += 1
            chars = list(line)
            tokens += len(chars)

            # Add chars to the dictionary
            counter.update(chars)

            # comment this line in to speed things up (if the corpus is too large)
            # if tokens > 50000000: break

            tc = len([t for t in line.split() if t not in {"[START]", "[END]", "[SEP]"}])

            actual_tokens += tc

        print(actual_tokens)

    # break

total_count = 0
for letter, count in counter.most_common():
    total_count += count

print(total_count)
print(processed)

sum = 0
idx = 0
for letter, count in counter.most_common():
    sum += count
    percentile = (sum / total_count)

    # comment this line in to use only top X percentile of chars, otherwise filter later
    # if percentile < 0.00001: break

    char_dictionary.add_item(letter)
    idx += 1
    print('%d\t%s\t%7d\t%7d\t%f' % (idx, letter, count, sum, percentile))

print(char_dictionary.item2idx)

import pickle
with open(out, 'wb') as f:
    mappings = {
        'idx2item': char_dictionary.idx2item,
        'item2idx': char_dictionary.item2idx
    }
    pickle.dump(mappings, f)

Finally, we move on to the actual training. First the case without base model, training completely from scratch.
This training will train contextual character models for you and you should run it twice, once for forward and once for backward. Don't forget to change the parameters to do so!

The training will run until the number of epochs is reached or until no improvements can be detected anymore.

In [None]:
#!/usr/bin/env python3
from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus

# True => forward, False => backward
is_forward_lm = True

# we load our character dictionary that we created (change the path if necessary)
dictionary: Dictionary = Dictionary.load_from_file(root + 'char_dict.pkl')

# create our corpus object (change the path if necessary)
corpus = TextCorpus(root + 'lm_data/',
                    dictionary,
                    is_forward_lm,
                    character_level=True)

# instantiate your language model, set hidden size and number of layers (you can leave it as is)
language_model = LanguageModel(dictionary,
                            is_forward_lm,
                            hidden_size=2048,
                            nlayers=1)

# prepare the trainer
trainer = LanguageModelTrainer(language_model, corpus)

# actually run the trainer (change the output path if necessary)
trainer.train(root + 'lm_forward/',
            sequence_length=250,
            mini_batch_size=100,
            max_epochs=50)

### Fine-tuning a Language Model
Especially if you don't have huge amounts of training data, you may prefer fine-tuning your model on an already existing model instead.

You can find base models in the [tutorial](https://flairnlp.github.io/docs/tutorial-embeddings/flair-embeddings) or on Huggingface.

In [None]:
from flair.embeddings import FlairEmbeddings

# fetch the base model, here the 
language_model = FlairEmbeddings('multi-forward-fast', has_decoder=True).lm

# we can copy the forward/backward setting from the base model
is_forward_lm = language_model.is_forward_lm

# we can use the character dictionary from the base model
dictionary: Dictionary = language_model.dictionary

# set up the corpus object
corpus = TextCorpus(root + 'lm_data/',
                    dictionary,
                    is_forward_lm,
                    character_level=True)

# instantiate your language model, set hidden size and number of layers
trainer = LanguageModelTrainer(language_model, corpus)

# finetune your language model
trainer.train(root + 'lm_forward_finetuned/',
            sequence_length=250,
            mini_batch_size=100,
            learning_rate=20,
            patience=10,
            checkpoint=False)

## Training a NER model
We will now train a model using the trained embeddings. You can of course also use others, Flair lets you stack a lot of different embedding types, also such as BERT. You can find publicly available embeddings in the Flair [tutorial](https://flairnlp.github.io/docs/tutorial-embeddings/flair-embeddings) but also on Huggingface.

For this model, make sure you have three files containing your training, validation and test splits in IOB-annotated format.
In this example, we assume a column of words, one column or short NE-tags and one column of longer ne-tags.

Flair also explains how to train your own model in their [tutorial](https://flairnlp.github.io/docs/tutorial-training/how-model-training-works). Among other things, they also [explain](https://flairnlp.github.io/docs/tutorial-training/how-to-train-sequence-tagger) there how to fine-tune BERT embeddings to annotate NER.

In [None]:
from flair.datasets import ColumnCorpus
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

columns = {0: 'text', 1: 'ner', 2: 'ner_long'}

# set the path to the folder where train, validation and test files are in
data_folder = root + "ner_data/"

# initialize the corpus object with the paths to train, test and validation (dev)
corpus: ColumnCorpus = ColumnCorpus(data_folder, columns,
                            train_file='train.txt',
                            test_file='test.txt',
                            dev_file='dev.txt')

# If you have multiple label types, choose the labels you want to predict (from the names defined above in columns)
label_type = 'ner'

# Create a dictionary with all labels in your NE-annotation
label_dict = corpus.make_label_dictionary(label_type=label_type, add_dev_test=True)
print(label_dict)

# Put together the embeddings you want to use
embedding_types = [
    # you can add more embeddings here!
    FlairEmbeddings(root + "lm_forward_finetuned/best-lm.pt"),
    FlairEmbeddings(root + "lm_backward_finetuned/best-lm.pt"),
]

# Combine the embeddings
embeddings = StackedEmbeddings(embeddings=embedding_types)

# Initialize our sequence tagger
tagger = SequenceTagger(hidden_size=256,
                        embeddings=embeddings,
                        tag_dictionary=label_dict,
                        tag_type=label_type,
                        tag_format="BIO"  # depends on your data
                        )

# Initialize the trainer
trainer = ModelTrainer(tagger, corpus)

# Start training (set the out path!)
trainer.train(root + "ner_model/",
            learning_rate=0.1,
            mini_batch_size=32,  # depending on data and gpu, reduce to 16, 8, 4, 2 or 1
            max_epochs=100)  # i chose many epochs here due to the small dataset i used for testing

Now let's test our model!

In [None]:
# import the necessary parts of the library
from flair.data import Sentence
from flair.nn import Classifier

# load the classifier from a local source by giving the path
tagger = Classifier.load(root + "ner_model/")

# create a sentence object that will be predicted
sentence = Sentence("Galcherus Brideine , miles , dedit nobis decem solidos super aquam suam de Eschemilliaco ( b ) . Ego Galcherus Brideine , miles , notum facio universis presentes litteras inspecturis quod , cum bone memorie defunctus Ansellus Brideine , pater meus , in vita sua dedisset in perpetuam elemosinam pro salute anime sue ecclesie et fratribus Pontiniaci viginti solidos tur . percipiendos singulis annis in festo sancti Remigii in redditibus suis de Deveisel ( c ) , quos ipse emerat a domino Milone de Lynieres , ego , quem medietas dicte elemosine contingebat , assedi dictis Pontiniacensibus decem solidos tur . pro ( Ii ) parte mea super aquam meam de Eschemiliaco , volens et concedens ut quicumque predictam aquam de cetero tenuerit , predictos decem solidos singulis annis in octavis sancti Andree apostoli reddere dictis fratribus sine contradictione et diminutione imperpetuum ( 6 ) teneatur . Quod ut ratum et stabile permaneat in futurum , sigillum meum in testimonium et munimen duxi litteris presentibus apponendum . Actum anno Domini M ° cc ° LO VIIlo , mense aprili .")

# predict the annotations
tagger.predict(sentence)

# show the annotations
for label in sentence.get_labels():
    print(label)

#### Fine-tuning a NER model
If you want to fine-tune a ner model, you can follow the same principles as an LM. We replace an initialized tagger with a base model, but you set the corpus and all up as usual. Note that this means you cannot define your own embeddings.

In [None]:
from flair.nn import Classifier

# initialize the corpus like when training from scratch

# load the base model
tagger = Classifier.load("ner-fast")

# Initialize the trainer
trainer = ModelTrainer(tagger, corpus)

# Start training (set the out path!)
trainer.fine_tune(root + "ner_model_finetuned/")