This code is based on the code example of flair. [Link](https://flairnlp.github.io/docs/tutorial-training/how-to-train-text-classifier)  

## Prepare dataset
Tuto is [here](https://flairnlp.github.io/docs/tutorial-training/how-to-load-custom-dataset).  

# Training models for predicting OS version from an OS fullname
Train a NER model using flair, refer to this code tuto: https://huggingface.co/flair/ner-english

## How to use NER in Flair

In [1]:
from flair.data import Sentence
from flair.models import SequenceTagger

In [2]:
# load tagger
# If you execute this line of code for the first time, it will take longtime to end execution.
# I suppose it's downloading the tagger
tagger = SequenceTagger.load("flair/ner-english")

2024-04-17 19:59:53,156 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


In [3]:
# make example sentence
sentence = Sentence("George Washington went to Washington")

# predict NER tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

Sentence[5]: "George Washington went to Washington" → ["George Washington"/PER, "Washington"/LOC]
The following NER tags are found:
Span[0:2]: "George Washington" → PER (0.9985)
Span[4:5]: "Washington" → LOC (0.9706)


## Train a BERT-based model for NER

In [5]:
from flair.data import Corpus
from flair.datasets import CSVClassificationCorpus
from flair.datasets import CONLL_03
from flair.embeddings import TransformerDocumentEmbeddings
# from flair.embeddings import WordEmbeddings, StackedEmbeddings, FlairEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

# from pytorch_pretrained_bert import BertModel


Use GPU instead of CPU:  
You will find a cruel "False"....

In [3]:
import torch
print("cuda is_available:")
print(torch.cuda.is_available())

cuda is_available:
False


### 1 Get the corpus, load dataset

In [None]:
# you need to change this later
corpus: Corpus = CONLL_03(base_path='path/to/conll_data')

### 2 Define BERT embedding

In [7]:
# Using flair's pre-trained transformer
# If you run the code for the first time, it will start downloading the transformer to your disk,
# in ~/.cache/huggingface/hub
document_embeddings = TransformerDocumentEmbeddings('distilbert-base-uncased', fine_tune=False)
type(document_embeddings)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

flair.embeddings.document.TransformerDocumentEmbeddings

### 3 Create BERT-based sequence tagger

In [None]:
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                         embeddings=document_embeddings,
                                         tag_dictionary=corpus.make_tag_dictionary(tag_type='ner'),
                                         tag_type='ner',
                                         use_crf=True)

### 4 initailize trainer

In [None]:
trainer = ModelTrainer(tagger, corpus)

### 5 Burn ~~GPU~~ CPU

In [None]:
trainer.fine_tune('path/to/save/model',
              learning_rate=0.1,
              mini_batch_size=8,
              max_epochs=1)