# Scripts to train a NER Model

This script use ThirdAI's NER library to train a model on an NER dataset from scratch. For this demonstration, we are using the `https://huggingface.co/datasets/conll2003` dataset. We also show how to load a pre-trained ThirdAI NER model and further fine-tune it, provided the labels remain the same.

In [None]:
!pip3 install thirdai==0.8.1
!pip3 install datasets

Import necessary libraries

In [None]:
import json
from datasets import load_dataset
from thirdai import bolt, dataset

Constants: Tag to label Map

In [None]:
TAG_MAP = {
    "O": 0,
    "B-PER": 1,
    "I-PER": 2,
    "B-ORG": 3,
    "I-ORG": 4,
    "B-LOC": 5,
    "I-LOC": 6,
    "B-MISC": 7,
    "I-MISC": 8,
}

entries = list(TAG_MAP.keys())

Functions to load dataset from huggingface and save training data as JSONL file. 
It could be txt file too, given each line contains a json of format: 

`{"source": [List of Text Tokens], "target": [List of Corresponding Text Tags]}`. 

Note: Make sure tags should not be outside of TAG_MAP

In [None]:
def save_dataset_as_jsonl(filename, loaded_data):
    with open(filename, "w") as file:
        for example in loaded_data:
            data = {
                "source": example["tokens"],
                "target": [entries[tag] for tag in example["ner_tags"]],
            }
            file.write(json.dumps(data) + "\n")


def download_dataset_as_file(subset):
    # Load dataset
    dataset = load_dataset("conll2003")
    loaded_data = dataset[f"{subset}"]
    filename = f"{subset}_ner_data.jsonl"
    save_dataset_as_jsonl(filename, loaded_data)
    return filename

Initialize a Bolt NER model given the column names and TAG_MAP.

In [None]:
ner_model = bolt.NER("source", "target", TAG_MAP)

Use thirdai's dataset module to load train file into a NerDataSource and pass it to train function.

In [None]:

train_data_source = dataset.NerDataSource(download_dataset_as_file("train"))

ner_model.train(
    train_data=train_data_source,
    epochs=1,
    learning_rate=0.001,
    batch_size=1024,
    train_metrics=["loss"],
)

If you have validation data, you can pass that to train function too as

In [None]:
train_data_source = dataset.NerDataSource(download_dataset_as_file("train"))
val_data_source = dataset.NerDataSource(download_dataset_as_file("validation"))

ner_model.train(
    train_data=train_data_source,
    epochs=2,
    learning_rate=0.001,
    batch_size=1024,
    train_metrics=["loss"],
    val_data=val_data_source,
    val_metrics=["loss"]
)

Saves the model and then loads it again

In [None]:
ner_model.save("thirdai_ner_model")
ner_model = bolt.NER.load("thirdai_ner_model")

Evaluation on Test Data

In [None]:
test_data = load_dataset("conll2003")["test"]
predictions = []
actuals = []

for example in test_data:
    tokens = example["tokens"]
    actual_tags = [entries[tag] for tag in example["ner_tags"]]

    # Predict and evaluate
    predicted_tags = ner_model.get_ner_tags([tokens])[0]

    predictions.extend(predicted_tags)
    actuals.extend(actual_tags)

correct_predictions = sum(p[0][0] == a for p, a in zip(predictions, actuals))
total_predictions = len(predictions)
accuracy = correct_predictions / total_predictions
print(f"Accuracy: {accuracy * 100:.2f}%")

In case, you want to further finetune on a already trained model, using a subset of tags. Here, we are creating a small retraining data, further we save it. 

In [None]:
import json

# Sample sentences with corresponding NER tags
sentences = [
    ("John Doe went to Paris", ["B-PER", "I-PER", "O", "O", "B-LOC"]),
    (
        "Alice and Bob are from New York City",
        ["B-PER", "O", "B-PER", "O", "O", "B-LOC", "I-LOC", "I-LOC"],
    ),
    ("The Eiffel Tower is in France", ["O", "B-LOC", "I-LOC", "O", "O", "B-LOC"]),
    ("Microsoft Corporation was founded by Bill Gates", ["B-ORG", "I-ORG", "O", "O", "O", "B-PER", "I-PER"]),
    ("She visited the Louvre Museum in Paris last summer", ["O", "O", "O", "B-LOC", "I-LOC", "O", "B-LOC", "O", "O"]),
    ("Google and IBM are big tech companies", ["B-ORG", "O", "B-ORG", "O", "O", "O", "O"]),
    ("Mount Everest is the highest mountain in the world", ["B-LOC", "I-LOC", "O", "O", "O", "O", "O", "O", "O"]),
    ("Leonardo DiCaprio won an Oscar", ["B-PER", "I-PER", "O", "O", "O"]),
]

# File to write the data
retrain_filename = "retraining_ner_data.json"
with open(retrain_filename, "w") as file:
    for sentence, tags in sentences:
        tokens = sentence.split()
        data = {"source": tokens, "target": tags}
        json_line = json.dumps(data)
        file.write(json_line + "\n")


You can just call the train function again for retraining the NER model on subset of data

In [None]:
retrain_data_source = dataset.NerDataSource(retrain_filename)

ner_model.train(
    train_data=retrain_data_source,
    epochs=3,
    learning_rate=0.001,
    batch_size=1024,
    train_metrics=["loss"],
)

Delete the files

In [None]:
import os
os.remove("thirdai_ner_model")
os.remove(retrain_filename)
os.remove("train_ner_data.jsonl")
os.remove("validation_ner_data.jsonl")