# Scripts to train a NER Model

This script use ThirdAI's NER library to train a model on an NER dataset from scratch. For this demonstration, we are using the `https://huggingface.co/datasets/conll2003` dataset. We also show how to load a saved ThirdAI NER model and further fine-tune it, provided the labels remain the same.

If you want to just use ThirdAI's pretrained multi-lingual PII detection model, please refer to [the other notebook](https://github.com/ThirdAILabs/Demos/blob/main/named_entity_recognition/pretrained_pii_model.ipynb) in this folder.

In [None]:
!pip3 install thirdai --upgrade
!pip3 install datasets

### Import necessary libraries

In [None]:
from thirdai import bolt, dataset, licensing
import utils

### Activate ThirdAI's license key

In [None]:
import os

if "THIRDAI_KEY" in os.environ:
    licensing.activate(os.environ["THIRDAI_KEY"])
else:
    licensing.activate("")  # Enter your ThirdAI key here

### Create a Tag to label Map

Tag to Label Map is used to map text entities to their corresponding integer labels while training/inferencing using a model.

Note: Ensure that the tags in your dataset should not be outside of TAG_MAP

In [3]:
TAGS = [
    "B-PER",
    "I-PER",
    "B-ORG",
    "I-ORG",
    "B-LOC",
    "I-LOC",
    "B-MISC",
    "I-MISC",
]

#### Download and Process the dataset

In [4]:
train_file = utils.download_conll_dataset_as_file("train")
validation_file = utils.download_conll_dataset_as_file("validation")

#### Training and validation files should have a source and target column where each token in the source column has a corresponding tag in the target column. 
Note : tags with spaces not supported, e.g, 'credit card' is not a valid tag but 'credit_card' is valid.

In [None]:
with open(train_file, "r") as f:
    for line in f.readlines()[:3]:
        print(line, end="")

#### Initialize a Bolt NER model

In [None]:
ner_model = bolt.UniversalDeepTransformer(
    data_types={
        "source": bolt.types.text(),
        "target": bolt.types.token_tags(tags=TAGS, default_tag="O"),
    },
    target="target",
)

Call the train function for the NER model and pass the training file to the function [required]. All other parameters are optional.

In [None]:
ner_model.train(
    train_file,
    epochs=2,
    learning_rate=0.001,
    batch_size=1024,
    metrics=["loss"],
    validation=bolt.Validation(filename=validation_file, metrics=["loss"]),
)

### Save and load the model

In [20]:
ner_model.save("thirdai_ner_model")
ner_model = bolt.UniversalDeepTransformer.load("thirdai_ner_model")

### Evaluation on Test Data

In [None]:
test_data = utils.load_dataset("conll2003")["test"]

predictions = []
actuals = []

labels = ["O"] + TAGS

for example in test_data:
    tokens = {"source": " ".join(example["tokens"])}
    actual_tags = [labels[tag] for tag in example["ner_tags"]]

    # Predict and evaluate
    predicted_tags = ner_model.predict(tokens, top_k=1)

    predictions.extend(predicted_tags)
    actuals.extend(actual_tags)

correct_predictions = sum(p[0][0] == a for p, a in zip(predictions, actuals))
total_predictions = len(predictions)
accuracy = correct_predictions / total_predictions
print(f"Accuracy: {accuracy * 100:.2f}%")

### Finetune the model further

In case, you want to further finetune on a already trained model, using a subset of tags. Here, we are creating a small retraining data, further we save it. 

In [15]:
import pandas as pd

# Sample sentences with corresponding NER tags
sentences = [
    ("John Doe went to Paris", ["B-PER", "I-PER", "O", "O", "B-LOC"]),
    (
        "Alice and Bob are from New York City",
        ["B-PER", "O", "B-PER", "O", "O", "B-LOC", "I-LOC", "I-LOC"],
    ),
    ("The Eiffel Tower is in France", ["O", "B-LOC", "I-LOC", "O", "O", "B-LOC"]),
    (
        "Microsoft Corporation was founded by Bill Gates",
        ["B-ORG", "I-ORG", "O", "O", "O", "B-PER", "I-PER"],
    ),
    (
        "She visited the Louvre Museum in Paris last summer",
        ["O", "O", "O", "B-LOC", "I-LOC", "O", "B-LOC", "O", "O"],
    ),
    (
        "Google and IBM are big tech companies",
        ["B-ORG", "O", "B-ORG", "O", "O", "O", "O"],
    ),
    (
        "Mount Everest is the highest mountain in the world",
        ["B-LOC", "I-LOC", "O", "O", "O", "O", "O", "O", "O"],
    ),
    ("Leonardo DiCaprio won an Oscar", ["B-PER", "I-PER", "O", "O", "O"]),
]

# File to write the data
retrain_filename = "retraining_ner_data.csv"
data = {"source": [], "target": []}
for sentence, tags in sentences:
    data["source"].append(sentence)
    data["target"].append(" ".join(tags))
df = pd.DataFrame(data)
df.to_csv(retrain_filename, index=False)

You can just call the train function again for retraining the NER model on subset of data

In [None]:
ner_model.train(
    retrain_filename,
    epochs=3,
    learning_rate=0.001,
    batch_size=1024,
    metrics=["loss"],
)

Delete the files

In [None]:
import os

os.remove("thirdai_ner_model")
os.remove(retrain_filename)
os.remove("train_ner_data.csv")
os.remove("validation_ner_data.csv")