# Scripts to train a NER Model

This script use ThirdAI's NER library to train a model on an NER dataset from scratch. For this demonstration, we are using the `https://huggingface.co/datasets/conll2003` dataset. We also show how to load a saved ThirdAI NER model and further fine-tune it, provided the labels remain the same.

If you want to just use ThirdAI's pretrained multi-lingual PII detection model, please refer to [the other notebook](https://github.com/ThirdAILabs/Demos/blob/main/named_entity_recognition/pretrained_pii_model.ipynb) in this folder.

In [None]:
!pip3 install thirdai --upgrade
!pip3 install datasets

### Import necessary libraries

In [None]:
import json
from datasets import load_dataset
from thirdai import bolt, dataset, licensing

### Activate ThirdAI's license key

In [None]:
import os
if "THIRDAI_KEY" in os.environ:
    licensing.activate(os.environ["THIRDAI_KEY"])
else:
    licensing.activate("")  # Enter your ThirdAI key here

### Create a Tag to label Map

In [None]:
TAG_MAP = {
    "O": 0,
    "B-PER": 1,
    "I-PER": 2,
    "B-ORG": 3,
    "I-ORG": 4,
    "B-LOC": 5,
    "I-LOC": 6,
    "B-MISC": 7,
    "I-MISC": 8,
}

entries = list(TAG_MAP.keys())

### Helper functions to download and process the dataset

Functions to load dataset from huggingface and save training data as JSONL file. 
It could be txt file too, given each line contains a json of format: 

`{"source": [List of Text Tokens], "target": [List of Corresponding Text Tags]}`. 

Note: Make sure tags should not be outside of TAG_MAP

In [None]:
def save_dataset_as_jsonl(filename, loaded_data):
    with open(filename, "w") as file:
        for example in loaded_data:
            data = {
                "source": example["tokens"],
                "target": [entries[tag] for tag in example["ner_tags"]],
            }
            file.write(json.dumps(data) + "\n")


def download_dataset_as_file(subset):
    # Load dataset
    dataset = load_dataset("conll2003")
    loaded_data = dataset[f"{subset}"]
    filename = f"{subset}_ner_data.jsonl"
    save_dataset_as_jsonl(filename, loaded_data)
    return filename

### Initialize a Bolt NER model

In [None]:
ner_model = bolt.NER(TAG_MAP)

Use thirdai's dataset module to load train file into a NerDataSource and pass it to train function.

In [None]:

train_data_source = dataset.NerDataSource(type=ner_model.type(), file_path=download_dataset_as_file("train"), token_column="source", tag_column="target")

ner_model.train(
    train_data=train_data_source,
    epochs=1,
    learning_rate=0.001,
    batch_size=1024,
    train_metrics=["loss"],
)

If you have validation data, you can pass that to train function too as

In [None]:
train_data_source = dataset.NerDataSource(type=ner_model.type(), file_path=download_dataset_as_file("train"), token_column="source", tag_column="target")
val_data_source = dataset.NerDataSource(type=ner_model.type(), file_path=download_dataset_as_file("validation"), token_column="source", tag_column="target")

ner_model.train(
    train_data=train_data_source,
    epochs=2,
    learning_rate=0.001,
    batch_size=1024,
    train_metrics=["loss"],
    val_data=val_data_source,
    val_metrics=["loss"]
)

### Saves and load the model

In [None]:
ner_model.save("thirdai_ner_model")
ner_model = bolt.NER.load("thirdai_ner_model")

### Evaluation on Test Data

In [None]:
test_data = load_dataset("conll2003")["test"]
predictions = []
actuals = []

for example in test_data:
    tokens = example["tokens"]
    actual_tags = [entries[tag] for tag in example["ner_tags"]]

    # Predict and evaluate
    predicted_tags = ner_model.get_ner_tags([tokens])[0]

    predictions.extend(predicted_tags)
    actuals.extend(actual_tags)

correct_predictions = sum(p[0][0] == a for p, a in zip(predictions, actuals))
total_predictions = len(predictions)
accuracy = correct_predictions / total_predictions
print(f"Accuracy: {accuracy * 100:.2f}%")

### Finetune the model further

In case, you want to further finetune on a already trained model, using a subset of tags. Here, we are creating a small retraining data, further we save it. 

In [None]:
import json

# Sample sentences with corresponding NER tags
sentences = [
    ("John Doe went to Paris", ["B-PER", "I-PER", "O", "O", "B-LOC"]),
    (
        "Alice and Bob are from New York City",
        ["B-PER", "O", "B-PER", "O", "O", "B-LOC", "I-LOC", "I-LOC"],
    ),
    ("The Eiffel Tower is in France", ["O", "B-LOC", "I-LOC", "O", "O", "B-LOC"]),
    ("Microsoft Corporation was founded by Bill Gates", ["B-ORG", "I-ORG", "O", "O", "O", "B-PER", "I-PER"]),
    ("She visited the Louvre Museum in Paris last summer", ["O", "O", "O", "B-LOC", "I-LOC", "O", "B-LOC", "O", "O"]),
    ("Google and IBM are big tech companies", ["B-ORG", "O", "B-ORG", "O", "O", "O", "O"]),
    ("Mount Everest is the highest mountain in the world", ["B-LOC", "I-LOC", "O", "O", "O", "O", "O", "O", "O"]),
    ("Leonardo DiCaprio won an Oscar", ["B-PER", "I-PER", "O", "O", "O"]),
]

# File to write the data
retrain_filename = "retraining_ner_data.json"
with open(retrain_filename, "w") as file:
    for sentence, tags in sentences:
        tokens = sentence.split()
        data = {"source": tokens, "target": tags}
        json_line = json.dumps(data)
        file.write(json_line + "\n")


You can just call the train function again for retraining the NER model on subset of data

In [None]:
retrain_data_source = dataset.NerDataSource(type=ner_model.type(), file_path=retrain_filename, token_column="source", tag_column="target")

ner_model.train(
    train_data=retrain_data_source,
    epochs=3,
    learning_rate=0.001,
    batch_size=1024,
    train_metrics=["loss"],
)

You can also use this pretrained model for other use-cases too with `from_pretrained` function, when you call this function we replaces the tags with new set of tags, while conserving the learning from last training.

In [19]:
model_save_path = "thirdai_ner_model"
ner_model.save(model_save_path)

NEW_TAG_MAP = {
    "O": 0,          
    "B-FN": 1,       
    "I-FN": 2,       
    "B-LN": 3,       
    "I-LN": 4,      
    "B-CITY": 5,
    "I-CITY": 6,     
    "B-STATE": 7,    
    "I-STATE": 8,    
    "B-COUNTRY": 9,  
    "I-COUNTRY": 10  
}

sentences = [
    ("Samantha Bloom traveled to Paris France last year", ["B-FN", "B-LN", "O", "O", "B-CITY", "B-COUNTRY", "O", "O"]),
    ("John Smith and Alice Johnson are from Austin Texas", ["B-FN", "B-LN", "O", "B-FN", "B-LN", "O", "O", "B-CITY", "B-STATE"]),
    ("Timothy Dalton was born in Cardiff Wales UK", ["B-FN", "B-LN", "O", "O", "O", "B-CITY", "B-STATE", "B-COUNTRY"]),
    ("Hilary Swank visited Toronto Ontario Canada last month", ["B-FN", "B-LN", "O", "B-CITY", "B-STATE", "B-COUNTRY", "O", "O"]),
    ("Nikola Tesla was born in Smiljan Croatia", ["B-FN", "B-LN", "O", "O", "O", "B-CITY", "B-COUNTRY"]),
    ("Michael Jordan, a native of Brooklyn New York, is famous worldwide", ["B-FN", "B-LN", "O", "O", "O", "B-CITY", "B-STATE", "I-STATE", "O", "O", "O"]),
    ("Julia Roberts was seen in Malibu California this week", ["B-FN", "B-LN", "O", "O", "O", "B-CITY", "B-STATE", "O", "O"]),
]

tl_filename = "tl_ner_data.json"
with open(tl_filename, "w") as file:
    for sentence, tags in sentences:
        tokens = sentence.split()
        data = {"source": tokens, "target": tags}
        json_line = json.dumps(data)
        file.write(json_line + "\n")

tl_ner_model = bolt.NER.from_pretrained(model_save_path, NEW_TAG_MAP)

tl_data_source = dataset.NerDataSource(type=tl_ner_model.type(), file_path=tl_filename, token_column="source", tag_column="target")

tl_ner_model.train(
    train_data=tl_data_source,
    epochs=10,
    learning_rate=0.001,
    batch_size=2,
    train_metrics=["loss"],
)

loading data | source 'tl_ner_data.json'


{'val_times': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
 'epoch_times': [1.1979999542236328,
  0.017000000923871994,
  0.017000000923871994,
  0.014999999664723873,
  0.014999999664723873,
  0.01600000075995922,
  0.01600000075995922,
  0.017000000923871994,
  0.017000000923871994,
  0.017000000923871994],
 'train_loss': [1.5787938833236694,
  0.8555572032928467,
  0.5551876425743103,
  0.32909536361694336,
  0.1694101244211197,
  0.07679398357868195,
  0.03332304581999779,
  0.015141624957323074,
  0.007578184362500906,
  0.004184901714324951]}

loading data | source 'tl_ner_data.json' | vectors 60 | batches 30 | time 0.009s | complete

train | epoch 0 | train_steps 30 | train_loss=1.57879  | train_batches 30 | time 1.198s

validate | epoch 0 | train_steps 30 |  | val_batches 0 | time 0.000s       

train | epoch 1 | train_steps 60 | train_loss=0.855557  | train_batches 30 | time 0.017s

validate | epoch 1 | train_steps 60 |  | val_batches 0 | time 0.000s       

train | epoch 2 | train_steps 90 | train_loss=0.555188  | train_batches 30 | time 0.017s

validate | epoch 2 | train_steps 90 |  | val_batches 0 | time 0.000s       

train | epoch 3 | train_steps 120 | train_loss=0.329095  | train_batches 30 | time 0.015s

validate | epoch 3 | train_steps 120 |  | val_batches 0 | time 0.000s      

train | epoch 4 | train_steps 150 | train_loss=0.16941  | train_batches 30 | time 0.015s

validate | epoch 4 | train_steps 150 |  | val_batches 0 | time 0.000s      

train | epoch 5 | train_steps 180 | train_loss=0.076794  | train_batches

Delete the files

In [20]:
import os
os.remove("thirdai_ner_model")
os.remove(retrain_filename)
os.remove(tl_data_source)
os.remove("train_ner_data.jsonl")
os.remove("validation_ner_data.jsonl")

FileNotFoundError: [Errno 2] No such file or directory: 'retraining_ner_data.json'