### PII (Personally Identifiable Information) Detection with a pre-trained BOLT NER model

In this notebook, we will show how to use ThirdAI's pre-trained PII detection model on your dataset. This model was trained on a proprietaty synthetic dataset generated from GPT-4. This is a multi-lingual model that was trained on English, French, Spanish and Italian data. It detects the following types of PII:

* NAME
* EMAIL
* HOMEADDRESS
* DOB
* UIN
* PHONENUMBER
* DATE
* VEHICLEUIN
* AGE
* GENDER
* HEIGHT
* USERNAME
* PASSWORD
* OCCUPATION
* ACCOUNTNUMBER
* PIN
* CREDITCARDNUMBER
* SEX
* AMOUNT
* ACCOUNTNAME
* IBAN
* ORGANIZATION
* URL
* CREDITCARDCVV

The latter part of the script shows how to load a pretrained model and train it on new PII entities.

In [None]:
!pip3 install thirdai --upgrade
!pip3 install datasets

### Activate your ThirdAI License Key

You can apply for a trial license [here](https://www.thirdai.com/try-bolt/) .

In [None]:
import os
from thirdai import bolt, licensing
import utils

import os

if "THIRDAI_KEY" in os.environ:
    licensing.activate(os.environ["THIRDAI_KEY"])
else:
    licensing.activate("")  # Enter your ThirdAI key here

### Download the Model

In [2]:
import os

if not os.path.isdir("./models/"):
    os.system("mkdir ./models/")

if not os.path.exists("./models/pretrained_multilingual.model"):
    os.system(
        "wget -nv -O ./models/pretrained_multilingual.model 'https://www.dropbox.com/scl/fi/busx1pb5s8s7j00hvmhql/pretrained_multilingual_v2.bolt?rlkey=axzlwwtyqsnz0wehw7nzcpr51&st=3xq14asc&dl=0'"
    )

### Load the Model

In [3]:
pii_model = bolt.UniversalDeepTransformer.NER.load(
    "./models/pretrained_multilingual.model"
)

### Use Pretrained Model Out of the Box

In [19]:
sample_sentence = "I'm Robert. I work for as an AI Engineer for a startup in Houston. I want to apply for a credit card. My email is robbie@gmail.com."

tokens = sample_sentence.split()

predicted_tags = pii_model.predict(tokens, top_k=1)

for i in range(len(tokens)):
    if predicted_tags[i][0][0] != "O":
        print(tokens[i] + " : " + predicted_tags[i][0][0])

Robert. : NAME
Engineer : OCCUPATION
Houston. : HOMEADDRESS
robbie@gmail.com. : EMAIL


In [12]:
sample_sentence = "I'm Siddharth. I work at for a big multinational company in Mountain View. I want to cancel my credit card with number 4147202361663155."

tokens = sample_sentence.split()

predicted_tags = pii_model.predict(tokens, top_k=1)

for i in range(len(tokens)):
    if predicted_tags[i][0][0] != "O":
        print(tokens[i] + " : " + predicted_tags[i][0][0])

Siddharth. : NAME
Mountain : HOMEADDRESS
View. : HOMEADDRESS
4147202361663155. : CREDITCARDNUMBER


## Finetune a pretrained model on your own data

#### Create a Tag to label Map

Tag to Label Map is used to map text entities to their corresponding integer labels while training/inferencing using a model. 

Note: Ensure that the tags in your dataset should not be outside of TAG_MAP

In [13]:
TAG_MAP = {
    "O": 0,
    "B-PER": 1,
    "I-PER": 2,
    "B-ORG": 3,
    "I-ORG": 4,
    "B-LOC": 5,
    "I-LOC": 6,
    "B-MISC": 7,
    "I-MISC": 8,
}

entries = list(TAG_MAP.keys())

### Download and Process the dataset

In [14]:
train_file = utils.download_conll_dataset_as_file("train")
validation_file = utils.download_conll_dataset_as_file("validation")

### Initialize a NER Model

In [15]:
ner_model = bolt.UniversalDeepTransformer.NER.from_pretrained(
    "./models/pretrained_multilingual.model",
    tokens_column="source",
    tags_column="target",
    tag_to_label=TAG_MAP,
)

Call the train function for the NER model and pass the training file to the function [required]. All other parameters are optional.

In [None]:
ner_model.train(
    train_file,
    epochs=2,
    learning_rate=0.001,
    batch_size=1024,
    train_metrics=["loss"],
    validation_file=validation_file,
    val_metrics=["loss"],
)

### Evaluating Finetuned Model on Test Dataset

In [None]:
test_data = utils.load_dataset("conll2003")["test"]

predictions = []
actuals = []

for example in test_data:
    tokens = example["tokens"]
    actual_tags = [entries[tag] for tag in example["ner_tags"]]

    # Predict and evaluate
    predicted_tags = ner_model.predict(tokens, top_k=1)

    predictions.extend(predicted_tags)
    actuals.extend(actual_tags)

correct_predictions = sum(p[0][0] == a for p, a in zip(predictions, actuals))
total_predictions = len(predictions)
accuracy = correct_predictions / total_predictions
print(f"Accuracy: {accuracy * 100:.2f}%")