# Large Language Model for PII detection

This notebook will present the fine-tuning and evaluation of a DeBerta model, trained through Huggingface.

As the training of LLMs, like the DeBERTa requires huge amounts of compute (40gbs of RAM in this case) - we've opted to train the model through the HPC in a separate script. 

Thus, this notebook won't contain code-cells that have been run, but rather contain snippets and explanations of what was done, as well as providing learning-curves and discussion. 

Meanwhile, the curious reader is encouraged to take a look at the *training-* and *utils-scripts* for further information - as larger functions have been placed there to ease readability in the notebook. 

The overall structure of the data- and prediction-processing have been borrowed from: [Eisuke Mizutani](https://www.kaggle.com/code/emiz6413/train-deberta-v3-single-model-lb-0-966). 

### Named Entity Recognition (NER) with transformers

The Transformer architecture has long been recognized for revolutionizing Natural Language Processing, particularly in Language Translation, Text Generation and Named Entity Recognition. Largely, due to the utilization of the self-attention mechanism. 

The self-attention mechanism utilizes three learnt weight matrices - the **$W_{K}$**, **$W_{Q}$** and **$W_{V}$** matrices - in order to calculate what is known as *Self-Attention*.

The Transformer-input, $X$, is projected onto the weight-matrices, in order to produce what is known as the **Queries (Q)**, **Keys (K)** and **Values (V)**.

$\text{\textbf{Q}} = \textbf{XW}_{Q}, \qquad \text{\textbf{K}} = \textbf{XW}_{K}, \qquad \text{\textbf{V}} = \textbf{XW}_{V}$

The Queries and Keys are then used to calculate the Attention scores for each token in a sentence:

$$\text{Attention} = \text{Softmax} \left ( \frac{\textbf{Q} \cdot \textbf{K}^T}{\sqrt{d_k}} \right ),$$

Here, the Attention scores represent how each token should attend to all other tokens, i.e. providing a weighted importance that informs the model which other tokens are important for that specific token. The Softmax function is applied to ensure that these attention weights sum to 1, normalizing the scores so that they can be effectively compared and utilized across different contexts. Meanwhile, the **Values** are understood as a learned embedding per token.

Finally, the token-representations, $\textbf{Z}$, are calculated by creating a linear combination of all the **Values**, based on the attention between tokens like so:
$$\textbf{Z} = \text{Attention} \cdot \textbf{V}$$
I.e. if *tokens 3, 7 and 10* are deemed to atteend highly to *token 3*, the final token-representation will be a linear combination of the **Values** of *tokens 3, 7 and 10*

Self-attention allows the model to weigh and consider all parts of the input sentence when understanding each word. Each token (word) in a sentence is given attention scores relative to all other tokens, enabling the model to capture a more contextual representation of each word.

For NER, this ability is crucial. By understanding the context in which a word appears, transformers can accurately classify tokens as named entities (like names, organizations, or locations) or other parts of speech. The model processes the entire sentence and uses the relationships between words to determine if a token is part of a named entity, by classifying on the token-representations, $\textbf{Z}$. This contextual awareness, powered by self-attention, means that transformers can recognize entities even when they appear in complex or ambiguous sentences.

### DeBERTa - a quick introduction

**DeBERTa**, an acronym for **Decoding-enhanced BERT with disentangled attention**, is a model that improves on the previous king of Natural Language Processing, namely the **Bidirectional Encoder Representations from Transformers** - also known as **BERT** for short. 

Developed by Microsoft, DeBERTa enhances BERT through two novel techniques.

Whereas BERT utilizes its attention mechanism to simultaneously process words in a sentence irrespective of their order, thereby understanding the context of each word based on all other words, DeBERTa refines this approach. It introduces a disentangled attention mechanism that separates the influence of *content* and *positional context*. This means DeBERTa not only considers what each word means but also where it appears in the sentence. Such dual attention to content and position allows DeBERTa to have a more nuanced understanding of text, making it better at grasping the subtle meanings and relationships in language.

Moreover, DeBERTa includes an enhanced mask decoder, improving how the model predicts and reconstructs masked words (words hidden during training to test the model’s understanding). This technique boosts DeBERTa’s ability to handle tasks where precise word recognition and contextual awareness is crucial, such as filling in blanks and predicting next words.

# Fine-tuning DeBERTa

Thankfully, Fine-tuning through Huggingface is as easy as, 
1) Loading a pre-trained model.
2) Loading a pre-trained tokenizer
3) Loading data
4) Defining model- and training-hyperparameters
5) Running `Trainer.train()`

### 1 & 2) Pre-trained model
The pretrained model and tokenizer, was in our case the: `microsoft/deberta-v3-large` model from HuggingFace
Setting up the training arguments and loading the models were done like so:

```
# Creating trainer arguments
train_args = TrainingArguments(
    output_dir=args.output_dir,
    logging_dir=f'./logs/{run_name}',
    fp16=args.AMP,
    learning_rate=args.lr,
    num_train_epochs=args.epochs,
    per_device_train_batch_size=args.batch_size,
    per_device_eval_batch_size=args.eval_batch_size,
    gradient_accumulation_steps=args.grad_accumulation_steps,
    report_to="tensorboard",  # Change this to enable TensorBoard logging
    evaluation_strategy="steps",
    eval_steps=50,
    eval_delay=100,
    save_strategy="steps",
    save_steps=50,
    save_total_limit=1,
    logging_steps=10,
    metric_for_best_model="f5",
    greater_is_better=True,
    load_best_model_at_end=True,
    overwrite_output_dir=True,
    lr_scheduler_type=args.lr_scheduler,
    warmup_ratio=args.warmup_ratio,
    weight_decay=args.weight_decay,
    seed=args.seed,
)

# Load dataset
data = load_data(args.data_dir, args)

# Initialize tokenizer and encoder
tokenizer = DebertaV2TokenizerFast.from_pretrained(args.model_path)
train_encoder = CustomTrainTokenizer(tokenizer=tokenizer, label2id=data['label2id'], max_length=args.max_length)
eval_encoder = CustomTrainTokenizer(tokenizer=tokenizer, label2id=data['label2id'], max_length=args.max_length)

# Apply encoders to datasets
data['val'] = data['val'].map(eval_encoder, num_proc=os.cpu_count())
data['train'] = concatenate_datasets([data['train']['original'], data['train']["extra"]])
data['train'] = data['train'].map(train_encoder, num_proc=os.cpu_count())    

# Initialize model
model_init = ModelInit(
    args.model_path,
    id2label=data['id2label'],
    label2id=data['label2id'],
    freeze_embedding=args.freeze_embedding,
    freeze_layers=args.freeze_layers,
)

# Initialize trainer
trainer = Trainer(
    args=train_args,
    model_init=model_init,
    train_dataset=data['train'],
    eval_dataset=data['val'],
    tokenizer=tokenizer,
    compute_metrics=MetricsComputer(eval_ds=data['val'], label2id=data['label2id'], id2label=data['id2label'], conf_thresh=args.conf_thresh),
    data_collator=DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=16),
)
```


### 3) Loading the data
The Kaggle competition provided us with a training-set of 6807 essays, as well as an unlabeled test-set to use for the competition submission.

Meanwhile, a [Kaggle-user](https://www.kaggle.com/datasets/mpware/pii-mixtral8x7b-generated-essays) provided an additional 2692 essays generated with Mixtral8x7B-Instruct model from Mistral AI. 

We opted to do a 70/30 split on the original data, while also including the extra data as training data. 
That way we ended up with 7457 training examples and 2042 validation examples. 

### 4) Training-parameters
The final model was fine-tuned with the following parameters, 

**Epochs and Learning Rate:**
- The model was fine-tuned for just 3 epochs.
- A learning rate of 2.5e-5 was used.

**Batch Size:**
- A training batch size of 1 and an evaluation batch size of 8 were used. 
- Batch size affects the gradient estimation and training stability, which is why we opted for such a small training batch size. 

**Learning Rate Scheduler:**
- A linear learning rate scheduler was applied, which gradually reduces the learning rate from the initial set value to zero, following a linear decay, in order to not overshoot minimas the further into training we get. 

**Warmup Ratio and Weight Decay:**
- The warmup ratio was set to 0.1, meaning 10% of the total training steps are used to gradually ramp up the learning rate from zero.
- A weight decay of 0.01 was applied as regularization to reduce overfitting.

**Model Initialization:**
- The first 6 layers of the classification layer were freezed, in order to potentially speed up training and focus learning on the upper layers of the network.

### Extra additions

In the detection of PII, Recall is incredibly important, as we cannot risk to miss any labeling of personal information. 
Therefore, we opted to imnclude a *confidence threshold* of 0.90 for the 'O' class. Thus, if the model wasn't at least 90% confident in it's 'O' prediction, the model would resort to the next most probable class.

Meanwhile, predictions of the model were created in two steps:
1) A prediction through regular DeBERTa-classification. 
2) A Regex-check for URLs, Phone-numbers and Emails on all tokens classified with 'O' during post-processing.

# Model training - Learning Curves

Below, a function is created to plot the .csv files created from the model-training. 

We'll be plotting the:
1) Train- and evaluation loss curves
2) The overall f5, recall and precision on the evaluation data
3) The f5-scores per individual label throughout training (on evaluation)

In [1]:
import pandas as pd
import os
import plotly.graph_objects as go

def plot_learning_curves(folder_name, title, y_limit=None):
    # Define the path to the folder containing CSV files
    folder_path = f'../logs/deberta_v2_E3_Mf5_LRlinear2.5e-05_WR0.1_WD0.01/tensorboard/{folder_name}'
    csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]
    
    # Check if the folder contains CSV files
    if not csv_files:
        print("No CSV files found in the directory.")
        return

    # Create a Plotly graph object figure
    fig = go.Figure()

    # Process each file
    for file in csv_files:
        file_path = os.path.join(folder_path, file)
        data = pd.read_csv(file_path)
        
        # Plotting the data
        fig.add_trace(go.Scatter(
            x=data['Step'],
            y=data['Value'],
            mode='lines',
            name=file[:-4].capitalize()  # Remove '.csv' from label and capitalize
        ))
    
    # Update the layout to add titles and labels
    fig.update_layout(
        title=title,
        xaxis_title="Step",
        yaxis_title="Value",
        legend_title="Metrics",
        font=dict(
            family="Courier New, monospace",
            size=14,
            color="black"
        )
    )
    
    # Setting the y-axis limits if provided
    if y_limit is not None:
        fig.update_yaxes(range=y_limit)
    
    # Show the plot
    fig.show()


In [2]:
plot_learning_curves(folder_name = 'loss', title = 'Loss curves', y_limit=(0, 0.018))

#### Take-aways
The loss curves show that the DeBERTa model is really quick at fine-tuning to the PII labels. The train-curve quickly reduces before plateauing at around step 400. The training and evaluation curves follow each other really nicely, with no signs of overfitting on the training data. 

In [3]:
plot_learning_curves(folder_name = 'metrics_overall', title = 'Overall evaluation f5, recall and precision', y_limit=(0.6, 1.02))

#### Take-aways
It's seen that the model immediately obtains f5 scores at around 0.93, and is able to increase the performance to 0.967 by the end of training.

The f5-score and recall are almost identical throughout training, this is not a coincidence. As the $f_{\beta}$-metric is a metric where larger $\beta$-values heavily favor recall over precision, as given by the formula. Meanwhile, it's seen that the precision greatly increases throughout training, starting at 0.65 and ending around 0.85-0.9.

In [4]:
plot_learning_curves(folder_name = 'f5_per_label', title = 'f5 per label', y_limit=(0, 1.1))

#### Take-aways
Comparing the learning curves of the individual labels, it's immediately noticeable that *phone numbers* and *emails* constantly score values at $≈1.0$. 

Thankfully, phone-numbers and emails are quite easy to catch through simple regex-rules, which is most likely why they score as highly as they do. 
Meanwhile, URL is not quite as stable, but does end up at 0.99 in the last step. - It's important to note here, that the labels distinguish between personal URLs and regular URLs (like Wikipedia etc.), which is most-likely why Regex isn't able to catch all of the URLs.

Usernames seeem to be the most difficult class in the beginning, however, through training the f5 greatly improves. 

Meanwhile, Street-addresses are the second-most difficult labels to classify at the start, but training barely improves performance in this case. Street-addresses usually consist of many consecutive tokens, which could be why these are more difficult to catch. 

# Model Prediction

In [15]:
import os, re, sys
import numpy as np
import pandas as pd
import argparse
from pathlib import Path
from transformers import DebertaV2TokenizerFast, DebertaV2ForTokenClassification, DataCollatorForTokenClassification, Trainer, TrainingArguments
import torch

from datasets import Dataset
from spacy.lang.en import English
nlp = English()

from utils import CustomPredTokenizer, get_predictions


IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html



In [16]:
def tokenize_text(text):
    """
    Tokenize the given text using the SpaCy English model. Store tokens and their trailing whitespace
    separately in session state.
    
    Parameters:
    text (str): The text to tokenize.
    
    Returns:
    None: Tokens and whitespace are stored in session state.
    """
    # Process the text through the SpaCy pipeline
    doc = nlp(text)
    
    # Extract tokens
    tokens = [token.text for token in doc]
    # Extract trailing whitespace associated with each token
    trailing_whitespace = [token.whitespace_ for token in doc]

    return tokens, trailing_whitespace

def prepare_text(text):

    tokens, trailing_whitespace = tokenize_text(text)
    
    # Make Dataset with tokens and trailing_whitespace
    dataset = Dataset.from_dict({
        'document': ['example_document'],
        'full_text': [text],
        'tokens': [tokens],
        'trailing_whitespace': [trailing_whitespace]
    })
    
    return dataset

def predict(text, args, tokenizer, model, trainer, nlp):
    # Prepare text
    dataset = prepare_text(text)
    dataset = dataset.map(CustomPredTokenizer(tokenizer=tokenizer, max_length=INFERENCE_MAX_LENGTH), num_proc=os.cpu_count())
    _, preds = get_predictions(dataset, trainer, model, args, nlp, return_all = True)

    return preds.select_columns(['full_text', 'tokens', 'pred_labels', 'trailing_whitespace']).to_pandas()

In [21]:
from types import SimpleNamespace

INFERENCE_MAX_LENGTH = 3072

data = {'conf_thresh': 0.9,
        'url_thresh': 0.1,
        'model_path': '../deberta_v3/model',
        'AMP': False,}
args = SimpleNamespace(**data)


# Instantiate tokenizer
tokenizer = DebertaV2TokenizerFast.from_pretrained(args.model_path)

# Load model and trainer
model = DebertaV2ForTokenClassification.from_pretrained(args.model_path)
collator = DataCollatorForTokenClassification(tokenizer)
train_args = TrainingArguments(".", 
                                per_device_eval_batch_size=1, 
                                report_to="none", 
                                fp16=args.AMP,)

trainer = Trainer(
    model=model, 
    args=train_args, 
    data_collator=collator, 
    tokenizer=tokenizer,
)

In [22]:
print("Predicting....")
preds = predict("Hello, world! My name is Phillip Hoejbjerg. I am an AI engineer. My student number is s184984", args, tokenizer, model, trainer, nlp)

num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.


Predicting....


Map: 100%|██████████| 1/1 [00:00<00:00, 17.76 examples/s]
100%|██████████| 1/1 [00:00<00:00, 587.60it/s]


In [29]:
pd.concat([preds['tokens'].explode(), preds['pred_labels'].explode()], axis = 1)

Unnamed: 0,tokens,pred_labels
0,Hello,O
0,",",O
0,world,O
0,!,O
0,My,O
0,name,O
0,is,O
0,Phillip,B-NAME_STUDENT
0,Hoejbjerg,I-NAME_STUDENT
0,.,O
