# Welcome on my Anonymization Notebook
The aim of this notebook is to implement a Token Classifier in order to perform Named Entity Recognition on textual data using state-of-art models, precisely Transformers.

This can be used for example to anonymize data, in other words remove sensitive information like (names, locations, dates) from textual data (documents, emails, etc)

## Packages used in this notebook
We start by installing the needed packages  

In [36]:
#! pip install datasets transformers seqeval

We use GPU in orther to speed up the training of our model.Note that GPU speed up also the inference(prediction) phase. 

In [15]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"We will be using the {device} device")

We will be using the cuda device


# Importing the data
We will use Conll2003 dataset, which is an english text dataset used to perform Named Entity Recognition, it contains four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

In [16]:
from datasets import load_dataset, load_metric

datasets = load_dataset("conll2003")

Reusing dataset conll2003 (/root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6)


# Importing the tokenizer 
which will be used preprocess data, in other words;
*   tokenize the text, dividing it into tokens
*   add padding in order to have inputs of same length



In [17]:
import transformers

print(transformers.__version__)

4.9.1


In [18]:
task = "ner" 
model_checkpoint = "distilbert-base-uncased"
batch_size = 32

In [19]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [20]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

# Data preprocessing
Transform the data into the appropriate shape and format for our model, we do so by:
*   tokenizing and padding our input examples of the original dataset
*   removing unwanted columns 
*   fetching data from a the dataset and serving it in batches using python Dataloader.



In [21]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True, padding='max_length')

    labels = []
    for i, label in enumerate(examples[f"{task}_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [22]:
label_list = datasets["train"].features[f"{task}_tags"].feature.names

In [23]:
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['chunk_tags', 'id', 'ner_tags', 'pos_tags', 'tokens'])


Loading cached processed dataset at /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6/cache-30f47d5d21e49b09.arrow


  0%|          | 0/4 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6/cache-21fdc2c39d9c0b5c.arrow


In [24]:
import torch 
tokenized_datasets.set_format("torch")
train_dataloader = torch.utils.data.DataLoader(tokenized_datasets['train'], batch_size=batch_size)
val_dataloader = torch.utils.data.DataLoader(tokenized_datasets['validation'], batch_size=200)

## Importing the Token Classification model

In [25]:
import tensorflow as tf
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))
model.to(device)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForTokenClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN t

DistilBertForTokenClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
          

In [26]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

from transformers import get_scheduler

num_epochs = 1
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)


In [31]:
first_batch = next(iter(train_dataloader))
print(first_batch.keys())
print(len(first_batch['attention_mask']))

dict_keys(['attention_mask', 'input_ids', 'labels'])
32


Fine-Tunning our model by training it on train_dataloader, fine-tunning means that the weights of some layers of the model are not random, but the result of a previous training of the model.<br>
we use tqdm in order to follow the training by displaying a smart progress bar

In [32]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/439 [00:00<?, ?it/s]

Evaluating the trained model on the validation dataloader.

In [33]:
from datasets import load_metric

metric = load_metric("seqeval")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

pred = []
labels = []

model.eval()
for batch in val_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    pred += predictions.tolist()
    labels += batch["labels"].tolist()



Downloading:   0%|          | 0.00/2.48k [00:00<?, ?B/s]

In [34]:
# Remove ignored index (special tokens)
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(pred, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(pred, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
print(results)


{'LOC': {'precision': 0.8921721099015034, 'recall': 0.9368535655960806, 'f1': 0.9139670738183749, 'number': 1837}, 'MISC': {'precision': 0.804583835946924, 'recall': 0.7234273318872018, 'f1': 0.7618503712164477, 'number': 922}, 'ORG': {'precision': 0.8151791988756149, 'recall': 0.8650260999254288, 'f1': 0.8393632416787264, 'number': 1341}, 'PER': {'precision': 0.9694489907255864, 'recall': 0.9647122692725298, 'f1': 0.9670748299319727, 'number': 1842}, 'overall_precision': 0.8854339873628201, 'overall_recall': 0.8961629081117469, 'overall_f1': 0.8907661425225829, 'overall_accuracy': 0.9800630816556988}


In [35]:
print( {
    "precision": results["overall_precision"],
    "recall": results["overall_recall"],
    "f1": results["overall_f1"],
    "accuracy": results["overall_accuracy"],
} )

{'precision': 0.8854339873628201, 'recall': 0.8961629081117469, 'f1': 0.8907661425225829, 'accuracy': 0.9800630816556988}


## Summary of the Results
In the case of Anonymization, Recall is the most important measure, since the goal is to detect named entities, we want to have less False Negatives.

We succeeded to reach 90% recall overall, which means that we anonymize 90% of sensitive data. 
