# Training NER Model on Multilingual PII Data

This notebook covers the training and evaluation of a Named Entity Recognition (NER) model on our multilingual PII dataset using the CoNLL-formatted data.

**Steps:**
1. Load and preprocess CoNLL files
2. Set up the tokenizer and model
3. Prepare data for training
4. Train the model
5. Evaluate the results

**Model Details:**
- We'll use a transformer-based model suitable for multilingual NER
- Data comes from `train.conll` and `validation.conll`
- Will handle multiple languages (en, de, fr, it)

## 1. Setup and Dependencies

First, we'll install and import the required libraries:
- `transformers`: For the NER model and tokenizer
- `torch`: For deep learning
- `seqeval`: For NER evaluation metrics
- `datasets`: For data handling

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install -q transformers datasets torch seqeval


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


In [None]:
# Install required packages
!pip install -q transformers datasets torch seqeval

# Import necessary libraries
import torch
from transformers import (

    AutoModelForTokenClassification,

    AutoTokenizer,

    DataCollatorForTokenClassification,

    TrainingArguments,

    Trainer
)
from datasets import Dataset, load_dataset
import numpy as np
from seqeval.metrics import classification_report
import json
import os

# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
Using device: cuda


## 2. Load and Process CoNLL Data

We'll:
1. Read the CoNLL files
2. Parse the data into a suitable format
3. Extract unique labels
4. Create label mappings

In [None]:
def read_conll_file(file_path):
    """
    Read a CoNLL file and return tokens and labels.
    Each sentence is a list of tokens and a list of labels.
    """
    sentences = []
    labels = []
    current_sent = []
    current_labels = []

    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()

            # Skip comment lines
            if line.startswith('#'):
                continue

            # Empty line marks sentence boundary
            if not line:
                if current_sent:
                    sentences.append(current_sent)
                    labels.append(current_labels)
                    current_sent = []
                    current_labels = []
                continue

            # Parse token and label
            token, label = line.split(' ')
            current_sent.append(token)
            current_labels.append(label)

        # Don't forget last sentence if file doesn't end with empty line
        if current_sent:
            sentences.append(current_sent)
            labels.append(current_labels)

    return sentences, labels

# Load train and validation data
train_file = '/content/drive/MyDrive/arner/train.conll' #for colab 'data/train.conll'
val_file = '/content/drive/MyDrive/arner/validation.conll' #for colab 'data/validation.conll'

train_sents, train_labels = read_conll_file(train_file)
val_sents, val_labels = read_conll_file(val_file)

print(f'Loaded {len(train_sents):,} training sentences')
print(f'Loaded {len(val_sents):,} validation sentences')

# Get unique labels and create label mapping
unique_labels = sorted(list(set(
    label for sent_labels in train_labels + val_labels
    for label in sent_labels
)))

label2id = {label: i for i, label in enumerate(unique_labels)}
id2label = {i: label for label, i in label2id.items()}

print('\nUnique labels:', unique_labels)
print(f'Number of labels: {len(unique_labels)}')

Loaded 331,093 training sentences
Loaded 82,928 validation sentences

Unique labels: ['B-AGE', 'B-BUILDINGNUM', 'B-CITY', 'B-CREDITCARDNUMBER', 'B-DATE', 'B-DRIVERLICENSENUM', 'B-EMAIL', 'B-GENDER', 'B-GIVENNAME', 'B-IDCARDNUM', 'B-PASSPORTNUM', 'B-SEX', 'B-SOCIALNUM', 'B-STREET', 'B-SURNAME', 'B-TAXNUM', 'B-TELEPHONENUM', 'B-TIME', 'B-TITLE', 'B-ZIPCODE', 'I-BUILDINGNUM', 'I-CITY', 'I-DATE', 'I-DRIVERLICENSENUM', 'I-EMAIL', 'I-GIVENNAME', 'I-SOCIALNUM', 'I-STREET', 'I-SURNAME', 'I-TAXNUM', 'I-TELEPHONENUM', 'I-TIME', 'I-TITLE', 'I-ZIPCODE', 'O']
Number of labels: 35


## 3. Set up Model and Tokenizer

We'll use XLM-RoBERTa as our base model since it's:
1. Pre-trained on multiple languages
2. Shows good performance on NER tasks
3. Handles our target languages (en, de, fr, it) well

In [None]:
# Initialize tokenizer and model
model_name = 'xlm-roberta-base'  # Multilingual model

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(unique_labels),
    id2label=id2label,
    label2id=label2id
)

print(f'Model parameters: {model.num_parameters():,}')

# Function to tokenize and align labels
def tokenize_and_align_labels(examples):
    """
    Tokenize sequences and align labels with wordpiece tokens.
    Handles sub-word tokenization by assigning -100 to non-first pieces.
    """
    tokenized_inputs = tokenizer(
        examples['tokens'],
        truncation=True,
        is_split_into_words=True,
        max_length=512
    )

    labels = []
    for i, label in enumerate(examples['ner_tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []

        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs['labels'] = labels
    return tokenized_inputs

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model parameters: 277,479,971


## 4. Prepare Training Data

Now we'll:
1. Convert our data to the format expected by the model
2. Create datasets with aligned labels
3. Set up the training arguments

In [None]:
# Convert data to datasets
def create_dataset(sentences, labels):
    # Convert labels to IDs
    label_ids = [[label2id[l] for l in sent_labels] for sent_labels in labels]

    return Dataset.from_dict({
        'tokens': sentences,
        'ner_tags': label_ids
    })

# Create train and validation datasets
train_dataset = create_dataset(train_sents, train_labels)
val_dataset = create_dataset(val_sents, val_labels)

# Apply tokenization
train_tokenized = train_dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=train_dataset.column_names
)

val_tokenized = val_dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=val_dataset.column_names
)

# Set up data collator
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',           # Output directory
    num_train_epochs=2,               # Total number of training epochs
    per_device_train_batch_size=32,   # Reduced batch size for training
    per_device_eval_batch_size=32,    # Reduced batch size for evaluation
    warmup_steps=500,                 # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,                # Strength of weight decay
    logging_dir='./logs',             # Directory for storing logs
    logging_steps=100,                # Log every X updates steps
    eval_strategy='epoch',            # Evaluate every epoch
    save_strategy='epoch',            # Save checkpoint every epoch
    load_best_model_at_end=True,      # Load best model found at end of training
    metric_for_best_model='f1'        # Use F1 score to determine best model
)

print(f'Training samples: {len(train_tokenized):,}')
print(f'Validation samples: {len(val_tokenized):,}')

Map:   0%|          | 0/331093 [00:00<?, ? examples/s]

Map:   0%|          | 0/82928 [00:00<?, ? examples/s]

Training samples: 331,093
Validation samples: 82,928


## 5. Train the Model

Now we'll:
1. Define evaluation metrics
2. Set up the trainer
3. Train the model
4. Save the best checkpoint

In [None]:
def compute_metrics(eval_preds):
    """
    Compute metrics for NER evaluation.
    Returns precision, recall, and F1 score.
    """
    predictions, labels = eval_preds
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [id2label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id2label[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    # Get the classification report
    results = classification_report(
        true_labels,
        true_predictions,
        output_dict=True
    )

    return {
        'precision': results['macro avg']['precision'],
        'recall': results['macro avg']['recall'],
        'f1': results['macro avg']['f1-score']
    }

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

# Train the model
print('Starting training...')
trainer.train()


# Save the final model
output_dir = '/content/drive/MyDrive/arner'
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print(f'\nModel saved to {output_dir}')