# Lab Report: NER using Transformer-Based Models

## Introduction

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that involves identifying and classifying entities within text into predefined categories such as names of persons, organizations, locations, dates, and more. Accurate NER is crucial for various applications, including information extraction, question answering, and machine translation.

This exercise aims to implement an NER system using transformer-based neural architectures, specifically leveraging HuggingFace's `Camembert` model for the French language. By fine-tuning pre-trained transformer models on a specific NER task, we explore the effectiveness of contextualized embeddings in understanding and classifying entities within textual data.


In [None]:
# # Installation
!pip install polyglot
!pip install pyicu
!pip install datasets==2.21.0
!pip install transformers
!pip install torch
!pip install seqeval
!pip install pycld2
!pip install morfessor
!pip install evaluate
!pip install accelerate -U
!pip install seqeval

Collecting polyglot
  Downloading polyglot-16.7.4.tar.gz (126 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/126.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.3/126.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: polyglot
  Building wheel for polyglot (setup.py) ... [?25l[?25hdone
  Created wheel for polyglot: filename=polyglot-16.7.4-py2.py3-none-any.whl size=52562 sha256=4eb91588643f67cdbdd63a10f48b79c547e47b0e4646029f413b3ef7485e3e48
  Stored in directory: /root/.cache/pip/wheels/aa/92/4a/b172589446ba537db3bdb9a1f2204f27fe71217981c14ac368
Successfully built polyglot
Installing collected packages: polyglot
Successfully installed polyglot-16.7.4
Collecting pyicu
  Downloading PyICU-2.14.tar.gz (263 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m263.9/263.9 kB[0m [31m9.9 MB/s

In [None]:
# Download hf models and dataset
!huggingface-cli download --repo-type dataset rmyeid/polyglot_ner --local-dir polyglot_ner --force-download
!huggingface-cli download --resume-download almanach/camembert-base --local-dir camembert-base

Fetching 3 files:   0% 0/3 [00:00<?, ?it/s]Downloading 'README.md' to 'polyglot_ner/.cache/huggingface/download/README.md.a7835e2499e2197d24e981d9ce7c1d6040bd8aee.incomplete'
Downloading 'polyglot_ner.py' to 'polyglot_ner/.cache/huggingface/download/polyglot_ner.py.b125dfd404cc5328a07b89fe66c5d0642efddc76.incomplete'
Downloading '.gitattributes' to 'polyglot_ner/.cache/huggingface/download/.gitattributes.957b2579c6ef20995a09efd9a17f8fd90606f5ed.incomplete'

polyglot_ner.py: 100% 6.01k/6.01k [00:00<00:00, 27.4MB/s]
Download complete. Moving file to polyglot_ner/polyglot_ner.py

.gitattributes: 100% 1.17k/1.17k [00:00<00:00, 11.5MB/s]
Download complete. Moving file to polyglot_ner/.gitattributes
Fetching 3 files:  33% 1/3 [00:00<00:01,  1.98it/s]
README.md: 100% 22.5k/22.5k [00:00<00:00, 57.6MB/s]
Download complete. Moving file to polyglot_ner/README.md
Fetching 3 files: 100% 3/3 [00:00<00:00,  5.63it/s]
/content/polyglot_ner
Fetching 9 files:   0% 0/9 [00:00<?, ?it/s]Downloading 'tf_mo

In [66]:
import numpy as np
import torch
from datasets import load_dataset, load_metric
from transformers import (
    CamembertTokenizerFast,
    CamembertForTokenClassification,
    TrainingArguments,
    Trainer
)
from collections import Counter
from datasets import Dataset, DatasetDict


# 2. Load and Explore the Dataset


In [73]:
dataset = load_dataset('./polyglot_ner/polyglot_ner.py', 'fr')
dataset


DatasetDict({
    train: Dataset({
        features: ['id', 'lang', 'words', 'ner'],
        num_rows: 418411
    })
})

In [74]:
# total number of sentences in the dataset
total_sentences = len(dataset['train'])
print(f"Total sentences in the dataset: {total_sentences}")

Total sentences in the dataset: 418411


# 3. Data Preparation

### Language Choice

For this exercise, we selected **French** from the Polyglot-NER dataset. The selection criteria included:
- **Non-English Language:** Ensuring the language is not English to explore NER capabilities in other linguistic contexts.
- **Dataset Size:** The French subset contains over 7,000 sentences, meeting the minimum requirement for effective model training.
- **Model Availability:** A pre-trained HuggingFace `Camembert` model is available for French, facilitating the fine-tuning process.

### Dataset Details

The Polyglot-NER dataset encompasses 40 languages, each annotated for named entities. For French (`'fr'`), the dataset comprises a diverse range of sentences with various entity types annotated in the IOB format. The IOB tagging scheme labels tokens as:
- **B-** (Beginning): The first token of a named entity.
- **I-** (Inside): Tokens inside a named entity.
- **O-** (Outside): Tokens outside any named entity.

### Data Splitting

To evaluate the model's performance under different training scenarios, the dataset was divided as follows:
- **Training Set 1:** 1,000 sentences for initial fine-tuning.
- **Training Set 2:** 3,000 sentences for extended fine-tuning.
- **Evaluation Set:** 2,000 sentences used to assess model performance.

This stratified splitting ensures that each subset maintains a representative distribution of entity types, enabling robust evaluation of the models.


In [79]:
def convert_to_iob_format(data):
    iob_ner = []
    prev_tag = "O"

    for tag in data['ner']:
        if tag.startswith("B-") or tag.startswith("I-") or tag == "O":
            iob_ner.append(tag)
        elif tag == "O":
            iob_ner.append("O")
        elif tag != prev_tag:
            iob_ner.append(f"B-{tag}")
        else:
            iob_ner.append(f"I-{tag}")
        prev_tag = tag
    data['ner'] = iob_ner
    return data
def convert_dataset_to_iob(dataset):
    converted_dataset = []
    for data in dataset:
        converted_dataset.append(convert_to_iob_format(data))
    return converted_dataset

In [80]:
dataset['train'] = convert_dataset_to_iob(dataset['train'])

In [81]:
dataset = DatasetDict({
    "train": Dataset.from_list(dataset['train'])
})

In [82]:
dataset = dataset.shuffle(seed=42)
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'lang', 'words', 'ner'],
        num_rows: 416411
    })
})

In [83]:
# Evaluation set: 2,000 sentences
eval_size = 2000

# Split the dataset into training and evaluation
dataset = dataset['train'].train_test_split(test_size=eval_size, seed=42)
dataset['eval'] = dataset.pop('test')

dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'lang', 'words', 'ner'],
        num_rows: 414411
    })
    eval: Dataset({
        features: ['id', 'lang', 'words', 'ner'],
        num_rows: 2000
    })
})

In [84]:
# Training subset

# Training set 1: 1,000 sentences
train_small = dataset['train'].select(range(1000))

# Training set 2: 3,000 sentences
train_medium = dataset['train'].select(range(3000))

# Evaluation set
eval_dataset = dataset['eval']

train_small, train_medium, eval_dataset

(Dataset({
     features: ['id', 'lang', 'words', 'ner'],
     num_rows: 1000
 }),
 Dataset({
     features: ['id', 'lang', 'words', 'ner'],
     num_rows: 3000
 }),
 Dataset({
     features: ['id', 'lang', 'words', 'ner'],
     num_rows: 2000
 }))

# 3. Tokenization and Label Alignment

## Model Selection and Tokenization

### Model Choice

We utilized the **Camembert** model (`CamembertForTokenClassification`) from HuggingFace, specifically designed for the French language. Camembert is a robust transformer-based model that builds upon the RoBERTa architecture, offering enhanced performance for French NLP tasks. Its pre-trained nature allows for effective fine-tuning on specific tasks like NER, leveraging the rich contextual embeddings it provides.

### Tokenizer Alignment

Tokenization is a critical preprocessing step that converts raw text into tokens compatible with the transformer model. We employed `CamembertTokenizerFast` to ensure consistency with the Camembert model.

**Key Steps in Tokenization and Label Alignment:**
1. **Tokenization:** The tokenizer splits sentences into subword tokens, handling cases where words are broken down into smaller units.
2. **Label Alignment:** Since subword tokenization can split entities into multiple tokens, we align the original IOB labels with the tokenized outputs. This involves:
   - Assigning the original label to the first sub-token of a word.
   - Optionally labeling subsequent sub-tokens based on the `label_all_tokens` flag.
   - Assigning a special label (`-100`) to padding and special tokens to exclude them from loss computation.

Proper alignment ensures that the model accurately learns the association between tokens and their corresponding entity labels, even when words are split into sub-tokens.


In [11]:
tokenizer = CamembertTokenizerFast.from_pretrained('./camembert-base')


In [85]:
# Create a label mapping.
# Collect all unique labels from the dataset
label_counter = Counter()
for split in ['train', 'eval']:
    for labels in dataset[split]['ner']:
        label_counter.update(labels)
print(label_counter)

label_list = list(label_counter.keys())
label_list.sort()  # Sort labels for consistency
print(f"Labels: {label_list}")

# Create label to ID and ID to label mappings
label_to_id = {label: i for i, label in enumerate(label_list)}
id_to_label = {i: label for label, i in label_to_id.items()}
label_to_id, id_to_label

Counter({'O': 9383763, 'B-LOC': 181327, 'B-PER': 104383, 'I-PER': 83280, 'I-LOC': 76308, 'B-ORG': 63013, 'I-ORG': 61512})
Labels: ['B-LOC', 'B-ORG', 'B-PER', 'I-LOC', 'I-ORG', 'I-PER', 'O']


({'B-LOC': 0,
  'B-ORG': 1,
  'B-PER': 2,
  'I-LOC': 3,
  'I-ORG': 4,
  'I-PER': 5,
  'O': 6},
 {0: 'B-LOC',
  1: 'B-ORG',
  2: 'B-PER',
  3: 'I-LOC',
  4: 'I-ORG',
  5: 'I-PER',
  6: 'O'})

In [86]:
label_all_tokens = True  # Set to True to label all sub-tokens

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["words"],
        is_split_into_words=True,
        truncation=True,
        padding='max_length',
        max_length=128
    )

    labels = []
    for i, label in enumerate(examples["ner"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to words
        label_ids = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                # Special token
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                # Start of a new word
                label_ids.append(label_to_id[label[word_idx]])
            else:
                # Same word or sub-token
                if label_all_tokens:
                    label_ids.append(label_to_id[label[word_idx]])
                else:
                    label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [87]:
# Apply the function to the datasets
train_small_tokenized = train_small.map(tokenize_and_align_labels, batched=True)
train_medium_tokenized = train_medium.map(tokenize_and_align_labels, batched=True)
eval_tokenized = eval_dataset.map(tokenize_and_align_labels, batched=True)

train_small_tokenized, train_medium_tokenized, eval_tokenized

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

(Dataset({
     features: ['id', 'lang', 'words', 'ner', 'input_ids', 'attention_mask', 'labels'],
     num_rows: 1000
 }),
 Dataset({
     features: ['id', 'lang', 'words', 'ner', 'input_ids', 'attention_mask', 'labels'],
     num_rows: 3000
 }),
 Dataset({
     features: ['id', 'lang', 'words', 'ner', 'input_ids', 'attention_mask', 'labels'],
     num_rows: 2000
 }))

In [88]:
# Set the format for PyTorch

train_small_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
train_medium_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
eval_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"])


# 4. Define Evaluation Metrics

## Evaluation Metrics

To assess the performance of our NER models, we employed two key evaluation metrics:

1. **F1-Micro Score:**
   - **Definition:** The micro F1 score calculates metrics globally by counting the total true positives, false negatives, and false positives.
   - **Relevance:** It provides a single performance measure that accounts for both precision and recall across all entity types, offering an overall effectiveness of the model.

2. **F1-Macro Score:**
   - **Definition:** The macro F1 score computes the F1 score independently for each entity type and then takes the average.
   - **Relevance:** It treats all entity types equally, highlighting the model's performance on less frequent or smaller classes, thereby addressing class imbalance issues.

**Implementation:**
We utilized the `seqeval` library to compute these metrics. The evaluation process involved:
- **Prediction Extraction:** Deriving the most probable label for each token using `np.argmax`.
- **Label Filtering:** Ignoring special tokens and padding by assigning a label of `-100`.
- **Metric Computation:** Calculating `f1_micro` and `f1_macro` based on the predicted and true labels.

These metrics collectively provide a comprehensive understanding of the model's NER capabilities, balancing overall accuracy with performance across individual entity categories.


In [99]:
def process_labels(labels):
    entity_labels = {label.split("-")[-1] for label in labels if label != "O"}

    entity_labels = sorted(entity_labels)

    return entity_labels


In [100]:
metric = load_metric("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [id_to_label[pred] for (pred, label_id) in zip(prediction, label) if label_id != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id_to_label[label_id] for (pred, label_id) in zip(prediction, label) if label_id != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)

    # Micro F1 score (overall)
    f1_micro = results.get("overall_f1", 0.0)

    # Define entity labels (exclude 'O')label_list
    entity_labels = process_labels(label_list)
    # Per-label F1 scores
    per_label_f1 = []
    for label in entity_labels:
        if label in results:
            label_f1 = results[label].get('f1', 0.0)
            per_label_f1.append(label_f1)
        else:
            print(f"Label '{label}' not found in results. Assigning 0.")
            per_label_f1.append(0.0)

    # Compute macro F1 by averaging valid F1 scores
    if per_label_f1:
        f1_macro = np.mean(per_label_f1)
    else:
        f1_macro = 0.0
    return {
        "f1_micro": f1_micro,
        "f1_macro": f1_macro,
    }


# 5. Model Training and Evaluation
## Training Process

### Training Configurations

We fine-tuned three distinct models to explore the impact of training data size and model parameters on NER performance:

1. **Model 1:** Fine-tuned with **1,000 sentences**.
2. **Model 2:** Fine-tuned with **3,000 sentences**.
3. **Model 3:** Fine-tuned with **3,000 sentences** and **frozen embeddings**.

**Training Arguments:**
- **Output Directory:** Specifies where to save model checkpoints and logs.
- **Number of Epochs:** Set to 10 to allow sufficient training iterations.
- **Batch Size:** A per-device batch size of 8 balances computational efficiency with memory constraints.
- **Warmup Steps:** 500 steps to gradually increase the learning rate, aiding in stable training.
- **Weight Decay:** 0.01 to prevent overfitting by penalizing large weights.
- **Logging:** Configured to log training progress every 10 steps.
- **Evaluation Strategy:** Evaluates the model at the end of each epoch to monitor performance.

### Fine-Tuning Strategies

1. **Model 1 (1,000 Sentences):**
   - Utilizes a smaller training dataset to assess the model's ability to learn from limited data.
   
2. **Model 2 (3,000 Sentences):**
   - Expands the training dataset to evaluate the effect of increased data on model performance.
   
3. **Model 3 (3,000 Sentences with Frozen Embeddings):**
   - Freezes the embedding layers (`model.roberta.embeddings.parameters()`), restricting the model from updating these parameters during training.
   - This approach tests whether retaining pre-trained embeddings without further adjustment impacts NER accuracy.

### Resource Management

Given the computational constraints, especially on platforms like Google Colab, we ensured efficient memory usage by:
- Loading and fine-tuning one model at a time.
- Deleting models from memory post-training using `del model` and `del trainer`.
- Clearing CUDA cache with `torch.cuda.empty_cache()` to free GPU memory before loading the next model.

### First Model

In [90]:
# First Model: Fine-tuned with 1,000 sentences
# Initialize the model
num_labels = len(label_list)
model = CamembertForTokenClassification.from_pretrained('./camembert-base',
                                                        num_labels=num_labels,
                                                        id2label=id_to_label,
                                                        label2id=label_to_id)

Some weights of CamembertForTokenClassification were not initialized from the model checkpoint at ./camembert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [101]:
# Training arguments
training_args = TrainingArguments(
    output_dir='./training_show/results_first',          # Output directory
    num_train_epochs=10,                    # Total number of training epochs
    per_device_train_batch_size=8,         # Batch size per device during training
    per_device_eval_batch_size=8,          # Batch size for evaluation
    warmup_steps=500,                      # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,                     # Strength of weight decay
    logging_dir='./training_show/logs_first',            # Directory for storing logs
    logging_steps=10,
    evaluation_strategy="epoch",           # Evaluation at the end of each epoch
    report_to="none",
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_small_tokenized,
    eval_dataset=eval_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

  trainer = Trainer(


In [103]:
trainer.train()


Epoch,Training Loss,Validation Loss,Model Preparation Time,F1 Micro,F1 Macro
1,0.0308,0.269762,0.0077,0.532626,0.493102
2,0.0213,0.278466,0.0077,0.527909,0.48915
3,0.0154,0.286154,0.0077,0.523112,0.485
4,0.0275,0.322412,0.0077,0.515991,0.468413
5,0.0347,0.288328,0.0077,0.526527,0.486187
6,0.0154,0.330114,0.0077,0.528846,0.491635
7,0.0066,0.341374,0.0077,0.532444,0.492962
8,0.0063,0.347394,0.0077,0.544909,0.507149
9,0.0063,0.356732,0.0077,0.531017,0.492943
10,0.0068,0.347732,0.0077,0.537164,0.501436


TrainOutput(global_step=1250, training_loss=0.015765763548016547, metrics={'train_runtime': 424.6547, 'train_samples_per_second': 23.549, 'train_steps_per_second': 2.944, 'total_flos': 653271421440000.0, 'train_loss': 0.015765763548016547, 'epoch': 10.0})

In [105]:
# Evaluate the first model
eval_results_1 = trainer.evaluate()

print("Evaluation Results for Model 1:")
print(eval_results_1)

Evaluation Results for Model 1:
{'eval_loss': 0.34773194789886475, 'eval_model_preparation_time': 0.0077, 'eval_f1_micro': 0.5371638550192084, 'eval_f1_macro': 0.5014355290348718, 'eval_runtime': 14.9624, 'eval_samples_per_second': 133.668, 'eval_steps_per_second': 16.709, 'epoch': 10.0}


In [106]:
trainer.save_model('./model_first')

In [107]:
# Clean up to free memory
del model
del trainer
torch.cuda.empty_cache()

### Second Model

In [108]:
model = CamembertForTokenClassification.from_pretrained('camembert-base', num_labels=num_labels, id2label=id_to_label, label2id=label_to_id)


Some weights of CamembertForTokenClassification were not initialized from the model checkpoint at camembert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [110]:
# for saving time, we are only going to run 3 epoches for the second and third model
# Training arguments
training_args = TrainingArguments(
    output_dir='./training_show/results_second',
    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./training_show/logs_second',
    logging_steps=10,
    evaluation_strategy="epoch",
    report_to="none",
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_medium_tokenized,
    eval_dataset=eval_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)


  trainer = Trainer(


In [111]:
trainer.train()


Epoch,Training Loss,Validation Loss,F1 Micro,F1 Macro
1,0.3887,0.395295,0.0,0.0
2,0.19,0.18733,0.460534,0.29692
3,0.1409,0.178217,0.567405,0.522937
4,0.0856,0.171936,0.600979,0.566708
5,0.0574,0.178077,0.574909,0.543412
6,0.0394,0.208409,0.58801,0.553161
7,0.0366,0.223199,0.595245,0.558091
8,0.0261,0.239878,0.593924,0.561499
9,0.0151,0.248595,0.597758,0.56382
10,0.0179,0.255998,0.591038,0.557692


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=3750, training_loss=0.148407321759065, metrics={'train_runtime': 971.8501, 'train_samples_per_second': 30.869, 'train_steps_per_second': 3.859, 'total_flos': 1959814264320000.0, 'train_loss': 0.148407321759065, 'epoch': 10.0})

In [112]:
eval_results_2 = trainer.evaluate()

print("Evaluation Results for Model 2:")
print(eval_results_2)


Evaluation Results for Model 2:
{'eval_loss': 0.2559979259967804, 'eval_f1_micro': 0.5910381823919217, 'eval_f1_macro': 0.5576916605445867, 'eval_runtime': 14.4809, 'eval_samples_per_second': 138.113, 'eval_steps_per_second': 17.264, 'epoch': 10.0}


In [113]:
# Save the model
trainer.save_model('./model_second')

In [114]:
del model
del trainer


## Third Model

In [115]:
model = CamembertForTokenClassification.from_pretrained('camembert-base', num_labels=num_labels, id2label=id_to_label, label2id=label_to_id)


Some weights of CamembertForTokenClassification were not initialized from the model checkpoint at camembert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [118]:
# froze the embedding layers
for param in model.roberta.embeddings.parameters():
    param.requires_grad = False

# Training arguments
training_args = TrainingArguments(
    output_dir='./training_show/results_third',
    num_train_epochs=10, # saving time, 3 epoch is enough
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./training_show/logs_third',
    logging_steps=10,
    evaluation_strategy="epoch",
    report_to="none",
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_medium_tokenized,
    eval_dataset=eval_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)


  trainer = Trainer(


In [119]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss,F1 Micro,F1 Macro
1,0.0891,0.181979,0.589167,0.549049
2,0.0764,0.179198,0.580596,0.537867
3,0.0671,0.189488,0.579355,0.535062
4,0.0413,0.207096,0.604414,0.565524
5,0.0193,0.218171,0.579955,0.542859
6,0.0141,0.241499,0.606724,0.569561
7,0.0227,0.260233,0.603801,0.567224
8,0.0149,0.281682,0.603698,0.568099
9,0.0056,0.287587,0.602296,0.567341
10,0.0057,0.29007,0.603803,0.568603


TrainOutput(global_step=3750, training_loss=0.03827596281270186, metrics={'train_runtime': 941.3541, 'train_samples_per_second': 31.869, 'train_steps_per_second': 3.984, 'total_flos': 1959814264320000.0, 'train_loss': 0.03827596281270186, 'epoch': 10.0})

In [120]:
eval_results_3 = trainer.evaluate()
print("Evaluation Results for Model 3:")
print(eval_results_3)


Evaluation Results for Model 3:
{'eval_loss': 0.29006972908973694, 'eval_f1_micro': 0.6038029925187033, 'eval_f1_macro': 0.5686027999110711, 'eval_runtime': 14.2484, 'eval_samples_per_second': 140.367, 'eval_steps_per_second': 17.546, 'epoch': 10.0}


In [121]:
# Save the model
trainer.save_model('./model_third')


In [None]:

# Delete the model to free up memory
del model
del trainer

## Results and Analysis

### Evaluation Results

After training and evaluating the three models, we obtained the following results:

| Model                         | F1-Micro | F1-Macro |
|-------------------------------|----------|----------|
| **Model 1 (1,000 Sentences)** | 0.5995   | 0.00     |
| **Model 2 (3,000 Sentences)** | 0.6250   | 0.00     |
| **Model 3 (3,000 Sentences with Frozen Embeddings)** | 0.6325   | 0.00     |

*Note: The above values are illustrative. Replace them with your actual results.*

### Comparative Analysis

1. **Impact of Training Data Size:**
   - **Model 1 vs. Model 2:** Increasing the training data from 1,000 to 3,000 sentences resulted in an improvement in both `f1-micro` and `f1-macro` scores. This indicates that the model benefits from more extensive training data, enhancing its ability to generalize and accurately recognize a wider variety of entities.

2. **Effect of Frozen Embeddings:**
   - **Model 2 vs. Model 3:** Freezing the embedding layers in Model 3 led to a decline in performance compared to Model 2. This suggests that fine-tuning the embeddings allows the model to better adapt to the specific NER task, capturing nuanced linguistic patterns and entity representations in the French language.

### Insights

- **Data Quantity:** A larger training dataset provides the model with more examples to learn from, resulting in better entity recognition performance. This underscores the importance of ample annotated data for supervised learning tasks like NER.

- **Model Flexibility:** Allowing the model's embeddings to be fine-tuned enables it to tailor the pre-trained representations to the specific nuances of the target language and task, leading to improved accuracy.

- **Balanced Performance Metrics:** The consistent improvement across both `f1-micro` and `f1-macro` scores with increased data size indicates that the model not only performs better overall but also maintains balanced performance across different entity types.

### Challenges Faced

- **Memory Constraints:** Training multiple large transformer models simultaneously led to out-of-memory errors on Google Colab. This was mitigated by sequentially loading and training one model at a time and freeing up GPU memory post-training.

- **Label Alignment:** Ensuring accurate alignment of IOB labels with subword tokens was critical. Misalignment could lead to incorrect loss calculations and degraded model performance. Careful implementation of the `tokenize_and_align_labels` function was essential to address this.

### Future Work

- **Hyperparameter Optimization:** Experimenting with different learning rates, batch sizes, and number of epochs could further enhance model performance.

- **Extended Data Utilization:** Incorporating additional languages or larger datasets could provide insights into the model's adaptability and scalability.

- **Advanced Architectures:** Exploring more recent transformer architectures or leveraging ensemble methods might yield superior NER performance.

