## TP5: Pre-training and Fine-Tuning of a Language Model

#### Part 2: Fine-Tuning with BERT
- Use a pre-trained BERT model for a binary classification task (e.g., positive/negative reviews).

#### 1. Dataset:
- Provide a simple dataset or use the IMDB dataset (available in `torchtext` for review classification).
- Example data:
    - "I love this movie!" → Positive (1)
    - "This movie is horrible." → Negative (0)

In [3]:
! pip install torch pandas transformers scikit-learn numpy

Collecting transformers
  Using cached transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.5.2-cp39-cp39-macosx_12_0_arm64.whl.metadata (13 kB)
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Downloading huggingface_hub-0.26.3-py3-none-any.whl.metadata (13 kB)
Collecting pyyaml>=5.1 (from transformers)
  Downloading PyYAML-6.0.2-cp39-cp39-macosx_11_0_arm64.whl.metadata (2.1 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2024.11.6-cp39-cp39-macosx_11_0_arm64.whl.metadata (40 kB)
Collecting requests (from transformers)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tokenizers<0.21,>=0.20 (from transformers)
  Downloading tokenizers-0.20.3-cp39-cp39-macosx_11_0_arm64.whl.metadata (6.7 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.5-cp39-cp39-macosx_11_0_arm64.whl.metadata (3.8 kB)
Collecting tqdm>=4.27 (from transformers)
  Usin

In [29]:
# Import necessary libraries
import pandas as pd

# Create sample data
data = {
    'text': [
        "I love this movie!",
        "This movie is horrible.",
        "An incredible cinematic experience.",
        "I wouldn't recommend this movie.",
        "The actors did a fantastic job.",
        "The plot was boring and predictable.",
        "A modern masterpiece.",
        "I fell asleep during the movie.",
        "This is the best movie I've seen this year.",
        "A complete waste of time.",
        "The cinematography was breathtaking.",
        "The dialogues were poorly written.",
        "A must-watch for everyone.",
        "I regret spending money on this.",
        "The soundtrack added so much depth.",
        "The characters felt one-dimensional.",
        "A perfect blend of action and emotion.",
        "The ending was abrupt and unsatisfying.",
        "This movie exceeded all my expectations.",
        "An absolute disaster of a film.",
        "The visuals were stunning but the story lacked depth.",
        "One of the worst movies I've ever seen.",
        "The humor was spot on and refreshing.",
        "Too slow-paced to keep my attention.",
        "A true classic that will stand the test of time.",
        "The performances were mediocre at best.",
        "I couldn't stop smiling throughout the movie.",
        "The plot twists were too predictable.",
        "An inspiring story with brilliant execution.",
        "Completely overhyped and disappointing.",
        "Une œuvre d'art exceptionnelle.",
        "Je n'ai pas aimé les effets spéciaux.",
        "Les personnages sont très attachants.",
        "Le scénario est confuse et mal structuré.",
        "Une comédie hilarante du début à la fin.",
        "La bande sonore était monotone.",
        "Un film émouvant qui touche le cœur.",
        "Les dialogues manquent de profondeur.",
        "Une aventure palpitante et bien réalisée.",
        "Le rythme du film est irrégulier."
    ],
    'label': [
        1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 
        1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 
        1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
        1, 0, 1, 0, 1, 0, 1, 0, 1, 0
    ] 
}

# Create a pandas DataFrame
df = pd.DataFrame(data)

# Display the dataset
print(df)


                                                 text  label
0                                  I love this movie!      1
1                             This movie is horrible.      0
2                 An incredible cinematic experience.      1
3                    I wouldn't recommend this movie.      0
4                     The actors did a fantastic job.      1
5                The plot was boring and predictable.      0
6                               A modern masterpiece.      1
7                     I fell asleep during the movie.      0
8         This is the best movie I've seen this year.      1
9                           A complete waste of time.      0
10               The cinematography was breathtaking.      1
11                 The dialogues were poorly written.      0
12                         A must-watch for everyone.      1
13                   I regret spending money on this.      0
14                The soundtrack added so much depth.      1
15               The cha

#### 2. Prepare the Dataset
- Split the data into training and validation sets using `train_test_split` from `sklearn`.
- Tokenize using `BertTokenizer.from_pretrained` (from the `transformers` library).

In [30]:
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer

# Split the data into training and validation sets
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

# Display the sizes of the datasets
print(f"Training set size: {len(train_df)} samples")
print(f"Validation set size: {len(val_df)} samples")

# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

# Tokenize the training set texts
train_encodings = tokenizer(
    train_df['text'].tolist(),
    truncation=True,
    padding=True,
    return_tensors='pt'  # Returns PyTorch tensors
)

# Tokenize the validation set texts
val_encodings = tokenizer(
    val_df['text'].tolist(),
    truncation=True,
    padding=True,
    return_tensors='pt'
)

# Display an example of tokenization
print("Tokenization example for the first sample in the training set:")
print(train_encodings['input_ids'][0])


Training set size: 32 samples
Validation set size: 8 samples
Tokenization example for the first sample in the training set:
tensor([   101,  10281, 106952,  10168,  10458,  10176,  10478,  14412,  45837,
         11709,    119,    102,      0,      0,      0,      0])


#### 3. Load the Pre-trained Model:
- Load a pre-trained BERT model ready for fine-tuning on a classification task.
- `BertForSequenceClassification.from_pretrained` (from `transformers`):
  To load BERT and add a dense layer for 2 classes.


In [31]:
# Import the BertForSequenceClassification class
from transformers import BertForSequenceClassification

# Load the pre-trained BERT model for classification
model = BertForSequenceClassification.from_pretrained(
    'bert-base-multilingual-cased',  # Multilingual model suitable for French
    num_labels=2  # Number of classes for binary classification (positive or negative)
)

# Display a summary of the model
print(model)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

#### 4. Adapt BERT for Classification:
- Convert tokenized data into PyTorch-compatible datasets. 
- `TensorDataset` (from `torch.utils.data`): To transform encodings and labels into datasets.

In [32]:
# Import necessary libraries
import torch
from torch.utils.data import TensorDataset, DataLoader

# Prepare labels for the training dataset
train_labels = torch.tensor(train_df['label'].tolist())

# Prepare labels for the validation dataset
val_labels = torch.tensor(val_df['label'].tolist())

# Create the TensorDataset for the training dataset
train_dataset = TensorDataset(
    train_encodings['input_ids'],
    train_encodings['attention_mask'],
    train_labels
)

# Create the TensorDataset for the validation dataset
val_dataset = TensorDataset(
    val_encodings['input_ids'],
    val_encodings['attention_mask'],
    val_labels
)

# Create DataLoaders for training and validation
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8)

# Get a training batch
batch = next(iter(train_loader))

# Extract elements from the batch
input_ids, attention_mask, labels = batch

print("input_ids shape:", input_ids.shape)
print("attention_mask shape:", attention_mask.shape)
print("labels:", labels)


input_ids shape: torch.Size([8, 16])
attention_mask shape: torch.Size([8, 16])
labels: tensor([1, 1, 1, 1, 1, 0, 1, 1])


#### 5. Configure Training
- Define training parameters (number of epochs, batch size, etc.).
- Use `TrainingArguments` (from `transformers`) to define the training parameters.

#### 6. Train the Model:
- Use the training data to fine-tune the weights of BERT and the new classification layer.

#### 7. Test the Model

In [33]:
import numpy as np
from transformers import TrainingArguments, Trainer
from sklearn.metrics import classification_report, accuracy_score
from torch.utils.data import Dataset

# Add a class to convert tuples to dictionaries
class DictDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            "input_ids": self.encodings["input_ids"][idx],
            "attention_mask": self.encodings["attention_mask"][idx],
            "labels": self.labels[idx]
        }

# Create datasets compatible with the Trainer
train_dataset = DictDataset(train_encodings, train_labels)
val_dataset = DictDataset(val_encodings, val_labels)

# 5. Configure training
training_args = TrainingArguments(
    output_dir='./results',             # Directory to save results
    num_train_epochs=3,                 # Number of epochs
    per_device_train_batch_size=8,      # Batch size for training
    per_device_eval_batch_size=8,       # Batch size for evaluation
    warmup_steps=500,                   # Number of warmup steps (scheduler)
    weight_decay=0.01,                  # Weight decay (L2 regularization)
    logging_dir='./logs',               # Directory for logs
    logging_steps=10,                   # Logging frequency
    evaluation_strategy="epoch",        # Evaluation frequency (each epoch)
    save_strategy="epoch",              # Save at each epoch
    load_best_model_at_end=True,        # Load the best model at the end
    metric_for_best_model="accuracy",   # Metric to determine the best model
    save_total_limit=2                  # Save only the 2 best models
)

# Function to compute performance metrics
def compute_metrics(pred):
    predictions, labels = pred
    predictions = np.argmax(predictions, axis=1)
    accuracy = accuracy_score(labels, predictions)
    return {"accuracy": accuracy}

# Create the Trainer object to manage training and evaluation
trainer = Trainer(
    model=model,                         # Model to train
    args=training_args,                  # Training parameters
    train_dataset=train_dataset,         # Training dataset
    eval_dataset=val_dataset,            # Validation dataset
    compute_metrics=compute_metrics      # Function to compute metrics
)

# 6. Train the model
print("Training the model...")
trainer.train()

# Save the trained model
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")

# 7. Test the model
print("Evaluating the model on the validation set...")
predictions = trainer.predict(val_dataset)

# Compute and display performance metrics
y_true = val_df['label'].tolist()
y_pred = np.argmax(predictions.predictions, axis=1)

print("Classification report:")
print(classification_report(y_true, y_pred, target_names=["Negative", "Positive"]))

# Example: Test a new sentence
test_sentence = "This movie is a masterpiece!"
inputs = tokenizer(test_sentence, return_tensors="pt", truncation=True, padding=True)
inputs = {key: value.to(model.device) for key, value in inputs.items()}

# Get the prediction
model.eval()
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class = np.argmax(logits.cpu().numpy(), axis=1)[0]
print(f"Text: '{test_sentence}' - Prediction: {'Positive' if predicted_class == 1 else 'Negative'}")




Training the model...


                                              
 33%|███▎      | 4/12 [00:02<00:03,  2.29it/s]

{'eval_loss': 0.692333459854126, 'eval_accuracy': 0.5, 'eval_runtime': 0.0918, 'eval_samples_per_second': 87.149, 'eval_steps_per_second': 10.894, 'epoch': 1.0}


                                              
 67%|██████▋   | 8/12 [00:09<00:03,  1.03it/s]

{'eval_loss': 0.6924784183502197, 'eval_accuracy': 0.5, 'eval_runtime': 0.0731, 'eval_samples_per_second': 109.364, 'eval_steps_per_second': 13.671, 'epoch': 2.0}


 83%|████████▎ | 10/12 [00:15<00:03,  1.79s/it]

{'loss': 0.7021, 'grad_norm': 6.165192604064941, 'learning_rate': 1.0000000000000002e-06, 'epoch': 2.5}


                                               
100%|██████████| 12/12 [00:19<00:00,  1.09s/it]

{'eval_loss': 0.6930872797966003, 'eval_accuracy': 0.5, 'eval_runtime': 0.0756, 'eval_samples_per_second': 105.841, 'eval_steps_per_second': 13.23, 'epoch': 3.0}


100%|██████████| 12/12 [00:24<00:00,  2.01s/it]


{'train_runtime': 24.1314, 'train_samples_per_second': 3.978, 'train_steps_per_second': 0.497, 'train_loss': 0.7021592855453491, 'epoch': 3.0}
Evaluating the model on the validation set...


100%|██████████| 1/1 [00:00<00:00, 647.97it/s]

Classification report:
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00         4
    Positive       0.50      1.00      0.67         4

    accuracy                           0.50         8
   macro avg       0.25      0.50      0.33         8
weighted avg       0.25      0.50      0.33         8

Text: 'This movie is a masterpiece!' - Prediction: Positive



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
