<a href="https://colab.research.google.com/github/201li220/ProjetMLA/blob/main/Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prepare Environment

we will see how to fine-tune one of the Transformers model to a text classification task of the GLUE Benchmark.

**AutoTokenizer** : This aids in tokenizing our text data into a format BERT can understand. The "Auto" prefix means it can infer the appropriate tokenizer for various models.

**DataCollatorWithPadding** : Ensures that our tokenized data is batched together with consistent lengths, adding padding where necessary. It’s crucial for training stability and efficiency.

**AutoModelForSequenceClassification** : A generic class that can instantiate model architectures tailored for sequence classification tasks. Again, the “Auto” prefix makes it versatile across various pre-trained models.

**TrainingArguments** : A convenient way to define the training configuration, such as the learning rate, batch size, and number of epochs.

**Trainer** : A high-level utility from the Transformers library that abstracts the training and evaluation loop, making fine-tuning straightforward.

Pipeline: Simplifies the process of applying models on data. It’s a handy tool for post-training evaluations and predictions.

In [None]:
# -*- coding: utf-8 -*-
"""
@author: Yu Jihan
"""
!pip install datasets
!pip install transformers==4.17
!pip install accelerate -U
!pip install evaluate

import numpy as np
import torch
import matplotlib.pyplot as plt
from tqdm.auto import tqdm

from torch.utils.data import DataLoader
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import AdamW, get_scheduler
from datasets import load_dataset
import evaluate
import accelerate


# check for GPU device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device available:', device)

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0
Device available: cuda


# Loading GLUE Dataset : CoLA, SST, MRPC, STS-B

fine-tuning a BERT model on the famous GLUE dataset using Trainer API. This requires a GPU environment for faster training and inference, while it still works on a CPU device too.

The base learning rate is set at 3e-5 marking it as a vital hyperparameter. A smaller value, like 3e-5, ensures that the model trains slower and is precise, avoiding overshooting the minimum. However, it might also mean longer training times.

In [None]:
GLUE_TASKS = ['cola', 'sst2', 'mrpc', 'stsb']
TASK = 'cola'
MODEL = 'bert-base-uncased'
BATCH_SIZE = 32
LEARNING_RATE = 3e-5
EPOCHS = 5

In [None]:
dataset = load_dataset('glue', TASK)

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})


In [None]:
dataset['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['unacceptable', 'acceptable'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [None]:
dataset["train"][0]

{'sentence': "Our friends won't buy this analysis, let alone the next one we propose.",
 'label': 1,
 'idx': 0}

# Tokenizer and Data Collator

Tokenizers API in the Transformers library offers essential preprocessing activities such as tokenization, padding, truncating, batching, and so on.

A tokenizer encodes texts into numbers that a model can understand.

In [None]:
# Load the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [None]:
# Data collator for dynamic padding as per batch
data_collator = DataCollatorWithPadding(tokenizer)

In [None]:
task_to_keys = {
    "cola": ("sentence", None),
    "mrpc": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
}

In [None]:
sentence1_key, sentence2_key = task_to_keys[TASK]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: Our friends won't buy this analysis, let alone the next one we propose.


In [None]:
# define a tokenize function
def tokenize_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

In [None]:
tokenize_function(dataset['train'][:5])

{'input_ids': [[101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 1998, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 2030, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 1996, 2062, 2057, 2817, 16025, 1010, 1996, 13675, 16103, 2121, 2027, 2131, 1012, 102], [101, 2154, 2011, 2154, 1996, 8866, 2024, 2893, 14163, 8024, 3771, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [None]:
# tokenize entire data
tokenized_datasets = dataset.map(tokenize_function, batched=True, batch_size=BATCH_SIZE)
if sentence2_key is None:
  tokenized_datasets = tokenized_datasets.remove_columns(["idx", "sentence"])
else:
  tokenized_datasets = tokenized_datasets.remove_columns(["idx", "sentence1", "sentence2"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets = tokenized_datasets.with_format("torch")
print(tokenized_datasets)

Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

Map:   0%|          | 0/1063 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1063
    })
})


In [None]:
tokenized_train = DataLoader(tokenized_datasets["train"],
                             shuffle=True,
                             batch_size=BATCH_SIZE,
                             collate_fn=data_collator)
tokenized_validation = DataLoader(tokenized_datasets["validation"],
                                  batch_size=BATCH_SIZE,
                                  collate_fn=data_collator)
tokenized_test = DataLoader(tokenized_datasets["test"],
                            batch_size=BATCH_SIZE,
                            collate_fn=data_collator)

In [None]:
# do a chekck for proper data preprocessing
for batch in tokenized_train:
    [print('{:>20} : {}'.format(k,v.shape)) for k,v in batch.items()]
    break

              labels : torch.Size([32])
           input_ids : torch.Size([32, 23])
      token_type_ids : torch.Size([32, 23])
      attention_mask : torch.Size([32, 23])


In [None]:
tokenized_sample = tokenize_function(dataset["train"][0])
print(tokenized_sample)
print(f"Length of tokenized IDs: {len(tokenized_sample.input_ids)}")
print(f"Length of attention mask: {len(tokenized_sample.attention_mask)}")

{'input_ids': [101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Length of tokenized IDs: 19
Length of attention mask: 19


# Fine-tuning BERT

In [None]:
num_labels = 1 if TASK=="stsb" else 2

# cache a pre-trained BERT model for two-class classification
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

# 1er methode


In [None]:
metric_name = "spearmanr" if TASK == "stsb" else "matthews_correlation" if TASK == "cola" else "f1" if TASK == "mrpc" else "accuracy"
metric = evaluate.load(metric_name)
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if TASK != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/6.60k [00:00<?, ?B/s]

In [None]:
model_name = MODEL.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-{TASK}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=3e-5, # AdamW optimizer
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    report_to='tensorboard'
)

In [None]:
trainer = Trainer(model,
                  args,
                  train_dataset=tokenized_datasets["train"],
                  eval_dataset=tokenized_datasets["validation"],
                  tokenizer=tokenizer,
                  data_collator = data_collator,
                  compute_metrics=compute_metrics
                  )

In [None]:
trainer.train()

***** Running training *****
  Num examples = 8551
  Num Epochs = 5
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 1340


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: ignored

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1043
  Batch size = 32


{'eval_loss': 0.6504604816436768,
 'eval_matthews_correlation': 0.5762564573315502,
 'eval_runtime': 1.4153,
 'eval_samples_per_second': 736.928,
 'eval_steps_per_second': 23.316,
 'epoch': 5.0}