# Prepare Environment

We will be finetuning BERT model on GLUE datasets.

**AutoTokenizer** : Tokenizing GLUE text data into format that BERT can understand.

**DataCollatorWithPadding** : Ensures tokenized data is batched together with consistent lengths, adding padding if needed for training stability and efficiency.

**AutoModelForSequenceClassification** : Instantiate model for sequence classification.

**TrainingArguments** : Define training configuration, such as learning rate, batch size, and number of epochs.

**Trainer** : Training and evaluation loop for fine-tuning.

In [1]:
"""
@author: Yu Jihan
"""
!pip install datasets
!pip install transformers==4.17
!pip install accelerate
!pip install evaluate

import numpy as np
import torch

from torch.utils.data import DataLoader
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
import evaluate
import accelerate


# check for GPU device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device available:', device)

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting transformers==4.17
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting sacremoses (from transformers==4.17)
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting click (from sacremoses->transformers==4.17)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading click-8.1.7-py3-none-any.whl (97 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.9/97.9 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected p

2023-12-16 02:46:46.589779: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-16 02:46:47.452245: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-16 02:46:47.452377: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory


Device available: cuda


# Loading GLUE Dataset : CoLA, SST, MRPC, STS-B

fine-tuning a BERT model on the famous GLUE dataset using Trainer API. This requires a GPU environment for faster training and inference, while it still works on a CPU device too.

The base learning rate is set at 3e-5 marking it as a vital hyperparameter. A smaller value, like 3e-5, ensures that the model trains slower and is precise, avoiding overshooting the minimum. However, it might also mean longer training times.

In [2]:
GLUE_TASKS = ['cola', 'sst2', 'mrpc', 'stsb']
TASK = GLUE_TASKS[3]
MODEL = 'bert-base-uncased'
BATCH_SIZE = 32
LEARNING_RATE = 3e-5
EPOCHS = 5

In [3]:
dataset = load_dataset('glue', TASK)

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 5749
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1500
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1379
    })
})


In [4]:
dataset['train'].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': Value(dtype='float32', id=None),
 'idx': Value(dtype='int32', id=None)}

In [5]:
dataset["train"][0]

{'sentence1': 'A plane is taking off.',
 'sentence2': 'An air plane is taking off.',
 'label': 5.0,
 'idx': 0}

# Tokenizer and Data Collator

Tokenizers API in the Transformers library offers essential preprocessing activities such as tokenization, padding, truncating, batching, and so on.

A tokenizer encodes texts into numbers that a model can understand.

In [6]:
# Load the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [7]:
# Data collator for dynamic padding as per batch
data_collator = DataCollatorWithPadding(tokenizer)

In [8]:
task_to_keys = {
    "cola": ("sentence", None),
    "mrpc": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
}

In [9]:
sentence1_key, sentence2_key = task_to_keys[TASK]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence 1: A plane is taking off.
Sentence 2: An air plane is taking off.


In [10]:
# define a tokenize function
def tokenize_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

In [11]:
tokenize_function(dataset['train'][:5])

{'input_ids': [[101, 1037, 4946, 2003, 2635, 2125, 1012, 102, 2019, 2250, 4946, 2003, 2635, 2125, 1012, 102], [101, 1037, 2158, 2003, 2652, 1037, 2312, 8928, 1012, 102, 1037, 2158, 2003, 2652, 1037, 8928, 1012, 102], [101, 1037, 2158, 2003, 9359, 14021, 5596, 2098, 8808, 2006, 1037, 10733, 1012, 102, 1037, 2158, 2003, 9359, 29022, 8808, 2006, 2019, 4895, 3597, 23461, 10733, 1012, 102], [101, 2093, 2273, 2024, 2652, 7433, 1012, 102, 2048, 2273, 2024, 2652, 7433, 1012, 102], [101, 1037, 2158, 2003, 2652, 1996, 10145, 1012, 102, 1037, 2158, 8901, 2003, 2652, 1996, 10145, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [12]:
# tokenize entire data
tokenized_datasets = dataset.map(tokenize_function, batched=True, batch_size=BATCH_SIZE)
if sentence2_key is None:
  tokenized_datasets = tokenized_datasets.remove_columns(["idx", "sentence"])
else:
  tokenized_datasets = tokenized_datasets.remove_columns(["idx", "sentence1", "sentence2"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets = tokenized_datasets.with_format("torch")
print(tokenized_datasets)

Map:   0%|          | 0/5749 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1379 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5749
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1500
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1379
    })
})


In [13]:
tokenized_train = DataLoader(tokenized_datasets["train"],
                             shuffle=True,
                             batch_size=BATCH_SIZE,
                             collate_fn=data_collator)
tokenized_validation = DataLoader(tokenized_datasets["validation"],
                                  batch_size=BATCH_SIZE,
                                  collate_fn=data_collator)
tokenized_test = DataLoader(tokenized_datasets["test"],
                            batch_size=BATCH_SIZE,
                            collate_fn=data_collator)

In [14]:
# do a chekck for proper data preprocessing
for batch in tokenized_train:
    [print('{:>20} : {}'.format(k,v.shape)) for k,v in batch.items()]
    break

              labels : torch.Size([32])
           input_ids : torch.Size([32, 68])
      token_type_ids : torch.Size([32, 68])
      attention_mask : torch.Size([32, 68])


In [15]:
tokenized_sample = tokenize_function(dataset["train"][0])
print(tokenized_sample)
print(f"Length of tokenized IDs: {len(tokenized_sample.input_ids)}")
print(f"Length of attention mask: {len(tokenized_sample.attention_mask)}")

{'input_ids': [101, 1037, 4946, 2003, 2635, 2125, 1012, 102, 2019, 2250, 4946, 2003, 2635, 2125, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Length of tokenized IDs: 16
Length of attention mask: 16


# Fine-tuning BERT
POUR SAVOIR LE NOM DES PARAMETRES

In [20]:
num_labels = 1 if TASK=="stsb" else 2

model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=num_labels)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [21]:
from torch.optim import AdamW

In [27]:
layers_name = []
for task in GLUE_TASKS:
    optimizer = AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=3e-5)

    for name, param in model.named_parameters():
        
        if str(name) not in layers_name:
            layers_name.append(name)
#layers_name = [element.replace("bert.", "") for element in layers_name]
#for i in range(12):
#    layers_name = [element.replace("."+str(i), "") for element in layers_name]
print(layers_name)

['bert.embeddings.word_embeddings.weight', 'bert.embeddings.position_embeddings.weight', 'bert.embeddings.token_type_embeddings.weight', 'bert.embeddings.LayerNorm.weight', 'bert.embeddings.LayerNorm.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.output.dense.weight', 'bert.encoder.layer.0.attention.output.dense.bias', 'bert.encoder.layer.0.attention.output.LayerNorm.weight', 'bert.encoder.layer.0.attention.output.LayerNorm.bias', 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.intermediate.dense.bias', 'bert.encoder.layer.0.output.dense.weight', 'bert.encoder.layer.0.output.dense.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.LayerNorm.

In [31]:
total_parameters = 0

for name, param in model.named_parameters():
    num_params = param.numel()  # Get the number of elements in the tensor
    total_parameters += num_params
    print(f"Parameter name: {name}, Number of parameters: {num_params}")

print(f"Total number of parameters in the model: {total_parameters}")


print("\n compte = "+str(16))


Parameter name: bert.embeddings.word_embeddings.weight, Number of parameters: 23440896
Parameter name: bert.embeddings.position_embeddings.weight, Number of parameters: 393216
Parameter name: bert.embeddings.token_type_embeddings.weight, Number of parameters: 1536
Parameter name: bert.embeddings.LayerNorm.weight, Number of parameters: 768
Parameter name: bert.embeddings.LayerNorm.bias, Number of parameters: 768
Parameter name: bert.encoder.layer.0.attention.self.query.weight, Number of parameters: 589824
Parameter name: bert.encoder.layer.0.attention.self.query.bias, Number of parameters: 768
Parameter name: bert.encoder.layer.0.attention.self.key.weight, Number of parameters: 589824
Parameter name: bert.encoder.layer.0.attention.self.key.bias, Number of parameters: 768
Parameter name: bert.encoder.layer.0.attention.self.value.weight, Number of parameters: 589824
Parameter name: bert.encoder.layer.0.attention.self.value.bias, Number of parameters: 768
Parameter name: bert.encoder.layer

In [39]:
EveryNames = []
for name, param in model.named_parameters():
    EveryNames.append(name)
k = 3
A = []
for i in range(len(EveryNames)):
    A.append(EveryNames[-i])
print(A)

['bert.embeddings.word_embeddings.weight', 'classifier.bias', 'classifier.weight', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'bert.encoder.layer.11.output.LayerNorm.bias', 'bert.encoder.layer.11.output.LayerNorm.weight', 'bert.encoder.layer.11.output.dense.bias', 'bert.encoder.layer.11.output.dense.weight', 'bert.encoder.layer.11.intermediate.dense.bias', 'bert.encoder.layer.11.intermediate.dense.weight', 'bert.encoder.layer.11.attention.output.LayerNorm.bias', 'bert.encoder.layer.11.attention.output.LayerNorm.weight', 'bert.encoder.layer.11.attention.output.dense.bias', 'bert.encoder.layer.11.attention.output.dense.weight', 'bert.encoder.layer.11.attention.self.value.bias', 'bert.encoder.layer.11.attention.self.value.weight', 'bert.encoder.layer.11.attention.self.key.bias', 'bert.encoder.layer.11.attention.self.key.weight', 'bert.encoder.layer.11.attention.self.query.bias', 'bert.encoder.layer.11.attention.self.query.weight', 'bert.encoder.layer.10.output.LayerNorm.bias', 

In [19]:
metric_name = "spearmanr" if TASK == "stsb" else "matthews_correlation" if TASK == "cola" else "f1" if TASK == "mrpc" else "accuracy"
metric = evaluate.load(metric_name)
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if TASK != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/5.05k [00:00<?, ?B/s]

In [20]:
model_name = MODEL.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-{TASK}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=3e-5, # AdamW optimizer
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

In [21]:
trainer = Trainer(model,
                  args,
                  train_dataset=tokenized_datasets["train"],
                  eval_dataset=tokenized_datasets["validation"],
                  tokenizer=tokenizer,
                  data_collator = data_collator,
                  compute_metrics=compute_metrics
                  )

In [22]:
trainer.train()

***** Running training *****
  Num examples = 5749
  Num Epochs = 5
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 900


Epoch,Training Loss,Validation Loss,Spearmanr
1,No log,0.593865,0.874148
2,No log,0.556845,0.885398
3,0.664800,0.466201,0.887414
4,0.664800,0.486066,0.890285
5,0.664800,0.477962,0.890405


***** Running Evaluation *****
  Num examples = 1500
  Batch size = 32
Saving model checkpoint to bert-base-uncased-finetuned-stsb/checkpoint-180
Configuration saved in bert-base-uncased-finetuned-stsb/checkpoint-180/config.json
Model weights saved in bert-base-uncased-finetuned-stsb/checkpoint-180/pytorch_model.bin
tokenizer config file saved in bert-base-uncased-finetuned-stsb/checkpoint-180/tokenizer_config.json
Special tokens file saved in bert-base-uncased-finetuned-stsb/checkpoint-180/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1500
  Batch size = 32
Saving model checkpoint to bert-base-uncased-finetuned-stsb/checkpoint-360
Configuration saved in bert-base-uncased-finetuned-stsb/checkpoint-360/config.json
Model weights saved in bert-base-uncased-finetuned-stsb/checkpoint-360/pytorch_model.bin
tokenizer config file saved in bert-base-uncased-finetuned-stsb/checkpoint-360/tokenizer_config.json
Special tokens file saved in bert-base-uncased-finetuned-stsb

TrainOutput(global_step=900, training_loss=0.4434303453233507, metrics={'train_runtime': 373.8725, 'train_samples_per_second': 76.885, 'train_steps_per_second': 2.407, 'total_flos': 1025938390484874.0, 'train_loss': 0.4434303453233507, 'epoch': 5.0})

In [23]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1500
  Batch size = 32


{'eval_loss': 0.47796162962913513,
 'eval_spearmanr': 0.8904048503412239,
 'eval_runtime': 4.0636,
 'eval_samples_per_second': 369.128,
 'eval_steps_per_second': 11.566,
 'epoch': 5.0}

**CoLA** : 'train_loss' : 0.133300, 'eval_loss': 0.750196, 'eval_matthews_correlation': 0.604103

**SST** : 'train_loss' : 0.073600, 'eval_loss': 0.283315, 'eval_accuracy': 0.928899

**MRPC** : 'train_loss' : 0.271300, 'eval_loss': 0.500252, 'eval_f1': 0.902998

**STSB** : 'train_loss' : 0.664800, 'eval_loss': 0.477962, 'eval_spearman': 0.890405