# Project Fine-tune and Evaluate LLM model on popular benchmark dataset

- In this project, I aim to evaluate the performance of various Large Language Models (LLMs) on benchmark datasets to identify their strengths and weaknesses. My focus is on implementing open-source LLMs and benchmarking them using the General Language Understanding Evaluation (GLUE) dataset, specifically the Microsoft Research Paraphrase Corpus (MRPC).

- Through this project, I aim to deepen my understanding of the evaluation process for LLMs, particularly in assessing the performance of newly created models. Additionally, I plan to implement and fine-tune baseline models to improve their accuracy and precision on the MRPC task. By doing so, I hope to contribute to the development of more robust and effective LLMs in natural language processing tasks.


### Step 1: Choosing BenchMark dataset and download from hugging face

- In this project, I use GLUE dataset (MRPC) to benchmark the model.

- MRPC consists of sentences taken from new sources. Each pair of sentences is labeled as either a paraphrase or not. The purpose is to identify the pair of sentences expressing the same meaning or not.

- In examining the dataset, the columns sentence 1, sentence 2 and label indicate whether sentence 1 and sentence 2 are paraphrased of each other or not.

In [1]:
# Choose benchmark dataset GLUE, SQAD
!pip install transformers
!pip install torch

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[

In [3]:
from datasets import load_dataset

dataset = load_dataset('glue', 'mrpc')
dataset


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

### Step 2: Choosing Open Source Baseline Model

In [4]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [6]:
# Preprocess the data

def tokenize_function(examples):
    return tokenizer(examples['sentence1'], examples['sentence2'], padding='max_length', truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched = True)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [7]:
# GPT-2 Model

from transformers import GPT2Tokenizer, GPT2ForSequenceClassification
import torch

# Load pre-trained GPT-2 tokenizer and model with a classification head
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained('gpt2')
model_gpt2 = GPT2ForSequenceClassification.from_pretrained('gpt2', num_labels=2)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Step 3: Evaluate the baseline model on the benchmark dataset to set a reference point

In [15]:
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    accuracy = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds)
    return {
        'accuracy': accuracy,
        'f1': f1,
    }


In [19]:
from transformers import Trainer

# Define a trainer with no training arguments (only evaluation)
trainer_bert = Trainer(
    model=model,
    eval_dataset=tokenized_datasets['validation'],
    compute_metrics=compute_metrics,
)

In [20]:
results_bert = trainer_bert.evaluate()
print(results_bert)

{'eval_loss': 1.1677886247634888, 'eval_accuracy': 0.3161764705882353, 'eval_f1': 0.0, 'eval_runtime': 3.0678, 'eval_samples_per_second': 132.993, 'eval_steps_per_second': 16.624}


In [21]:
from transformers import Trainer, TrainingArguments

tokenizer_gpt2.pad_token = tokenizer.eos_token
model_gpt2.config.pad_token_id = tokenizer.eos_token_id

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Use the Trainer API for training and evaluation
trainer_gpt2 = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
results_gpt2 = trainer_gpt2.evaluate()
print(results_gpt2)



{'eval_loss': 1.1677886247634888, 'eval_accuracy': 0.3161764705882353, 'eval_f1': 0.0, 'eval_runtime': 3.0688, 'eval_samples_per_second': 132.95, 'eval_steps_per_second': 16.619}


### Step 4: Fine tune the baseline model


- evaluation_strategy: allows you to monitor the performance of the model.
- learning rate = 2e-5: the step size fro the optimizer to update the model weights. A smaller learning rates help achieving convergence better by making small controlled updates to the model weight
- num_train_epochs: specify the number of the times the model go through the entire dataset
- per_device_train_batch_size: set the batch size for training
- weights_step = 500: number of steps for the learning rate to warm up before reaching the intial learning rate
- weight_decay = 0.01: add penalty to the loss
- logging_dir = './logs': specify the log where they are stored
- logging_steps = 10: number of steps between the logging updates.

What we fine-tuned:
- Optimized learning rate: adjusting the learning and using warmup steps can prevent issues like overshooting and slow convergence. (adjusting the learning rate to a suitable number)
- Regularization and monitoring: add penalty loss to reduce overfitting.
- Batch size and epochs: proper setting of batch size and number of epochs ensures a balance between training efficiency and model performance. Larger batch sizes and epochs requires a lot of computational resources.

In [22]:
!pip install transformers



In [24]:
from transformers import TrainingArguments, Trainer

# Specify output_dir, leraning_rate, evaluation_strategy, num_train_epoch, weight decay
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

fine_tune_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

fine_tune_trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.5849,0.522285,0.759804,0.834459
2,0.4394,0.4686,0.791667,0.863563
3,0.2787,0.419673,0.833333,0.883562


TrainOutput(global_step=690, training_loss=0.49827541752138, metrics={'train_runtime': 235.8362, 'train_samples_per_second': 46.66, 'train_steps_per_second': 2.926, 'total_flos': 2895274053181440.0, 'train_loss': 0.49827541752138, 'epoch': 3.0})

In [27]:
from transformers import TrainingArguments, Trainer

# Specify output_dir, leraning_rate, evaluation_strategy, num_train_epoch, weight decay
training_args_updated_learning_rate = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-3,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

fine_tune_trainer_updated_learning_rate = Trainer(
    model=model,
    args=training_args_updated_learning_rate,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

fine_tune_trainer_updated_learning_rate.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.7129,0.671668,0.683824,0.812227
2,0.6668,0.667398,0.683824,0.812227
3,0.662,0.624885,0.683824,0.812227


TrainOutput(global_step=690, training_loss=0.6366893177447112, metrics={'train_runtime': 238.8703, 'train_samples_per_second': 46.067, 'train_steps_per_second': 2.889, 'total_flos': 2895274053181440.0, 'train_loss': 0.6366893177447112, 'epoch': 3.0})

### Step 5: Evaluate the baseline model on the benchmark dataset to determine if that is better.
A step before this can involve understanding the fine-tuning process and how to make the model become better. Research about different methods that we can achieve that

In [25]:
fine_tune_results = fine_tune_trainer.evaluate()
print(fine_tune_results)

{'eval_loss': 0.41967302560806274, 'eval_accuracy': 0.8333333333333334, 'eval_f1': 0.8835616438356164, 'eval_runtime': 2.8844, 'eval_samples_per_second': 141.449, 'eval_steps_per_second': 9.014, 'epoch': 3.0}


In [28]:
fine_tune_trainer_updated_learning_rate_result = fine_tune_trainer_updated_learning_rate.evaluate()

print(fine_tune_trainer_updated_learning_rate_result)

{'eval_loss': 0.6248849034309387, 'eval_accuracy': 0.6838235294117647, 'eval_f1': 0.8122270742358079, 'eval_runtime': 2.8923, 'eval_samples_per_second': 141.063, 'eval_steps_per_second': 8.989, 'epoch': 3.0}


### Step 6: Save the model

In [None]:
model.save_pretrained('./fine_tuned_model')
tokenizer.save_pretrained('./fine_tuned_model')


fine_tune_trainer.save_model('./fine_tuned_model')


In [None]:
# give access to google drive
from google.colab import drive
drive.mount('/content/drive')

# path
outputPath = '/content/drive/MyDrive/Summer 2024/projects/project1'

model.save_pretrained(outputPath)
tokenizer.save_pretrained(outputPath)
fine_tune_trainer.save_model(outputPath)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Step 7: Report and Summaries


- Without fine-tuning:
{'eval_loss': 0.6237205862998962, 'eval_runtime': 5.0478, 'eval_samples_per_second': 80.828, 'eval_steps_per_second': 10.104}


- With finetuning:
{'eval_loss': 0.39543601870536804, 'eval_runtime': 2.8832, 'eval_samples_per_second': 141.509, 'eval_steps_per_second': 9.018, 'epoch': 3.0}


- Without Fine-Tuning:
Evaluation Loss: 0.6237
Evaluation Runtime: 5.0478 seconds
Samples per Second: 80.828
Steps per Second: 10.104

- With Fine-tuning:
Evaluation Loss: 0.3954 (improved from 0.6237)
Evaluation Runtime: 2.8832 seconds (faster than 5.0478 seconds)
Samples per Second: 141.509 (faster than 80.828 samples per second)
Steps per Second: 9.018 (slightly slower than 10.104 steps per second)
Epochs: 3.0 (indicates the number of training epochs)

- Result:
- The fine-tuned model is approximately 42.9% faster in evaluation runtime compared to the baseline model
- The fine-tuned model shows a 36.6% improvement in evaluation loss
- Pushed to hugging face hub for open-source collaboration


What's next:
- Add another model: T-5
- Add another benchmark dataset and evaluate better with the benchmark dataset



In [None]:
!pip install transformers huggingface_hub




In [None]:
# Save to hugging face
# Access token:

from huggingface_hub import login

login()



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
fine_tune_trainer.push_to_hub('fine_tuned_bert')

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.05k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Thang203/results/commit/b31469c1654c5e5b097ca49fe5ee682374009935', commit_message='fine_tuned_bert', commit_description='', oid='b31469c1654c5e5b097ca49fe5ee682374009935', pr_url=None, pr_revision=None, pr_num=None)