

# **Distributed finetuning ```Mistral-7B-Instruct-v0.3``` model using ```LORA``` on ```SAMSum dataset``` (abstractive dialogue summaries)**



*   **Author:** ```Pratik Vyas```
*   **Task:** ```Summarization```
*   **Distributed Trainning:** [HuggingFace Accelerate](https://huggingface.co/docs/accelerate/index)
*   **Base Model from which model finetuned:** [Mistral-7B-Instruct-v0.3]( https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 )
*   **Dataset:** [SAMSum]( https://paperswithcode.com/dataset/samsum-corpus )
*   **Finetuned model at Huggingface hub:** [Dist_Mistral-7B-Instruct-v0.3_summarizer_v2](https://huggingface.co/Prat/Dist_Mistral-7B-Instruct-v0.3_summarizer_v2)






# **Import Libs**

In [1]:
!pip3 install -q -U accelerate
!pip3 install -q -U bitsandbytes
!pip3 install -q -U peft
!pip3 install -q -U trl
!pip3 install -q -U accelerate
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers
!pip install -q rouge_score
!pip install -q optuna

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
trl 0.12.1 requires datasets>=2.21.0, but you have datasets 2.17.0 which is incompatible.[0m[31m
[0m

In [2]:
import torch

print("Is CUDA available? ", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device name: ", torch.cuda.get_device_name(0))

Is CUDA available?  True
Device name:  Tesla T4


In [3]:
from peft import LoraConfig
from datasets import load_dataset
from datasets import load_metric
import pandas as pd
import numpy as np

import transformers
from trl import SFTTrainer
from rouge_score import rouge_scorer
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from google.colab import userdata

In [None]:
import os

# os.environ["HF_TOKEN"] = "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"
os.environ["HF_TOKEN"] = "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"  ## model-finegran
os.environ["WEIGHT_BIASES"] = "9d7decf681236b200a35c0121bca0fe725be724c"

# **Load Model and tokenizer**

In [5]:
# load a pre-trained tokenizer from the Hugging Face Model Hub, with authentication for the Hugging Face API token
# google/gemma-2-2b-it
model_id = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ["HF_TOKEN"])

# **LORA-Finetuning**

## **Load Dataset**

In [6]:
from datasets import load_dataset

## list of dataset for summarization. Choose one of them for your task
# https://paperswithcode.com/dataset/cnn-daily-mail-1
# data = load_dataset("knkarthick/dialogsum") ##Dialogue Summarization Dataset
# data = load_dataset("cnn_dailymail","3.0.0")
# data = load_dataset("GEM/wiki_lingua")

!pip install -q py7zr
data = load_dataset("samsum")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [7]:
data

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [8]:
data["train"][0]

{'id': '13818513',
 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}

In [9]:
!pip3 install -q -U wandb

In [10]:
# integrate Weights & Biases (W&B) with training process for tracking, monitoring, and collaboration

import wandb

wandb.login(key=os.environ["WEIGHT_BIASES"])
run = wandb.init(
    project="Distributed_Mistral-7B-Instruct-v0.3_FineTuning",
    job_type="training",
    anonymous="allow",
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mpratik_ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
# preprcessing before passing input
def create_prompt(example):
    text = f"user:\nSummarise dialogue in one sentence: {example['dialogue']} {example['summary']}"
    return [text]

## **LORA distributed hyper-parameters tuning with optuna and accelerate**

In [12]:
!pip install -q accelerate

**Accelerate Param details**

- `accelerator = Accelerator()`
  
  initializes an instance of the `Accelerator` class from the `accelerate` library. This instance is used to manage and facilitate distributed training across multiple devices (e.g., multiple GPUs or TPUs).

- `accelerator.prepare`

  The accelerator.prepare method is used to prepare the model, datasets, and other components for distributed training. This method ensures that these components are correctly set up to work across multiple devices (e.g., multiple GPUs or TPUs).

    1. Model Preparation:
      - The method wraps the model to ensure it can be trained across multiple devices. This may involve moving the model to the appropriate device (e.g., GPU) and setting up data parallelism.

    2. Dataset Preparation:
      - The method wraps the datasets to ensure they can be used in a distributed training setup. This may involve creating distributed data loaders that split the data across multiple devices.

    3. Return Prepared Components:
      - The method returns the prepared components (model, training dataset, and validation dataset) that are ready for distributed training.

In [None]:
import optuna
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
from datasets import DatasetDict


# Define the objective function
def objective(trial):
    dataset_dict = DatasetDict(data)
    DATA_RECORD_SIZE = 10  # size of train/test dataset
    # Extract the first 100 rows from the training dataset
    training_dataset = dataset_dict["train"].select(range(DATA_RECORD_SIZE))
    # Extract the first 100 rows from the training dataset
    val_dataset = dataset_dict["validation"].select(range(DATA_RECORD_SIZE))

    print(training_dataset)
    print(val_dataset)

    # Define hyperparameters to tune
    lora_r = trial.suggest_int("lora_r", 2, 4)
    lora_alpha = trial.suggest_int("lora_alpha", 16, 64)

    # learning_rate = trial.suggest_loguniform("learning_rate", 2e-4, 3e-4)
    # lora_dropout = trial.suggest_uniform("lora_dropout", 0.01, 0.03)
    # optim = trial.suggest_categorical("optim", ["paged_adamw_8bit", "paged_adamw_32bit"])
    # gradient_accumulation_steps = trial.suggest_int("gradient_accumulation_steps", 2, 3)
    # target_modules = trial.suggest_categorical(
    #     "target_modules",
    #     [
    #         ["q_proj", "v_proj"],
    #         ["q_proj", "k_proj", "v_proj"],
    #         [
    #             "q_proj",
    #             "o_proj",
    #             "k_proj",
    #             "v_proj",
    #             "gate_proj",
    #             "up_proj",
    #             "down_proj",
    #         ],
    #     ],
    # )

    lora_config = LoraConfig(
        r=lora_r,  # hyperparam tuning
        lora_alpha=lora_alpha,  # hyperparam tuning
        lora_dropout=0.02,
        target_modules=["q_proj", "k_proj", "v_proj"],
        task_type="CAUSAL_LM",
    )

    NUM_OF_ITERATION = 20  # this param override NUM_OF_EPOCHS
    # Define training arguments
    training_arguments = transformers.TrainingArguments(
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        gradient_accumulation_steps=3,
        # num_train_epochs=NUM_OF_EPOCHS,
        warmup_steps=10,
        eval_strategy="steps",
        eval_steps=0.2,
        max_steps=NUM_OF_ITERATION,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_32bit",
        report_to="none",  # set none to disable
    )

    # Initialize the Accelerator for distributed processing
    accelerator = Accelerator()

    # Load model pre-trained model
    # The BitsAndBytesConfig configuration is used to specify settings for quantizing a model to use 4-bit precision,
    # which can help reduce the model's memory footprint and improve inference speed
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    # load a pre-trained causal language model with specific quantization settings
    model = AutoModelForCausalLM.from_pretrained(
        model_id, quantization_config=bnb_config
    )

    # Prepare the model, optimizer, and datasets with the Accelerator
    model, training_dataset, val_dataset = accelerator.prepare(
        model, training_dataset, val_dataset
    )

    # Initialize the Trainer
    tokenizer.pad_token = tokenizer.eos_token  # Ensure pad token is set
    tokenizer.padding_side = "left"  # it is a decoder-only model, it is generally recommended to set padding_side to "left".
    trainer = SFTTrainer(
        model=model,
        train_dataset=training_dataset,
        eval_dataset=val_dataset,
        max_seq_length=512,  ## max seq length to input/output. It is crucial for GPU memory management
        args=training_arguments,
        peft_config=lora_config,
        formatting_func=create_prompt,  # preprocessing function before input
        processing_class=tokenizer,
    )

    # Train the model
    trainer.train()

    # Evaluate the model
    eval_results = trainer.evaluate()

    # Return the evaluation metric to optimize
    return eval_results["eval_loss"]


# Create an Optuna study and optimize the objective function
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=20)

# Print the best hyperparameters
best_params = study.best_params
print("Best hyperparameters: ", best_params)

[I 2024-12-03 07:56:42,823] A new study created in memory with name: no-name-7da96f06-fd88-46a6-a5c1-1ab37c45ffc5


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.386719
8,0.6691,2.190549
12,0.5195,2.069803
16,0.4171,2.06167
20,0.2518,2.17861


[I 2024-12-03 07:58:53,197] Trial 0 finished with value: 2.17861008644104 and parameters: {'lora_r': 2, 'lora_alpha': 35}. Best is trial 0 with value: 2.17861008644104.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/10 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.37418
8,0.6719,2.211726
12,0.5531,2.073637
16,0.4828,2.061614
20,0.3253,2.098849


[I 2024-12-03 08:00:55,197] Trial 1 finished with value: 2.098849296569824 and parameters: {'lora_r': 3, 'lora_alpha': 19}. Best is trial 1 with value: 2.098849296569824.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.36674
8,0.6387,2.151016
12,0.4947,2.046266
16,0.3789,2.051684
20,0.2167,2.196161


[I 2024-12-03 08:02:56,735] Trial 2 finished with value: 2.1961607933044434 and parameters: {'lora_r': 3, 'lora_alpha': 32}. Best is trial 1 with value: 2.098849296569824.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.386719
8,0.6815,2.192869
12,0.5005,2.062756
16,0.4529,2.042771
20,0.1824,2.383192


[I 2024-12-03 08:04:57,651] Trial 3 finished with value: 2.3831918239593506 and parameters: {'lora_r': 2, 'lora_alpha': 61}. Best is trial 1 with value: 2.098849296569824.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.386719
8,0.6668,2.182402
12,0.5149,2.070879
16,0.3356,2.144276
20,0.1003,2.444385


[I 2024-12-03 08:06:58,577] Trial 4 finished with value: 2.4443845748901367 and parameters: {'lora_r': 2, 'lora_alpha': 37}. Best is trial 1 with value: 2.098849296569824.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.386719
8,0.667,2.189377
12,0.5168,2.048227
16,0.4718,2.047642
20,0.2414,2.20161


[I 2024-12-03 08:09:07,793] Trial 5 finished with value: 2.2016096115112305 and parameters: {'lora_r': 3, 'lora_alpha': 37}. Best is trial 1 with value: 2.098849296569824.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7634,2.353562
8,0.6517,2.171851
12,0.537,2.069089
16,0.4301,2.058832
20,0.3124,2.081241


[I 2024-12-03 08:11:11,768] Trial 6 finished with value: 2.081240653991699 and parameters: {'lora_r': 4, 'lora_alpha': 16}. Best is trial 6 with value: 2.081240653991699.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.373904
8,0.6682,2.199207
12,0.5475,2.077997
16,0.426,2.068957
20,0.2584,2.203246


[I 2024-12-03 08:13:14,583] Trial 7 finished with value: 2.2032458782196045 and parameters: {'lora_r': 2, 'lora_alpha': 20}. Best is trial 6 with value: 2.081240653991699.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.386719
8,0.6745,2.197309
12,0.5297,2.056151
16,0.2918,2.194189
20,0.2003,2.287405


[I 2024-12-03 08:15:17,594] Trial 8 finished with value: 2.287405490875244 and parameters: {'lora_r': 2, 'lora_alpha': 33}. Best is trial 6 with value: 2.081240653991699.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.368093
8,0.6469,2.157817
12,0.5107,2.059878
16,0.3557,2.048344
20,0.2536,2.158393


[I 2024-12-03 08:17:21,787] Trial 9 finished with value: 2.158392906188965 and parameters: {'lora_r': 4, 'lora_alpha': 28}. Best is trial 6 with value: 2.081240653991699.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.386719
8,0.645,2.14766
12,0.4766,2.049273
16,0.1785,2.356506
20,0.0197,2.709504


[I 2024-12-03 08:19:25,921] Trial 10 finished with value: 2.7095043659210205 and parameters: {'lora_r': 4, 'lora_alpha': 51}. Best is trial 6 with value: 2.081240653991699.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7634,2.35345
8,0.6517,2.17199
12,0.5371,2.082313
16,0.439,2.05277
20,0.3185,2.093026


[I 2024-12-03 08:21:29,534] Trial 11 finished with value: 2.0930259227752686 and parameters: {'lora_r': 4, 'lora_alpha': 16}. Best is trial 6 with value: 2.081240653991699.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7632,2.353942
8,0.6517,2.172124
12,0.5371,2.081579
16,0.4334,2.057569
20,0.3145,2.078552


[I 2024-12-03 08:23:31,182] Trial 12 finished with value: 2.078552007675171 and parameters: {'lora_r': 4, 'lora_alpha': 16}. Best is trial 12 with value: 2.078552007675171.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.369937
8,0.654,2.170605
12,0.5244,2.07208
16,0.3783,2.073867
20,0.2773,2.098116


[I 2024-12-03 08:25:32,337] Trial 13 finished with value: 2.0981156826019287 and parameters: {'lora_r': 4, 'lora_alpha': 25}. Best is trial 12 with value: 2.078552007675171.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.35645
8,0.612,2.116189
12,0.4362,2.049376
16,0.2766,2.207769
20,0.0709,2.527023


[I 2024-12-03 08:27:33,804] Trial 14 finished with value: 2.5270228385925293 and parameters: {'lora_r': 4, 'lora_alpha': 46}. Best is trial 12 with value: 2.078552007675171.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.370576
8,0.6566,2.177552
12,0.5287,2.076849
16,0.4462,2.053386
20,0.2627,2.143404


[I 2024-12-03 08:29:35,123] Trial 15 finished with value: 2.1434037685394287 and parameters: {'lora_r': 4, 'lora_alpha': 24}. Best is trial 12 with value: 2.078552007675171.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.386719
8,0.6552,2.171863
12,0.5011,2.047524
16,0.2167,2.256908
20,0.0369,2.638846


[I 2024-12-03 08:31:36,119] Trial 16 finished with value: 2.638845682144165 and parameters: {'lora_r': 3, 'lora_alpha': 44}. Best is trial 12 with value: 2.078552007675171.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7626,2.351871
8,0.6475,2.165455
12,0.5303,2.079188
16,0.4205,2.065434
20,0.2908,2.113096


[I 2024-12-03 08:33:36,795] Trial 17 finished with value: 2.113095998764038 and parameters: {'lora_r': 4, 'lora_alpha': 17}. Best is trial 12 with value: 2.078552007675171.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.370994
8,0.6547,2.17679
12,0.5217,2.058739
16,0.3831,2.06517
20,0.2819,2.154642


[I 2024-12-03 08:35:38,395] Trial 18 finished with value: 2.154641628265381 and parameters: {'lora_r': 3, 'lora_alpha': 25}. Best is trial 12 with value: 2.078552007675171.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 10
})


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7738,2.386719
8,0.6388,2.134162
12,0.4609,2.036675
16,0.165,2.405524
20,0.0144,2.793889


[I 2024-12-03 08:37:39,324] Trial 19 finished with value: 2.793888568878174 and parameters: {'lora_r': 4, 'lora_alpha': 56}. Best is trial 12 with value: 2.078552007675171.


Best hyperparameters:  {'lora_r': 4, 'lora_alpha': 16}


**Retrieve the best hyperparameters**

In [14]:
# Retrieve the best hyperparameters
best_params = study.best_params
print("Best hyperparameters: \n", best_params)

Best hyperparameters: 
 {'lora_r': 4, 'lora_alpha': 16}


## **Distributed training for final model with best hyperparameters**

In [15]:
# #Load base/pretrained model for training

# Clear GPU cache before loading the model for the second time
torch.cuda.empty_cache()

# Load model for training with CPU offloading enabled
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    # Enable CPU offloading for specific layers
    llm_int8_enable_fp32_cpu_offload=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",  # Let Transformers automatically decide device placement
)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [16]:
from datasets import DatasetDict

DATA_RECORD_SIZE = 100  # size of training dataset

dataset_dict = DatasetDict(data)
# Extract the first 100 rows from the training dataset
training_dataset = dataset_dict["train"].select(range(DATA_RECORD_SIZE))

# Extract the first 100 rows from the training dataset
val_dataset = dataset_dict["validation"].select(range(DATA_RECORD_SIZE))

print(training_dataset)
print(val_dataset)

Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 100
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 100
})


In [18]:
# Use the best hyperparameters to train the final model
# best_lora_dropout = best_params["lora_dropout"]
best_lora_r = best_params["lora_r"]
best_lora_alpha = best_params["lora_alpha"]
# best_optim = best_params["optim"]
# best_target_modules = best_params["target_modules"]
# best_learning_rate = best_params["learning_rate"]
# best_gradient_accumulation_steps = best_params["gradient_accumulation_steps"]

In [19]:
from accelerate import Accelerator

# Initialize the Accelerator
accelerator = Accelerator()


## Training Arguments Parameters

1. **`per_device_train_batch_size`**:
   - **Description**: The batch size per device (GPU/TPU/CPU) during training.
   - **Example**: If you have 2 GPUs and set `per_device_train_batch_size=1`, the effective batch size will be 2.

2. **`per_device_eval_batch_size`**:
   - **Description**: The batch size per device (GPU/TPU/CPU) during evaluation.
   - **Example**: If you have 2 GPUs and set `per_device_eval_batch_size=1`, the effective batch size will be 2.

3. **`gradient_accumulation_steps`**:
   - **Description**: The number of steps to accumulate gradients before performing a backward/update pass.
   - **Example**: If set to 2, the model will accumulate gradients over 2 steps before updating the model parameters, effectively doubling the batch size.

4. **`num_train_epochs`**:
   - **Description**: The total number of training epochs.
   - **Example**: If set to 1, the model will see the entire training dataset once.

5. **`warmup_steps`**:
   - **Description**: The number of steps to perform learning rate warmup.
   - **Example**: If set to 2, the learning rate will gradually increase over the first 2 steps.

6. **`eval_strategy`**:
   - **Description**: The evaluation strategy to use during training.
   - **Options**: `"no"` (no evaluation), `"steps"` (evaluate every `eval_steps`), `"epoch"` (evaluate at the end of each epoch).
   - **Example**: If set to `"steps"`, the model will be evaluated every `eval_steps`.

7. **`eval_steps`**:
   - **Description**: The number of steps between evaluations.
   - **Example**: If set to 0.2, the model will be evaluated every 0.2 steps. Note that this is an unusual setting; typically, `eval_steps` is an integer.

8. **`max_steps`**:
   - **Description**: The total number of training steps to perform.
   - **Example**: If set to 50, the training will stop after 50 steps, regardless of the number of epochs.

9. **`learning_rate`**:
   - **Description**: The initial learning rate for the optimizer.
   - **Example**: This is a hyperparameter that can be tuned. In this context, it is set dynamically based on the value suggested by Optuna.

10. **`fp16`**:
    - **Description**: Whether to use 16-bit (mixed) precision training instead of 32-bit.
    - **Example**: If set to `True`, the model will use mixed precision training, which can speed up training and reduce memory usage.

11. **`logging_steps`**:
    - **Description**: The number of steps between logging events.
    - **Example**: If set to 1, the model will log training metrics every step.

12. **`output_dir`**:
    - **Description**: The directory where the model checkpoints and other outputs will be saved.
    - **Example**: If set to `"outputs"`, all outputs will be saved in the `outputs` directory.

13. **`optim`**:
    - **Description**: The optimizer to use.
    - **Options**: `"adamw_torch"`, `"paged_adamw_32bit"`, `"adamw_hf"`, etc.
    - **Example**: This is a hyperparameter that can be tuned. In this context, it is set dynamically based on the value suggested by Optuna.

14. **`report_to`**:
    - **Description**: The list of integrations to report the results and logs to.
    - **Options**: `"none"`, `"wandb"`, `"tensorboard"`, etc.
    - **Example**: If set to `"none"`, no reporting will be done to any integration.

In [20]:
# Define LoRA configuration with the best hyperparameters
lora_config = LoraConfig(
    r=best_lora_r,
    lora_alpha=best_lora_alpha,
    lora_dropout=0.02,
    target_modules=["q_proj", "k_proj", "v_proj"],
    task_type="CAUSAL_LM",
)


NUM_OF_ITERATION = 20

# Define training arguments with the best hyperparameters
training_arguments = transformers.TrainingArguments(
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=3,
    # num_train_epochs=NUM_OF_EPOCHS,
    warmup_steps=2,
    eval_strategy="steps",  # "epoch", "steps",
    eval_steps=0.2,
    max_steps=NUM_OF_ITERATION,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=1,
    output_dir="final_outputs",
    optim="paged_adamw_32bit",
    report_to="wandb",
)

In [21]:
from transformers import AdamW

# Initialize the Accelerator
accelerator = Accelerator()

# Ensure pad token is set
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"  # as it is a decoder-only model, it is recommended to set padding_side to "left".

# Initialize the optimizer
optimizer = AdamW(model.parameters(), lr=training_arguments.learning_rate)

# Prepare the model, tokenizer, datasets, and optimizer with the Accelerator
model, optimizer, training_dataset, val_dataset = accelerator.prepare(
    model, optimizer, training_dataset, val_dataset
)



**Parameter Details**

- ```accelerator.wait_for_everyone()```

 method is used to synchronize all processes in a distributed training setup,ensuring that all processes reach the same point before proceeding.
 This is crucial for maintaining consistency and coordination across multiple devices (e.g., multiple GPUs or TPUs) during training.

In [22]:
from accelerate import DistributedType

# Initialize Trainer with the best hyperparameters
trainer = SFTTrainer(
    model=model,
    train_dataset=training_dataset,
    eval_dataset=val_dataset,
    peft_config=lora_config,
    max_seq_length=512,  # max length to input/output. It is crucial for GPU memory management
    dataset_text_field="dialogue",
    formatting_func=create_prompt,  # preprocessing function before input
    processing_class=tokenizer,
    args=training_arguments,
    packing=False,  # The trainer will attempt to pack multiple sequences into a single batch
)

# Train the final model
model.config.use_cache = False

# Use the Accelerator to manage the training loop
trainer.train()

# Save the final model
# accelerator.wait_for_everyone() method is used to synchronize all processes in a distributed training setup,ensuring that all processes reach the same point before proceeding.
# This is crucial for maintaining consistency and coordination across multiple devices (e.g., multiple GPUs or TPUs) during training.
accelerator.wait_for_everyone()
if accelerator.is_local_main_process:
    model.save_pretrained("final_model")
    tokenizer.save_pretrained("final_model")


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss
4,0.7253,2.231347
8,0.5571,2.080308
12,0.4704,2.067158
16,0.3589,2.08595
20,0.2705,2.129035


## Delta weights

save only delta weights ( code commented as not require right now)

In [23]:
"""
# uncomment code to extract delta weights

from peft import extract_lora_weights

# Extract the delta weights from the fine-tuned model
delta_weights = extract_lora_weights(model)

# Save the delta weights to a specified directory
delta_weights.save_pretrained("delta_weights")

"""

'\n# uncomment code to extract delta weights\n\nfrom peft import extract_lora_weights\n\n# Extract the delta weights from the fine-tuned model\ndelta_weights = extract_lora_weights(model)\n\n# Save the delta weights to a specified directory\ndelta_weights.save_pretrained("delta_weights")\n\n'

Later merge delta weights with original model

In [24]:
"""
# uncomment code to merge delta weights with original model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import apply_lora_weights

# Load the original model and tokenizer
model_id = "your-model-id"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the delta weights
delta_weights = LoraConfig.from_pretrained("delta_weights")

# Apply the delta weights to the original model
model = apply_lora_weights(model, delta_weights)

# Example usage of the model and tokenizer
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors='pt')
outputs = model.generate(inputs['input_ids'])
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

"""

'\n# uncomment code to merge delta weights with original model\n\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom peft import apply_lora_weights\n\n# Load the original model and tokenizer\nmodel_id = "your-model-id"\nmodel = AutoModelForCausalLM.from_pretrained(model_id)\ntokenizer = AutoTokenizer.from_pretrained(model_id)\n\n# Load the delta weights\ndelta_weights = LoraConfig.from_pretrained("delta_weights")\n\n# Apply the delta weights to the original model\nmodel = apply_lora_weights(model, delta_weights)\n\n# Example usage of the model and tokenizer\ntext = "Hello, how are you?"\ninputs = tokenizer(text, return_tensors=\'pt\')\noutputs = model.generate(inputs[\'input_ids\'])\nprint(tokenizer.decode(outputs[0], skip_special_tokens=True))\n\n'

In [25]:
# wandb.finish()
# model.config.use_cache = True

# **Model Evaluation using Rouge Score**

More on Roughe score at https://arxiv.org/abs/1803.01937

In [26]:
text = """user: Generate summary of this dialogue in one line
          dialogue:
          Rachel: <file_other>
          Rachel: Top 50 Best Films of 2018
          Rachel: :)
          Janice: Omg, I've watched almost all 50... xDD
          Spencer: Hahah, Deadpool 2 also??
          Janice: Yep
          Spencer: Really??
          Janice: My bf forced me to watch it xD
          Rachel: Hahah
          Janice: It wasn't that bad
          Janice: I thought it'd be worse
          Rachel: And Avengers? :D
          Janice: 2 times
          Rachel: Omg
          Janice: xP
          Rachel: You are the best gf in the world
          Rachel: Your bf should appreciate that ;-)
          Janice: He does
          Janice: x)
AI Summary:"""

device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

true_summary = "Rachel sends a list of Top 50 films of 2018. Janice watched almost half of them, Deadpool 2 and Avengers included."

outputs = model.generate(**inputs, max_new_tokens=50)
model_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(model_summary)

print("---------------------------------------------------------------------")
end_token = ""

highlight = str.strip(model_summary.split("AI Summary:")[1])
print(f"Generated Summary: {highlight}")
print("---------------------------------------------------------------------")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


user: Generate summary of this dialogue in one line
          dialogue:
          Rachel: <file_other>
          Rachel: Top 50 Best Films of 2018
          Rachel: :)
          Janice: Omg, I've watched almost all 50... xDD
          Spencer: Hahah, Deadpool 2 also??
          Janice: Yep
          Spencer: Really??
          Janice: My bf forced me to watch it xD
          Rachel: Hahah
          Janice: It wasn't that bad
          Janice: I thought it'd be worse
          Rachel: And Avengers? :D
          Janice: 2 times
          Rachel: Omg
          Janice: xP
          Rachel: You are the best gf in the world
          Rachel: Your bf should appreciate that ;-)
          Janice: He does
          Janice: x)
AI Summary: Rachel shared a list of the top 50 films of 2018, Janice mentioned she had watched most of them, Spencer asked if she had watched Deadpool 2, Janice confirmed she had, and Rachel praised Janice
--------------------------------------------------------------------

In [27]:
def calculate_rouge_scores(original_summary, generated_summary):
    rouge = load_metric("rouge")
    scores = rouge.compute(
        predictions=[str.strip(generated_summary)], references=[original_summary]
    )
    return scores

In [28]:
rouge_scores = calculate_rouge_scores(highlight, true_summary)
rouge_scorer_ = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL", "rougeLsum"])
rouge_scores = rouge_scorer_.score(highlight, true_summary)

for metric, scores in rouge_scores.items():
    print(f"{metric}:")
    print(f"Precision: {scores.precision}")
    print(f"Recall: {scores.recall}")
    print(f"F1 Score: {scores.fmeasure}")
    print()

  rouge = load_metric("rouge")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


rouge1:
Precision: 0.7619047619047619
Recall: 0.45714285714285713
F1 Score: 0.5714285714285714

rouge2:
Precision: 0.45
Recall: 0.2647058823529412
F1 Score: 0.33333333333333337

rougeL:
Precision: 0.7619047619047619
Recall: 0.45714285714285713
F1 Score: 0.5714285714285714

rougeLsum:
Precision: 0.7619047619047619
Recall: 0.45714285714285713
F1 Score: 0.5714285714285714



In [29]:
rouge_scores

{'rouge1': Score(precision=0.7619047619047619, recall=0.45714285714285713, fmeasure=0.5714285714285714),
 'rouge2': Score(precision=0.45, recall=0.2647058823529412, fmeasure=0.33333333333333337),
 'rougeL': Score(precision=0.7619047619047619, recall=0.45714285714285713, fmeasure=0.5714285714285714),
 'rougeLsum': Score(precision=0.7619047619047619, recall=0.45714285714285713, fmeasure=0.5714285714285714)}

### Calculate Rouge Score on test data

In [30]:
test_dataset = dataset_dict["validation"].select(range(5))

test_dataset = pd.DataFrame(test_dataset)

In [31]:
num_iterations = len(test_dataset)

avg_scores = {
    "rouge1": {"precision": 0, "recall": 0, "f1": 0},
    "rouge2": {"precision": 0, "recall": 0, "f1": 0},
    "rougeL": {"precision": 0, "recall": 0, "f1": 0},
    "rougeLsum": {"precision": 0, "recall": 0, "f1": 0},
}

print("Test dataset matrces...")
for idx, row in test_dataset.iterrows():
    dialogue = row["dialogue"]
    true_summary = row["summary"]

    text = f"""user\n Write the highlight of this dialogue in one sentence:{dialogue}\nAI Summary:"""
    device = "cuda:0"
    inputs = tokenizer(text, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=50)
    gemma_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print("---------------------------------------------------------------------")
    print(f"True Summary: {true_summary}")

    delimiter = "AI Summary:"
    end_token = ""

    highlight = str.strip(gemma_summary.split("AI Summary:")[1])
    print(f"Generated Summary: {highlight}")

    rouge_scores = calculate_rouge_scores(highlight, true_summary)
    rouge_scorer_ = rouge_scorer.RougeScorer(
        ["rouge1", "rouge2", "rougeL", "rougeLsum"]
    )
    rouge_scores = rouge_scorer_.score(highlight, true_summary)

    for metric, scores in rouge_scores.items():
        rouge_scores_matrix = {
            metric: {
                "precision": scores.precision,
                "recall": scores.recall,
                "fmeasure": scores.fmeasure,
            }
        }
        # Convert the rouge_scores to a DataFrame
        df = pd.DataFrame(rouge_scores_matrix).transpose()
        print(df)

        print()
        avg_scores[metric]["precision"] += scores.precision
        avg_scores[metric]["recall"] += scores.recall
        avg_scores[metric]["f1"] += scores.fmeasure


for metric, scores in avg_scores.items():
    avg_scores[metric]["precision"] /= num_iterations
    avg_scores[metric]["recall"] /= num_iterations
    avg_scores[metric]["f1"] /= num_iterations


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Test dataset matrces...
---------------------------------------------------------------------
True Summary: A will go to the animal shelter tomorrow to get a puppy for her son. They already visited the shelter last Monday and the son chose the puppy. 
Generated Summary: Tom is asked by his friend if he can go with him to the animal shelter to get a puppy for his son. Tom agrees and they discuss the type of dog his son would like. Tom's son had previously visited the shelter


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


        precision    recall  fmeasure
rouge1   0.642857  0.409091       0.5

        precision    recall  fmeasure
rouge2   0.333333  0.209302  0.257143

        precision    recall  fmeasure
rougeL   0.535714  0.340909  0.416667

           precision    recall  fmeasure
rougeLsum   0.535714  0.340909  0.416667

---------------------------------------------------------------------
True Summary: Emma and Rob love the advent calendar. Lauren fits inside calendar various items, for instance, small toys and Christmas decorations. Her children are excited whenever they get the calendar.
Generated Summary: Emma and Lauren discuss the advent calendars they have for their children. Lauren mentions that she adds notes asking her children to do something nice for someone else. Emma is impressed and agrees that it makes the advent calendar more about traditions and being kind


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


        precision    recall  fmeasure
rouge1   0.413793  0.266667  0.324324

        precision    recall  fmeasure
rouge2   0.142857  0.090909  0.111111

        precision  recall  fmeasure
rougeL   0.310345     0.2  0.243243

           precision  recall  fmeasure
rougeLsum   0.310345     0.2  0.243243

---------------------------------------------------------------------
True Summary: Madison is pregnant but she doesn't want to talk about it. Patricia Stevens got married and she thought she was pregnant. 
Generated Summary: Jackie reveals that her friend Madison is pregnant but doesn't want to talk about it. Iggy asks why and Jackie doesn't know. Iggy then shares a personal story about a friend who was pregnant and he couldn't bring


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


        precision    recall  fmeasure
rouge1   0.636364  0.341463  0.444444

        precision  recall  fmeasure
rouge2    0.47619    0.25  0.327869

        precision    recall  fmeasure
rougeL   0.636364  0.341463  0.444444

           precision    recall  fmeasure
rougeLsum   0.636364  0.341463  0.444444

---------------------------------------------------------------------
True Summary: Marla found a pair of boxers under her bed.
Generated Summary: Marla found a pair of underwear under her bed and her friends helped her investigate who it belonged to. They concluded that her sister's friend might have put it there as a dare.


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


        precision    recall  fmeasure
rouge1   0.888889  0.235294  0.372093

        precision    recall  fmeasure
rouge2       0.75  0.181818  0.292683

        precision    recall  fmeasure
rougeL   0.888889  0.235294  0.372093

           precision    recall  fmeasure
rougeLsum   0.888889  0.235294  0.372093

---------------------------------------------------------------------
True Summary: Robert wants Fred to send him the address of the music shop as he needs to buy guitar cable.
Generated Summary: Robert asked for the address of a music shop and Fred provided the address on Google Maps. Robert thanked Fred for his help.
        precision    recall  fmeasure
rouge1   0.421053  0.347826  0.380952

        precision    recall  fmeasure
rouge2   0.166667  0.136364      0.15

        precision   recall  fmeasure
rougeL   0.315789  0.26087  0.285714

           precision   recall  fmeasure
rougeLsum   0.315789  0.26087  0.285714



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


### Average rouge score

In [32]:
# Convert the evaluation results to a DataFrame
df = pd.DataFrame(avg_scores)

# Transpose the DataFrame for better readability
df = df.transpose()

# Print the DataFrame
print("Test dataset average rouge score...")
print(df)

Test dataset average rouge score...
           precision    recall        f1
rouge1      0.600591  0.320068  0.404363
rouge2      0.373810  0.173679  0.227761
rougeL      0.537420  0.275707  0.352432
rougeLsum   0.537420  0.275707  0.352432


In [33]:
wandb.finish()

0,1
eval/loss,█▂▁▂▄
eval/runtime,▄█▇▄▁
eval/samples_per_second,▄▁▂▅█
eval/steps_per_second,▄▁▂▅█
train/epoch,▁▁▂▂▂▂▃▃▄▄▄▄▅▅▅▅▆▆▇▇▇▇████
train/global_step,▁▁▂▂▂▂▃▃▄▄▄▄▅▅▅▅▆▆▇▇▇▇████
train/grad_norm,██▅▂▁▂▁ ▄▂▂▂▃▃▄▅█▇
train/learning_rate,▁▅██▇▇▆▆▆▆▅▅▅▄▄▃▃▃▂▂
train/loss,███▇▆▆▅▅▅▅▄▄▃▃▃▂▂▂▁▁

0,1
eval/loss,2.12903
eval/runtime,0.9752
eval/samples_per_second,1.025
eval/steps_per_second,1.025
total_flos,437217184972800.0
train/epoch,20.0
train/global_step,20.0
train/grad_norm,
train/learning_rate,3e-05
train/loss,0.2705


# Push Model to Huggingface hub

In [34]:
new_model = "Dist_Mistral-7B-Instruct-v0.3_summarizer_v2"
trainer.model.save_pretrained(new_model)
trainer.model.push_to_hub(new_model, use_temp_dir=False)

README.md:   0%|          | 0.00/5.42k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/9.46M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Prat/Dist_Mistral-7B-Instruct-v0.3_summarizer_v2/commit/fb815744a4e6561ecbcd2994ec0f178358c9aa89', commit_message='Upload model', commit_description='', oid='fb815744a4e6561ecbcd2994ec0f178358c9aa89', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Prat/Dist_Mistral-7B-Instruct-v0.3_summarizer_v2', endpoint='https://huggingface.co', repo_type='model', repo_id='Prat/Dist_Mistral-7B-Instruct-v0.3_summarizer_v2'), pr_revision=None, pr_num=None)