

# **Finetuning ```google/gemma-2-27b-it``` model using ```LORA``` on ```SAMSum dataset``` (abstractive dialogue summaries)**



*   **Author:** ```Pratik Vyas```
*   **Task:** ```Summarization```
*   **Pretrained model:** [gemma-2-27b-it]( https://huggingface.co/google/gemma-2-27b-it )
*   **Dataset:** [SAMSum]( https://paperswithcode.com/dataset/samsum-corpus )
*   **Evaluation Matrix:** ```Rouge score```
*   **Finetuning Metrics:** [gemma-2-27b-it Finetuning Metrics](https://github.com/Git-PratikVyas/Finetuning-LORA/blob/main/FinetuningMetrics/gemma_2_27b_it_Analyse_finetuning_Metrics.ipynb)
*   **Finetuned model at Huggingface hub:** [Prat/gemma-2-27b-it_ft_summarizer_v3](https://huggingface.co/Prat/gemma-2-27b-it-ft-summarizer-v3)







# **Import Libs**

In [1]:
!pip3 install -q -U accelerate
!pip3 install -q -U bitsandbytes
!pip3 install -q -U peft
!pip3 install -q -U trl
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers
!pip install -q rouge_score
!pip install -q optuna
!pip install -q --upgrade torch
!pip3 install -q -U wandb
!pip install -q accelerate
!pip install -q GPUtil

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m336.3/336.3 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.8/374.8 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.7/365.7 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from peft import LoraConfig
from datasets import load_dataset
from datasets import load_metric
import pandas as pd
import numpy as np

import transformers
from trl import SFTTrainer
from rouge_score import rouge_scorer
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from google.colab import userdata

In [3]:
import os

os.environ["HF_TOKEN"] = "HF_KEY"
os.environ["WB_KEY"] = "WB_KEY"

# **Load tokenizer**

In [4]:
# load a pre-trained tokenizer from the Hugging Face Model Hub, with authentication for the Hugging Face API token

model_id = "google/gemma-2-27b-it"

new_model = "gemma-2-27b-it_ft"

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ["HF_TOKEN"])

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

# **Load Dataset**

In [5]:
from datasets import load_dataset

## list of dataset for summarization. Choose one of them for your task
# https://paperswithcode.com/dataset/cnn-daily-mail-1
# data = load_dataset("knkarthick/dialogsum") ##Dialogue Summarization Dataset
# data = load_dataset("cnn_dailymail","3.0.0")
# data = load_dataset("GEM/wiki_lingua")

!pip install -q py7zr
data = load_dataset("samsum")

print(data)

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.9/67.9 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.1/93.1 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.7/49.7 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m90.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.9/138.9 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m413.7/413.7 kB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
[?25h

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.04k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})


In [6]:
# Using list comprehension to count words in each dialogue
word_counts_dialogue = [len(dialogue.split()) for dialogue in data["train"]["dialogue"]]
# Get the maximum number of words
max_words_dialogue = max(word_counts_dialogue)
print(f"Maximum number of tokens in dialogue: {max_words_dialogue}")

word_counts_summary = [len(summary.split()) for summary in data["train"]["summary"]]
max_words_summary = max(word_counts_summary)
print(f"Maximum number of tokens in Summary: {max_words_summary}")


Maximum number of tokens in dialogue: 803
Maximum number of tokens in Summary: 64


In [7]:
# integrate Weights & Biases (W&B) with training process for tracking, monitoring, and collaboration
import os
import wandb

wandb.login(key=os.environ["WB_KEY"])
run = wandb.init(
    project="gemma-2-27b-it_ft_summarizer_v3",
    job_type="training",
    anonymous="allow",
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mpratik_ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [8]:
# preprcessing before passing input
def create_prompt(example):
    prefix_text = "Summarize dialogue in one sentence:"
    text = f"""<start_of_turn>user\n {prefix_text} {example['dialogue']} <end_of_turn>\n<start_of_turn>model {example['summary']} <end_of_turn>"""
    return [text]

# **LORA Finetuning**

**Load pre-trained model**


 **BitsAndBytes**, a library designed to facilitate efficient loading and inference of LLMs with reduced precision. This is particularly useful for deploying models on hardware with limited memory resources.

1. `use_4bit`

- **Definition**: This parameter activates the loading of base models in 4-bit precision.
- **Purpose**: Using 4-bit precision significantly reduces the memory footprint of the model, allowing larger models to fit into GPU memory. This is especially beneficial for inference tasks where high throughput is required but full precision is not necessary.
- **Implications**: When set to `True`, the model weights are quantized to 4 bits, which can lead to a trade-off between model performance (accuracy) and resource efficiency. This setting is particularly useful when deploying large models in production environments where memory constraints are a concern.

2. `bnb_4bit_compute_dtype`

- **Definition**: This parameter specifies the data type used for computations involving 4-bit models.
- **Options**: The common options include:
  - **`bfloat16`**: Half-precision floating-point format, which uses 16 bits per value.
  - **`float32`**: Single-precision floating-point format, using 32 bits per value.
- **Purpose**: By setting this parameter to `float16`, you enable faster computations while still maintaining a reasonable level of numerical stability. Using `float16` can improve performance on compatible hardware (like NVIDIA GPUs with Tensor Cores) by allowing for faster matrix operations and reduced memory bandwidth usage.
- **Implications**: The choice of compute dtype can affect both the speed and accuracy of the model's predictions. While `float16` can speed up computations, it may also introduce some numerical inaccuracies compared to using `float32`.

3. `bnb_4bit_quant_type`

- **Definition**: This parameter specifies the type of quantization used for the 4-bit model weights.
- **Options**:
  - **`fp4`**: A specific quantization format that uses floating-point representations optimized for low precision.
  - **`nf4`**: Another format that stands for "Narrow Float 4," which is designed to provide better accuracy at lower bit widths by utilizing a narrower representation.
- **Purpose**: The choice of quantization type can significantly impact both the model's performance and its memory efficiency. Different quantization schemes can yield varying levels of accuracy when using low-bit representations.
- **Implications**: Selecting `nf4` may provide better performance in terms of maintaining model accuracy compared to `fp4`, depending on the specific characteristics of the model and task.

4. `use_nested_quant`

- **Definition**: This parameter activates nested quantization for 4-bit base models, also known as double quantization.
- **Purpose**: Nested quantization involves applying quantization techniques multiple times (e.g., first quantizing weights down to a lower precision and then further quantizing those results). This can help achieve even lower memory usage while attempting to maintain performance.


In [9]:
import time

start_time = time.time()


# #Load base/pretrained model for training

# Clear GPU cache
torch.cuda.empty_cache()

# Load model for training
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=False,
    # Enable CPU offloading for specific layers
    llm_int8_enable_fp32_cpu_offload=False,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,  # <-- important to set bfloat16
    device_map="auto",  # Let Transformers automatically decide device placement
)

print(model)

end_time = time.time()
print("\n\n--->Execution Time:", (end_time - start_time) / 60, "minutes")

config.json:   0%|          | 0.00/893 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/42.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/12 [00:00<?, ?it/s]

model-00001-of-00012.safetensors:   0%|          | 0.00/4.74G [00:00<?, ?B/s]

model-00002-of-00012.safetensors:   0%|          | 0.00/4.87G [00:00<?, ?B/s]

model-00003-of-00012.safetensors:   0%|          | 0.00/4.87G [00:00<?, ?B/s]

model-00004-of-00012.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00005-of-00012.safetensors:   0%|          | 0.00/4.87G [00:00<?, ?B/s]

model-00006-of-00012.safetensors:   0%|          | 0.00/4.87G [00:00<?, ?B/s]

model-00007-of-00012.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00008-of-00012.safetensors:   0%|          | 0.00/4.87G [00:00<?, ?B/s]

model-00009-of-00012.safetensors:   0%|          | 0.00/4.87G [00:00<?, ?B/s]

model-00010-of-00012.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00011-of-00012.safetensors:   0%|          | 0.00/4.87G [00:00<?, ?B/s]

model-00012-of-00012.safetensors:   0%|          | 0.00/680M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/12 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256000, 4608, padding_idx=0)
    (layers): ModuleList(
      (0-45): 46 x Gemma2DecoderLayer(
        (self_attn): Gemma2Attention(
          (q_proj): Linear4bit(in_features=4608, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4608, out_features=2048, bias=False)
          (v_proj): Linear4bit(in_features=4608, out_features=2048, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4608, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding()
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear4bit(in_features=4608, out_features=36864, bias=False)
          (up_proj): Linear4bit(in_features=4608, out_features=36864, bias=False)
          (down_proj): Linear4bit(in_features=36864, out_features=4608, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((4608,), eps=1e-06)
        (post_attention_layernor

In [10]:
text = """user
Summarize dialogue one sentence:
Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)
model
"""
device = "cuda:0"
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.


user
Summarize dialogue one sentence:
Amanda: I baked  cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)
model
Amanda offers Jerry some cookies and promises to bring them to him the next day. 




**Load Dataset ( train and validation )**

In [11]:
from datasets import DatasetDict

# TRAIN_DATA_RECORD_SIZE = 1400  # <-----
# VAL_DATA_RECORD_SIZE = 300

dataset_dict = DatasetDict(data)
# Extract the first 100 rows from the training dataset
# training_dataset = dataset_dict["train"].select(range(TRAIN_DATA_RECORD_SIZE))
training_dataset = dataset_dict["train"]

# Extract the first 100 rows from the training dataset
# val_dataset = dataset_dict["validation"].select(range(VAL_DATA_RECORD_SIZE))
val_dataset = dataset_dict["validation"]

print(training_dataset)
print(val_dataset)

Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 14732
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 818
})


In [12]:
training_dataset[0]

{'id': '13818513',
 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}

**Set best LORA hyper-parameters**

```Target Modules```

1. `q_proj` (Query Projection):
   - **Definition**: This module is responsible for projecting the input embeddings into the query space.
   - **Functionality**: In the attention mechanism, the query vectors are derived from the input embeddings to determine how much focus should be placed on different parts of the input sequence.
   - **Role in Attention**: The query vectors are compared against key vectors to compute attention scores, which dictate how much attention each token should pay to others.

2. `o_proj` (Output Projection):
   - **Definition**: This module is used to project the output of the attention mechanism back into the original embedding space.
   - **Functionality**: After calculating attention scores and aggregating values, the resulting output needs to be transformed back to match the dimensionality of the input embeddings for further processing.
   - **Role in Attention**: It ensures that the output from the attention layer can be fed into subsequent layers of the model, maintaining consistency in dimensions.

3. `k_proj` (Key Projection):
   - **Definition**: This module projects input embeddings into the key space.
   - **Functionality**: Similar to query projection, key projection transforms input embeddings into key vectors that are used in conjunction with query vectors during the attention calculation.
   - **Role in Attention**: The keys are compared with queries to generate attention scores, which determine how relevant each token is concerning others.

4. `v_proj` (Value Projection):
   - **Definition**: This module projects input embeddings into the value space.
   - **Functionality**: Value vectors represent the actual content that will be aggregated based on attention scores.
   - **Role in Attention**: After computing attention weights from queries and keys, these weights are applied to value vectors to produce a weighted sum that forms the output of the attention mechanism.

5. `gate_proj` (Gate Projection):
   - **Definition**: This module is part of a gating mechanism often used in more complex architectures or specific models like transformers with additional control over information flow.
   - **Functionality**: Gates can modulate how much information passes through certain layers or components based on learned parameters.
   - **Role in Attention/Modeling**: It helps manage which parts of information are retained or discarded during processing, enhancing model flexibility and performance.

6. `up_proj` (Upward Projection):
   - **Definition**: This module typically refers to a projection that increases dimensionality or transforms data into a higher-dimensional space.
   - **Functionality**: In certain architectures, upward projections can be used to expand feature representations before passing them through non-linear transformations or additional layers.
   - **Role in Model Structure**: It can help capture more complex relationships by allowing for richer representations at certain stages of processing.

7. `down_proj` (Downward Projection):
   - **Definition**: This module reduces dimensionality or transforms data into a lower-dimensional space.
   - **Functionality**: Downward projections can be used to condense information after processing through multiple layers or operations, making it more manageable for subsequent computations.
   - **Role in Model Structure**: It helps streamline data flow and reduce computational overhead while retaining essential features.


In [13]:
################################################################################
# LoRA parameters
################################################################################
best_lora_dropout = 0.3
best_lora_r = 2
best_lora_alpha = 4
best_target_modules = [
    "q_proj",
    "o_proj",
    "k_proj",
    "v_proj",
    "gate_proj",
    "up_proj",
    "down_proj",
]

**Method to log CPU/ memory usage matrices during training**



In [14]:
resource_usage_training_df = pd.DataFrame(columns=["cpu_usage", "memory_usage"])


def log_resource_usage(stage):
    # CPU and memory usage
    # stage=trial.number
    cpu_usage = psutil.cpu_percent(interval=1)
    memory_usage = psutil.virtual_memory().percent
    print(f"CPU Usage: {cpu_usage}%")
    print(f"Memory Usage: {memory_usage}%")

    # GPU usage
    gpus = GPUtil.getGPUs()
    for gpu in gpus:
        gpu_memory_used = gpu.memoryUsed
        gpu_memory_total = gpu.memoryTotal
        gpu_utilization = gpu.load
        print(
            f"GPU {gpu.id} - Memory Usage: {gpu.memoryUsed}/{gpu.memoryTotal} MB - Utilization: {gpu.load * 100}%"
        )

    # Initialize a DataFrame to store resource usage metrics
    # Append the metrics to the DataFrame

    # Create a dictionary of the metrics
    metrics = {
        "stage": stage,
        "cpu_usage": cpu_usage,
        "memory_usage": memory_usage,
        "gpu_memory_used": gpu_memory_used,
        "gpu_memory_total": gpu_memory_total,
        "gpu_utilization": gpu_utilization * 100,  # Convert to percentage
    }
    # Append the metrics to the DataFrame
    global resource_usage_training_df
    resource_usage_training_df = pd.concat(
        [resource_usage_training_df, pd.DataFrame([metrics])], ignore_index=True
    )


**LORA config and training Arguments**


`TrainingArguments Parameter`

1. **`per_device_train_batch_size`**:
   - **Definition**: This parameter sets the batch size for training on each device (e.g., GPU).
   - **Details**: A batch size of `1` means that each training step will process one sample at a time. Smaller batch sizes can lead to more frequent updates but may result in noisier gradients and longer training times.

2. **`per_device_eval_batch_size`**:
   - **Definition**: This parameter sets the batch size for evaluation on each device.
   - **Details**: Similar to the training batch size, a batch size of `1` for evaluation means that one sample will be evaluated at a time. This can be useful for memory-constrained environments or when evaluating models on large datasets.

3. **`gradient_accumulation_steps`**:
   - **Definition**: This parameter specifies how many steps to accumulate gradients before performing a backward/update pass.
   - **Details**: Setting this to `2` means that gradients will be accumulated over 2 steps before updating the model weights. This effectively simulates a larger batch size without increasing memory usage, which can be beneficial when working with limited GPU memory.

4. **`num_train_epochs`**:
   - **Definition**: This parameter indicates the total number of epochs for training.
   - **Details**: An epoch is one complete pass through the entire training dataset. The variable `num_epochs` should be defined elsewhere in your code, determining how many times the model will see the entire dataset during training.

5. **`warmup_steps`**:
   - **Definition**: This parameter specifies the number of steps for linear learning rate warmup.
   - **Details**: During warmup, the learning rate increases linearly from `0` to the initial learning rate over the specified number of steps. This helps stabilize training in the early phases and can prevent large gradient updates that might destabilize learning.

6. **`evaluation_strategy`**:
   - **Definition**: This parameter determines when to evaluate the model during training.
   - **Details**: Setting this to `"steps"` means that evaluation will occur at regular intervals defined by `eval_steps`.

7. **`eval_steps`**:
   - **Definition**: This parameter specifies how often to evaluate the model during training.
   - **Details**: The value `0.1` typically indicates that evaluation will occur every 10% of the total number of training steps.

8. **`learning_rate`**:
   - **Definition**: This parameter sets the initial learning rate for the optimizer.
   - **Details**: A learning rate of `1e-4` (0.0001) is balancing between convergence speed and stability.

9. **`weight_decay`**:
   - **Definition**: This parameter applies weight decay (L2 regularization) to prevent overfitting by penalizing large weights.
   - **Details**: A weight decay value of `1e-2` (0.01) helps regularize the model, encouraging smaller weights and potentially improving generalization.

10. **`fp16`**:
    - **Definition**: This parameter enables mixed precision training using 16-bit floating-point (FP16) format.
    - **Details**: Setting this to `False` means that FP16 training is disabled, and full precision (FP32) will be used instead.

11. **`bf16`**:
    - **Definition**: This parameter enables bfloat16 precision, which is particularly useful for training on TPUs or specific GPUs.
    - **Details**: Setting this to `True` allows using bfloat16, which can provide similar benefits as FP16 while maintaining a wider dynamic range, reducing issues with underflow.

12. **`logging_steps`**:
    - **Definition**: This parameter specifies how often to log training metrics.
    - **Details**: A value of `1` means that metrics will be logged after every step, which can provide detailed insights into model performance during training.

13. **`output_dir`**:
    - **Definition**: This parameter specifies where to save model checkpoints and logs.
    - **Details**: The directory `"train_outputs"` will contain all saved models and logs during training.

14. **`optim`**:
    - **Definition**: This parameter specifies which optimizer to use during training.
    - **Details**: Setting this to `"paged_adamw_8bit"` indicates that a specific variant of AdamW optimized for 8-bit precision will be used, which can help reduce memory usage while maintaining efficiency.

15. **`report_to`**:
    - **Definition**: This parameter determines where to report metrics during training.
    - **Details**: Setting this to `"wandb"` indicates that metrics will be reported to Weights & Biases (WandB). other options is `"tensorboard"`.


In [15]:
# Define LoRA configuration with the best hyperparameters
lora_config = LoraConfig(
    r=best_lora_r,
    lora_alpha=best_lora_alpha,
    lora_dropout=best_lora_dropout,
    target_modules=best_target_modules,
    task_type="CAUSAL_LM",
)


# NUM_OF_EPOCHS = 20  # <----
training_arguments = transformers.TrainingArguments(
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    # num_train_epochs=NUM_OF_EPOCHS,
    warmup_steps=3,
    evaluation_strategy="steps",
    eval_steps=0.1,
    max_steps=75,
    learning_rate=1e-4,
    weight_decay=1e-2,
    fp16=False,
    bf16=True,
    logging_steps=1,
    output_dir="train_outputs",
    optim="paged_adamw_8bit",
    report_to="wandb",
)



**Model training**

The `Accelerator` is used to facilitate distributed training and mixed precision training. It simplifies the process of scaling up your model training across multiple GPUs or even multiple nodes, and it can also help with optimizing memory usage and computational efficiency.

Benefits :
1. Distributed Training:
   - Benefit: Allows the training process to be distributed across multiple GPUs or nodes, which can significantly speed up training times.
   - Example: If you have multiple GPUs, `Accelerator` will automatically distribute the model and data across these GPUs, enabling parallel processing. Accelerator manages communication between devices, ensuring that gradients are synchronized correctly.

2. Mixed Precision Training:
   - Benefit: Reduces memory usage and can speed up training by using lower precision (e.g., `bfloat16`).
   - Example: By using mixed precision, you can fit larger models or larger batch sizes into GPU memory, which can improve training efficiency.

3. Simplified Device Management:
   - Benefit: Automatically handles the placement of tensors on the correct devices, reducing the complexity of managing device-specific code.
   - Example: You don't need to manually move tensors to the GPU or handle device-specific operations; `Accelerator` takes care of it.

By using `Accelerator`, you can achieve faster training times, better memory utilization, and easier scaling of your model training process.  

In [16]:
from transformers import AdamW
from accelerate import Accelerator


# Initialize the Accelerator
accelerator = Accelerator()

# Ensure pad token is set
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"  # as it is a decoder-only model, it is recommended to set padding_side to "left".

# Initialize the optimizer
optimizer = AdamW(model.parameters(), lr=training_arguments.learning_rate)

# Prepare the model, tokenizer, datasets, and optimizer with the Accelerator
model, optimizer, training_dataset, val_dataset = accelerator.prepare(
    model, optimizer, training_dataset, val_dataset
)



In [17]:
from accelerate import DistributedType
import time

# Function to log resource usage
import psutil
import GPUtil

start_time = time.time()

# Clear GPU cache
torch.cuda.empty_cache()


SAVE_MODEL = True
# Initialize Trainer with the best hyperparameters
trainer = SFTTrainer(
    model=model,
    train_dataset=training_dataset,
    eval_dataset=val_dataset,
    peft_config=lora_config,
    max_seq_length=900,  # max length to input. It is crucial for GPU memory management
    dataset_text_field="dialogue",  #  field in the dataset that contains the text data to be used for training and evaluation.
    formatting_func=create_prompt,  # preprocessing function before input
    processing_class=tokenizer,
    args=training_arguments,
    packing=False,  # The trainer will attempt to pack multiple sequences into a single batch
)

# Train the final model
model.config.use_cache = False

# Log resource usage before training
print("Resource usage before training:")
log_resource_usage(1)


# Add the custom callback to the trainer
# trainer.add_callback(LoggingCallback)

# Use the Accelerator to manage the training loop
trainer.train()

# Log resource usage before training
print("Resource usage after training:")
log_resource_usage(2)


# Save the final model
# accelerator.wait_for_everyone() method is used to synchronize all processes in a distributed training setup,ensuring that all processes reach the same point before proceeding.
# This is crucial for maintaining consistency and coordination across multiple devices (e.g., multiple GPUs or TPUs) during training.
accelerator.wait_for_everyone()
if accelerator.is_local_main_process:
    if SAVE_MODEL:
        trainer.model.save_pretrained(new_model)
        trainer.tokenizer.save_pretrained(new_model)

end_time = time.time()
print("\n\n--->Execution Time:", (end_time - start_time) / 60, "minutes")


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]



Resource usage before training:
CPU Usage: 1.0%
Memory Usage: 3.8%
GPU 0 - Memory Usage: 21699.0/40960.0 MB - Utilization: 0.0%


  resource_usage_training_df = pd.concat(
  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss
8,2.1515,2.076029
16,2.3333,1.959144
24,2.2421,1.890339
32,1.7967,1.847928
40,2.0336,1.834626
48,1.447,1.836888
56,1.9496,1.844111
64,1.5861,1.848862
72,1.8536,1.854472


Resource usage after training:
CPU Usage: 10.9%
Memory Usage: 3.9%
GPU 0 - Memory Usage: 21871.0/40960.0 MB - Utilization: 0.0%


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.




--->Execution Time: 4.10518221060435 minutes


In [18]:
wandb.finish()
model.config.use_cache = True

0,1
eval/loss,█▅▃▁▁▁▁▁▂
eval/runtime,█▂▄▁█▂▂▅▇
eval/samples_per_second,▅███▁██▅▅
eval/steps_per_second,▅███▁██▅▅
train/epoch,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇█████
train/global_step,▁▁▁▁▁▂▂▂▂▃▃▃▃▃▃▄▄▄▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇██
train/grad_norm,▁▂▂▃▄▅▃▁▂▃▄▁▄▄▄▃▃▃▄▄▅▄▅▆▅▆▃▅▆▇▆▇██▃▇▇▇█▅
train/learning_rate,▆████▇▇▇▇▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁▁
train/loss,██▇▇▇▇▂▆▃▇▆▅▂▅▅▅▆▁▆▅▅▅▂▅▅▅▅▅▅▅▅▅▄▄▆▁▆▄▄▅

0,1
eval/loss,1.85447
eval/runtime,0.5033
eval/samples_per_second,1.987
eval/steps_per_second,1.987
total_flos,1.98434186062848e+16
train/epoch,9.4
train/global_step,75.0
train/grad_norm,2.90409
train/learning_rate,0.0
train/loss,3.7336


## Merge finetuned LORA with pre-trained model

In [1]:
model_id = "google/gemma-2-27b-it"
new_model = "gemma-2-27b-it_ft"

import os

os.environ["HF_TOKEN"] = "HF_KEY"
os.environ["WB_KEY"] = "WB_KEY"

In [3]:
# integrate Weights & Biases (W&B) with training process for tracking, monitoring, and collaboration
import os
import wandb

wandb.login(key=os.environ["WB_KEY"])
run = wandb.init(
    project="gemma-2-27b-it_ft_summarizer_v3",
    job_type="training",
    anonymous="allow",
)



In [4]:
from peft import LoraConfig
from datasets import load_dataset
from datasets import load_metric
import pandas as pd
import numpy as np

import transformers
from rouge_score import rouge_scorer
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


In [5]:
import os
import time

start_time = time.time()

# Clear GPU cache
torch.cuda.empty_cache()

# Load model for training
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=False,
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",  # Let Transformers automatically decide device placement
)

end_time = time.time()
print("\n\n--->Execution Time:", (end_time - start_time) / 60, "minutes")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/12 [00:00<?, ?it/s]



--->Execution Time: 3.8909881194432576 minutes


- The method loads the specified PEFT weights or configuration associated with new_model.
- It integrates these adaptations into the base_model, effectively modifying its architecture or parameters according to the specified PEFT approach.
- The resulting model is now capable of utilizing the fine-tuned parameters while retaining the original capabilities of the base model.

In [6]:
from peft import LoraConfig, PeftModel
import time

start_time = time.time()


tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ["HF_TOKEN"])

# - The method loads the specified PEFT weights associated with new_model.
# - It integrates these adaptations into the base_model, effectively modifying its architecture or parameters according to the specified PEFT approach.
# - The resulting model is now capable of utilizing the fine-tuned parameters while retaining the original capabilities of the base model.
peft_model = PeftModel.from_pretrained(base_model, new_model)


# - it combines the parameters of the LoRA layers with the corresponding layers in the base model. This effectively integrates the adaptations into the model's weights, allowing you to use the adapted model as a standalone entity
# - After merging, the method unloads or removes any temporary components related to the LoRA layers that are no longer needed.
merged_model = peft_model.merge_and_unload()

end_time = time.time()
print("\n\n--->Execution Time:", (end_time - start_time) / 60, "minutes")





--->Execution Time: 1.7694380203882853 minutes


In [7]:
print(merged_model)

Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256000, 4608, padding_idx=0)
    (layers): ModuleList(
      (0-45): 46 x Gemma2DecoderLayer(
        (self_attn): Gemma2Attention(
          (q_proj): Linear4bit(in_features=4608, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4608, out_features=2048, bias=False)
          (v_proj): Linear4bit(in_features=4608, out_features=2048, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4608, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding()
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear4bit(in_features=4608, out_features=36864, bias=False)
          (up_proj): Linear4bit(in_features=4608, out_features=36864, bias=False)
          (down_proj): Linear4bit(in_features=36864, out_features=4608, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((4608,), eps=1e-06)
        (post_attention_layernor

# **Model Evaluation using Rouge Score**

More on Roughe score at https://arxiv.org/abs/1803.01937

### Calculate Rouge Score on test data

In [8]:
from datasets import load_metric
from rouge_score import rouge_scorer
from datasets import load_dataset
from datasets import DatasetDict
import pandas as pd

In [9]:
def calculate_rouge_scores(original_summary, generated_summary):
    rouge = load_metric("rouge")
    scores = rouge.compute(
        predictions=[str.strip(generated_summary)], references=[original_summary]
    )
    return scores

In [10]:
def create_input_prompt(dialogue):
    prompt_template = """
  <start_of_turn>user
  Summarise dialogue in one sentence.
  {dialogue}
  <end_of_turn>\n<start_of_turn>model
  """
    prompt = prompt_template.format(dialogue=dialogue)
    return prompt

In [11]:
# Load test data

!pip install -q py7zr
data = load_dataset("samsum", split="test")

print(data)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 819
})


In [12]:
# take 25 test sample from test dataset

test_dataset = data.select(range(25))

print(test_dataset)
test_dataset = pd.DataFrame(test_dataset)

Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 25
})


In [13]:
num_iterations = len(test_dataset)

avg_scores = {
    "rouge1": {"precision": 0, "recall": 0, "f1": 0},
    "rouge2": {"precision": 0, "recall": 0, "f1": 0},
    "rougeL": {"precision": 0, "recall": 0, "f1": 0},
    "rougeLsum": {"precision": 0, "recall": 0, "f1": 0},
}


for idx, row in test_dataset.iterrows():
    device = "cuda:0"
    dialogue = row["dialogue"]
    true_summary = row["summary"]

    text = create_input_prompt(dialogue)  # convert into gemma prompt template

    inputs = tokenizer(text, return_tensors="pt").to(device)
    outputs = merged_model.generate(**inputs, max_new_tokens=50)
    gemma_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(gemma_summary)

    print("---------------------------------------------------------------------")
    print(f"True Summary: {true_summary}")

    # end_token = ""

    highlight = str.strip(gemma_summary.split("model")[1])
    print(f"Generated Summary: {highlight}")
    print()

    ## rouge score
    rouge_scores = calculate_rouge_scores(highlight, true_summary)
    rouge_scorer_ = rouge_scorer.RougeScorer(
        ["rouge1", "rouge2", "rougeL", "rougeLsum"]
    )
    rouge_scores = rouge_scorer_.score(highlight, true_summary)

    for metric, scores in rouge_scores.items():
        rouge_scores_matrix = {
            metric: {
                "precision": scores.precision,
                "recall": scores.recall,
                "fmeasure": scores.fmeasure,
            }
        }
        # Convert the rouge_scores to a DataFrame
        df = pd.DataFrame(rouge_scores_matrix).transpose()
        # print(df)

        avg_scores[metric]["precision"] += scores.precision
        avg_scores[metric]["recall"] += scores.recall
        avg_scores[metric]["f1"] += scores.fmeasure


for metric, scores in avg_scores.items():
    avg_scores[metric]["precision"] /= num_iterations
    avg_scores[metric]["recall"] /= num_iterations
    avg_scores[metric]["f1"] /= num_iterations


The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.



  user
  Summarise dialogue in one sentence.
  Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye
  
model
  Hannah asks Amanda for Betty's number, but Amanda suggests asking Larry instead, leading to Hannah reluctantly agreeing to text him. 


---------------------------------------------------------------------
True Summary: Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.
Generated Summary: Hannah asks Amanda for Betty's number, but Amanda suggests asking Larry instead, leading to Hannah reluctantly agreeing to text him.



  rouge = load_metric("rouge")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]


  user
  Summarise dialogue in one sentence.
  Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)
  
model
  Eric and Rob discuss a stand-up comedian's routine about a machine, finding it humorous and deciding to watch more of his work. 


---------------------------------------------------------------------
True Summary: Eric and Rob are going to watch a stand-up on youtube.
Generated Summary: Eric and Rob discuss a stand-up comedian's routine about a machine, finding it humorous and deciding to watch more of his work.



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Lenny: Babe, can you help me with something?
Bob: Sure, what's up?
Lenny: Which one should I pick?
Bob: Send me photos
Lenny:  <file_photo>
Lenny:  <file_photo>
Lenny:  <file_photo>
Bob: I like the first ones best
Lenny: But I already have purple trousers. Does it make sense to have two pairs?
Bob: I have four black pairs :D :D
Lenny: yeah, but shouldn't I pick a different color?
Bob: what matters is what you'll give you the most outfit options
Lenny: So I guess I'll buy the first or the third pair then
Bob: Pick the best quality then
Lenny: ur right, thx
Bob: no prob :)
  
model
  Lenny asks Bob for help choosing between three pairs of trousers, and Bob advises them to pick the pair that offers the most outfit options and is of the best quality. 


---------------------------------------------------------------------
True Summary: Lenny can't decide which trousers to buy. Bob advised Lenny on that topic. Lenny goes with Bob's advice to p

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Will: hey babe, what do you want for dinner tonight?
Emma:  gah, don't even worry about it tonight
Will: what do you mean? everything ok?
Emma: not really, but it's ok, don't worry about cooking though, I'm not hungry
Will: Well what time will you be home?
Emma: soon, hopefully
Will: you sure? Maybe you want me to pick you up?
Emma: no no it's alright. I'll be home soon, i'll tell you when I get home. 
Will: Alright, love you. 
Emma: love you too. 
  
model
  Emma is having a bad day and doesn't feel like eating, but reassures Will she'll be home soon. 


---------------------------------------------------------------------
True Summary: Emma will be home soon and she will let Will know.
Generated Summary: Emma is having a bad day and doesn't feel like eating, but reassures Will she'll be home soon.



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Ollie: Hi , are you in Warsaw
Jane: yes, just back! Btw are you free for diner the 19th?
Ollie: nope!
Jane: and the  18th?
Ollie: nope, we have this party and you must be there, remember?
Jane: oh right! i lost my calendar..  thanks for reminding me
Ollie: we have lunch this week?
Jane: with pleasure!
Ollie: friday?
Jane: ok
Jane: what do you mean " we don't have any more whisky!" lol..
Ollie: what!!!
Jane: you just call me and the all thing i heard was that sentence about whisky... what's wrong with you?
Ollie: oh oh... very strange! i have to be carefull may be there is some spy in my mobile! lol
Jane: dont' worry, we'll check on friday.
Ollie: don't forget to bring some sun with you
Jane: I can't wait to be in Morocco..
Ollie: enjoy and see you friday
Jane: sorry Ollie, i'm very busy, i won't have time for lunch  tomorrow, but may be at 6pm after my courses?this trip to Morocco was so nice, but time consuming!
Ollie: ok for tea!
Jane: 

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Benjamin: Hey guys, what are we doing with the keys today?
Hilary: I've got them. Whoever wants them can meet me at lunchtime or after
Elliot: I'm ok. We're meeting for the drinks in the evening anyway and I guess we'll be going back to the apartment together?
Hilary: Yeah, I guess so
Daniel: I'm with Hilary atm and won't let go of her for the rest of the day, so any option you guys choose is good for me
Benjamin: Hmm I might actually pass by at lunchtime, take the keys and go take a nap. I'm sooo tired after yesterday
Hilary: Sounds good. We'll be having lunch with some French people (the ones who work on the history of food in colonial Mexico - I already see you yawning your head off)
Benjamin: YAAAAWN 🙊 Where and where are you meeting?
Hilary: So I'm meeting them at the entrance to the conference hall at 2 pm and then we'll head to this place called La Cantina. Italian cuisine, which is quite funny, but that's what they've chosen
Benja

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Max: Know any good sites to buy clothes from?
Payton: Sure :) <file_other> <file_other> <file_other> <file_other> <file_other> <file_other> <file_other>
Max: That's a lot of them!
Payton: Yeah, but they have different things so I usually buy things from 2 or 3 of them.
Max: I'll check them out. Thanks. 
Payton: No problem :)
Max: How about u?
Payton: What about me?
Max: Do u like shopping?
Payton: Yes and no.
Max: How come?
Payton: I like browsing, trying on, looking in the mirror and seeing how I look, but not always buying.
Max: Y not?
Payton: Isn't it obvious? ;)
Max: Sry ;)
Payton: If I bought everything I liked, I'd have nothing left to live on ;)
Max: Same here, but probably different category ;)
Payton: Lol
Max: So what do u usually buy?
Payton: Well, I have 2 things I must struggle to resist!
Max: Which are?
Payton: Clothes, ofc ;)
Max: Right. And the second one?
Payton: Books. I absolutely love reading!
Max: Gr8! What books do u 

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Rita: I'm so bloody tired. Falling asleep at work. :-(
Tina: I know what you mean.
Tina: I keep on nodding off at my keyboard hoping that the boss doesn't notice..
Rita: The time just keeps on dragging on and on and on.... 
Rita: I keep on looking at the clock and there's still 4 hours of this drudgery to go.
Tina: Times like these I really hate my work.
Rita: I'm really not cut out for this level of boredom.
Tina: Neither am I.
  
model
  Rita and Tina are both exhausted and bored at work, wishing the day would end. 


---------------------------------------------------------------------
True Summary: Rita and Tina are bored at work and have still 4 hours left.
Generated Summary: Rita and Tina are both exhausted and bored at work, wishing the day would end.



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Beatrice: I am in town, shopping. They have nice scarfs in the shop next to the church. Do you want one?
Leo: No, thanks
Beatrice: But you don't have a scarf.
Leo: Because I don't need it.
Beatrice: Last winter you had a cold all the time. A scarf could help.
Leo: I don't like them.
Beatrice: Actually, I don't care. You will get a scarf.
Leo: How understanding of you!
Beatrice: You were complaining the whole winter that you're going to die. I've had enough.
Leo: Eh.
  
model
  Beatrice insists on buying Leo a scarf despite his resistance, reminding him of his past suffering from colds. 


---------------------------------------------------------------------
True Summary: Beatrice wants to buy Leo a scarf, but he doesn't like scarves. She cares about his health and will buy him a scarf no matter his opinion.
Generated Summary: Beatrice insists on buying Leo a scarf despite his resistance, reminding him of his past suffering from colds.



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Ivan: hey eric
Eric: yeah man
Ivan: so youre coming to the wedding
Eric: your brother's
Ivan: yea
Eric: i dont know mannn
Ivan: YOU DONT KNOW??
Eric: i just have a lot to do at home, plus i dont know if my parents would let me
Ivan: ill take care of your parents
Eric: youre telling me you have the guts to talk to them XD
Ivan: thats my problem
Eric: okay man, if you say so
Ivan: yea just be there 
Eric: alright
  
model
  Ivan tries to convince his hesitant friend Eric to attend his brother's wedding. 


---------------------------------------------------------------------
True Summary: Eric doesn't know if his parents let him go to Ivan's brother's wedding. Ivan will talk to them.
Generated Summary: Ivan tries to convince his hesitant friend Eric to attend his brother's wedding.



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Wanda: Let's make a party!
Gina: Why?
Wanda: beacuse. I want some fun!
Gina: ok, what do u need?
Wanda: 1st I need too make a list
Gina: noted and then?
Wanda: well, could u take yours father car and go do groceries with me?
Gina: don't know if he'll agree
Wanda: I know, but u can ask :)
Gina: I'll try but theres no promisess
Wanda: I know, u r the best!
Gina: When u wanna go
Wanda: Friday?
Gina: ok, I'll ask
  
model
  Wanda wants to throw a party and asks Gina to help with groceries, potentially using her father's car. 


---------------------------------------------------------------------
True Summary: Wanda wants to throw a party. She asks Gina to borrow her father's car and go do groceries together. They set the date for Friday. 
Generated Summary: Wanda wants to throw a party and asks Gina to help with groceries, potentially using her father's car.



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Martin: I won two cinema tickets!
Aggie: oh cool, how come?
Martin: online. on fb, the movie mag organized it
Aggie: so what did you do
Martin: just write a short review and that's it
Aggie: well done :) so what and when. and where?
Martin: the new film with Redford
Aggie: i guess i heard sth
Martin: it's pretty cool i heard. till the end of the week
Aggie: sounds good. we'll find time XD
  
model
  Martin won two cinema tickets to a new Redford film by writing a short review online and he and Aggie plan to see it together. 


---------------------------------------------------------------------
True Summary: Martin wrote a short review and won 2 cinema tickets on FB. Martin wants Aggie to go with him this week for the new film with Redford.
Generated Summary: Martin won two cinema tickets to a new Redford film by writing a short review online and he and Aggie plan to see it together.



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Charlee: I'm in class. Theatre in Portuguese lol
Curtis: Realllly?
Charlee: Yes. One of my subjects at the university that I attend is portuguese theatre. We are preparing a performance
Curtis: What performance is this? Are you devising it?
Charlee: A polish one translated into portuguese
Curtis: Thats quite cool. Who is the writer?
Charlee: Mrożek
  
model
  Charlee tells Curtis about her Portuguese theatre class where they are preparing a performance of a Polish play translated into Portuguese by Mrożek. 


---------------------------------------------------------------------
True Summary: Charlee is attending Portuguese theater as a subject at university. He and other students are preparing a play by Mrożek translated into Portuguese.
Generated Summary: Charlee tells Curtis about her Portuguese theatre class where they are preparing a performance of a Polish play translated into Portuguese by Mrożek.



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Mary: Are you going by car or train?
Tom: Ella rented a car
Ella: this makes all of this much faster
Mary: good decision
  
model
  Ella rented a car, which will make their trip faster. 


---------------------------------------------------------------------
True Summary: Ella rented a car, this makes things much faster for her and Tom. 
Generated Summary: Ella rented a car, which will make their trip faster.



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Luke: are you still looking for someone to join netflix family?
Paul: yes, 1 person :)
Luke: i am the one!
Paul: sure, i will send you the login and password on sunday
Luke: ok we can talk tomorrow
Paul: i don't really remember it now
Luke: send me also the bank account details so I can wire you the money every month. Are you paying for this or someone else?
Paul: I do, and I keep track of everyone accessing so you should not expect any bans :D
Luke: easy mate :D you still on holidays with your girl?
Paul: last dinner :( tomorrow we are out
Luke: how long have you been there?
Paul: less than 8 days :/
  
model
  Luke agrees to join Paul's Netflix family plan and will pay him monthly. 


---------------------------------------------------------------------
True Summary: Paul is going to share his Netflix account with Luke. In exchange Luke is going to contribute to the subscription. Paul will send Luke his bank details. Paul is on vacation

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Greg: Hi, honey. I need to stay after hours :-(
Betsy: Again?
Greg: I'm sorry!
Betsy: What about Johnny?
Greg: Well, could you pick him up? 
Betsy: What if I can't?
Greg: Betsy?
Betsy: What if I can't?
Greg: Can't you, really?
Betsy: I can't. Today I need to work long hours as well. Tuesdays are your days in the kindergarten.
Greg: Talk to you later. I'll see what I can do.
Betsy: You'd better think of something.
Greg: Oh. Just stop it now.
  
model
  Greg needs Betsy to pick up their son Johnny from kindergarten because he has to work late, but Betsy is also working late and reminds Greg it's his day for kindergarten pickup. 


---------------------------------------------------------------------
True Summary: Greg and Betsy have a lot of work today, so they cannot pick up Johnny from the kindergarten. However, it's Greg's turn to do it. Greg will try to find a solution.
Generated Summary: Greg needs Betsy to pick up their son Johnny fro

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Ethan: somethin for Scott <file_photo>
Toby: haha, totally
Marshall: pretty much sums it up
Scott: you know you're exactly fuckin the same
Toby: oh we know honey bunny
Marshall: we just enjoy making fun of YOU
Ethan: xD
Scott: oh fuck y'all
Toby: <file_gif>
  
model
  Ethan, Toby, and Marshall are teasing Scott about his personality. 


---------------------------------------------------------------------
True Summary: Ethan, Toby and Marshall are making fun of Scott.
Generated Summary: Ethan, Toby, and Marshall are teasing Scott about his personality.



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Igor: Shit, I've got so much to do at work and I'm so demotivated. 
John: It's pretty irresponsible to give that much work to someone on their notice period.
Igor: Yeah, exactly! Should I even care?
John: It's up to you, but you know what they say...
Igor: What do you mean?
John: Well, they say how you end things shows how you really are...
Igor: And now how you start, right?
John: Gotcha! 
Igor: So what shall I do then? 
John: It's only two weeks left, so grit your teeth and do what you have to do. 
Igor: Easy to say, hard to perform.
John: Come on, stop thinking, start doing! 
Igor: That's so typical of you!  ;)  
  
model
  Igor is feeling overwhelmed by work during his notice period and John encourages him to finish strong despite his lack of motivation. 


---------------------------------------------------------------------
True Summary: Igor has a lot of work on his notice period and he feels demotivated. John thinks he should do w

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Clara: Hi, what you up to?
Neela: Not much, chilling out.
Clara: Just rewatching Dear White People on Netflix, love it!😍
Neela: Oh yeah, heard of it, but not seen it yet? Any good?
Clara: Well, yes! I just said it was, LOL. It's about a fictional Ivy League University and the students in one House of Residence.
Neela: Why is it called Dear White People?
Clara: That's the name of the radio show the main character, Sam, presents on college radio.
Neela: Yeah, but why is it so good?
Clara: Well, it's mainly stories from the perspective of black students there, which I find very interesting. The characters are strong and likeable too.
Neela: I suppose it's rather different from the UK, then?
Clara: It seems so, as there is a lot more racial awareness and discrimination there than here. It all kicks off when there is a Blackface party held by an elite group of white students, which gets out of hand.
Neela: How's that?
Clara: Well, obviously, t

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Ernest: hey Mike, did you park your car on our street?
Mike: no, took it into garage today
Ernest: ok good
Mike: why?
Ernest: someone just crashed into a red honda looking just like yours
Mike: lol lucky me
  
model
  Ernest warns Mike that a car resembling his was just crashed into, but Mike reveals he parked his car in the garage. 


---------------------------------------------------------------------
True Summary: Mike took his car into garage today. Ernest is relieved as someone had just crashed into a red Honda which looks like Mike's. 
Generated Summary: Ernest warns Mike that a car resembling his was just crashed into, but Mike reveals he parked his car in the garage.



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Deirdre: Hi Beth, how are you love?
Beth: Hi Auntie Deirdre, I'm been meaning to message you, had a favour to ask.
Deirdre: Wondered if you had any thought about your Mum's 40th, we've got to do something special!
Beth: How about a girls weekend, just mum, me, you and the girls, Kira will have to come back from Uni, of course.
Deirdre: Sounds fab! Get your thinking cap on, it's only in 6 weeks! Bet she's dreading it, I remember doing that!
Beth: Oh yeah, we had a surprise party for you, you nearly had a heart attack! 
Deirdre: Well, it was a lovely surprise! Gosh, thats nearly 4 years ago now, time flies! What was the favour, darling?
Beth: Oh, it was just that I fancied trying a bit of work experience in the salon, auntie.
Deirdre: Well, I am looking for Saturday girls, are you sure about it? you could do well in the exams and go on to college or 6th form.
Beth: I know, but it's not for me, auntie, I am doing all foundation papers and I'

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Gloria: This exam is a bit of a lottery in fact
Gloria: You can't really get prepared, it's all about experience
Emma: But there are some rules and some typical texts right?
Gloria: You can see some texts from previous years
Gloria: <file_other>
Emma: Wow that's very useful
Emma: I have never seen this site
Gloria: Yes it's very good
Gloria: Actually it's good to read all the texts because you will see that some phrases repeat very often
Emma: How much time do you have for all 4 parts?
Gloria: 4 hours
Emma: Is it enough?
Gloria: Well it has to be
Gloria: Would be perfect to have 2 more hours... But on the other hand it would be really exhausting
Emma: 4 hours and no breaks?
Gloria: No breaks :/ So it's really important to be really focused and try to write as fast as you can
Gloria: And read it carefully and correct during the last hour
Emma: I'm going to read everything from that website, it's great
  
model
  Gloria advises Emma on how 

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Adam: Have you talked to May?
Karen: Yes, yesterday, why?
Adam: I just talked to her and I must admit I worry about her
Karen: Me too, I suggested she should see a specialist, but she wasn't very happy about it
Adam: No wonder...
Karen: I know, but I think this is serious. She's saying she's depressed, like everyone around, but in her case it may be true
Adam: She was telling me she doesn't feel like doing anything, she's bored all the time, she never feels happy. It sounds like a real, typical depression
Adam: She also told me that she has trouble sleeping. I asked her to go out for a beer or anything basically, but she doesn't want to leave the flat
Karen: Oh my, it sounds really serious. I don't what to tell you
Adam: I was wondering how I can help her
Karen: Honestly I don't know if we can help her, Adam. I suggested a specialist because these are very sensitive issues and I'm afraid we may unintentionally make it worse
Adam: Yes, but

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Anne: You were right, he was lying to me :/
Irene: Oh no, what happened?
Jane: who? that Mark guy?
Anne: yeah, he told me he's 30, today I saw his passport - he's 40
Irene: You sure it's so important?
Anne: he lied to me Irene
  
model
  Anne discovered her boyfriend Mark lied about his age, causing her distress. 


---------------------------------------------------------------------
True Summary: Mark lied to Anne about his age. Mark is 40.
Generated Summary: Anne discovered her boyfriend Mark lied about his age, causing her distress.



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.



  user
  Summarise dialogue in one sentence.
  Augustine: Guys, remember it's Wharton's bday next week?
Darlene: yay, a party!
Heather: yay! crap we need to buy him a present
Walker: he mentioned paper shredder once
Augustine: wtf?!?
Walker: he did really. for no reason at all.
Heather: whatever that make him happy
Darlene: cool with me. we can shred some papers at the party 
Augustine: so much fun
Heather: srsly guys, you mean we should really get office equipment???
Darlene: Walk, ask him if he really wnts it and if he yes then we get it
Walker: i heard him say that. wasn;t drunk. me neither.
Darlene: but better ask him twice
Walker: will do
Augustine: 2moro ok?
Darlene: and sure ask ab the party!
  
model
  The group decides to ask Wharton if he really wants a paper shredder for his birthday present. 


---------------------------------------------------------------------
True Summary: Next week is Wharton's birthday. Augustine, Darlene, Heather and Walker want to buy him a paper s

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


### Average rouge score on test data

In [14]:
# Convert the evaluation results to a DataFrame
df = pd.DataFrame(avg_scores)

# Transpose the DataFrame for better readability
df = df.transpose()

# Print the DataFrame
print("Test dataset average rouge score...")
print(df)

Test dataset average rouge score...
           precision    recall        f1
rouge1      0.432926  0.460685  0.430851
rouge2      0.190036  0.189437  0.183169
rougeL      0.336717  0.355725  0.334290
rougeLsum   0.336717  0.355725  0.334290


The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics are commonly used to evaluate the quality of summaries generated by models. They compare the overlap of n-grams between the generated summary and reference summaries. Here, you have ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum scores, each with precision, recall, and F1 metrics.


**Interpretation of ROUGE Scores**

1. **ROUGE-1**:
   - **Precision**: 0.432926 (43.29% of the unigrams in the generated summary are also in the reference summary).
   - **Recall**: 0.460685 (46.07% of the unigrams in the reference summary are also in the generated summary).
   - **F1 Score**: 0.430851 (Balanced measure of precision and recall).

2. **ROUGE-2**:
   - **Precision**: 0.190036 (19.00% of the bigrams in the generated summary are also in the reference summary).
   - **Recall**: 0.189437 (18.94% of the bigrams in the reference summary are also in the generated summary).
   - **F1 Score**: 0.183169 (Balanced measure of precision and recall).

3. **ROUGE-L**:
   - **Precision**: 0.336717 (33.67% of the longest common subsequences in the generated summary are also in the reference summary).
   - **Recall**: 0.355725 (35.57% of the longest common subsequences in the reference summary are also in the generated summary).
   - **F1 Score**: 0.334290 (Balanced measure of precision and recall).

4. **ROUGE-Lsum**:
   - **Precision**: 0.336717 (Same as ROUGE-L).
   - **Recall**: 0.355725 (Same as ROUGE-L).
   - **F1 Score**: 0.334290 (Same as ROUGE-L).

### Analysis

1. **ROUGE-1** scores are relatively higher, indicating that the generated summary captures a good portion of the important words from the reference summary.
2. **ROUGE-2** scores are lower, suggesting that the model struggles more with capturing the correct sequence of words (bigrams).
3. **ROUGE-L** and **ROUGE-Lsum** scores are in between, indicating that the model captures some of the longer sequences of words correctly but still has room for improvement.

**Recommendations for Improvement**

1. **Improve Bigram Capture**:
   - Focus on improving the model's ability to capture bigrams, as indicated by the lower ROUGE-2 scores. This can be achieved by fine-tuning the model further or using more sophisticated training techniques.

2. **Fine-Tuning**:
   - Further fine-tuning the model with more data or using techniques like data augmentation might help improve the scores.

3. **Model Architecture**:
   - Experiment with different model architectures or hyperparameters to see if they yield better results.

4. **Regularization**:
   - Implement regularization techniques such as dropout and weight decay to prevent overfitting and improve generalization.

5. **Learning Rate Adjustment**:
   - Experiment with different learning rates and learning rate schedules to find the optimal value that allows the model to learn effectively without overshooting.



**Summary**

- **ROUGE-1**: Indicates good capture of important words.
- **ROUGE-2**: Indicates struggles with capturing correct sequences of words.
- **ROUGE-L** and **ROUGE-Lsum**: Indicates some capture of longer sequences but room for improvement.
- **Suggestions**: Improve bigram capture, fine-tune further, experiment with model architectures, implement regularization, and adjust learning rates.

By making these adjustments, you can improve the model's performance and generalization, leading to better ROUGE scores and overall summarization quality.

In [15]:
wandb.finish()

# Push Model to Huggingface hub

In [16]:
merged_model.save_pretrained("gemma-2-27b-it-ft-summarizer-v3")
merged_model.push_to_hub("gemma-2-27b-it-ft-summarizer-v3", use_temp_dir=False)

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Prat/gemma-2-27b-it-ft-summarizer-v3/commit/4f6107af4d3268fc8a2aaa7267ea3fa7f9ded6eb', commit_message='Upload Gemma2ForCausalLM', commit_description='', oid='4f6107af4d3268fc8a2aaa7267ea3fa7f9ded6eb', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Prat/gemma-2-27b-it-ft-summarizer-v3', endpoint='https://huggingface.co', repo_type='model', repo_id='Prat/gemma-2-27b-it-ft-summarizer-v3'), pr_revision=None, pr_num=None)

# **Thank You!!**