

# **Distributed finetuning ```gemma-2-9b-it``` model using ```LORA``` on ```SAMSum dataset``` (abstractive dialogue summaries)**


Following adjustments make to fit large model (9 billion parameters) in memory during finetuning :

- ```Use Mixed Precision Training```: This reduces memory usage and speeds up training.
- ```Gradient Checkpointing```: This trades compute for memory by recomputing activations during the backward pass.
- ```gradient_accumulation_steps```: By using gradient_accumulation_steps, you can train models with larger effective batch sizes without running into memory limitations, leading to potentially better model performance and training stability.
- ```Reduce Batch Size```: Smaller batch sizes reduce memory usage.
- ```Offload to CPU```: Offload parts of the model to the CPU to save GPU memory.
- ```Use Efficient Optimizers```: Use optimizers that are memory efficient.
- ```max_seq_length```: Use small max_seq_length

# **Import Libs**

In [1]:
!pip3 install -q -U accelerate
!pip3 install -q -U bitsandbytes
!pip3 install -q -U peft
!pip3 install -q -U trl
!pip3 install -q -U accelerate
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers
!pip install -q rouge_score
!pip install -q optuna

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
trl 0.12.1 requires datasets>=2.21.0, but you have datasets 2.17.0 which is incompatible.[0m[31m
[0m

In [2]:
import torch

print("Is CUDA available? ", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device name: ", torch.cuda.get_device_name(0))
    !nvidia-smi

Is CUDA available?  True
Device name:  Tesla T4
Wed Dec  4 11:17:54 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P8              10W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                    

In [3]:
from peft import LoraConfig
from datasets import load_dataset
from datasets import load_metric
import pandas as pd
import numpy as np

import transformers
from trl import SFTTrainer
from rouge_score import rouge_scorer
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from google.colab import userdata

In [4]:
import os

os.environ["HF_TOKEN"] = "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"
# os.environ["WEIGHT_BIASES"] = "9d7decf681236b200a35c0121bca0fe725be724c"

# **Load Model and tokenizer**

In [5]:
# load a pre-trained tokenizer from the Hugging Face Model Hub, with authentication for the Hugging Face API token


model_id = "google/gemma-2-9b-it"
new_model = "Dist_gemma-2-9b-it_summarizer_v2"

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ["HF_TOKEN"])

# **LORA-Finetuning**

## **Load Dataset**

In [6]:
from datasets import load_dataset

## list of dataset for summarization. Choose one of them for your task
# https://paperswithcode.com/dataset/cnn-daily-mail-1
# data = load_dataset("knkarthick/dialogsum") ##Dialogue Summarization Dataset
# data = load_dataset("cnn_dailymail","3.0.0")
# data = load_dataset("GEM/wiki_lingua")


!pip install -q py7zr
data = load_dataset("samsum")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [7]:
data

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [8]:
data["train"][0]

{'id': '13818513',
 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}

In [15]:
from datasets import DatasetDict

DATA_RECORD_SIZE = 100  # size of training dataset

dataset_dict = DatasetDict(data)
# Extract the first 100 rows from the training dataset
training_dataset = dataset_dict["train"].select(range(DATA_RECORD_SIZE))

# Extract the first 100 rows from the training dataset
val_dataset = dataset_dict["validation"].select(range(DATA_RECORD_SIZE))

print(training_dataset)
print(val_dataset)

Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 100
})
Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 100
})


In [16]:
training_dataset["dialogue"][0]

"Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"

# Load Model

In [12]:
# #Load base/pretrained model for training

# Clear GPU cache before loading the model for the second time
torch.cuda.empty_cache()

# Load model for training with CPU offloading enabled
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    # Enable CPU offloading for specific layers
    llm_int8_enable_fp32_cpu_offload=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",  # Let Transformers automatically decide device placement
)


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [13]:
print(model)

Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256000, 3584, padding_idx=0)
    (layers): ModuleList(
      (0-41): 42 x Gemma2DecoderLayer(
        (self_attn): Gemma2Attention(
          (q_proj): Linear4bit(in_features=3584, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=3584, out_features=2048, bias=False)
          (v_proj): Linear4bit(in_features=3584, out_features=2048, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=3584, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding()
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear4bit(in_features=3584, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=3584, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=3584, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((3584,), eps=1e-06)
        (pre_feedforward_layerno

In [14]:
input_text = """user: Generate summary of this dialogue in one line
          dialogue:
          Rachel:
          Rachel: Top 50 Best Films of 2018
          Rachel: :)
          Janice: Omg, I've watched almost all 50... xDD
          Spencer: Hahah, Deadpool 2 also??
          Janice: Yep
          Spencer: Really??
          Janice: My bf forced me to watch it xD
          Rachel: Hahah
          Janice: It wasn't that bad
          Janice: I thought it'd be worse
          Rachel: And Avengers? :D
          Janice: 2 times
          Rachel: Omg
          Janice: xP
          Rachel: You are the best gf in the world
          Rachel: Your bf should appreciate that ;-)
          Janice: He does
          Janice: x)
AI Summary:"""

input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids, max_length=228)

print(tokenizer.decode(outputs[0]))



<bos>user: Generate summary of this dialogue in one line
          dialogue:
          Rachel: 
          Rachel: Top 50 Best Films of 2018
          Rachel: :)
          Janice: Omg, I've watched almost all 50... xDD
          Spencer: Hahah, Deadpool 2 also??
          Janice: Yep
          Spencer: Really??
          Janice: My bf forced me to watch it xD
          Rachel: Hahah
          Janice: It wasn't that bad
          Janice: I thought it'd be worse
          Rachel: And Avengers? :D
          Janice: 2 times
          Rachel: Omg
          Janice: xP
          Rachel: You are the best gf in the world
          Rachel: Your bf should appreciate that ;-)
          Janice: He does
          Janice: x)
AI Summary: Janice and Rachel discuss Janice's extensive viewing of the Top 50 Best Films of 2018, including Deadpool 2 and Avengers. 


<end_of_turn><eos>


## **Distributed finetuning**

In [11]:
!pip install -q accelerate

In [22]:
from accelerate import Accelerator

# Initialize the Accelerator
accelerator = Accelerator()

In [23]:
# Clear GPU cache before loading the model for the second time
torch.cuda.empty_cache()

# Define LoRA configuration with the best hyperparameters
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.02,
    target_modules=["q_proj", "k_proj", "v_proj"],
    task_type="CAUSAL_LM",
)


NUM_OF_ITERATION = 20

# Define training arguments with the best hyperparameters
training_arguments = transformers.TrainingArguments(
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True, #<------
    # num_train_epochs=NUM_OF_EPOCHS,
    warmup_steps=2,
    eval_strategy="steps",  # "epoch", "steps",
    eval_steps=0.2,
    max_steps=NUM_OF_ITERATION,
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    logging_steps=1,
    output_dir="final_outputs",
    optim="paged_adamw_8bit",
    report_to="none",
)
# training_arguments

In [24]:
# cutom optmizer for better memory utilization but not using in this code anymore
'''
import bitsandbytes as bnb
from torch import nn
from transformers.trainer_pt_utils import get_parameter_names

training_args = transformers.TrainingArguments(per_device_train_batch_size= transformers.TrainingArguments.per_device_train_batch_size, output_dir="output")

decay_parameters = get_parameter_names(model, [nn.LayerNorm])
decay_parameters = [name for name in decay_parameters if "bias" not in name]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if n in decay_parameters],
        "weight_decay": training_args.weight_decay,
    },
    {
        "params": [p for n, p in model.named_parameters() if n not in decay_parameters],
        "weight_decay": 0.0,
    },
]

optimizer_kwargs = {
    "betas": (training_args.adam_beta1, training_args.adam_beta2),
    "eps": training_args.adam_epsilon,
}
optimizer_kwargs["lr"] = training_args.learning_rate
adam_bnb_optim = bnb.optim.Adam8bit(
    optimizer_grouped_parameters,
    betas=(training_args.adam_beta1, training_args.adam_beta2),
    eps=training_args.adam_epsilon,
    lr=training_args.learning_rate,
)
'''

In [25]:
from transformers import AdamW

# Initialize the Accelerator
accelerator = Accelerator()

# Ensure pad token is set
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"  # as it is a decoder-only model, it is recommended to set padding_side to "left".

# Initialize the optimizer
optimizer = AdamW(model.parameters(), lr=training_arguments.learning_rate)

# #Prepare the model, tokenizer, datasets, and optimizer with the Accelerator
# model, adam_bnb_optim, training_dataset, val_dataset = accelerator.prepare(
#     model, adam_bnb_optim, training_dataset, val_dataset
# )

model, optimizer, training_dataset, val_dataset = accelerator.prepare(
    model, optimizer, training_dataset, val_dataset
)



In [26]:
from accelerate import DistributedType


import time

start_time = time.time()


# preprcessing before passing input
def create_prompt(example):
    text = f"user:\nSummarise dialogue in one sentence: {example['dialogue']} \nSummary:\n{example['summary']}"
    return [text]


# Initialize Trainer with the best hyperparameters
trainer = SFTTrainer(
    model=model,
    train_dataset=training_dataset,
    eval_dataset=val_dataset,
    peft_config=lora_config,
    max_seq_length=700,  # max length to input/output. It is crucial for GPU memory management
    dataset_text_field="dialogue",
    formatting_func=create_prompt,  # preprocessing function before input
    processing_class=tokenizer,
    args=training_arguments,
    packing=False,  # The trainer will attempt to pack multiple sequences into a single batch
)

# Train the final model
model.config.use_cache = False

# Use the Accelerator to manage the training loop
trainer.train()


# Save the final model
# accelerator.wait_for_everyone() method is used to synchronize all processes in a distributed training setup,ensuring that all processes reach the same point before proceeding.
# This is crucial for maintaining consistency and coordination across multiple devices (e.g., multiple GPUs or TPUs) during training.
accelerator.wait_for_everyone()
if accelerator.is_local_main_process:
    trainer.model.save_pretrained(new_model)
    tokenizer.save_pretrained(new_model)

end_time = time.time()
print("\n\n--->Execution Time:", end_time - start_time, "seconds")


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs
  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss
4,1.9699,2.26934
8,1.5937,2.213828
12,1.2076,2.25994
16,0.881,2.379813
20,0.6869,2.459884




--->Execution Time: 450.19209814071655 seconds


# Merge LORA finetuned model with base model

In [27]:

torch.cuda.empty_cache()

In [28]:
from peft import  PeftModel

torch.cuda.empty_cache()

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    # device_map="cpu",
    offload_folder="offload",  # Specify offload folder
)


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [29]:
# Clear any cached GPU memory before loading PEFT model
# torch.cuda.empty_cache()

model = PeftModel.from_pretrained(base_model, new_model,offload_folder="offload_peft" )
# model = model.merge_and_unload()
# with torch.no_grad():  # Disable gradient calculations to save memory
#     model = model.merge_and_unload()

# # Reload tokenizer to save it
# tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
# tokenizer.pad_token = tokenizer.eos_token
# tokenizer.padding_side = "left"

In [30]:
model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Gemma2ForCausalLM(
      (model): Gemma2Model(
        (embed_tokens): Embedding(256000, 3584, padding_idx=0)
        (layers): ModuleList(
          (0-41): 42 x Gemma2DecoderLayer(
            (self_attn): Gemma2Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3584, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.02, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3584, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
        

# Evaluate

```GPU out of memory exception during evaluation```

In [52]:
torch.cuda.empty_cache()

In [47]:

text = """user: Generate summary of this dialogue in one line
          dialogue:
          Rachel:
          Rachel: Top 50 Best Films of 2018
          Rachel: :)
          Janice: Omg, I've watched almost all 50... xDD
          Spencer: Hahah, Deadpool 2 also??
          Janice: Yep
          Spencer: Really??
          Janice: My bf forced me to watch it xD
          Rachel: Hahah
          Janice: It wasn't that bad
          Janice: I thought it'd be worse
          Rachel: And Avengers? :D
          Janice: 2 times
          Rachel: Omg
          Janice: xP
          Rachel: You are the best gf in the world
          Rachel: Your bf should appreciate that ;-)
          Janice: He does
          Janice: x)
AI Summary:"""

# device = "cuda:0"
# device = "cpu"

model.to(device)


inputs = tokenizer(text, return_tensors="pt").to(device)
# model.to(device)


# inputs = inputs.to(accelerator.device, dtype=torch.float16)
# # # Move each tensor within the BatchEncoding to the accelerator's device and cast to float16
# for key in inputs:
#     # inputs[key] = inputs[key].to(accelerator.device, dtype=torch.float16)
#     inputs[key] = inputs[key].to(dtype=torch.float16)

true_summary = "Rachel sends a list of Top 50 films of 2018. Janice watched almost half of them, Deadpool 2 and Avengers included."

# with torch.no_grad():
#     outputs = model.generate(**inputs, max_length=288)

outputs = model.generate(**inputs, max_new_tokens=288)
model_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(model_summary)

print("---------------------------------------------------------------------")
end_token = ""

highlight = str.strip(model_summary.split("AI Summary:")[1])
print(f"Generated Summary: {highlight}")
print("---------------------------------------------------------------------")

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 11.06 MiB is free. Process 545434 has 14.73 GiB memory in use. Of the allocated memory 14.32 GiB is allocated by PyTorch, and 286.41 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)