<a href="https://colab.research.google.com/github/Bhabuk10/FineTuning_LLMs/blob/main/Finetuning_LLM_with_PEFT_AND_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # Fine-tuning LLMs with PEFT Adapters and LoRA .

This notebook guides you to efficient fine-tuning for massive language models (LLMs).  Leveraging the power of  Parameter-efficient Fine-tuning (PEFT) adapters and Low-Rank Adaptation (LoRA) , we'll explore how to fine-tune LLMs using the `peft` library and `bitsandbytes` for 8-bit efficiency.  
This approach lets you focus on training tiny, task-specific adapters instead of the entire model, saving computational resources without sacrificing performance.

In [None]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-313bfa03-3892-2dad-daff-4e30eaf21e68)


##Install Dependencies

In [None]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m45.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (

#Model Loading

 This notebook utilizes the `facebook/opt-6.7b`  developed by Meta AI containing 6.7 billion parameters for demonstration purposes. You can experiment with other models that are compatible with your Colab GPU configuration.












In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM



model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-6.7b",
    load_in_8bit=True,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-6.7b")
tokenizer.pad_token = tokenizer.eos_token  # Set the pad token to the end-of-sentence token



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


pytorch_model.bin.index.json:   0%|          | 0.00/41.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.96G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.36G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

#Post-processing on the model

Freezing Original Weights and FP32 Casting:

In [None]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)


#PEFT Model Configuration and Adapter Creation

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16, # Rank of the low-rank matrices
    lora_alpha=32, # Similar to learning rate
    # target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 8388608 || all params: 6666862592 || trainable%: 0.12582542214183376


The significant reduction in trainable parameters (8,388,608) compared to the total model parameters (6,666,862,592) underscores the efficiency of LoRA. With only 0.12% of parameters requiring training, LoRA dramatically reduces the memory footprint needed for fine-tuning.

# Loading the Dataset: Experiment with Your Own Data (Optional)

In [None]:
import transformers
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")

Downloading readme:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/647k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2508 [00:00<?, ? examples/s]

In [None]:
def merge_columns(example):
  example["prediction"] = example["quote"] + " ->: " + str(example["tags"])
  return example

data['train'] = data['train'].map(merge_columns)
data['train'] ["prediction"][:10]

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

["“Be yourself; everyone else is already taken.” ->: ['be-yourself', 'gilbert-perreira', 'honesty', 'inspirational', 'misattributed-oscar-wilde', 'quote-investigator']",
 "“I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.” ->: ['best', 'life', 'love', 'mistakes', 'out-of-control', 'truth', 'worst']",
 "“Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.” ->: ['human-nature', 'humor', 'infinity', 'philosophy', 'science', 'stupidity', 'universe']",
 "“So many books, so little time.” ->: ['books', 'humor']",
 "“A room without books is like a body without a soul.” ->: ['books', 'simile', 'soul']",
 "“Be who you are and say what you feel, because those who mind don't matter, and those who matter don't mind.” ->: ['ataraxy', 'be-yourself', 'confidence', 'fitting-in', 'individuality', 'misattribut

In [None]:
data['train'][0]

{'quote': '“Be yourself; everyone else is already taken.”',
 'author': 'Oscar Wilde',
 'tags': ['be-yourself',
  'gilbert-perreira',
  'honesty',
  'inspirational',
  'misattributed-oscar-wilde',
  'quote-investigator'],
 'prediction': "“Be yourself; everyone else is already taken.” ->: ['be-yourself', 'gilbert-perreira', 'honesty', 'inspirational', 'misattributed-oscar-wilde', 'quote-investigator']"}

In [None]:
data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['quote', 'author', 'tags', 'prediction', 'input_ids', 'attention_mask'],
        num_rows: 2508
    })
})

#Training with Flexible Hyperparameters for Optimal Performance

The training process involves tuning hyperparameters to achieve the desired balance between performance and resource efficiency. Key arguments such as `gradient_accumulation_steps` and `batch_size` can be adjusted based on your available GPU memory.

In [None]:
trainer = transformers.Trainer(
    model=model,
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=2,  # Adjust for memory limitations
        warmup_steps=5,
        max_steps=20,  # Adjust for memory limitations or desired training duration
        learning_rate=2e-4,
        fp16=True,  # Enable mixed precision (if supported) for potentially faster training
        logging_steps=1,
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

model.config.use_cache = False  # Silence warnings, re-enable for inference
trainer.train()


max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss
1,1.0659
2,3.0908
3,2.0813
4,1.7961
5,1.6947
6,2.3511
7,3.0683
8,0.6646
9,2.4936
10,1.2839


TrainOutput(global_step=20, training_loss=2.094375690817833, metrics={'train_runtime': 66.6036, 'train_samples_per_second': 0.601, 'train_steps_per_second': 0.3, 'total_flos': 58885986631680.0, 'train_loss': 2.094375690817833, 'epoch': 0.01594896331738437})

#Saving the Model

In [None]:
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")



#Inference

In [None]:
# Load the LoRA configuration from the outputs directory
lora_config = LoraConfig.from_pretrained('outputs')

# Integrate the trained LoRA adapters with the base model
model = get_peft_model(model, lora_config)


In [None]:
batch = tokenizer("“So many books, so little time.” ->:", return_tensors='pt')

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=50)

print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))



 “So many books, so little time.” ->: “So many books, so little time.”

I’ve been reading a lot of books lately. I’ve been reading a lot of books for a while now, but I’ve been reading a lot


While the initial performance may not be optimal, this is likely due to the limited training steps used.  Consider adjusting the hyperparameters and rerunning the fine-tuning process.