# Fine-tuning Large Language Models (LLMs) with PEFT and Bitsandbytes

This notebook demonstrates how to efficiently load and utilize large language models (LLMs) with limited GPU resources using techniques like Parameter-Efficient Fine-Tuning (PEFT) and 8-bit quantization with the `bitsandbytes` library.

## Key Libraries Used

- **Accelerate**: For optimizing multi-GPU, TPU, and distributed training.
- **PEFT (Parameter-Efficient Fine-Tuning)**: Enables memory-efficient tuning methods like LoRA, QLoRA, and adapters.
- **Bitsandbytes**: Supports 8-bit and 4-bit quantization for reducing VRAM usage.
- **Transformers**: Hugging Face's library for working with pre-trained models.
- **TRL (Transformer Reinforcement Learning)**: For Reinforcement Learning from Human Feedback (RLHF).
- **Py7zr**: Handles extraction of 7z-format compressed files.
- **Auto-GPTQ**: Implements GPTQ-based quantization for faster inference.
- **Optimum**: Hugging Face's library for hardware optimizations.

## Setup

1.  **Install Libraries**: Install the necessary libraries using pip:

In [1]:
!pip install accelerate peft bitsandbytes git+https://github.com/huggingface/transformers trl py7zr auto-gptq optimum

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-f56qw_s_
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-f56qw_s_
  Resolved https://github.com/huggingface/transformers to commit cd22550692cabffb037b7e5a956e8da3cbbb2b67
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting bitsandbytes
  Downloading bitsandbytes-0.47.0-py3-none-manylinux_2_24_x86_64.whl.metadata (11 kB)
Collecting trl
  Downloading trl-0.21.0-py3-none-any.whl.metadata (11 kB)
Collecting py7zr
  Downloading py7zr-1.0.0-py3-none-any.whl.metadata (17 kB)
Collecting auto-gptq
  Downloading auto_gptq-0.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting optimum
  Downloading optimum-1.27.0-py3-none-any.whl

In [2]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
from peft import PeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
import os

# Create a temporary directory for offloading
offload_dir = "/tmp/offload_dir"
os.makedirs(offload_dir, exist_ok=True)


# Load the base model with BitsAndBytesConfig for 8-bit quantization and CPU offload
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_enable_fp32_cpu_offload=True, # Enable CPU offload for FP32 modules
    llm_int8_cpu_offload=True, # Explicitly enable CPU offload
    offload_folder=offload_dir # Specify the offload directory
)

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1",
    quantization_config=quantization_config,
    torch_dtype=torch.float16,
    device_map="auto" # Let accelerate handle the device mapping
)

# Load the PEFT model (adapter) on top of the base model
model = PeftModelForCausalLM.from_pretrained(
    base_model,
    "mynamerahulkumar/mistral-finetuned-samsum",
    offload_folder=offload_dir  # Add the offload_folder here
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/837 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/27.3M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [4]:
test_inputs = tokenizer("""
###Human: Summarize this following dialogue: Sunny: I'm at the railway station in Chennai Karthik: No problems so far? Sunny: no, everything's going smoothly Karthik: good. lets meet there soon!
###Assistant: """, return_tensors="pt").to("cuda")


In [6]:
from transformers import GenerationConfig



In [7]:
generation_config = GenerationConfig(
    do_sample=True,
    top_k=1,
    temperature=0.1,
    max_new_tokens=25,
    pad_token_id=tokenizer.eos_token_id
)

In [8]:
import time
st_time = time.time()
outputs = model.generate(**test_inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(time.time()-st_time)


###Human: Summarize this following dialogue: Sunny: I'm at the railway station in Chennai Karthik: No problems so far? Sunny: no, everything's going smoothly Karthik: good. lets meet there soon!
###Assistant: 
The dialogue between Sunny and Karthik is about Sunny's arrival at the railway station in Chennai
7.114001750946045
