<a href="https://colab.research.google.com/github/Sounakray2003/Asmadiya-tech/blob/main/SFT_trainer_for_Llama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps trl peft accelerate bitsandbytes xformers datasets huggingface_hub torch --extra-index-url https://download.pytorch.org/whl/cu118

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-0uc_fr15/unsloth_7c335a72a8b84861bedfb3deb6d87750
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-0uc_fr15/unsloth_7c335a72a8b84861bedfb3deb6d87750
  Resolved https://github.com/unslothai/unsloth.git to commit d707bd43b4e883b521761d525be2fae428fe5980
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.10.13 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.10.13-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.

In [2]:
from huggingface_hub import notebook_login
notebook_login()  # Paste your HF token (must accept Llama license)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
# =====================================================
# QLoRA Fine-Tuning: Llama-3.2-1B-Instruct
# 4-bit + LoRA | Free Colab T4 | ~10-15 mins | 1k samples
# =====================================================

# --- CELL 1: Install ---
!pip install -q bitsandbytes accelerate peft trl transformers datasets huggingface_hub

# --- CELL 2: HF Login ---
from huggingface_hub import notebook_login
print("Paste your Hugging Face token (required for Llama):")
notebook_login()

# --- CELL 3: Load 4-bit Model + LoRA ---
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model_name = "meta-llama/Llama-3.2-1B-Instruct"

# 4-bit config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load tokenizer + add [PAD]
tokenizer = AutoTokenizer.from_pretrained(model_name, token=True)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Load 4-bit model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="cuda",
    token=True,
)

# Resize embeddings for [PAD]
model.resize_token_embeddings(len(tokenizer))

# Prepare for QLoRA
model = prepare_model_for_kbit_training(model)

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
print(f"VRAM: {torch.cuda.memory_allocated()/1e9:.2f} GB")

# --- CELL 4: Load & Format Dataset ---
from datasets import load_dataset

dataset = load_dataset("yahma/alpaca-cleaned", split="train")

alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

def format(examples):
    texts = [alpaca_prompt.format(i, o) + EOS_TOKEN for i, o in zip(examples["instruction"], examples["output"])]
    return {"text": texts}

dataset = dataset.map(format, batched=True, remove_columns=dataset.column_names)
dataset = dataset.shuffle(seed=42).select(range(1000))

# --- CELL 5: Tokenize ---
def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=1024,
        padding=False,
    )

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])
tokenized.set_format(type="torch", columns=["input_ids", "attention_mask"])

# --- CELL 6: Data Collator ---
from torch.utils.data import DataLoader

def collator(features):
    input_ids = [f["input_ids"] for f in features]
    attention_mask = [f["attention_mask"] for f in features]

    input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)
    attention_mask = torch.nn.utils.rnn.pad_sequence(attention_mask, batch_first=True, padding_value=0)

    labels = input_ids.clone()
    labels[labels == tokenizer.pad_token_id] = -100

    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

train_loader = DataLoader(tokenized, batch_size=2, shuffle=True, collate_fn=collator)

# --- CELL 7: Training Loop ---
from torch.optim import AdamW
from transformers import get_scheduler

optimizer = AdamW(model.parameters(), lr=2e-4)
num_epochs = 1
total_steps = len(train_loader)
lr_scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=10, num_training_steps=total_steps)

model.train()
accum_steps = 4
step = 0

print(f"Starting QLoRA training – {total_steps} steps")

for epoch in range(num_epochs):
    for batch in train_loader:
        step += 1
        batch = {k: v.to("cuda") for k, v in batch.items()}

        outputs = model(**batch)
        loss = outputs.loss / accum_steps
        loss.backward()

        if step % accum_steps == 0:
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

        if step % 10 == 0:
            print(f"Step {step} | Loss: {loss.item()*accum_steps:.4f} | VRAM: {torch.cuda.memory_allocated()/1e9:.2f} GB")

print("Training complete!")

# --- CELL 8: Save LoRA + Merge ---
lora_dir = "llama32-1b-qlora"
model.save_pretrained(lora_dir)
tokenizer.save_pretrained(lora_dir)
print(f"LoRA adapter saved: ~30 MB → {lora_dir}")

# Merge into full 16-bit
if input("Merge & save full 16-bit model? (y/n): ").lower() == "y":
    from peft import PeftModel

    print("Loading base model with [PAD] token...")
    tokenizer_merged = AutoTokenizer.from_pretrained(model_name, token=True)
    if tokenizer_merged.pad_token is None:
        tokenizer_merged.add_special_tokens({'pad_token': '[PAD]'})

    base_model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        token=True,
    )
    base_model.resize_token_embeddings(len(tokenizer_merged))

    print("Merging LoRA...")
    model_peft = PeftModel.from_pretrained(base_model, lora_dir)
    merged_model = model_peft.merge_and_unload()

    merged_dir = "llama32-1b-qlora-merged"
    merged_model.save_pretrained(merged_dir)
    tokenizer_merged.save_pretrained(merged_dir)
    print(f"Merged 16-bit model saved: ~2 GB → {merged_dir}")

# --- CELL 9: Inference (on merged model) ---
from transformers import pipeline

merged_dir = "llama32-1b-qlora-merged"
gen = pipeline(
    "text-generation",
    model=merged_dir,
    tokenizer=merged_dir,
    device=0,
    torch_dtype=torch.bfloat16,
)

prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is the capital of Japan?

### Response:
"""

print("\nGenerating...")
out = gen(prompt, max_new_tokens=64, do_sample=True, temperature=0.7)
response = out[0]["generated_text"].split("### Response:")[-1].strip()
print(f"Model says:\n{response}")

print("\nAll done! Full QLoRA pipeline complete.")

Paste your Hugging Face token (required for Llama):


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

trainable params: 11,272,192 || all params: 1,247,088,640 || trainable%: 0.9039
VRAM: 8.37 GB


Map:   0%|          | 0/51760 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Starting QLoRA training – 500 steps
Step 10 | Loss: 1.7340 | VRAM: 7.44 GB
Step 20 | Loss: 1.7579 | VRAM: 7.23 GB
Step 30 | Loss: 2.3798 | VRAM: 7.03 GB
Step 40 | Loss: 1.7499 | VRAM: 7.29 GB
Step 50 | Loss: 1.7580 | VRAM: 7.18 GB
Step 60 | Loss: 1.2105 | VRAM: 7.01 GB
Step 70 | Loss: 1.2868 | VRAM: 7.01 GB
Step 80 | Loss: 2.0721 | VRAM: 7.09 GB
Step 90 | Loss: 1.1656 | VRAM: 7.12 GB
Step 100 | Loss: 1.1529 | VRAM: 7.14 GB
Step 110 | Loss: 1.3475 | VRAM: 7.22 GB
Step 120 | Loss: 1.3105 | VRAM: 7.17 GB
Step 130 | Loss: 1.2328 | VRAM: 7.24 GB
Step 140 | Loss: 1.2491 | VRAM: 7.18 GB
Step 150 | Loss: 1.0205 | VRAM: 7.19 GB
Step 160 | Loss: 1.2713 | VRAM: 6.94 GB
Step 170 | Loss: 1.5266 | VRAM: 7.29 GB
Step 180 | Loss: 1.4887 | VRAM: 7.16 GB
Step 190 | Loss: 0.8622 | VRAM: 7.49 GB
Step 200 | Loss: 1.2483 | VRAM: 7.31 GB
Step 210 | Loss: 1.0674 | VRAM: 7.15 GB
Step 220 | Loss: 1.2447 | VRAM: 6.95 GB
Step 230 | Loss: 1.0552 | VRAM: 7.31 GB
Step 240 | Loss: 0.8816 | VRAM: 7.05 GB
Step 250 | Lo

`torch_dtype` is deprecated! Use `dtype` instead!
Device set to use cuda:0



Generating...
Model says:
The capital of Japan is Tokyo.

All done! Full QLoRA pipeline complete.
