# 📖 Notebook Overview & Unsloth Primer

**Purpose:**  
This notebook demonstrates instruction-style fine-tuning of a 7B Mistral model using the Unsloth library’s 4-bit quantization and optimized LoRA adapters, powered by TRL’s `SFTTrainer`. You will learn how to:

1. **Leverage 4-bit Quantization:**  
   - Drastically reduce VRAM usage (∼4× smaller weights) while retaining full inference quality.  
   - Enable training of large models on a single GPU.

2. **Use Unsloth’s Gradient Checkpointing & Quant LoRA:**  
   - **Gradient checkpointing (“unsloth” mode):** trades a bit of compute for memory, allowing longer contexts or larger batch sizes.  
   - **LoRA adapters + RSLoRA support:** inject tiny trainable rank-decomposed matrices into attention (q/k/v/o, gate/up/down projections) for parameter efficiency.

3. **Integrate with TRL’s SFTTrainer:**  
   - Simplifies instruction-tuning with built-in support for text-only fields.  
   - Packs or streams examples, logs training metrics, and handles mixed-precision seamlessly.

---

## 🔑 Key Concepts

- **4-bit Quantization (`load_in_4bit=True`):**  
  - Reduces model weight storage & memory bandwidth.  
  - Works with both GPT-style and LLaMA-style architectures via BNB backend.

- **Unsloth PEFT (`get_peft_model`):**  
  - Combines LoRA with Unsloth’s “unsloth” checkpointing to save ∼30 % VRAM.  
  - Optional “rslora” for rank-stabilized updates.

- **Instruction Format:**  
  - Uses Alpaca-style prompts (instruction + input → response) or chat templates for downstream tasks.

---

## 🗂 Notebook Workflow

1. **Install & Verify**  
   Install Unsloth nightly build, Transformers, PEFT, TRL, and Torch. Verify versions.

2. **Imports & Env Setup**  
   All Python imports, environment variables, and CUDA checks in one place.

3. **GPU & CUDA Check**  
   Confirm torch CUDA availability, device count/name, and `nvcc --version`.

4. **UnslothLoader Class**  
   Encapsulate model/tokenizer loading with 4-bit support and sequence length configuration.

5. **PEFT Configuration**  
   Wrap the base model with LoRA adapters, Unsloth checkpointing, and optional RSLoRA.

6. **Dataset Preparation**  
   Load Hugging Face data, rename columns, and format examples in Alpaca/chat style.

7. **SFTTrainer Setup**  
   Build TRL’s `SFTTrainer` with `TrainingArguments` tuned for quantized LoRA training.

8. **Training & Monitoring**  
   Log baseline and peak GPU memory reservation; run `.train()` to fine-tune.

9. **Save Checkpoint**  
   Merge LoRA adapters back into the base model and save both model + tokenizer.

10. **Inference Examples**  
    Reload the merged 4-bit model, enable Unsloth inference optimizations, and generate sample outputs with streaming.

---

## 💡 Deep Tips & Best Practices

- **Batch Size vs. Sequence Length:**  
  Quantization + checkpointing lets you trade sequence length for batch size—experiment to find your sweet spot.

- **Mixed Precision:**  
  Prefer `torch.float16` on Ampere+ GPUs. Use `is_bfloat16_supported()` to conditionally enable BF16.

- **Logging & Debugging:**  
  - Set `logging_steps=1` for quick feedback in small demos.  
  - Watch out for OOM—reduce `per_device_train_batch_size` or disable packing.

- **Reproducibility:**  
  Always set `random_state` in Unsloth and `seed` in `TrainingArguments` to ensure consistent runs.

---

> **Next:** proceed to **Cell 1: Install Unsloth & Core Dependencies** and get your environment ready!


#📦 Install Unsloth & Core Dependencies
- Single cell for all pip installs and upgrades.

In [None]:
%%bash
# Install Unsloth core + nightly, transformers, tokenizers, PEFT, Torch
pip install --upgrade --no-cache-dir unsloth tokenizer torch peft \
    && pip uninstall unsloth -y \
    && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

# Verify key packages
pip show torch


#📥 Imports & Environment Configuration
- Group all imports and any os/env setup here.

In [None]:
import sys, os
import torch
from unsloth import FastLanguageModel
from transformers import AutoTokenizer, TextStreamer
from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import is_bfloat16_supported

# (Optional) show Python executable
print("Python executable:", sys.executable)


#🖥️ GPU Check & CUDA Toolkit Verification
- Confirm GPU availability

In [None]:
def check_gpu():
    """Print GPU count, names, and availability."""
    print("CUDA available:", torch.cuda.is_available())
    print("Device count:", torch.cuda.device_count())
    if torch.cuda.is_available():
        print("Device name:", torch.cuda.get_device_name(0))
    else:
        print("No GPU found.")

check_gpu()



#🛠️ Unsloth Model Loader Class
- Encapsulate model + tokenizer loading with 4-bit support.

In [None]:
class UnslothLoader:
    """
    Loads an UnsLo­th-supported LLM in 4-bit or full precision,
    sets max sequence length and dtype automatically.
    """
    def __init__(self, model_name: str, max_seq: int = 2048, dtype=None, load4bit: bool = True):
        self.model_name = model_name
        self.max_seq = max_seq
        self.dtype = dtype
        self.load4bit = load4bit
        self.model, self.tokenizer = None, None

    def load(self):
        """Download and instantiate model + tokenizer."""
        self.model, self.tokenizer = FastLanguageModel.from_pretrained(
            model_name=self.model_name,
            max_seq_length=self.max_seq,
            dtype=self.dtype,
            load_in_4bit=self.load4bit
        )
        print(f"Loaded {self.model_name} with 4-bit={self.load4bit}")

# Example usage:
loader = UnslothLoader("unsloth/mistral-7b-v0.2-bnb-4bit")
loader.load()
model, tokenizer = loader.model, loader.tokenizer


# 🔧 Apply LoRA & Unsloth Optimizations
- Wrap the base model with PEFT adapters + Unsloth checkpointing.

In [None]:
def configure_peft(model, r=16, alpha=16, dropout=0.0, use_rslora=False):
    """
    Inject LoRA adapters and Unsloth-specific settings for VRAM efficiency.
    """
    model = FastLanguageModel.get_peft_model(
        model,
        r=r,
        target_modules=[
            "q_proj","k_proj","v_proj","o_proj",
            "gate_proj","up_proj","down_proj"
        ],
        lora_alpha=alpha,
        lora_dropout=dropout,
        bias="none",
        use_gradient_checkpointing="unsloth",
        random_state=3407,
        use_rslora=use_rslora,
        loftq_config=None
    )
    print("PEFT + Unsloth configured!")
    return model

# Apply
model = configure_peft(model)


#🔄 Dataset Prep & Prompt Formatting
- Load raw data, rename columns, and format in Alpaca/chat style.

In [None]:
# 1) Load and rename
ds = load_dataset("KingNish/reasoning-base-20k", split="train")
ds = ds.rename_columns({"user":"instruction","reasoning":"input","assistant":"output"})

# 2) Alpaca-style formatting
ALPACA_TMPL = (
    "Below is an instruction with context. Write an appropriate response.\n\n"
    "### Instruction:\n{}\n\n### Input:\n{}\n\n### Response:\n{}"
)

def format_alpaca(ex):
    text = [
        ALPACA_TMPL.format(i, inp, out) + tokenizer.eos_token
        for i, inp, out in zip(ex["instruction"], ex["input"], ex["output"])
    ]
    return {"text": text}

ds = ds.map(format_alpaca, batched=True)
print("Sample formatted prompt:\n", ds[0]["text"][:200])


# 🚀 Training Setup with SFTTrainer
- Define and initialize the SFTTrainer for fine-tuning.

In [None]:
def build_sft_trainer(model, tokenizer, dataset, out_dir="unsloth_outputs"):
    """
    Prepare TRL's SFTTrainer for instruction-style fine-tuning.
    """
    args = TrainingArguments(
        output_dir=out_dir,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        max_steps=60,             # placeholder for demo; use epochs in prod
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        report_to="none"
    )
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset,
        dataset_text_field="text",
        max_seq_length=2048,
        packing=False,
        args=args
    )
    print("SFTTrainer ready.")
    return trainer

trainer = build_sft_trainer(model, tokenizer, ds)


# 📊 Monitor GPU Memory & Start Training
- Log baseline GPU memory, run .train(), then compute usage.

In [None]:
# Baseline memory
start_mem = torch.cuda.max_memory_reserved() / (1024**3)
total_mem = torch.cuda.get_device_properties(0).total_memory / (1024**3)
print(f"Start reserved: {start_mem:.2f} GB / {total_mem:.2f} GB")

# Train
train_stats = trainer.train()

# Peak memory
peak_mem = torch.cuda.max_memory_reserved() / (1024**3)
print(f"Peak reserved: {peak_mem:.2f} GB ({peak_mem/total_mem*100:.1f}%)")


# 💾 Save Fine-Tuned Model
- Persist adapters and full model for inference.

In [None]:
def save_model(trainer, path="unsloth_finetuned"):
    """
    Saves LoRA adapters + base model merged into one checkpoint.
    """
    model_merged = trainer.model.merge_and_unload()
    model_merged.save_pretrained(path)
    trainer.tokenizer.save_pretrained(path)
    print("Model saved to:", path)

save_model(trainer)


# 🧪 Inference Examples
- Load the merged checkpoint and generate a sample response.

In [None]:
# Reload for inference
inf_model, inf_tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth_finetuned",
    load_in_4bit=True
)
FastLanguageModel.for_inference(inf_model)

def infer(prompt, max_new_tokens=100):
    """Generate text with streaming for real-time output."""
    inputs = inf_tokenizer(prompt, return_tensors="pt").to("cuda")
    streamer = TextStreamer(inf_tokenizer)
    inf_model.eval()
    with torch.no_grad():
        inf_model.generate(
            **inputs,
            streamer=streamer,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.3,
            top_k=10,
            eos_token_id=inf_tokenizer.eos_token_id
        )

# Example
sample_prompt = "Prove that the difference between two consecutive cubes cannot be divisible by 5."
infer(sample_prompt)
