# üìñ Notebook Overview & DeepSpeed Primer

**Purpose:**  
This notebook demonstrates fine-tuning Meta‚Äôs LLaMA-3.2-1B model with parameter-efficient LoRA adapters, accelerated by DeepSpeed‚Äôs ZeRO optimizations. You will learn:

1. **Why DeepSpeed?**  Advanced memory and compute optimizations to:
   - **Scale to large models** on limited GPUs by partitioning optimizer states and gradients (ZeRO Stages 1‚Äì3).  
   - **Speed up training** with kernel fusion, sparse attention, and activation checkpointing.  
   - **Simplify multi-GPU & distributed setups** via a unified API.

2. **Key Concepts:**
   - **ZeRO-Offload & ZeRO Stages**  
     - Stage 1: partitions optimizer state.  
     - Stage 2: also partitions gradients.  
     - Stage 3: additionally partitions model weights.  
   - **LoRA (Low-Rank Adaptation):** injects small, trainable adapter matrices into attention layers rather than updating all model weights.  
   - **DeepSpeed Config:** JSON file (`config.json`) specifying ZeRO stage, offload settings (CPU/GPU), fp16, gradient accumulation, etc.

3. **Notebook Workflow:**
   1. **Install & Verify Dependencies:** PyTorch, Transformers, DeepSpeed, PEFT, etc.  
   2. **Set Environment & Paths:** ensure CUDA/DeepSpeed pick up correct toolkit location.  
   3. **Authenticate:** login to Hugging Face Hub for dataset/model access.  
   4. **Load Model + Tokenizer:** configure padding tokens, dtype, device map.  
   5. **Wrap with LoRA:** inject adapters for parameter-efficient fine-tuning.  
   6. **Preprocess Dataset:** tokenize raw data, mask labels so loss only on generated tokens.  
   7. **Initialize Trainer with DeepSpeed:** use `TrainingArguments(deepspeed="config.json")` to automatically enable ZeRO.  
   8. **Train & Monitor:** DeepSpeed handles gradient partitioning, offload, and logging.  
   9. **Merge Adapters & Save:** consolidate LoRA weights into base model for inference portability.  
   10. **Sample Inference:** load the merged model and generate a test response.

4. **Deep Explanations & Tips:**
   - **Memory Savings:**  
     DeepSpeed ZeRO Stage 2 splits optimizer and gradient states across all GPUs; you get ~3√ó more effective memory than naive data-parallel.  
   - **Speed vs. Precision Trade-offs:**  
     Use `fp16=True` or `bf16=True` to cut memory in half, but watch out for numeric stability ‚Äî DeepSpeed includes ‚Äúloss scaling‚Äù to help.  
   - **Configuration Highlights (`config.json`):**  
     ```json
     {
       "zero_optimization": {
         "stage": 2,
         "offload_param": { "device": "cpu", "pin_memory": true },
         "offload_optimizer": { "device": "cpu", "pin_memory": true }
       },
       "fp16": { "enabled": true, "loss_scale": 0 },
       "gradient_accumulation_steps": 8,
       "train_micro_batch_size_per_gpu": 2
     }
     ```  
     - **offload_param** & **offload_optimizer**: push large tensors to CPU to fit in GPU memory.  
     - **loss_scale = 0**: automatic dynamic scaling to avoid underflow in fp16.  
   - **Debugging & Logs:**  
     Enable `logging_steps` in `TrainingArguments` to get per-step loss. Look at the DeepSpeed console banner at startup for memory and throughput stats.

---

> **Next:** start with **Cell 1: Environment & Dependency Installation** to get your GPU and DeepSpeed ready!


# üì¶Environment & Dependency Installation
- Install all required Python packages in one place.

In [None]:
# Install core libraries: PyTorch, Transformers, Datasets, PEFT, DeepSpeed
!pip install torch transformers datasets peft deepspeed \
    && pip install mpi4py \
    && pip install torch --upgrade

# Verify PyTorch installation
!pip show torch


# üì• Imports & Path Configuration
- Centralize all imports and basic OS path settings.

In [None]:
# Standard library
import os
import importlib.metadata

# Deep learning frameworks
import torch
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    TrainingArguments, Trainer
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from huggingface_hub import login

# Ensure CUDA toolkit path is set
os.environ["CUDA_HOME"] = "/usr"
os.environ["PATH"] = f"{os.environ['CUDA_HOME']}/bin:" + os.environ.get("PATH", "")
os.environ["LD_LIBRARY_PATH"] = f"{os.environ['CUDA_HOME']}/lib64:" + os.environ.get("LD_LIBRARY_PATH", "")


#üñ•Ô∏è CUDA & GPU Verification
- Check GPU availability and print device info.

In [None]:
def check_gpu():
    """Print GPU availability and device details."""
    print("CUDA available:", torch.cuda.is_available())
    print("CUDA device count:", torch.cuda.device_count())
    if torch.cuda.is_available():
        print("Device name:", torch.cuda.get_device_name(0))
    else:
        print("No GPU found.")

# Run the check
check_gpu()

# Confirm nvcc installation and version
!nvcc --version


#üõ†Ô∏è DeepSpeed Build Configuration
- Configure environment variables and install DeepSpeed.

In [None]:
# Prevent DeepSpeed from rebuilding ops if CUDA toolkit not found
export DS_BUILD_OPS=0
export CUDA_HOME=/usr

# Install DeepSpeed cleanly
!DS_BUILD_AIO=0 pip install deepspeed --no-cache-dir --force-reinstall


#üîê Hugging Face Authentication
- Log in to Hugging Face to enable dataset/model pulls and pushes.

In [None]:
def hf_login(token: str):
    """
    Authenticate with the Hugging Face Hub.
    Replace 'YOUR_TOKEN' with your actual token.
    """
    login(token=token)
    print("Logged into Hugging Face as:", importlib.metadata.version("transformers"))

hf_login(token="YOUR_HF_TOKEN")


#ü§ñ Model & Tokenizer Initialization
- Load model/tokenizer, set up LoRA configuration.

In [None]:
class ModelLoader:
    """Handles loading of model and tokenizer, and wraps with LoRA."""

    def __init__(self, model_name: str):
        self.model_name = model_name
        self.tokenizer = None
        self.model = None

    def load_tokenizer(self):
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        # Ensure pad token is defined
        if self.tokenizer.pad_token_id is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
            self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

    def load_model(self, dtype=torch.float16):
        # Load in half precision for GPU memory savings
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name, torch_dtype=dtype
        )

    def apply_lora(self, r=16, alpha=32, dropout=0.05):
        """Wrap the base model with LoRA adapters."""
        lora_cfg = LoraConfig(
            r=r, lora_alpha=alpha, lora_dropout=dropout,
            bias="none", task_type="CAUSAL_LM",
            target_modules=["q_proj", "v_proj"]
        )
        self.model = get_peft_model(self.model, lora_cfg)
        self.model.print_trainable_parameters()  # sanity check

# Usage
loader = ModelLoader("meta-llama/Llama-3.2-1B")
loader.load_tokenizer()
loader.load_model()
loader.apply_lora()
tokenizer, model = loader.tokenizer, loader.model


#üìù Data Preprocessing Function
- Define a function to convert raw examples into training features.

In [None]:
def preprocess_sample(example, tokenizer, max_len=1024):
    """
    Tokenize and mask input so that loss is only computed on the assistant's output.
    """
    user = example["user"].strip()
    assistant = example["assistant"].strip()
    reasoning = example.get("reasoning", "").strip()

    # Combine assistant answer + chain-of-thought if present
    full_ans = f"{assistant}\nReasoning: {reasoning}" if reasoning else assistant
    prompt = f"### User:\n{user}\n\n### Assistant:\n"

    # Tokenize
    enc = tokenizer(prompt + full_ans,
                    max_length=max_len, padding="max_length",
                    truncation=True, return_tensors="pt")
    inp_ids = enc.input_ids[0]
    attn = enc.attention_mask[0]

    # Determine prompt length to mask labels
    prompt_ids = tokenizer(prompt,
                           max_length=max_len, padding="max_length",
                           truncation=True, return_tensors="pt")
    prompt_len = prompt_ids.attention_mask[0].sum().item()

    labels = inp_ids.clone()
    labels[:prompt_len] = -100          # do not compute loss on prompt
    labels[attn == 0] = -100            # ignore padding

    return {
        "input_ids": inp_ids.tolist(),
        "attention_mask": attn.tolist(),
        "labels": labels.tolist(),
    }

# Example mapping
raw_ds = load_dataset("KingNish/reasoning-base-20k", split="train")
processed_ds = raw_ds.map(
    lambda ex: preprocess_sample(ex, tokenizer),
    remove_columns=raw_ds.column_names,
    batched=False
)


#üöÄ Training Pipeline Class
- Encapsulate dataset split, Trainer setup, and training logic.

In [None]:
class LoraTrainer:
    """Orchestrates the fine-tuning process with Hugging Face Trainer."""

    def __init__(self, model, tokenizer, dataset, output_dir="output", epochs=3):
        self.model = model
        self.tokenizer = tokenizer
        self.raw_ds = dataset
        self.output_dir = output_dir
        self.epochs = epochs

    def prepare_datasets(self, test_size=0.1):
        split = self.raw_ds.train_test_split(test_size=test_size)
        self.train_ds = split["train"].map(
            lambda ex: preprocess_sample(ex, self.tokenizer),
            remove_columns=split["train"].column_names, batched=False
        )
        self.eval_ds = split["test"].map(
            lambda ex: preprocess_sample(ex, self.tokenizer),
            remove_columns=split["test"].column_names, batched=False
        )

    def train(self, bs=2, grad_acc=8):
        args = TrainingArguments(
            output_dir=self.output_dir,
            overwrite_output_dir=True,
            num_train_epochs=self.epochs,
            per_device_train_batch_size=bs,
            gradient_accumulation_steps=grad_acc,
            fp16=True,
            save_strategy="epoch",
            deepspeed="config.json",
            logging_steps=100,
            report_to="none",
            remove_unused_columns=False,
            seed=42
        )
        trainer = Trainer(
            model=self.model,
            tokenizer=self.tokenizer,
            args=args,
            train_dataset=self.train_ds,
            eval_dataset=self.eval_ds
        )
        trainer.train()
        return trainer

# Run training
trainer_obj = LoraTrainer(model, tokenizer, raw_ds)
trainer_obj.prepare_datasets()
trainer = trainer_obj.train()


#üíæ Merge LoRA Weights & Save Full Model
- After training, merge adapters and export the complete model.

In [None]:
def merge_and_save(trainer, save_dir="lora-merged-model"):
    """
    Merge LoRA weights into the base model and save tokenizer + model.
    """
    print("Merging LoRA adapters into base model...")
    base = trainer.model.merge_and_unload()
    base.eval()
    os.makedirs(save_dir, exist_ok=True)
    base.save_pretrained(save_dir)
    trainer.tokenizer.save_pretrained(save_dir)
    print(f"Model saved to {save_dir}")

merge_and_save(trainer, save_dir="llama-1b-finetuned-full")


#üß™ Inference Example
- Load the merged model and run a sample prompt.

In [None]:
def run_inference(model_dir, prompt_text, max_tokens=200):
    """
    Load a saved model and tokenizer, then generate a response.
    """
    tok = AutoTokenizer.from_pretrained(model_dir)
    mdl = AutoModelForCausalLM.from_pretrained(
        model_dir, torch_dtype=torch.float16, device_map="auto"
    )
    mdl.eval()

    system = (
        "You are a helpful assistant. Answer concisely and accurately. "
        "If unsure, say you don't know."
    )
    full_prompt = f"{system}\n\n### User:\n{prompt_text}\n\n### Assistant:\n"
    inputs = tok(full_prompt, return_tensors="pt").to(mdl.device)

    with torch.no_grad():
        out = mdl.generate(
            **inputs, max_new_tokens=max_tokens,
            temperature=0.1, top_p=0.9, top_k=50,
            repetition_penalty=1.1, no_repeat_ngram_size=1
        )
    text = tok.decode(out[0], skip_special_tokens=True)
    answer = text.split("### Assistant:")[-1].strip()
    print("üß† Model Reply:\n", answer)

# Example inference
run_inference(
    model_dir="llama-1b-finetuned-full",
    prompt_text="What is the discriminant of the quadratic equation 5x^2 - 2x + 1 = 0?"
)
