<h1 Fine-tuning Generative Models</h1>

## 🧠 Full Notebook Summary – QLoRA + DPO Fine-Tuning Pipeline

- **🔤 Tokenizer & Dataset Preparation**
  - Loaded the `TinyLlama` tokenizer and the `ultrachat_200k` dataset.
  - Formatted prompts using TinyLlama’s chat template (`<|user|>`, `<|assistant|>`).

- **🤖 Model Loading with Quantization (QLoRA)**
  - Loaded the `TinyLlama-1.1B` model in **4-bit precision** using `BitsAndBytesConfig`.
  - Enabled low-memory, efficient training via quantization.

- **🔧 LoRA Adapter Injection**
  - Defined a `LoraConfig` to apply LoRA adapters to key transformer layers (e.g., `q_proj`, `v_proj`, `k_proj`, etc.).
  - Prepared the quantized model for training and injected LoRA layers.

- **📚 Supervised Fine-Tuning (SFT)**
  - Fine-tuned the model on instruction-following data using `SFTTrainer`.
  - Only LoRA adapter weights were updated during training.
  - Saved SFT LoRA weights to `TinyLlama-1.1B-qlora`.

- **📈 DPO Dataset Preparation**
  - Loaded a human preference dataset: `distilabel-intel-orca-dpo-pairs`.
  - Filtered for high-quality pairs (e.g., `chosen_score ≥ 8` and no ties).
  - Formatted prompts using the TinyLlama chat structure.

- **⚙️ DPO Training Configuration**
  - Used `DPOConfig` to define training arguments (batch size, learning rate, cosine LR scheduler, etc.).
  - Optimized for quantized and memory-efficient training.

- **🏋️‍♂️ Direct Preference Optimization (DPO)**
  - Further fine-tuned the SFT model using preference data via `DPOTrainer`.
  - The model learned to prefer higher-quality answers over rejected ones.
  - Saved DPO LoRA adapter weights to `TinyLlama-1.1B-dpo-qlora`.

- **🔗 Merging LoRA Adapters**
  - Merged the SFT LoRA adapters into the base model using `merge_and_unload()`.
  - Then merged the DPO LoRA adapters on top of the SFT model.
  - Result: a fully fine-tuned base model with both SFT and DPO updates, without needing LoRA layers anymore.

- **💬 Inference**
  - Created a prompt using TinyLlama's expected chat format.
  - Used Hugging Face's `pipeline()` with the final model to generate a response.

---

✅ The result is a memory-efficient, preference-aligned, instruction-tuned TinyLlama model ready for deployment or further experimentation.


%%capture
!pip install -q accelerate==0.31.0 peft==0.11.1 bitsandbytes==0.43.1 transformers==4.41.2 trl==0.9.4 sentencepiece==0.2.0 triton==3.1.0

# Supervised Fine-Tuning (SFT)

## Data Preprocessing

In [2]:
from transformers import AutoTokenizer
from datasets import load_dataset


# Load a tokenizer to use its chat template
'''This loads the tokenizer associated with the "TinyLlama/TinyLlama-1.1B-Chat-v1.0" model.
This tokenizer contains a chat template, which defines how conversations (e.g., <|user|> and <|assistant|> tokens) should be formatted.'''
template_tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

'''This function takes an individual example (a dictionary) from the dataset.
It extracts the messages field, which contains a conversation history (a list of dicts with role and content, e.g. user/assistant).
It then uses the tokenizer’s apply_chat_template() to convert this conversation into a formatted text prompt that the TinyLlama model expects.
tokenize=False means it returns a string, not token IDs. The function returns a dictionary with one key: "text" (the formatted prompt).'''
def format_prompt(example):
    """Format the prompt to using the <|user|> template TinyLLama is using"""

    # Format answers
    chat = example["messages"]
    prompt = template_tokenizer.apply_chat_template(chat, tokenize=False)

    return {"text": prompt}

# Load and format the data using the template TinyLLama is using
'''Loads a subset ("test_sft") of the "HuggingFaceH4/ultrachat_200k" dataset. This dataset contains synthetic chat conversations.
Shuffles it with a fixed seed (for reproducibility).
Selects the first 3,000 examples after shuffling.'''
dataset = (
    load_dataset("HuggingFaceH4/ultrachat_200k",  split="test_sft")
      .shuffle(seed=42)
      .select(range(3_000))
)
'''Applies the format_prompt function to each example in the dataset.
As a result, each example will now include a "text" field that contains the prompt formatted with TinyLlama's chat template.'''
dataset = dataset.map(format_prompt)

  from .autonotebook import tqdm as notebook_tqdm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
Map: 100%|█████████████████████████████████████████████████████████████████| 3000/3000 [00:01<00:00, 1837.92 examples/s]


In [3]:
# Example of formatted prompt
print(dataset["text"][2576])

<|user|>
Given the text: Knock, knock. Who’s there? Hike.
Can you continue the joke based on the given text material "Knock, knock. Who’s there? Hike"?</s>
<|assistant|>
Sure! Knock, knock. Who's there? Hike. Hike who? Hike up your pants, it's cold outside!</s>
<|user|>
Can you tell me another knock-knock joke based on the same text material "Knock, knock. Who's there? Hike"?</s>
<|assistant|>
Of course! Knock, knock. Who's there? Hike. Hike who? Hike your way over here and let's go for a walk!</s>



## Models - Quantization

In [4]:
'''torch: PyTorch library for tensor computation and deep learning.
AutoModelForCausalLM: Automatically loads a causal language model (used for text generation).
AutoTokenizer: Loads the tokenizer that matches the model.
BitsAndBytesConfig: Used for configuring quantization (loading the model in lower precision to save memory).'''
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

'''Specifies the name of the TinyLlama model checkpoint you're loading.
This is likely a checkpoint partway through training (e.g., at 1.431 million steps on 3 trillion tokens).'''
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

# 4-bit quantization configuration - Q in QLoRA
'''load_in_4bit=True: Enables 4-bit model weights. bnb_4bit_quant_type="nf4": Uses NF4 quantization (a newer quantization type with better performance for LLMs).
bnb_4bit_compute_dtype="float16": Uses float16 for computations. bnb_4bit_use_double_quant=True: Applies nested quantization (quantize the quantization constants), further compressing the model.
This makes it possible to fine-tune or run inference on large models using a GPU with limited memory.'''
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Use 4-bit precision model loading
    bnb_4bit_quant_type="nf4",  # Quantization type
    bnb_4bit_compute_dtype="float16",  # Compute dtype
    bnb_4bit_use_double_quant=True,  # Apply nested quantization
)

# Load the model to train on the GPU
'''Loads the model with the name model_name. device_map="auto": Automatically places model parts on the appropriate GPU(s).
quantization_config=bnb_config: Applies the 4-bit quantization settings defined earlier.
This step loads the model in a memory-efficient, quantized format, ready for training or inference.'''
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",

    # Leave this out for regular SFT
    quantization_config=bnb_config,
)
'''use_cache = False: Disables caching of past key-values during training. This is necessary when doing gradient checkpointing or training with sequences of variable lengths.
pretraining_tp = 1: Sets the tensor parallelism factor to 1. Relevant if using tensor parallel training. In this context, it just ensures compatibility.'''
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
'''Loads the tokenizer associated with the model.
trust_remote_code=False: Avoids executing any custom code from the model repo. This is a security-safe default.
pad_token = "<PAD>": Sets the tokenizer’s pad token (TinyLlama may not come with a default one).
padding_side = "left": Pads on the left side (important for models trained to generate from right-aligned text, especially with attention masks).'''
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=False)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_side = "left"

## Configuration

### LoRA Configuration

In [5]:
'''oRA is a parameter-efficient fine-tuning (PEFT) method. Instead of updating all the weights of a large model, it adds small trainable low-rank adapters to certain layers (e.g., attention layers). This:
Greatly reduces memory and compute requirements. Enables fine-tuning even large models on consumer GPUs'''
'''LoraConfig: Used to configure LoRA training. prepare_model_for_kbit_training: Prepares a quantized (e.g., 4-bit) model for training.
get_peft_model: Wraps the model with LoRA adapters.'''
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# Prepare LoRA Configuration
'''| Parameter               | Meaning                                                                                |
| ----------------------- | -------------------------------------------------------------------------------------- |
| `lora_alpha=32`         | Scaling factor for LoRA weights. Controls how much they affect outputs.                |
| `lora_dropout=0.1`      | Dropout applied to the LoRA layers during training.                                    |
| `r=64`                  | LoRA rank — controls how many parameters are added. Higher = more capacity.            |
| `bias="none"`           | Do not apply LoRA to bias terms.                                                       |
| `task_type="CAUSAL_LM"` | Specifies this is a causal language modeling task.                                     |
| `target_modules=[...]`  | Which layers LoRA will modify. Typically includes linear layers in attention and MLPs. |
 '''
peft_config = LoraConfig(
    lora_alpha=32,  # LoRA Scaling
    lora_dropout=0.1,  # Dropout for LoRA Layers
    r=64,  # Rank
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=  # Layers to target
     ['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

# prepare model for training
'''This function prepares a quantized model (e.g., 4-bit) for training with LoRA.
It: Casts normalization layers to float32 for stability. Makes embeddings and outputs trainable if needed. Sets requires_grad=False for frozen base model weights.
This step is crucial for compatibility with quantized models like those loaded with BitsAndBytes.'''
model = prepare_model_for_kbit_training(model)
'''Wraps the base model with LoRA adapters using the config defined earlier.
Only the small LoRA layers are now trainable — the original model weights remain frozen.
After this, you can train the model using a regular PyTorch or Hugging Face training loop.'''
model = get_peft_model(model, peft_config)

### Training Configuration

In [6]:
'''this code is setting up training hyperparameters for a Hugging Face model using the TrainingArguments class from the transformers library'''
from transformers import TrainingArguments

output_dir = "./results"

# Training arguments
'''| Parameter                       | Description                                                                                                                           |
| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `output_dir="./results"`        | Directory to save model checkpoints and logs.                                                                                         |
| `per_device_train_batch_size=2` | Train with 2 samples per GPU (or CPU if no GPU).                                                                                      |
| `gradient_accumulation_steps=4` | Accumulate gradients over 4 steps before performing a backward/update step. Effectively simulates a larger batch size of `2 × 4 = 8`. |
| `optim="paged_adamw_32bit"`     | Optimizer used — this is a **memory-efficient variant of AdamW**, used with quantized models (4-bit/8-bit).                           |
| `learning_rate=2e-4`            | Initial learning rate for the optimizer.                                                                                              |
| `lr_scheduler_type="cosine"`    | Use a **cosine annealing** schedule for learning rate — starts high and decays in a cosine curve.                                     |
| `num_train_epochs=1`            | Train the model for 1 full pass through the dataset.                                                                                  |
| `logging_steps=10`              | Log training metrics (like loss) every 10 steps.                                                                                      |
| `fp16=True`                     | Use **mixed-precision (16-bit float)** training to reduce memory usage and speed up training on supported GPUs.                       |
| `gradient_checkpointing=True`   | Save memory by recomputing activations during backpropagation (trades memory for compute). Useful when training large models.         |
 '''
training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    num_train_epochs=1,
    logging_steps=10,
    fp16=True,
    gradient_checkpointing=True
)

## Training!

In [7]:
'''Imports SFTTrainer from the trl library — an extension of Hugging Face’s Trainer designed specifically for fine-tuning large language models.
It's ideal for chat models, QLoRA, LoRA, and other efficient training methods.'''
from trl import SFTTrainer

# Set supervised fine-tuning parameters
'''| Parameter                   | Description                                                                                                                                                             |
| --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `model=model`               | The model to fine-tune — already set up with LoRA/QLoRA adapters.                                                                                                       |
| `train_dataset=dataset`     | The training dataset — should contain a `"text"` field with formatted prompts.                                                                                          |
| `dataset_text_field="text"` | Tells `SFTTrainer` to use the `"text"` field as the input for training.                                                                                                 |
| `tokenizer=tokenizer`       | The tokenizer for tokenizing text prompts.                                                                                                                              |
| `args=training_arguments`   | Training hyperparameters (from `TrainingArguments`).                                                                                                                    |
| `max_seq_length=512`        | Maximum sequence length — inputs longer than this will be truncated.                                                                                                    |
| `peft_config=peft_config`   | PEFT config for QLoRA. This enables adapter-based fine-tuning. <br> 🔹 **If omitted**, the trainer performs **full fine-tuning**, which is slower and uses more memory. |
 '''
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    max_seq_length=512,

    # Leave this out for regular SFT
    peft_config=peft_config,
)

# Train model
trainer.train()

# Save QLoRA weights
'''Saves only the LoRA adapter weights (not the full model) to the directory "TinyLlama-1.1B-qlora".
This is much smaller than saving the full model and can later be loaded using PEFT.'''
trainer.model.save_pretrained("TinyLlama-1.1B-qlora")


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
Map: 100%|█████████████████████████████████████████████████████████████████| 3000/3000 [00:01<00:00, 2278.40 examples/s]
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  return fn(*args, **kwargs)


Step,Training Loss
10,1.6709
20,1.4759
30,1.4509
40,1.488
50,1.4778
60,1.3904
70,1.4949
80,1.45
90,1.4276
100,1.4044




### Merge Adapter

In [8]:
'''Imports a PEFT-aware model class that can automatically detect and load:
The base model Any attached LoRA adapters (used for QLoRA fine-tuning)
This class is a wrapper over AutoModelForCausalLM, specialized for PEFT.'''
from peft import AutoPeftModelForCausalLM

'''| Argument                 | Description                                                      |
| ------------------------ | ---------------------------------------------------------------- |
| `"TinyLlama-1.1B-qlora"` | Folder containing the adapter weights (from `save_pretrained`).  |
| `low_cpu_mem_usage=True` | Optimizes memory usage while loading (helpful for large models). |
| `device_map="auto"`      | Automatically places model components on available GPUs or CPU.  |
 '''

model = AutoPeftModelForCausalLM.from_pretrained(
    "TinyLlama-1.1B-qlora",
    low_cpu_mem_usage=True,
    device_map="auto",
)

# Merge LoRA and base model
'''This merges the LoRA adapter weights into the base model weights.
After this, the model no longer depends on PEFT or LoRA:
It becomes a standard Hugging Face model, ready for deployment or inference.
All parameters are stored together — no adapters.'''
merged_model = model.merge_and_unload()

### Inference

In [9]:
'''Imports Hugging Face’s high-level pipeline API, which simplifies running inference for tasks like:
Text generation, Sentiment analysis, Translation, Summarization, etc.'''
from transformers import pipeline

# Use our predefined prompt template
'''This format matches the chat template that TinyLlama was trained on, ensuring better and more coherent responses.'''
prompt = """<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
"""

# Run our instruction-tuned model
'''This pipeline internally: Tokenizes the prompt, Runs it through the model, Decodes the output back into readable text'''
pipe = pipeline(task="text-generation", model=merged_model, tokenizer=tokenizer)
print(pipe(prompt)[0]["generated_text"])

<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
Large Language Models (LLMs) are a type of artificial intelligence (AI) that can generate human-like language. They are trained on large amounts of data, including text, audio, and video, and are capable of generating complex and nuanced language.

LLMs are used in a variety of applications, including natural language processing (NLP), machine translation, and chatbots. They can be used to generate text, speech, or images, and can be trained to understand different languages and dialects.

One of the most significant applications of LLMs is in the field of natural language generation (NLG). LLMs can be used to generate text in a variety of languages, including English, French, and German. They can also be used to generate speech, such as in chatbots or voice assistants.

LLMs have also been used in the field of machine translation (MT). LLMs can be trained to translate between different languages, and can be used

# Preference Tuning (PPO/DPO)

## Data Preprocessing

In [10]:
'''This code loads and formats a dataset for DPO (Direct Preference Optimization) training, using a TinyLlama-compatible prompt style.
Load a preference dataset (with ranked outputs: chosen vs rejected).
Format the data into prompt/response pairs that follow TinyLlama’s chat-style template (e.g., <|user|>, <|assistant|>, <|system|>).
Filter out low-quality or irrelevant data.
Prepare it for DPO fine-tuning, where the model is trained to prefer "better" responses.'''

'''Takes a single example from the dataset. Formats it using TinyLlama's chat structure, with special tags:
<|system|>: Optional system-level instruction (e.g., "You are a helpful assistant.")
<|user|>: User's question or command.
<|assistant|>: Marks where the model's answer should begin.
Adds </s> (end-of-sequence token) at the end of each section.'''
from datasets import load_dataset

def format_prompt(example):
    """Format the prompt to using the <|user|> template TinyLLama is using"""

    # Format answers
    system = "<|system|>\n" + example['system'] + "</s>\n"
    prompt = "<|user|>\n" + example['input'] + "</s>\n<|assistant|>\n"
    chosen = example['chosen'] + "</s>\n"
    rejected = example['rejected'] + "</s>\n"

    return {
        "prompt": system + prompt,
        "chosen": chosen,
        "rejected": rejected,
    }

# Apply formatting to the dataset and select relatively short answers
'''Loads the "train" split of the argilla/distilabel-intel-orca-dpo-pairs dataset.
This dataset contains preference pairs: each sample includes a prompt and two completions — one labeled better (chosen) and one worse (rejected).'''
dpo_dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

'''Keeps only high-quality training examples:
status != "tie": Only include samples where a clear winner was chosen.
chosen_score >= 8: The chosen response must be rated 8 or higher (likely on a 1–10 scale).
not r["in_gsm8k_train"]: Excludes examples that overlap with GSM8K’s training data (to avoid data leakage).'''
dpo_dataset = dpo_dataset.filter(
    lambda r:
        r["status"] != "tie" and
        r["chosen_score"] >= 8 and
        not r["in_gsm8k_train"]
)
'''Applies the format_prompt() function to every row in the dataset.
Removes all original columns and replaces them with:
"prompt" (system + user prompt)
"chosen" (preferred response)
"rejected" (non-preferred response'''
dpo_dataset = dpo_dataset.map(format_prompt, remove_columns=dpo_dataset.column_names)

'''After these steps, you have a cleaned, formatted dataset ready for DPO training, where each sample has:
A prompt (in TinyLlama format)
A high-quality (chosen) response
A lower-quality (rejected) response
This is the standard setup for training models to prefer better responses using ranking-based methods like DPO.'''
dpo_dataset

Generating train split: 100%|███████████████████████████████████████████| 12859/12859 [00:00<00:00, 29714.81 examples/s]
Filter: 100%|███████████████████████████████████████████████████████████| 12859/12859 [00:00<00:00, 24422.82 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████| 5922/5922 [00:00<00:00, 7520.06 examples/s]


Dataset({
    features: ['chosen', 'rejected', 'prompt'],
    num_rows: 5922
})

## Models - Quantization

In [11]:
'''Loads a QLoRA-fine-tuned model with 4-bit quantization. Merges the LoRA adapter weights into the base model
Loads and configures the corresponding tokenizer'''
'''AutoPeftModelForCausalLM: A PEFT-aware model loader that can load base models along with LoRA/QLoRA adapters.
BitsAndBytesConfig: Used to configure 4-bit quantization for loading large models efficiently (QLoRA).
AutoTokenizer: Loads the tokenizer associated with the model.'''
from peft import AutoPeftModelForCausalLM
from transformers import BitsAndBytesConfig, AutoTokenizer

# 4-bit quantization configuration - Q in QLoRA
'''| Parameter                          | Description                                                            |
| ---------------------------------- | ---------------------------------------------------------------------- |
| `load_in_4bit=True`                | Load weights in 4-bit precision                                        |
| `bnb_4bit_quant_type="nf4"`        | Use **NF4** (Normalized Float 4) — better performance vs. regular int4 |
| `bnb_4bit_compute_dtype="float16"` | Computations (e.g., matmuls) are done in 16-bit float                  |
| `bnb_4bit_use_double_quant=True`   | Uses nested quantization to reduce memory even further                 |
 '''
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Use 4-bit precision model loading
    bnb_4bit_quant_type="nf4",  # Quantization type
    bnb_4bit_compute_dtype="float16",  # Compute dtype
    bnb_4bit_use_double_quant=True,  # Apply nested quantization
)

# Merge LoRA and base model
'''Loads the fine-tuned TinyLlama model with LoRA adapters from the directory "TinyLlama-1.1B-qlora".
Applies the bnb_config for 4-bit quantized loading. Uses device_map="auto" to automatically place layers on available GPU(s)/CPU.
low_cpu_mem_usage=True: Helps avoid memory overload on systems with limited RAM. ⚠️ At this point, the model is still composed of the base + LoRA adapter weights.'''
model = AutoPeftModelForCausalLM.from_pretrained(
    "TinyLlama-1.1B-qlora",
    low_cpu_mem_usage=True,
    device_map="auto",
    quantization_config=bnb_config,
)
'''Combines the base model and the fine-tuned LoRA weights into one model. Unloads any adapter-specific layers.
The result (merged_model) is now a standard model, with LoRA changes baked in. You no longer need PEFT or LoRA configs to use it for inference or export.'''
merged_model = model.merge_and_unload()

# Load LLaMA tokenizer
'''Loads the tokenizer that matches the original base model. trust_remote_code=False means you only trust standard Hugging Face tokenizers (for safety).
Sets: pad_token = "<PAD>": Required for left-padding input sequences. padding_side = "left": Important for generation tasks (used with attention masks when padding sequences).'''
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=False)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_side = "left"



## Configuration

In [12]:
'''You’re importing from the PEFT (Parameter-Efficient Fine-Tuning) LoraConfig: Configuration for how LoRA adapters are applied.
prepare_model_for_kbit_training: Prepares a quantized model (e.g. 4-bit or 8-bit) to be fine-tuned.
get_peft_model: Injects LoRA layers into the model based on the config.'''
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# Prepare LoRA Configuration
'''LoRA works by freezing the base model and only training small rank-decomposed matrices inserted into key transformer layers.
r=64: Low-rank dimension (fewer parameters).
target_modules: These are specific linear projection layers in transformer blocks — typically part of the attention and MLP components.
This setup is optimized for models like LLaMA, GPT, etc.'''
peft_config = LoraConfig(
    lora_alpha=32,  # LoRA Scaling
    lora_dropout=0.1,  # Dropout for LoRA Layers
    r=64,  # Rank
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=  # Layers to target
     ['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

# prepare model for training
'''Prepares a 4-bit (or 8-bit) quantized model for training by: Casting certain layers (like LayerNorm) to float32 for numerical stability.
Ensuring that only LoRA-injected layers will be trainable. Freezing all other base model weights.
🔒 This is essential when using quantized models like QLoRA — because you cannot train 4-bit weights directly.'''
model = prepare_model_for_kbit_training(model)
'''Adds LoRA adapters into the model based on the peft_config.
After this, model: Is ready for training using standard techniques (Trainer, SFTTrainer, etc.)
Will only update the LoRA adapter weights during training (keeping the base model frozen)
Has a small memory footprint, even on consumer GPUs'''
model = get_peft_model(model, peft_config)

In [13]:
'''This code sets up training hyperparameters for Direct Preference Optimization (DPO) using the trl library (Transformers Reinforcement Learning by Hugging Face), 
specifically via the DPOConfig class. DPOConfig is a configuration class for training with Direct Preference Optimization (DPO), a method used to align LLMs with human preferences 
by teaching the model to prefer better responses (without needing a reward model like in RLHF).'''
from trl import DPOConfig

output_dir = "./results"

# Training arguments
'''| Argument                        | Description                                                                                        |
| ------------------------------- | -------------------------------------------------------------------------------------------------- |
| `output_dir="./results"`        | Where the model checkpoints and logs will be saved.                                                |
| `per_device_train_batch_size=2` | Number of examples per GPU per step.                                                               |
| `gradient_accumulation_steps=4` | Accumulates gradients over 4 steps before backpropagation — simulates a batch size of `2 × 4 = 8`. |
| `optim="paged_adamw_32bit"`     | Uses a memory-efficient version of the AdamW optimizer, suitable for quantized models.             |
| `learning_rate=1e-5`            | The base learning rate for training.                                                               |
| `lr_scheduler_type="cosine"`    | Applies a **cosine decay** to the learning rate over time.                                         |
| `max_steps=200`                 | Train for 200 update steps (not epochs).                                                           |
| `logging_steps=10`              | Log metrics every 10 steps.                                                                        |
| `fp16=True`                     | Use 16-bit floating-point precision (faster, less memory).                                         |
| `gradient_checkpointing=True`   | Saves memory by recomputing some activations during backpropagation.                               |
| `warmup_ratio=0.1`              | 10% of training steps are used to gradually ramp up the learning rate from zero.                   |
 '''
training_arguments = DPOConfig(
    output_dir=output_dir,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=1e-5,
    lr_scheduler_type="cosine",
    max_steps=200,
    logging_steps=10,
    fp16=True,
    gradient_checkpointing=True,
    warmup_ratio=0.1
)

In [14]:
'''This code sets up and runs Direct Preference Optimization (DPO) fine-tuning on a QLoRA-prepared TinyLlama model using Hugging Face's trl library — specifically, the DPOTrainer.'''
from trl import DPOTrainer

# Create DPO trainer
'''| Argument                    | Description                                                                                                                                              |
| --------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `model`                     | The TinyLlama model prepared with **QLoRA adapters**                                                                                                     |
| `args=training_arguments`   | The `DPOConfig` settings defined earlier (batch size, learning rate, etc.)                                                                               |
| `train_dataset=dpo_dataset` | A dataset of preference pairs (`prompt`, `chosen`, `rejected`)                                                                                           |
| `tokenizer=tokenizer`       | Tokenizer that matches the model (set up earlier with TinyLlama)                                                                                         |
| `peft_config=peft_config`   | The LoRA configuration (defines where and how to train adapters)                                                                                         |
| `beta=0.1`                  | A hyperparameter controlling how strongly the model is optimized to prefer chosen over rejected responses (higher = more aggressive preference training) |
| `max_prompt_length=512`     | Maximum length for the **prompt** part (input before the model response starts)                                                                          |
| `max_length=512`            | Maximum total input length (prompt + response) during training                                                                                           |
 '''
dpo_trainer = DPOTrainer(
    model,
    args=training_arguments,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    beta=0.1,
    max_prompt_length=512,
    max_length=512,
)

# Fine-tune model with DPO
'''Starts the fine-tuning process using DPO, where: For each example, the model is shown a prompt, a chosen response, and a rejected response.
It learns to assign higher probability to the chosen response (based on likelihood), using a contrastive loss informed by the beta parameter.
✅ This is more sample-efficient and stable than reward-model-based RLHF methods.'''
dpo_trainer.train()

# Save adapter
'''Saves only the LoRA adapter weights (not the full base model) to the directory "TinyLlama-1.1B-dpo-qlora".
You can later: Reload it with AutoPeftModelForCausalLM.from_pretrained(...)
Merge it into the base model for deployment (model.merge_and_unload())'''
dpo_trainer.model.save_pretrained("TinyLlama-1.1B-dpo-qlora")


Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.
Map: 100%|██████████████████████████████████████████████████████████████████| 5922/5922 [00:18<00:00, 325.53 examples/s]
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs
  return fn(*args, **kwargs)
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
10,0.6922
20,0.6781
30,0.646
40,0.6066
50,0.5949
60,0.6171
70,0.5933
80,0.5316
90,0.5596
100,0.6388


In [15]:
'''This code merges multiple LoRA adapters into a single base model — a common final step when you've done sequential fine-tuning, such as:
Supervised Fine-Tuning (SFT) using LoRA → saved as "TinyLlama-1.1B-qlora"
Direct Preference Optimization (DPO) LoRA fine-tuning → saved as "TinyLlama-1.1B-dpo-qlora"'''
from peft import PeftModel

# Merge LoRA and base model
'''Loads the SFT model with QLoRA adapters from "TinyLlama-1.1B-qlora".
Uses AutoPeftModelForCausalLM which automatically loads both the base model and LoRA weights.
Then, merge_and_unload() merges the SFT LoRA weights into the base model, returning a standard model (sft_model) without LoRA layers.
✅ Now sft_model is a standalone, fully fine-tuned base model after the SFT stage.'''
model = AutoPeftModelForCausalLM.from_pretrained(
    "TinyLlama-1.1B-qlora",
    low_cpu_mem_usage=True,
    device_map="auto",
)
sft_model = model.merge_and_unload()

# Merge DPO LoRA and SFT model
'''PeftModel.from_pretrained(...) takes:
sft_model (already merged with SFT LoRA)
TinyLlama-1.1B-dpo-qlora: the DPO-specific LoRA adapter weights
This attaches the DPO LoRA adapters on top of the sft_model.
merge_and_unload() then merges those DPO LoRA weights into the sft_model, giving you the final fully fine-tuned model that includes:
Base model weights
SFT LoRA modifications
DPO LoRA modifications
✅ dpo_model is now the final merged model, fully trained through both SFT and DPO, and no longer needs LoRA adapters to function.'''
dpo_model = PeftModel.from_pretrained(
    sft_model,
    "TinyLlama-1.1B-dpo-qlora",
    device_map="auto",
)
dpo_model = dpo_model.merge_and_unload()



In [16]:
'''This code runs inference using a fine-tuned language model — specifically, one that was instruction-tuned via SFT + DPO, and uses a prompt formatted in the style expected by 
TinyLlama (with chat-style tokens like <|user|> and <|assistant|>).'''
from transformers import pipeline

# Use our predefined prompt template
prompt = """<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
"""

# Run our instruction-tuned model
'''Sets up a text generation pipeline using:
dpo_model: your final model (merged with both SFT and DPO LoRA weights)
tokenizer: the tokenizer that matches TinyLlama
✅ At this point, you're ready to generate text using your aligned model.'''
pipe = pipeline(task="text-generation", model=dpo_model, tokenizer=tokenizer)
print(pipe(prompt)[0]["generated_text"])

<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
Large Language Models (LLMs) are a type of artificial intelligence (AI) that can generate human-like language. They are trained on large amounts of data, including text, audio, and video, and are capable of generating complex and nuanced language.

LLMs are used in a variety of applications, including natural language processing (NLP), machine translation, and chatbots. They can be used to generate text, speech, or images, and can be trained to understand different languages and dialects.

One of the most significant applications of LLMs is in the field of natural language generation (NLG). LLMs can be used to generate text in a variety of languages, including English, French, and German. They can also be used to generate speech, such as in chatbots or voice assistants.

LLMs have also been used in the field of machine translation (MT). LLMs can be trained to translate between different languages, and can be used