# Alignment Using Preference Dataset

## Overview
This notebook shows how to fine-tune a language model using **Supervised Fine-Tuning (SFT)**.

**What you'll learn:**
- Create a dataset with multiple outputs per instruction (some good, some bad)
- Train the model using ONLY the good (accepted) examples
- Test the aligned model on new questions

**Key concept:** The model learns "what good responses look like" by training on accepted examples only. Bad examples are stored for future use with DPO (a more advanced technique).

## Step 1: Install Libraries

Install packages needed for model training, LoRA fine-tuning, and quantization.

In [None]:
!pip install transformers datasets torch accelerate peft bitsandbytes trl -q

print("✓ Libraries installed")

✓ Libraries installed


## Step 2: Import Libraries

Import everything we need for model loading, training, and fine-tuning.

In [None]:
import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training
)
from trl import SFTTrainer

### Check GPU Availability

Verify that we have access to a GPU for faster training.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")
print(f"CUDA Available: {torch.cuda.is_available()}")
print(f"Device: {device}")

GPU: Tesla T4
CUDA Available: True
Device: cuda


## Step 3: Create Preference Dataset

**Format:** Each instruction has 3-4 possible responses. Each response is labeled:
- **"accepted"** = Good, high-quality response
- **"rejected"** = Poor or unhelpful response

**Important:** SFT training uses ONLY accepted examples. Rejected ones are saved for DPO (a different training method that learns from comparisons).

In [None]:
preference_data = [
    {
        "instruction": "Explain photosynthesis to a 10-year-old.",
        "outputs": [
            {"text": "Photosynthesis is how plants make their own food! Plants use sunlight, water from the soil, and air to create energy. The green color in leaves (called chlorophyll) helps catch the sunlight. It's like plants are cooking their own meal using sunshine as the heat source!", "label": "accepted"},
            {"text": "Photosynthesis is a biological process involving light-dependent and light-independent reactions where chloroplasts convert electromagnetic radiation into chemical energy through the Calvin cycle.", "label": "rejected"},
            {"text": "Plants eat sunlight.", "label": "rejected"},
            {"text": "Think of plants as having tiny solar panels in their leaves! They use sunlight, water, and carbon dioxide from air to make sugar (their food) and release oxygen that we breathe. The green chlorophyll is like the solar panel that captures the sun's energy.", "label": "accepted"}
        ]
    },
    {
        "instruction": "Write a professional email declining a job offer.",
        "outputs": [
            {"text": "Dear [Hiring Manager],\n\nThank you so much for offering me the [Position] role at [Company]. I truly appreciate the time you and your team invested in the interview process.\n\nAfter careful consideration, I have decided to decline the offer as I've accepted a position that better aligns with my career goals.\n\nI was impressed by [Company] and hope our paths cross again in the future.\n\nBest regards,\n[Your Name]", "label": "accepted"},
            {"text": "No thanks, I found something better.", "label": "rejected"},
            {"text": "I'm not interested in working for your company anymore. Thanks anyway.", "label": "rejected"},
            {"text": "Dear [Hiring Manager],\n\nI want to express my gratitude for the [Position] offer at [Company]. The opportunity was compelling and the interview process was very positive.\n\nHowever, after thoughtful consideration, I've decided to pursue another opportunity that aligns more closely with my long-term career objectives.\n\nI hold [Company] in high regard and would welcome the chance to connect in the future.\n\nSincerely,\n[Your Name]", "label": "accepted"}
        ]
    },
    {
        "instruction": "Explain the difference between machine learning and deep learning.",
        "outputs": [
            {"text": "Machine Learning (ML) is a broad field where computers learn patterns from data. Deep Learning (DL) is a subset of ML that uses neural networks with many layers (hence 'deep'). Think of ML as a toolbox with many tools, and DL as one powerful tool in that box. DL excels at complex tasks like image recognition and language understanding.", "label": "accepted"},
            {"text": "They're basically the same thing.", "label": "rejected"},
            {"text": "Deep learning uses more GPUs than machine learning.", "label": "rejected"},
            {"text": "Machine Learning encompasses various algorithms that learn from data, including decision trees, SVMs, and neural networks. Deep Learning specifically refers to neural networks with multiple hidden layers (2+), enabling hierarchical feature learning. While ML may require manual feature engineering, DL automatically learns features from raw data.", "label": "accepted"}
        ]
    },
    {
        "instruction": "Give three tips for effective time management.",
        "outputs": [
            {"text": "1. **Prioritize tasks**: Use the Eisenhower Matrix to separate urgent vs. important tasks. Focus on what matters most.\n\n2. **Time blocking**: Dedicate specific time slots to specific activities. This reduces decision fatigue and increases focus.\n\n3. **Eliminate distractions**: Turn off notifications during deep work sessions. Your phone and social media will still be there later.", "label": "accepted"},
            {"text": "Work harder, sleep less, multitask always.", "label": "rejected"},
            {"text": "Just make a to-do list.", "label": "rejected"},
            {"text": "1. **Set clear priorities**: Identify your most important tasks each day using methods like the 80/20 rule. Complete high-impact work first.\n\n2. **Use time blocking**: Schedule specific blocks for focused work, meetings, and breaks. Treat these blocks as non-negotiable appointments.\n\n3. **Minimize context switching**: Group similar tasks together and eliminate interruptions during deep work periods.", "label": "accepted"}
        ]
    },
    {
        "instruction": "What should I do if my laptop won't turn on?",
        "outputs": [
            {"text": "Try these troubleshooting steps:\n\n1. Check if it's charged: Plug in the charger and wait 15 minutes, then try again.\n\n2. Hard reset: Hold the power button for 30 seconds, release, then press it normally.\n\n3. Remove peripherals: Disconnect all USB devices, external monitors, etc., then try starting.\n\n4. If still not working, the battery or power adapter might be faulty. Consider taking it to a technician.", "label": "accepted"},
            {"text": "Buy a new laptop.", "label": "rejected"},
            {"text": "Have you tried turning it off and on again?", "label": "rejected"},
            {"text": "First, ensure the power adapter is connected and the outlet works. Wait 15-20 minutes for charging. Next, perform a hard reset by holding the power button for 30 seconds. Remove all external devices and try booting. If unsuccessful, test with a different charger if available, or consult a repair technician.", "label": "accepted"}
        ]
    },
    {
        "instruction": "How do I start learning Python programming?",
        "outputs": [
            {"text": "Here's a beginner-friendly roadmap:\n\n1. **Install Python**: Download from python.org or use Google Colab (online, free).\n\n2. **Learn basics**: Variables, data types, loops, and functions. Use free resources like Python.org tutorials or freeCodeCamp.\n\n3. **Practice daily**: Solve problems on sites like HackerRank or LeetCode (easy level).\n\n4. **Build projects**: Start small (calculator, to-do list) then progress to larger projects.\n\n5. **Join communities**: Reddit's r/learnpython and Python Discord are great for questions.", "label": "accepted"},
            {"text": "Just watch YouTube videos.", "label": "rejected"},
            {"text": "Python is easy, you don't need to study much.", "label": "rejected"},
            {"text": "Begin by setting up Python (python.org or Anaconda). Master fundamentals: variables, control flow, functions, and data structures. Practice through coding challenges on platforms like Codewars or HackerRank. Build simple projects (web scraper, automation script) to apply knowledge. Engage with learning communities and read documentation regularly.", "label": "accepted"}
        ]
    },
    {
        "instruction": "Summarize the plot of Romeo and Juliet in 2 sentences.",
        "outputs": [
            {"text": "Romeo and Juliet are young lovers from two feuding families in Verona, Italy. Despite their families' hatred, they secretly marry, but a series of misunderstandings leads to both their tragic deaths, which finally reconciles the families.", "label": "accepted"},
            {"text": "It's a love story where everyone dies.", "label": "rejected"},
            {"text": "Romeo and Juliet fall in love and then there's some drama.", "label": "rejected"},
            {"text": "Two teenagers from rival families in Verona fall in love and marry secretly. A tragic chain of events involving miscommunication and impulsive decisions results in both their deaths, ultimately ending the family feud.", "label": "accepted"}
        ]
    },
    {
        "instruction": "What are the health benefits of drinking water?",
        "outputs": [
            {"text": "Drinking adequate water provides numerous health benefits: It regulates body temperature, lubricates joints, helps deliver nutrients to cells, and keeps organs functioning properly. Water also improves skin health, aids digestion, helps with weight management, and prevents dehydration which can cause fatigue and headaches. Aim for 8 glasses (64 oz) per day, more if you're active.", "label": "accepted"},
            {"text": "Water is good for you.", "label": "rejected"},
            {"text": "It makes you not thirsty.", "label": "rejected"},
            {"text": "Water is essential for optimal health: it maintains body temperature, supports joint lubrication, facilitates nutrient transport, and ensures proper organ function. Additionally, adequate hydration improves cognitive function, supports metabolism, enhances skin elasticity, aids digestion, and prevents dehydration-related symptoms like headaches and fatigue. Target 8-10 glasses daily.", "label": "accepted"}
        ]
    },
]

print(f"✓ Created dataset with {len(preference_data)} instructions")
print(f"  Each instruction has multiple outputs (accepted/rejected)")
print(f"\nExample structure:")
print(f"Instruction: {preference_data[0]['instruction']}")
print(f"Number of outputs: {len(preference_data[0]['outputs'])}")
for i, output in enumerate(preference_data[0]['outputs']):
    print(f"  Output {i+1}: {output['label']} - {output['text'][:60]}...")

✓ Created dataset with 8 instructions
  Each instruction has multiple outputs (accepted/rejected)

Example structure:
Instruction: Explain photosynthesis to a 10-year-old.
Number of outputs: 4
  Output 1: accepted - Photosynthesis is how plants make their own food! Plants use...
  Output 2: rejected - Photosynthesis is a biological process involving light-depen...
  Output 3: rejected - Plants eat sunlight....
  Output 4: accepted - Think of plants as having tiny solar panels in their leaves!...


## Step 4: Extract Accepted Examples

Filter the dataset to keep only "accepted" responses for training. This is the key to SFT: we only train on high-quality examples.

In [None]:
# Extract only the accepted (good) examples
# Why? SFT learns by mimicking good examples. Bad examples would confuse it.
training_data = []
for item in preference_data:
    instruction = item["instruction"]
    for output in item["outputs"]:
        if output["label"] == "accepted":
            training_data.append({
                "instruction": instruction,
                "output": output["text"]
            })

# Repeat examples 5x to have more training data
training_data = training_data * 5

dataset = Dataset.from_list(training_data)

print(f"✓ Training dataset: {len(dataset)} examples")

print(f"  Using only 'accepted' outputs for training")
print(f"Output: {dataset[0]['output'][:100]}...")

print(f"\nSample:")
print(f"Instruction: {dataset[0]['instruction']}")

✓ Training dataset: 80 examples
  Using only 'accepted' outputs for training

Sample:
Instruction: Explain photosynthesis to a 10-year-old.
Output: Photosynthesis is how plants make their own food! Plants use sunlight, water from the soil, and air ...


## Step 5: Load Model

Load TinyLlama model with 4-bit quantization to save memory. Quantization reduces memory usage by approximately 75% while maintaining model quality.

In [None]:
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
max_seq_length = 2048

print(f"Loading {MODEL_NAME}...")

# Configure 4-bit quantization (reduces memory by ~75%)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

print("✓ Model loaded successfully")

Loading TinyLlama/TinyLlama-1.1B-Chat-v1.0...


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

✓ Model loaded successfully


## Step 6: Setup Tokenizer

Load and configure the tokenizer. The tokenizer converts text to numbers that the model can understand.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.padding_side = "right"
tokenizer.truncation_side = "right"

# Set special tokens if missing
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

if tokenizer.bos_token is None:
    tokenizer.add_special_tokens({'bos_token': '<s>'})

if tokenizer.eos_token is None:
    tokenizer.add_special_tokens({'eos_token': '</s>'})

print("✓ Tokenizer configured")

✓ Tokenizer configured


## Step 7: Add LoRA Adapters

**LoRA** = Efficient training method that only trains 1-2% of model parameters.

Benefits: Much faster, uses less memory, same quality as full training.

In [None]:
# Prepare model for efficient training
model = prepare_model_for_kbit_training(model)

# Configure LoRA settings
lora_config = LoraConfig(
    r=16,  # LoRA rank
    lora_alpha=32,  # Scaling factor
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Add LoRA adapters to the model
model = get_peft_model(model, lora_config)

# Show how many parameters we're actually training

trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"  Trainable: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")

total_params = sum(p.numel() for p in model.parameters())
print(f"✓ LoRA adapters added")

✓ LoRA adapters added
  Trainable: 12,615,680 (2.01%)


## Step 8: Format Dataset

Convert data to chat format that the model understands. This applies the model's conversation template to each example.

In [13]:
# Format each example as a conversation
def format_prompts(examples):
    instructions = examples["instruction"]
    outputs = examples["output"]
    texts = []

    for instruction, output in zip(instructions, outputs):
        # Create user-assistant conversation
        messages = [
            {"role": "user", "content": instruction},
            {"role": "assistant", "content": output},
        ]
        # Apply model's chat template
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False
        )
        texts.append(text)

    return {"text": texts}

# Apply formatting to all examples

dataset = dataset.map(format_prompts, batched=True)print(f"Sample: {dataset[0]['text'][:150]}...")

print("✓ Dataset formatted")

Map:   0%|          | 0/80 [00:00<?, ? examples/s]

✓ Dataset formatted
Sample: <|user|>
Explain photosynthesis to a 10-year-old.</s>
<|assistant|>
Photosynthesis is how plants make their own food! Plants use sunlight, water from ...


## Step 9: Configure Training

Set up training parameters. We'll train for 3 epochs with a small batch size and use mixed precision for faster training.

In [20]:
# Training settings
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,  # Train for 3 passes through data
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,  # Use mixed precision for speed
    save_strategy="epoch",
    logging_steps=10,
    warmup_steps=10,
    optim="paged_adamw_8bit",
    report_to="none"
)

# Create trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer
)

print("✓ Trainer configured")
print(f"  Epochs: {training_args.num_train_epochs}")

print(f"  Batch size: {training_args.per_device_train_batch_size}")print(f"  Learning rate: {training_args.learning_rate}")

Adding EOS to train dataset:   0%|          | 0/80 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/80 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/80 [00:00<?, ? examples/s]

✓ Trainer configured
  Training for 3 epochs
  Batch size: 2
  Learning rate: 0.0002


## Step 10: Train the Model

Start training! The model learns from accepted examples only. This typically takes 10-15 minutes on a GPU.

In [21]:
print("Starting training...")
print(f"Training on {len(dataset)} examples")
print("This takes ~10-15 minutes on GPU\n")

# Run training
trainer.train()

print("\n✓ Training complete!")
print(f"  Model learned from {len(dataset)} accepted examples")
print(f"  Note: Rejected examples were NOT used (SFT only learns from good examples)")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Starting training...
Dataset size: 80 examples
This will take 10-15 minutes on GPU



  return fn(*args, **kwargs)


Step,Training Loss
10,1.5457
20,0.8483
30,0.3556


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)



✓ Training complete!
  Model learned from 80 accepted examples
  Rejected examples were used to create contrast in dataset


## Step 11: Save Model

Save the trained LoRA adapters. The base model stays unchanged - only the small adapter weights are saved.

In [22]:
# Save LoRA adapters and tokenizer
model.save_pretrained("aligned_model")
tokenizer.save_pretrained("aligned_model")

print("✓ Model saved to 'aligned_model' folder")

✓ Model saved to 'aligned_model'
  Files: adapter_config.json, adapter_model.bin, tokenizer


## Step 12: Test the Model

Test the trained model on new questions to see if it learned to generate high-quality responses.

In [None]:
# Switch model to evaluation mode
model.eval()

# Function to generate responses
def generate_response(instruction, max_tokens=150):
    # Format instruction
    messages = [{"role": "user", "content": instruction}]
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Tokenize
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=max_seq_length
    ).to(model.device)
    
    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    # Decode and extract assistant's response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if "assistant" in response:
        assistant_start = response.find("assistant") + len("assistant")
        return response[assistant_start:].strip()
    return response.strip()

# Test questions
test_instructions = [
    "How do I make a good first impression?",
    "Explain what DNA is to a child.",
    "Give tips for staying motivated."
]


print("Testing the trained model:\n")
print("-" * 80 + "\n")

for instruction in test_instructions:
    print(f"Q: {instruction}")
    response = generate_response(instruction)
    print(f"A: {response}\n")
    print("-" * 80 + "\n")

Testing aligned model:

Q: How do I make a good first impression?
A: |>
Here are 8 effective strategies:

1. Bring a handmade gift or treats.
2. Smile and say hello.
3. Prepare a quick introduction: "Hi, my name is [Name], I'm [Position] at [Company]."
4. Ask open-ended questions to learn more about the role.
5. Demonstrate technical skills through simple tasks.
6. Focus on common interests and common ground.
7. Practice active listening and respond appropriately.
8. Leave a positive impression with follow-up messages or thank-you notes.

--------------------------------------------------------------------------------

Q: Explain what DNA is to a child.
A: |>
Here are 8 effective strategies:

1. Bring a handmade gift or treats.
2. Smile and say hello.
3. Prepare a quick introduction: "Hi, my name is [Name], I'm [Position] at [Company]."
4. Ask open-ended questions to learn more about the role.
5. Demonstrate technical skills through simple tasks.
6. Focus on common interests and common

## Summary

**What We Did:**
1. Created a dataset with multiple responses per question (some accepted, some rejected)
2. Filtered to keep ONLY accepted (good) responses
3. Fine-tuned TinyLlama using LoRA (efficient training)
4. Tested the model on new questions

**Key Takeaways:**
- **SFT** (Supervised Fine-Tuning) learns from good examples only
- Rejected examples are NOT used in SFT training
- This approach teaches the model "what good responses look like"
- For learning from good vs bad comparisons, use DPO instead

**Next Steps:**
- Try DPO (Direct Preference Optimization) to learn from accepted/rejected pairs
- See `DPO_prefs.ipynb` for that approach