# Lab 3.1.4: Dataset Preparation for LLM Fine-Tuning

**Module:** 3.1 - Large Language Model Fine-Tuning  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐☆☆

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Convert raw data to Alpaca instruction format
- [ ] Implement ChatML and Llama chat templates
- [ ] Design effective system prompts
- [ ] Implement data cleaning and quality filtering
- [ ] Create proper train/validation/test splits
- [ ] Handle multi-turn conversations

---

## Prerequisites

- Basic Python data manipulation
- Understanding of JSON format
- Familiarity with pandas (helpful but not required)

---

## Real-World Context

### The Data Quality Principle

**"Garbage in, garbage out"** is especially true for LLM fine-tuning. The quality of your dataset directly determines:
- How well your model follows instructions
- The quality and style of responses
- Whether it learns your domain knowledge correctly

**Companies like OpenAI, Anthropic, and Google** spend enormous resources on data curation. Some estimates suggest data preparation takes 70-80% of the effort in successful fine-tuning projects.

---

## ELI5: What is Dataset Preparation?

> **Imagine you're teaching a new employee.** You wouldn't just dump random documents on their desk and expect them to learn your job.
>
> Instead, you'd:
> 1. **Organize** the training materials in a logical order
> 2. **Format** them consistently so they're easy to follow
> 3. **Remove** outdated or incorrect information
> 4. **Add** examples of exactly how you want tasks done
> 5. **Test** them periodically to ensure they're learning
>
> **Dataset preparation is exactly that** for your LLM - organizing and formatting examples so the model learns exactly what you want it to do.

---

## Part 1: Understanding Dataset Formats

There are several common formats for instruction-tuning datasets. Let's explore each one.

In [None]:
# Setup
import json
import re
import random
from typing import Dict, List, Optional, Union, Tuple
from dataclasses import dataclass, field
from pathlib import Path
import hashlib

# Set random seed for reproducibility
random.seed(42)

# NOTE: Reusable versions of all utilities in this notebook are available in:
# - scripts/dataset_utils.py (DataCleaner, ChatTemplateFormatter, DatasetConverter, etc.)
# You can import them with:
#   import sys; sys.path.insert(0, '..')
#   from scripts.dataset_utils import DataCleaner, ChatTemplateFormatter
# For learning purposes, we implement them from scratch below.

### 1.1 Alpaca Format

The **Alpaca format** is one of the most common formats for instruction-tuning. It was introduced by Stanford's Alpaca project.

In [None]:
# Alpaca format example
alpaca_example = {
    "instruction": "Write a function to calculate the factorial of a number.",
    "input": "",  # Optional: additional context
    "output": """def factorial(n):
    if n < 0:
        raise ValueError("Factorial is not defined for negative numbers")
    if n <= 1:
        return 1
    return n * factorial(n - 1)"""
}

alpaca_with_input = {
    "instruction": "Summarize the following text in one sentence.",
    "input": """The transformer architecture was introduced in 2017 in the paper 
    'Attention is All You Need'. It revolutionized natural language processing 
    by enabling parallel processing of sequences and capturing long-range 
    dependencies through self-attention mechanisms.""",
    "output": "The transformer architecture, introduced in 2017, transformed NLP by enabling parallel sequence processing and capturing long-range dependencies through self-attention."
}

print("Alpaca Format (without input):")
print(json.dumps(alpaca_example, indent=2))
print("\nAlpaca Format (with input):")
print(json.dumps(alpaca_with_input, indent=2))

### 1.2 ShareGPT/Conversation Format

The **ShareGPT format** supports multi-turn conversations, which is essential for chat models.

In [None]:
# ShareGPT format example
sharegpt_example = {
    "conversations": [
        {"from": "system", "value": "You are a helpful coding assistant."},
        {"from": "human", "value": "How do I read a file in Python?"},
        {"from": "gpt", "value": """You can read a file in Python using the `open()` function:

```python
# Method 1: Using with statement (recommended)
with open('file.txt', 'r') as f:
    content = f.read()

# Method 2: Read line by line
with open('file.txt', 'r') as f:
    for line in f:
        print(line)
```

The `with` statement ensures the file is properly closed after reading."""},
        {"from": "human", "value": "What about reading a JSON file?"},
        {"from": "gpt", "value": """For JSON files, use the `json` module:

```python
import json

with open('data.json', 'r') as f:
    data = json.load(f)

# data is now a Python dictionary or list
print(data)
```

This automatically parses the JSON into Python data structures."""}
    ]
}

print("ShareGPT/Conversation Format:")
print(json.dumps(sharegpt_example, indent=2))

### 1.3 Chat Template Formats

Different models use different chat templates. Let's implement the most common ones.

In [None]:
@dataclass
class ChatMessage:
    """Represents a single message in a conversation."""
    role: str  # 'system', 'user', or 'assistant'
    content: str

@dataclass
class Conversation:
    """Represents a full conversation."""
    messages: List[ChatMessage] = field(default_factory=list)
    
    def add_message(self, role: str, content: str):
        self.messages.append(ChatMessage(role=role, content=content))


class ChatTemplateFormatter:
    """Format conversations for different model chat templates."""
    
    @staticmethod
    def to_chatml(conversation: Conversation) -> str:
        """
        Format conversation in ChatML format.
        Used by: OpenAI models, some open-source models
        """
        formatted = ""
        for msg in conversation.messages:
            formatted += f"<|im_start|>{msg.role}\n{msg.content}<|im_end|>\n"
        return formatted.strip()
    
    @staticmethod
    def to_llama3(conversation: Conversation) -> str:
        """
        Format conversation in Llama 3.1 format.
        Used by: Llama 3, Llama 3.1 models
        """
        formatted = "<|begin_of_text|>"
        for msg in conversation.messages:
            formatted += f"<|start_header_id|>{msg.role}<|end_header_id|>\n\n"
            formatted += f"{msg.content}<|eot_id|>"
        return formatted
    
    @staticmethod
    def to_llama2(conversation: Conversation) -> str:
        """
        Format conversation in Llama 2 format.
        Used by: Llama 2 models
        """
        formatted = "<s>"
        system_msg = None
        
        for msg in conversation.messages:
            if msg.role == "system":
                system_msg = msg.content
            elif msg.role == "user":
                if system_msg:
                    formatted += f"[INST] <<SYS>>\n{system_msg}\n<</SYS>>\n\n"
                    system_msg = None
                else:
                    formatted += "[INST] "
                formatted += f"{msg.content} [/INST]"
            elif msg.role == "assistant":
                formatted += f" {msg.content} </s>"
        
        return formatted
    
    @staticmethod
    def to_mistral(conversation: Conversation) -> str:
        """
        Format conversation in Mistral/Mixtral format.
        Used by: Mistral, Mixtral models
        """
        formatted = "<s>"
        for msg in conversation.messages:
            if msg.role == "user":
                formatted += f"[INST] {msg.content} [/INST]"
            elif msg.role == "assistant":
                formatted += f" {msg.content}</s>"
        return formatted

In [None]:
# Demonstrate chat templates
conv = Conversation()
conv.add_message("system", "You are a helpful AI assistant.")
conv.add_message("user", "What is machine learning?")
conv.add_message("assistant", "Machine learning is a branch of AI that enables computers to learn patterns from data.")

print("="*60)
print("ChatML Format:")
print("="*60)
print(ChatTemplateFormatter.to_chatml(conv))

print("\n" + "="*60)
print("Llama 3.1 Format:")
print("="*60)
print(ChatTemplateFormatter.to_llama3(conv))

print("\n" + "="*60)
print("Llama 2 Format:")
print("="*60)
print(ChatTemplateFormatter.to_llama2(conv))

---

## Part 2: Data Conversion Utilities

Let's create utilities to convert between different formats.

In [None]:
class DatasetConverter:
    """Convert between different dataset formats."""
    
    @staticmethod
    def alpaca_to_conversation(
        alpaca_data: Dict,
        system_prompt: str = "You are a helpful assistant."
    ) -> Conversation:
        """
        Convert Alpaca format to Conversation format.
        
        Args:
            alpaca_data: Dict with 'instruction', 'input' (optional), 'output'
            system_prompt: System message to prepend
        
        Returns:
            Conversation object
        """
        conv = Conversation()
        conv.add_message("system", system_prompt)
        
        # Combine instruction and input
        user_message = alpaca_data["instruction"]
        if alpaca_data.get("input", "").strip():
            user_message += f"\n\n{alpaca_data['input']}"
        
        conv.add_message("user", user_message)
        conv.add_message("assistant", alpaca_data["output"])
        
        return conv
    
    @staticmethod
    def sharegpt_to_conversation(sharegpt_data: Dict) -> Conversation:
        """
        Convert ShareGPT format to Conversation format.
        
        Args:
            sharegpt_data: Dict with 'conversations' list
        
        Returns:
            Conversation object
        """
        conv = Conversation()
        
        role_mapping = {
            "system": "system",
            "human": "user",
            "user": "user",
            "gpt": "assistant",
            "assistant": "assistant",
        }
        
        for turn in sharegpt_data["conversations"]:
            role = role_mapping.get(turn["from"], turn["from"])
            conv.add_message(role, turn["value"])
        
        return conv
    
    @staticmethod
    def conversation_to_alpaca(conv: Conversation) -> List[Dict]:
        """
        Convert Conversation to Alpaca format.
        Multi-turn conversations become multiple examples.
        
        Returns:
            List of Alpaca-format dicts
        """
        examples = []
        system_msg = ""
        current_instruction = ""
        
        for msg in conv.messages:
            if msg.role == "system":
                system_msg = msg.content
            elif msg.role == "user":
                current_instruction = msg.content
            elif msg.role == "assistant":
                examples.append({
                    "instruction": current_instruction,
                    "input": f"System: {system_msg}" if system_msg else "",
                    "output": msg.content
                })
        
        return examples

In [None]:
# Test conversion
converted_conv = DatasetConverter.alpaca_to_conversation(
    alpaca_example,
    system_prompt="You are an expert Python programmer."
)

print("Alpaca → Conversation → Llama 3.1 Format:")
print("="*60)
print(ChatTemplateFormatter.to_llama3(converted_conv))

---

## Part 3: Data Cleaning and Quality Filtering

Clean data is essential for good fine-tuning results. Let's implement comprehensive cleaning.

In [None]:
class DataCleaner:
    """Clean and filter training data for quality."""
    
    def __init__(
        self,
        min_instruction_length: int = 10,
        max_instruction_length: int = 1000,
        min_output_length: int = 20,
        max_output_length: int = 4000,
        remove_duplicates: bool = True,
    ):
        self.min_instruction_length = min_instruction_length
        self.max_instruction_length = max_instruction_length
        self.min_output_length = min_output_length
        self.max_output_length = max_output_length
        self.remove_duplicates = remove_duplicates
        self.seen_hashes = set()
    
    def clean_text(self, text: str) -> str:
        """Clean a single text string."""
        # Remove excessive whitespace
        text = re.sub(r'\s+', ' ', text)
        text = re.sub(r'\n\s*\n', '\n\n', text)
        
        # Remove leading/trailing whitespace
        text = text.strip()
        
        # Fix common issues
        text = text.replace('\r\n', '\n')
        text = text.replace('\r', '\n')
        
        return text
    
    def is_valid_example(self, example: Dict) -> Tuple[bool, str]:
        """
        Check if an example meets quality criteria.
        
        Returns:
            Tuple of (is_valid, reason if invalid)
        """
        instruction = example.get("instruction", "")
        output = example.get("output", "")
        
        # Check instruction length
        if len(instruction) < self.min_instruction_length:
            return False, f"Instruction too short ({len(instruction)} chars)"
        if len(instruction) > self.max_instruction_length:
            return False, f"Instruction too long ({len(instruction)} chars)"
        
        # Check output length
        if len(output) < self.min_output_length:
            return False, f"Output too short ({len(output)} chars)"
        if len(output) > self.max_output_length:
            return False, f"Output too long ({len(output)} chars)"
        
        # Check for duplicate
        if self.remove_duplicates:
            content_hash = hashlib.md5(
                (instruction + output).encode()
            ).hexdigest()
            if content_hash in self.seen_hashes:
                return False, "Duplicate example"
            self.seen_hashes.add(content_hash)
        
        # Check for empty or placeholder content
        placeholder_patterns = [
            r'^TODO',
            r'^TBD',
            r'^\[.*\]$',
            r'^N/A$',
        ]
        for pattern in placeholder_patterns:
            if re.match(pattern, output, re.IGNORECASE):
                return False, "Placeholder content detected"
        
        return True, ""
    
    def clean_example(self, example: Dict) -> Dict:
        """Clean a single example."""
        return {
            "instruction": self.clean_text(example.get("instruction", "")),
            "input": self.clean_text(example.get("input", "")),
            "output": self.clean_text(example.get("output", "")),
        }
    
    def process_dataset(
        self,
        data: List[Dict],
        verbose: bool = True
    ) -> Tuple[List[Dict], Dict]:
        """
        Process and clean a full dataset.
        
        Returns:
            Tuple of (cleaned_data, statistics)
        """
        cleaned = []
        stats = {
            "total": len(data),
            "passed": 0,
            "failed": 0,
            "failure_reasons": {}
        }
        
        for example in data:
            cleaned_example = self.clean_example(example)
            is_valid, reason = self.is_valid_example(cleaned_example)
            
            if is_valid:
                cleaned.append(cleaned_example)
                stats["passed"] += 1
            else:
                stats["failed"] += 1
                stats["failure_reasons"][reason] = stats["failure_reasons"].get(reason, 0) + 1
        
        if verbose:
            print(f"Dataset Processing Results:")
            print(f"  Total examples: {stats['total']}")
            print(f"  Passed: {stats['passed']} ({100*stats['passed']/stats['total']:.1f}%)")
            print(f"  Failed: {stats['failed']} ({100*stats['failed']/stats['total']:.1f}%)")
            if stats["failure_reasons"]:
                print(f"  Failure reasons:")
                for reason, count in stats["failure_reasons"].items():
                    print(f"    - {reason}: {count}")
        
        return cleaned, stats

In [None]:
# Test with sample data including some bad examples
test_data = [
    # Good example
    {
        "instruction": "Explain the concept of overfitting in machine learning.",
        "input": "",
        "output": """Overfitting occurs when a machine learning model learns the training data too well, 
        including its noise and random fluctuations. The model becomes very accurate on training data 
        but performs poorly on new, unseen data. Common signs include low training error but high 
        validation error. Solutions include regularization, cross-validation, and increasing training data."""
    },
    # Too short instruction
    {
        "instruction": "Explain",
        "input": "",
        "output": "This is a valid response about explaining something."
    },
    # Too short output
    {
        "instruction": "What is deep learning?",
        "input": "",
        "output": "AI stuff"
    },
    # Placeholder
    {
        "instruction": "What is a neural network?",
        "input": "",
        "output": "TODO"
    },
    # Another good example
    {
        "instruction": "What is the difference between a list and a tuple in Python?",
        "input": "",
        "output": """Lists and tuples are both sequence types in Python, but they have key differences:
        1. Mutability: Lists are mutable (can be modified), tuples are immutable (cannot be changed after creation).
        2. Syntax: Lists use square brackets [], tuples use parentheses ().
        3. Performance: Tuples are slightly faster due to immutability.
        4. Use cases: Lists for collections that may change, tuples for fixed data like coordinates."""
    },
    # Duplicate of first example
    {
        "instruction": "Explain the concept of overfitting in machine learning.",
        "input": "",
        "output": """Overfitting occurs when a machine learning model learns the training data too well, 
        including its noise and random fluctuations. The model becomes very accurate on training data 
        but performs poorly on new, unseen data. Common signs include low training error but high 
        validation error. Solutions include regularization, cross-validation, and increasing training data."""
    },
]

cleaner = DataCleaner()
cleaned_data, stats = cleaner.process_dataset(test_data)

---

## Part 4: Train/Validation/Test Splits

Proper data splitting is crucial for evaluating your fine-tuned model.

In [None]:
class DatasetSplitter:
    """Split datasets into train/validation/test sets."""
    
    @staticmethod
    def split(
        data: List[Dict],
        train_ratio: float = 0.8,
        val_ratio: float = 0.1,
        test_ratio: float = 0.1,
        shuffle: bool = True,
        seed: int = 42
    ) -> Tuple[List[Dict], List[Dict], List[Dict]]:
        """
        Split data into train/val/test sets.
        
        Args:
            data: List of examples
            train_ratio: Fraction for training
            val_ratio: Fraction for validation
            test_ratio: Fraction for testing
            shuffle: Whether to shuffle before splitting
            seed: Random seed for reproducibility
        
        Returns:
            Tuple of (train, validation, test) lists
        """
        assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 1e-6, \
            "Ratios must sum to 1.0"
        
        data = data.copy()
        if shuffle:
            random.seed(seed)
            random.shuffle(data)
        
        n = len(data)
        train_end = int(n * train_ratio)
        val_end = train_end + int(n * val_ratio)
        
        train = data[:train_end]
        val = data[train_end:val_end]
        test = data[val_end:]
        
        return train, val, test
    
    @staticmethod
    def stratified_split(
        data: List[Dict],
        category_key: str,
        train_ratio: float = 0.8,
        val_ratio: float = 0.1,
        test_ratio: float = 0.1,
        seed: int = 42
    ) -> Tuple[List[Dict], List[Dict], List[Dict]]:
        """
        Stratified split preserving category distribution.
        
        Args:
            data: List of examples with category field
            category_key: Key in dict for category
        """
        # Group by category
        categories = {}
        for example in data:
            cat = example.get(category_key, "unknown")
            if cat not in categories:
                categories[cat] = []
            categories[cat].append(example)
        
        # Split each category and combine
        train, val, test = [], [], []
        
        for cat, examples in categories.items():
            t, v, te = DatasetSplitter.split(
                examples, train_ratio, val_ratio, test_ratio, seed=seed
            )
            train.extend(t)
            val.extend(v)
            test.extend(te)
        
        # Shuffle final sets
        random.seed(seed)
        random.shuffle(train)
        random.shuffle(val)
        random.shuffle(test)
        
        return train, val, test

In [None]:
# Demo with larger synthetic dataset
synthetic_data = [
    {"instruction": f"Question {i}", "input": "", "output": f"Answer {i}"}
    for i in range(100)
]

train, val, test = DatasetSplitter.split(
    synthetic_data,
    train_ratio=0.8,
    val_ratio=0.1,
    test_ratio=0.1
)

print(f"Dataset Split:")
print(f"  Total: {len(synthetic_data)}")
print(f"  Train: {len(train)} ({100*len(train)/len(synthetic_data):.0f}%)")
print(f"  Validation: {len(val)} ({100*len(val)/len(synthetic_data):.0f}%)")
print(f"  Test: {len(test)} ({100*len(test)/len(synthetic_data):.0f}%)")

---

## Part 5: System Prompt Design

System prompts shape your model's behavior. Let's explore best practices.

In [None]:
class SystemPromptLibrary:
    """Library of system prompts for different use cases."""
    
    GENERAL_ASSISTANT = """You are a helpful, harmless, and honest AI assistant. \
You provide accurate information and admit when you're uncertain. \
You follow ethical guidelines and refuse harmful requests."""

    CODING_ASSISTANT = """You are an expert software engineer with deep knowledge of multiple programming languages. \
You write clean, efficient, well-documented code. You explain your reasoning and suggest best practices. \
When providing code, include comments and handle edge cases."""

    DATA_SCIENCE = """You are a data science expert with expertise in machine learning, statistics, and data analysis. \
You explain complex concepts clearly and provide practical, production-ready code examples. \
You consider scalability, performance, and best practices in your recommendations."""

    MEDICAL = """You are a medical information assistant. You provide general health information for educational purposes. \
IMPORTANT: You are not a doctor. Always recommend consulting healthcare professionals for medical decisions. \
Never provide specific diagnoses or treatment plans."""

    LEGAL = """You are a legal information assistant providing general legal information for educational purposes. \
IMPORTANT: You are not a lawyer. This is not legal advice. Always recommend consulting qualified attorneys for legal matters. \
Laws vary by jurisdiction - you explain general concepts without specific legal counsel."""

    CREATIVE_WRITING = """You are a creative writing assistant with expertise in storytelling, poetry, and various writing styles. \
You help with brainstorming, character development, plot structure, and prose improvement. \
You adapt your style to match the user's preferences and project requirements."""

    CUSTOMER_SUPPORT = """You are a friendly and professional customer support agent for [COMPANY_NAME]. \
You help customers with questions about products and services. \
You are patient, empathetic, and solution-oriented. \
For issues you cannot resolve, you escalate appropriately."""

    @classmethod
    def get_prompt(cls, category: str, **kwargs) -> str:
        """
        Get a system prompt by category, with optional customization.
        
        Args:
            category: Prompt category
            **kwargs: Placeholders to fill in (e.g., COMPANY_NAME)
        """
        prompt = getattr(cls, category.upper(), cls.GENERAL_ASSISTANT)
        
        # Replace placeholders
        for key, value in kwargs.items():
            prompt = prompt.replace(f"[{key.upper()}]", value)
        
        return prompt

In [None]:
# Demonstrate system prompts
print("System Prompt Examples:")
print("=" * 60)

for category in ["GENERAL_ASSISTANT", "CODING_ASSISTANT", "DATA_SCIENCE"]:
    print(f"\n{category}:")
    print("-" * 40)
    print(SystemPromptLibrary.get_prompt(category))

---

## Part 6: Complete Pipeline

Let's put everything together into a complete data preparation pipeline.

In [None]:
class DatasetPipeline:
    """
    Complete pipeline for preparing fine-tuning datasets.
    """
    
    def __init__(
        self,
        system_prompt: str = SystemPromptLibrary.GENERAL_ASSISTANT,
        chat_format: str = "llama3",
        train_ratio: float = 0.8,
        val_ratio: float = 0.1,
        test_ratio: float = 0.1,
    ):
        self.system_prompt = system_prompt
        self.chat_format = chat_format
        self.train_ratio = train_ratio
        self.val_ratio = val_ratio
        self.test_ratio = test_ratio
        
        self.cleaner = DataCleaner()
        self.converter = DatasetConverter()
        self.formatter = ChatTemplateFormatter()
        self.splitter = DatasetSplitter()
    
    def process(
        self,
        data: List[Dict],
        input_format: str = "alpaca",
        verbose: bool = True
    ) -> Dict[str, List[str]]:
        """
        Process a dataset through the complete pipeline.
        
        Args:
            data: Raw data in specified format
            input_format: 'alpaca' or 'sharegpt'
            verbose: Print progress
        
        Returns:
            Dict with 'train', 'val', 'test' formatted texts
        """
        if verbose:
            print("Step 1: Cleaning data...")
        cleaned, stats = self.cleaner.process_dataset(data, verbose=verbose)
        
        if verbose:
            print(f"\nStep 2: Converting to conversations...")
        conversations = []
        for example in cleaned:
            if input_format == "alpaca":
                conv = self.converter.alpaca_to_conversation(
                    example, self.system_prompt
                )
            else:
                conv = self.converter.sharegpt_to_conversation(example)
            conversations.append(conv)
        
        if verbose:
            print(f"  Converted {len(conversations)} conversations")
        
        if verbose:
            print(f"\nStep 3: Formatting for {self.chat_format}...")
        format_fn = {
            "chatml": self.formatter.to_chatml,
            "llama3": self.formatter.to_llama3,
            "llama2": self.formatter.to_llama2,
            "mistral": self.formatter.to_mistral,
        }.get(self.chat_format, self.formatter.to_llama3)
        
        formatted = [format_fn(conv) for conv in conversations]
        
        if verbose:
            print(f"\nStep 4: Splitting dataset...")
        train, val, test = self.splitter.split(
            formatted,
            self.train_ratio,
            self.val_ratio,
            self.test_ratio
        )
        
        if verbose:
            print(f"  Train: {len(train)}, Val: {len(val)}, Test: {len(test)}")
        
        return {
            "train": train,
            "val": val,
            "test": test,
            "statistics": stats
        }
    
    def save(
        self,
        processed_data: Dict,
        output_dir: str,
        format: str = "jsonl"
    ):
        """
        Save processed data to files.
        
        Args:
            processed_data: Output from process()
            output_dir: Directory to save files
            format: 'jsonl' or 'json'
        """
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)
        
        for split in ["train", "val", "test"]:
            data = processed_data[split]
            
            if format == "jsonl":
                filepath = output_path / f"{split}.jsonl"
                with open(filepath, "w") as f:
                    for text in data:
                        f.write(json.dumps({"text": text}) + "\n")
            else:
                filepath = output_path / f"{split}.json"
                with open(filepath, "w") as f:
                    json.dump([{"text": t} for t in data], f, indent=2)
            
            print(f"Saved {split} to {filepath}")

In [None]:
# Create sample dataset
sample_dataset = [
    {
        "instruction": "What is gradient descent?",
        "input": "",
        "output": "Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent, defined by the negative of the gradient. In machine learning, it's used to update model parameters to reduce the loss function."
    },
    {
        "instruction": "Explain the difference between supervised and unsupervised learning.",
        "input": "",
        "output": "Supervised learning uses labeled data where we know the correct outputs, training models to map inputs to known targets (like classification or regression). Unsupervised learning works with unlabeled data, finding hidden patterns or structures (like clustering or dimensionality reduction). Supervised learning has clear feedback; unsupervised discovers patterns independently."
    },
    {
        "instruction": "What is transfer learning?",
        "input": "",
        "output": "Transfer learning is a technique where a model trained on one task is reused as the starting point for a model on a different task. It leverages knowledge gained from solving one problem to help solve a related problem, reducing the need for large amounts of task-specific training data."
    },
    {
        "instruction": "How does dropout work as a regularization technique?",
        "input": "",
        "output": "Dropout randomly deactivates neurons during training with a specified probability (e.g., 0.5). This prevents neurons from co-adapting and forces the network to learn more robust features. During inference, all neurons are active but weights are scaled by the keep probability to maintain expected outputs."
    },
    {
        "instruction": "What is the purpose of batch normalization?",
        "input": "",
        "output": "Batch normalization normalizes layer inputs by subtracting the batch mean and dividing by batch standard deviation. It stabilizes training, allows higher learning rates, reduces sensitivity to initialization, and provides some regularization. It helps with the internal covariate shift problem in deep networks."
    },
]

# Run pipeline
pipeline = DatasetPipeline(
    system_prompt=SystemPromptLibrary.DATA_SCIENCE,
    chat_format="llama3"
)

processed = pipeline.process(sample_dataset)

In [None]:
# Preview formatted output
print("\nSample Formatted Training Example:")
print("="*60)
print(processed["train"][0])

---

## Try It Yourself: Exercises

### Exercise 1: Create a Domain-Specific Dataset

Create at least 20 examples for a specific domain (e.g., cooking, fitness, finance).

In [None]:
# Exercise 1: Your domain-specific dataset
your_dataset = [
    # Add your examples here
]

### Exercise 2: Implement Additional Quality Filters

Add filters for:
- Detecting and removing examples with toxic language
- Checking for balanced instruction/output ratios
- Verifying code examples are syntactically valid

In [None]:
# Exercise 2: Additional filters
# Your implementation here

---

## Common Mistakes

### Mistake 1: Wrong Chat Template

```python
# ❌ Wrong: Using Llama 2 format for Llama 3
text = "[INST] User message [/INST] Response"

# ✅ Right: Use correct format for model
text = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nUser message<|eot_id|>"
```

### Mistake 2: Inconsistent Data Quality

```python
# ❌ Wrong: Mixing quality levels
data = [
    {"instruction": "...", "output": "Comprehensive detailed response..."},
    {"instruction": "...", "output": "ok"},  # Too short!
]

# ✅ Right: Filter for consistent quality
cleaner = DataCleaner(min_output_length=50)
clean_data, _ = cleaner.process_dataset(data)
```

### Mistake 3: No Validation Set

```python
# ❌ Wrong: Using all data for training
train_data = all_data

# ✅ Right: Always hold out validation data
train, val, test = DatasetSplitter.split(all_data, 0.8, 0.1, 0.1)
```

---

## Checkpoint

You've learned:
- ✅ Different dataset formats (Alpaca, ShareGPT, ChatML, Llama)
- ✅ How to convert between formats
- ✅ Data cleaning and quality filtering
- ✅ Proper train/val/test splitting
- ✅ System prompt design
- ✅ Building a complete data pipeline

---

## Next Steps

**[Lab 3.1.5: DPO Training](05-dpo-training.ipynb)**

In the next notebook, you'll learn how to create preference datasets and train with Direct Preference Optimization for even better model quality!

In [None]:
# Cleanup
import gc
gc.collect()
print("Cleanup complete!")