# Lab 3.1.6: Dataset Preparation for Fine-Tuning

**Module:** 3.1 - Large Language Model Fine-Tuning  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐☆☆

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand different dataset formats (Alpaca, ShareGPT, ChatML)
- [ ] Convert raw data to fine-tuning formats
- [ ] Create preference pairs for DPO/RLHF
- [ ] Implement data quality filtering
- [ ] Build a complete data preparation pipeline

---

## Real-World Context

### "Garbage In, Garbage Out"

The quality of your fine-tuned model depends **entirely** on your data:

| Data Quality | Result |
|-------------|--------|
| 100 high-quality examples | Better than... |
| 10,000 low-quality examples | Much worse! |

This notebook teaches you to create **gold-standard training data**.

---

## ELI5: Why Data Format Matters

> **Imagine teaching a new employee.** You could:
>
> 1. **Give them random notes** scattered everywhere → They'll be confused
> 2. **Give them a structured manual** with clear sections → They'll learn fast!
>
> LLMs are the same! They learn best when data is:
> - **Consistently formatted** (same structure every time)
> - **Clearly labeled** (who's speaking? what's the task?)
> - **High quality** (accurate, helpful, well-written)
>
> This notebook teaches you to create that "structured manual" for your AI!

---

In [None]:
# Setup
import json
import re
import random
import hashlib
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass, field
import warnings
warnings.filterwarnings('ignore')

# For working with datasets
from datasets import Dataset, DatasetDict

random.seed(42)
print("Setup complete!")

---

## Part 1: Understanding Dataset Formats

### The Three Most Common Formats

In [None]:
# 1. ALPACA FORMAT
# Used by: Alpaca, Dolly, many instruction datasets
# Simple, single-turn instruction format

alpaca_example = {
    "instruction": "Summarize the following article in 3 bullet points.",
    "input": """Climate change is causing sea levels to rise at an accelerating rate. 
    Scientists predict that by 2100, coastal cities may experience significant flooding. 
    Many governments are now investing in sea walls and other protective measures.""",
    "output": """• Sea levels are rising faster due to climate change
• Coastal cities face major flood risks by 2100
• Governments are building sea walls as protection"""
}

print("ALPACA FORMAT:")
print(json.dumps(alpaca_example, indent=2))

In [None]:
# 2. SHAREGPT FORMAT
# Used by: ShareGPT, many chat datasets
# Multi-turn conversation format

sharegpt_example = {
    "conversations": [
        {"from": "system", "value": "You are a helpful coding assistant."},
        {"from": "human", "value": "How do I reverse a string in Python?"},
        {"from": "gpt", "value": "You can reverse a string in Python using slicing:\n\n```python\ntext = 'hello'\nreversed_text = text[::-1]\nprint(reversed_text)  # Output: 'olleh'\n```"},
        {"from": "human", "value": "What about using a loop?"},
        {"from": "gpt", "value": "Here's how to reverse using a loop:\n\n```python\ntext = 'hello'\nreversed_text = ''\nfor char in text:\n    reversed_text = char + reversed_text\nprint(reversed_text)  # Output: 'olleh'\n```"}
    ]
}

print("SHAREGPT FORMAT:")
print(json.dumps(sharegpt_example, indent=2))

In [None]:
# 3. OPENAI/CHATML FORMAT
# Used by: OpenAI API, many modern models
# Standard chat format with roles

chatml_example = {
    "messages": [
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "How do I read a JSON file in Python?"},
        {"role": "assistant", "content": """Here's how to read a JSON file:\n\n```python\nimport json\n\nwith open('data.json', 'r') as f:\n    data = json.load(f)\n\nprint(data)\n```"""}
    ]
}

print("CHATML/OPENAI FORMAT:")
print(json.dumps(chatml_example, indent=2))

---

## Part 2: Format Converters

Let's build utilities to convert between formats.

In [None]:
@dataclass
class Message:
    """A single message in a conversation."""
    role: str  # 'system', 'user', 'assistant'
    content: str

@dataclass
class Conversation:
    """A full conversation."""
    messages: List[Message] = field(default_factory=list)
    
    def add(self, role: str, content: str):
        self.messages.append(Message(role=role, content=content))
        return self
    
    def to_dict(self) -> List[Dict]:
        return [{"role": m.role, "content": m.content} for m in self.messages]


class FormatConverter:
    """
    Convert between different dataset formats.
    """
    
    @staticmethod
    def alpaca_to_conversation(
        alpaca: Dict,
        system_prompt: str = "You are a helpful assistant."
    ) -> Conversation:
        """Convert Alpaca format to Conversation."""
        conv = Conversation()
        conv.add("system", system_prompt)
        
        user_msg = alpaca["instruction"]
        if alpaca.get("input", "").strip():
            user_msg += f"\n\n{alpaca['input']}"
        
        conv.add("user", user_msg)
        conv.add("assistant", alpaca["output"])
        
        return conv
    
    @staticmethod
    def sharegpt_to_conversation(sharegpt: Dict) -> Conversation:
        """Convert ShareGPT format to Conversation."""
        role_map = {
            "system": "system",
            "human": "user",
            "user": "user",
            "gpt": "assistant",
            "assistant": "assistant",
        }
        
        conv = Conversation()
        for turn in sharegpt["conversations"]:
            role = role_map.get(turn["from"], turn["from"])
            conv.add(role, turn["value"])
        
        return conv
    
    @staticmethod
    def conversation_to_chatml(conv: Conversation) -> str:
        """Format conversation as ChatML string."""
        output = ""
        for msg in conv.messages:
            output += f"<|im_start|>{msg.role}\n{msg.content}<|im_end|>\n"
        return output.strip()
    
    @staticmethod
    def conversation_to_llama3(conv: Conversation) -> str:
        """Format conversation for Llama 3.1."""
        output = "<|begin_of_text|>"
        for msg in conv.messages:
            output += f"<|start_header_id|>{msg.role}<|end_header_id|>\n\n"
            output += f"{msg.content}<|eot_id|>"
        return output


# Test converters
conv = FormatConverter.alpaca_to_conversation(alpaca_example)
print("Alpaca → ChatML:")
print(FormatConverter.conversation_to_chatml(conv)[:300] + "...")

---

## Part 3: Data Quality Filtering

Not all data is created equal. Let's build quality filters.

In [None]:
class DataQualityFilter:
    """
    Filter training examples for quality.
    
    Quality criteria:
    - Appropriate length
    - No duplicates
    - No placeholder content
    - Language detection (optional)
    """
    
    def __init__(
        self,
        min_instruction_len: int = 10,
        max_instruction_len: int = 2000,
        min_output_len: int = 20,
        max_output_len: int = 4000,
        remove_duplicates: bool = True,
    ):
        self.min_instruction_len = min_instruction_len
        self.max_instruction_len = max_instruction_len
        self.min_output_len = min_output_len
        self.max_output_len = max_output_len
        self.remove_duplicates = remove_duplicates
        self.seen_hashes = set()
        
        # Placeholder patterns to reject
        self.placeholder_patterns = [
            r'^TODO',
            r'^TBD',
            r'^\[.*\]$',
            r'^N/A$',
            r'^\.\.\.$',
            r'^\?+$',
        ]
    
    def clean_text(self, text: str) -> str:
        """Clean a text string."""
        # Normalize whitespace
        text = re.sub(r'\s+', ' ', text)
        # Normalize newlines
        text = re.sub(r'\n\s*\n', '\n\n', text)
        # Strip
        text = text.strip()
        return text
    
    def is_placeholder(self, text: str) -> bool:
        """Check if text is just a placeholder."""
        for pattern in self.placeholder_patterns:
            if re.match(pattern, text.strip(), re.IGNORECASE):
                return True
        return False
    
    def is_duplicate(self, instruction: str, output: str) -> bool:
        """Check if we've seen this example before."""
        if not self.remove_duplicates:
            return False
        
        content_hash = hashlib.md5((instruction + output).encode()).hexdigest()
        if content_hash in self.seen_hashes:
            return True
        self.seen_hashes.add(content_hash)
        return False
    
    def check_example(self, example: Dict) -> Tuple[bool, str]:
        """
        Check if an example passes quality filters.
        
        Returns: (is_valid, reason_if_invalid)
        """
        instruction = example.get("instruction", "")
        output = example.get("output", "")
        
        # Length checks
        if len(instruction) < self.min_instruction_len:
            return False, f"Instruction too short ({len(instruction)} chars)"
        if len(instruction) > self.max_instruction_len:
            return False, f"Instruction too long ({len(instruction)} chars)"
        if len(output) < self.min_output_len:
            return False, f"Output too short ({len(output)} chars)"
        if len(output) > self.max_output_len:
            return False, f"Output too long ({len(output)} chars)"
        
        # Placeholder check
        if self.is_placeholder(output):
            return False, "Output is placeholder"
        
        # Duplicate check
        if self.is_duplicate(instruction, output):
            return False, "Duplicate example"
        
        return True, ""
    
    def filter_dataset(
        self, 
        data: List[Dict], 
        verbose: bool = True
    ) -> Tuple[List[Dict], Dict]:
        """
        Filter a dataset and return cleaned data with statistics.
        """
        cleaned = []
        stats = {
            "total": len(data),
            "passed": 0,
            "failed": 0,
            "reasons": {}
        }
        
        for example in data:
            # Clean text fields
            cleaned_example = {
                "instruction": self.clean_text(example.get("instruction", "")),
                "input": self.clean_text(example.get("input", "")),
                "output": self.clean_text(example.get("output", "")),
            }
            
            is_valid, reason = self.check_example(cleaned_example)
            
            if is_valid:
                cleaned.append(cleaned_example)
                stats["passed"] += 1
            else:
                stats["failed"] += 1
                stats["reasons"][reason] = stats["reasons"].get(reason, 0) + 1
        
        if verbose:
            print(f"\nFiltering Results:")
            print(f"  Total: {stats['total']}")
            print(f"  Passed: {stats['passed']} ({100*stats['passed']/max(1,stats['total']):.1f}%)")
            print(f"  Failed: {stats['failed']}")
            if stats["reasons"]:
                print(f"  Failure reasons:")
                for reason, count in sorted(stats["reasons"].items(), key=lambda x: -x[1]):
                    print(f"    - {reason}: {count}")
        
        return cleaned, stats


# Test the filter
test_data = [
    {"instruction": "Write a poem", "input": "", "output": "Roses are red..."},  # Too short
    {"instruction": "Hi", "input": "", "output": "Hello! How can I help you today?"},  # Instruction too short
    {"instruction": "Explain machine learning", "input": "", "output": "TODO"},  # Placeholder
    {"instruction": "What is Python?", "input": "", "output": "Python is a high-level programming language known for its readability and versatility."},  # Good!
    {"instruction": "What is Python?", "input": "", "output": "Python is a high-level programming language known for its readability and versatility."},  # Duplicate!
]

quality_filter = DataQualityFilter(min_output_len=30)
cleaned_data, stats = quality_filter.filter_dataset(test_data)

---

## Part 4: Creating Preference Pairs for DPO

For preference optimization (DPO, SimPO, etc.), we need (prompt, chosen, rejected) triplets.

In [None]:
class PreferencePairGenerator:
    """
    Generate preference pairs for DPO training.
    
    Methods:
    1. From quality scores (if you have ratings)
    2. From multiple responses (pick best vs rest)
    3. Synthetic (generate rejected responses)
    """
    
    @staticmethod
    def from_scores(
        responses: List[Dict],
        min_score_diff: float = 0.5
    ) -> List[Dict]:
        """
        Create pairs from responses with quality scores.
        
        Input: [{"prompt": str, "response": str, "score": float}, ...]
        Output: [{"prompt": str, "chosen": str, "rejected": str}, ...]
        """
        # Group by prompt
        by_prompt = {}
        for r in responses:
            prompt = r['prompt']
            if prompt not in by_prompt:
                by_prompt[prompt] = []
            by_prompt[prompt].append(r)
        
        pairs = []
        for prompt, prompt_responses in by_prompt.items():
            # Sort by score (highest first)
            prompt_responses.sort(key=lambda x: x['score'], reverse=True)
            
            # Create pairs where chosen >> rejected
            for i, high in enumerate(prompt_responses):
                for low in prompt_responses[i + 1:]:
                    if high['score'] - low['score'] >= min_score_diff:
                        pairs.append({
                            'prompt': prompt,
                            'chosen': high['response'],
                            'rejected': low['response'],
                        })
        
        return pairs
    
    @staticmethod
    def synthetic_rejected(
        examples: List[Dict],
        rejection_strategies: List[str] = None
    ) -> List[Dict]:
        """
        Create synthetic rejected responses.
        
        Strategies:
        - truncate: Cut response short
        - add_errors: Add grammar/logic errors
        - generic: Replace with generic response
        - refuse: Add refusal
        """
        if rejection_strategies is None:
            rejection_strategies = ["truncate", "generic"]
        
        pairs = []
        
        generic_responses = [
            "I cannot help with that.",
            "Please try again later.",
            "I don't know.",
            "That's an interesting question.",
        ]
        
        for example in examples:
            instruction = example.get('instruction', '')
            input_text = example.get('input', '')
            chosen = example.get('output', '')
            
            prompt = instruction
            if input_text:
                prompt += f"\n\nInput: {input_text}"
            
            # Pick a random strategy
            strategy = random.choice(rejection_strategies)
            
            if strategy == "truncate" and len(chosen) > 50:
                # Cut at ~30% of length
                cut_point = len(chosen) // 3
                rejected = chosen[:cut_point] + "..."
            elif strategy == "generic":
                rejected = random.choice(generic_responses)
            else:
                rejected = random.choice(generic_responses)
            
            pairs.append({
                'prompt': prompt,
                'chosen': chosen,
                'rejected': rejected,
            })
        
        return pairs


# Demo with scored responses
scored_responses = [
    {"prompt": "What is 2+2?", "response": "2+2 equals 4.", "score": 0.9},
    {"prompt": "What is 2+2?", "response": "The answer is four.", "score": 0.85},
    {"prompt": "What is 2+2?", "response": "4", "score": 0.5},
    {"prompt": "What is 2+2?", "response": "I think it's 5?", "score": 0.1},
]

pairs = PreferencePairGenerator.from_scores(scored_responses, min_score_diff=0.3)
print(f"Generated {len(pairs)} preference pairs:")
for p in pairs[:2]:
    print(f"  Chosen: {p['chosen'][:50]}...")
    print(f"  Rejected: {p['rejected'][:50]}...")
    print()

---

## Part 5: Complete Data Pipeline

Let's put it all together into a complete pipeline.

In [None]:
class DataPipeline:
    """
    Complete data preparation pipeline for LLM fine-tuning.
    
    Steps:
    1. Load raw data
    2. Clean and filter
    3. Convert to target format
    4. Split into train/val/test
    5. Save
    """
    
    def __init__(
        self,
        system_prompt: str = "You are a helpful assistant.",
        output_format: str = "llama3",  # 'llama3', 'chatml', 'alpaca'
        quality_filter: Optional[DataQualityFilter] = None,
    ):
        self.system_prompt = system_prompt
        self.output_format = output_format
        self.quality_filter = quality_filter or DataQualityFilter()
    
    def load_jsonl(self, filepath: str) -> List[Dict]:
        """Load data from JSONL file."""
        data = []
        with open(filepath, 'r') as f:
            for line in f:
                if line.strip():
                    data.append(json.loads(line))
        return data
    
    def load_json(self, filepath: str) -> List[Dict]:
        """Load data from JSON file."""
        with open(filepath, 'r') as f:
            return json.load(f)
    
    def format_example(self, example: Dict) -> str:
        """Format a single example for training."""
        conv = FormatConverter.alpaca_to_conversation(example, self.system_prompt)
        
        if self.output_format == "llama3":
            return FormatConverter.conversation_to_llama3(conv)
        elif self.output_format == "chatml":
            return FormatConverter.conversation_to_chatml(conv)
        else:
            # Keep as Alpaca dict
            return json.dumps(example)
    
    def split_data(
        self,
        data: List[Dict],
        train_ratio: float = 0.8,
        val_ratio: float = 0.1,
        test_ratio: float = 0.1,
        shuffle: bool = True,
    ) -> Tuple[List[Dict], List[Dict], List[Dict]]:
        """Split data into train/val/test sets."""
        assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 1e-6
        
        data = data.copy()
        if shuffle:
            random.shuffle(data)
        
        n = len(data)
        train_end = int(n * train_ratio)
        val_end = train_end + int(n * val_ratio)
        
        return data[:train_end], data[train_end:val_end], data[val_end:]
    
    def process(
        self,
        data: List[Dict],
        output_dir: str = "./processed_data",
    ) -> DatasetDict:
        """
        Run the complete pipeline.
        
        Returns HuggingFace DatasetDict ready for training.
        """
        print("="*50)
        print("DATA PREPARATION PIPELINE")
        print("="*50)
        
        # Step 1: Filter
        print("\n1. Filtering data...")
        filtered_data, stats = self.quality_filter.filter_dataset(data)
        
        # Step 2: Format
        print(f"\n2. Formatting to {self.output_format}...")
        formatted_data = []
        for example in filtered_data:
            formatted_data.append({
                "text": self.format_example(example),
                **example  # Keep original fields too
            })
        
        # Step 3: Split
        print("\n3. Splitting data...")
        train, val, test = self.split_data(formatted_data)
        print(f"   Train: {len(train)}, Val: {len(val)}, Test: {len(test)}")
        
        # Step 4: Create HuggingFace dataset
        print("\n4. Creating HuggingFace dataset...")
        dataset_dict = DatasetDict({
            "train": Dataset.from_list(train),
            "validation": Dataset.from_list(val),
            "test": Dataset.from_list(test),
        })
        
        # Step 5: Save
        print(f"\n5. Saving to {output_dir}...")
        Path(output_dir).mkdir(parents=True, exist_ok=True)
        dataset_dict.save_to_disk(output_dir)
        
        print("\n" + "="*50)
        print("PIPELINE COMPLETE!")
        print("="*50)
        
        return dataset_dict


# Demo the pipeline
sample_data = [
    {"instruction": "What is Python?", "input": "", "output": "Python is a high-level, interpreted programming language known for its clear syntax and readability."},
    {"instruction": "Explain machine learning in simple terms.", "input": "", "output": "Machine learning is a type of AI where computers learn patterns from data instead of being explicitly programmed."},
    {"instruction": "Write a haiku about coding.", "input": "", "output": "Lines of code flow down\nSilent bugs hide in the dark\nDebugger reveals"},
    {"instruction": "What is 2+2?", "input": "", "output": "2+2 equals 4. This is a basic arithmetic operation."},
    {"instruction": "Hi", "input": "", "output": "Hello!"},  # Will be filtered
]

pipeline = DataPipeline(output_format="llama3")
dataset = pipeline.process(sample_data, output_dir="./demo_dataset")

In [None]:
# View the processed data
print("\nProcessed dataset structure:")
print(dataset)

print("\nSample formatted example:")
print(dataset["train"][0]["text"])

---

## Try It Yourself: Exercises

### Exercise 1: Add Language Detection

Extend `DataQualityFilter` to detect and filter by language.

<details>
<summary>Hint</summary>

Use the `langdetect` library: `pip install langdetect`
</details>

In [None]:
# Exercise 1: Your code here
# Extend DataQualityFilter with language detection

### Exercise 2: Multi-Turn Conversation Pairs

Create preference pairs from multi-turn conversations where the last response varies.

<details>
<summary>Hint</summary>

The "prompt" should include all messages up to the last assistant turn.
</details>

In [None]:
# Exercise 2: Your code here
# Create multi-turn preference pairs

---

## Checkpoint

You've learned:
- ✅ Different dataset formats (Alpaca, ShareGPT, ChatML)
- ✅ How to convert between formats
- ✅ Quality filtering for training data
- ✅ Creating preference pairs for DPO
- ✅ Building a complete data pipeline

---

## Further Reading

- [Alpaca Dataset](https://github.com/tatsu-lab/stanford_alpaca)
- [ShareGPT Vicuna Dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered)
- [Data Quality for LLMs](https://arxiv.org/abs/2305.10429)

---

## Cleanup

In [None]:
# Clean up demo files
import shutil
if Path("./demo_dataset").exists():
    shutil.rmtree("./demo_dataset")
    print("Demo dataset cleaned up!")

---

## Next Steps

Continue to:

**[Lab 3.1.7: DPO Training](lab-3.1.7-dpo-training.ipynb)** - Use your preference pairs to train with Direct Preference Optimization!