# Lab 4.6.8.1: Dataset Preparation

**Capstone Option E:** Browser-Deployed Fine-Tuned LLM (Troscha Matcha Guide)  
**Phase:** 1 of 6  
**Time:** 4-6 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## Phase Objectives

By completing this phase, you will:
- [ ] Understand the messages format with `<preferences>` JSON output
- [ ] Use the pre-built 300 training examples from `data/option-e-browser-llm/`
- [ ] Implement data validation and quality checks
- [ ] Split data into train/validation/test sets
- [ ] Save dataset in Hugging Face format

---

## Phase Checklist

- [ ] Environment setup complete
- [ ] Dataset format understood (including `<preferences>` JSON)
- [ ] Training examples loaded from data files
- [ ] Validation examples created (20+)
- [ ] Test examples created (20+)
- [ ] Quality validation passed
- [ ] Dataset saved locally
- [ ] (Optional) Dataset pushed to Hub

---

## Why This Matters

**Quality over Quantity** - For domain-specific fine-tuning, 150 excellent examples beat 10,000 mediocre ones.

| Dataset Size | Quality Needed | Use Case |
|--------------|----------------|----------|
| 50-100 | Very High | Narrow domain adaptation |
| 100-500 | High | **Our target: Domain expertise** |
| 500-5000 | Medium | Broader capabilities |
| 5000+ | Mixed OK | General instruction tuning |

Think of it like training a specialist vs. a generalist:
- **Specialist** (our goal): Deep knowledge in matcha ‚Üí fewer, high-quality examples
- **Generalist**: Broad knowledge ‚Üí many diverse examples

---

## ELI5: What Makes Good Training Data?

> **Imagine you're writing a study guide for a Troscha barista exam.**
>
> Each training example is like a flashcard:
> - **Front** (user question): "What's the difference between Firu and Giru?"
> - **Back** (expert answer): A detailed, accurate response PLUS structured preferences
>
> **Good flashcards:**
> - Cover all products and common questions
> - Have clear, specific questions
> - Have detailed, accurate answers with personality
> - Include `<preferences>` JSON for product recommendations
>
> **Bad flashcards:**
> - All ask the same thing differently
> - Have one-word answers
> - Missing the `<preferences>` JSON structure
> - Don't match Troscha's product catalog

**Unique to this project:** Every response ends with a `<preferences>` JSON block that enables structured product recommendations in the UI!

---

## Part 1: Environment Setup

In [None]:
# Environment Setup
import os
import sys
import json
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Any, Optional
from dataclasses import dataclass, field, asdict
import random

# Dataset library
from datasets import Dataset, DatasetDict

print("üçµ PHASE 1: DATASET PREPARATION")
print("="*70)
print(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print(f"Working Directory: {os.getcwd()}")

In [None]:
# Project Configuration
PROJECT_DIR = Path("./troscha-matcha")
DATA_DIR = PROJECT_DIR / "data"
DATA_DIR.mkdir(parents=True, exist_ok=True)

# Path to pre-built training data
# This path is relative to the notebook location (labs/ directory)
# The data lives in domain-4.../module-4.6.../data/option-e-browser-llm/
SOURCE_DATA_DIR = Path("../data/option-e-browser-llm")

# Verify source data exists, try alternative paths if needed
if not SOURCE_DATA_DIR.exists():
    # Try from project root if running from different directory
    alt_paths = [
        Path("data/option-e-browser-llm"),
        Path("./data/option-e-browser-llm"),
        Path(__file__).parent.parent / "data" / "option-e-browser-llm" if "__file__" in dir() else None,
    ]
    for alt in alt_paths:
        if alt and alt.exists():
            SOURCE_DATA_DIR = alt
            print(f"   Using alternative path: {SOURCE_DATA_DIR}")
            break

# Dataset configuration
DATASET_CONFIG = {
    "name": "troscha-matcha-dataset",
    "version": "1.0.0",
    "min_train_examples": 240,  # 80% of 300
    "min_val_examples": 30,     # 10% of 300
    "min_test_examples": 30,    # 10% of 300
    "train_split": 0.8,
    "val_split": 0.1,
    "test_split": 0.1,
}

# System prompt for Troscha matcha guide
SYSTEM_PROMPT = """You are Troscha's matcha guide.

MENU:
- Yura: Latte Rp 27k
- Taku: Straight Rp 25k | Latte Rp 32k | Strawberry Rp 40k
- Firu: Straight Rp 34k | Latte Rp 44k | Miruku Rp 49k | Strawberry Rp 52k
- Giru: Straight Rp 39k | Latte Rp 49k | Miruku Rp 54k | Strawberry Rp 57k
- Zeno: Straight Rp 44k | Latte Rp 54k | Miruku Rp 59k | Strawberry Rp 62k
- Moku: Hojicha Latte Rp 35k
- Hiku: Straight Rp 79k | Latte Rp 89k
- Kiyo: Straight Rp 94k | Latte Rp 104k

ADDON: Oat Milk +Rp 5k

End responses with <preferences> JSON."""

print(f"üìÅ Project Directory: {PROJECT_DIR}")
print(f"üìä Dataset Name: {DATASET_CONFIG['name']}")
print(f"üìù Source Data: {SOURCE_DATA_DIR}")
if SOURCE_DATA_DIR.exists():
    print(f"   ‚úÖ Source data directory found")
else:
    print(f"   ‚ö†Ô∏è Source data directory not found - check path")

---

## Part 2: Understanding the Messages Format

Modern chat models expect data in a specific **messages format**:

In [None]:
# The Messages Format with <preferences> JSON

# Each training example is a conversation with three parts:
example_format = {
    "messages": [
        {
            "role": "system",
            "content": "You are Troscha's matcha guide.\n\nMENU:\n- Yura: Latte Rp 27k\n..."
        },
        {
            "role": "user", 
            "content": "What's the difference between Firu and Giru?"
        },
        {
            "role": "assistant",
            "content": """Great question! Firu and Giru are both premium blends but come from different legendary regions.

Firu is from Uji (Kyoto) - Japan's original matcha heartland. It blends Yabukita with Okumidori cultivars for harmonious umami sweetness.

Giru is from Nishio (Aichi) - Japan's largest matcha producer. It blends Saemidori with Okumidori for mellow, well-rounded character.

Both are excellent for lattes!

<preferences>{"intent": "compare", "sweetness": "medium", "bitterness": "low", "umami": "high", "body": "medium", "serving": null, "experience": null, "recommended_matcha": ["m-003", "m-004"], "origin_preference": ["uji", "nishio"], "notes": "comparing Uji vs Nishio terroir"}</preferences>"""
        }
    ]
}

print("üìã MESSAGES FORMAT WITH <preferences>")
print("="*70)
print(json.dumps(example_format, indent=2)[:800] + "...")
print("\nüí° The <preferences> JSON at the end enables structured recommendations!")

In [None]:
# Data structures (for reference - our pre-built data already uses this format)
#
# The following dataclasses show the expected structure of training examples.
# Since we're loading pre-built data from JSON files, these are provided
# for documentation purposes only.

@dataclass
class Message:
    """A single message in a conversation."""
    role: str  # "system", "user", or "assistant"
    content: str
    
    def to_dict(self) -> Dict[str, str]:
        return {"role": self.role, "content": self.content}

@dataclass
class TrainingExample:
    """A complete training example."""
    messages: List[Message]
    category: str  # For tracking topic coverage
    
    def to_dict(self) -> Dict[str, Any]:
        return {
            "messages": [m.to_dict() for m in self.messages],
            "category": self.category,
        }

print("‚úÖ Data structures defined (for reference)")
print("   Note: Pre-built training data is loaded from ../data/option-e-browser-llm/")

---

## Part 3: Understanding the <preferences> JSON Schema

Every assistant response in our training data ends with a structured `<preferences>` JSON block:

In [None]:
# Topic Categories based on actual training data files

CATEGORIES = {
    "product_comparisons": {
        "description": "Pairwise and group product comparisons",
        "file": "01_product_comparisons.json",
        "target_count": 30,
    },
    "product_specific": {
        "description": "Questions about specific products",
        "file": "02_product_specific.json",
        "target_count": 24,
    },
    "taste_discovery": {
        "description": "Taste preference discovery",
        "file": "03_taste_discovery.json",
        "target_count": 40,
    },
    "experience_onboarding": {
        "description": "Experience-based recommendations",
        "file": "04_experience_onboarding.json",
        "target_count": 35,
    },
    "serving_context": {
        "description": "Serving style guidance",
        "file": "05_serving_context.json",
        "target_count": 30,
    },
    "matcha_education": {
        "description": "General matcha education",
        "file": "06_matcha_education.json",
        "target_count": 25,
    },
    "price_questions": {
        "description": "Price and value questions",
        "file": "07_price_questions.json",
        "target_count": 15,
    },
    "common_questions": {
        "description": "Storage, prep, caffeine, etc.",
        "file": "08_common_questions.json",
        "target_count": 20,
    },
    "troubleshooting": {
        "description": "Problem-solving conversations",
        "file": "09_troubleshooting.json",
        "target_count": 18,
    },
    "multi_turn": {
        "description": "Multi-turn refinement",
        "file": "10_multi_turn.json",
        "target_count": 20,
    },
    "out_of_scope": {
        "description": "Out-of-scope handling",
        "file": "11_out_of_scope.json",
        "target_count": 31,
    },
    "brand_identity": {
        "description": "Brand and chatbot identity",
        "file": "12_brand_identity.json",
        "target_count": 12,
    },
}

print("üìä TOPIC DISTRIBUTION (from data/option-e-browser-llm/)")
print("="*70)
total = 0
for cat, info in CATEGORIES.items():
    print(f"   {cat:<25} {info['target_count']:>3} examples  - {info['description']}")
    total += info['target_count']
print("-"*70)
print(f"   {'TOTAL':<25} {total:>3} examples")

---

## Part 4: The Preferences JSON Schema

The `<preferences>` JSON enables structured product recommendations:

In [None]:
# Preferences JSON Schema

PREFERENCES_SCHEMA = {
    "intent": "recommend|educate|compare|troubleshoot|out_of_scope",
    "sweetness": "low|medium|high|null",
    "bitterness": "low|medium|high|null",
    "umami": "low|medium|high|null",
    "body": "light|medium|full|null",
    "serving": "straight|latte|miruku|null",
    "experience": "beginner|intermediate|enthusiast|null",
    "recommended_matcha": ["m-001", "m-003", "..."],  # Product IDs
    "origin_preference": ["shiga", "uji", "nishio"],  # or null
    "notes": "string or null"
}

print("üìã PREFERENCES JSON SCHEMA")
print("="*70)
print(json.dumps(PREFERENCES_SCHEMA, indent=2))
print("\nüí° This structured output enables:")
print("   - Product card rendering in the UI")
print("   - Filter/search based on preferences")
print("   - Analytics on customer preferences")
print("   - Personalized recommendation flows")

In [None]:
# Load all training data from the data/option-e-browser-llm/ directory

def load_training_data(source_dir: Path) -> List[Dict[str, Any]]:
    """
    Load all training examples from JSON files in the source directory.
    
    Args:
        source_dir: Path to the option-e-browser-llm data directory
        
    Returns:
        List of training examples with messages format
    """
    all_examples = []
    
    for category, info in CATEGORIES.items():
        file_path = source_dir / info["file"]
        if file_path.exists():
            with open(file_path) as f:
                examples = json.load(f)
                # Add category metadata to each example
                for ex in examples:
                    ex["category"] = category
                all_examples.extend(examples)
                print(f"   ‚úÖ Loaded {len(examples)} from {info['file']}")
        else:
            print(f"   ‚ö†Ô∏è File not found: {info['file']}")
    
    return all_examples

print("üìÇ LOADING TRAINING DATA")
print("="*70)
ALL_EXAMPLES = load_training_data(SOURCE_DATA_DIR)
print(f"\n   Total: {len(ALL_EXAMPLES)} examples loaded")

In [None]:
# Using Pre-Generated Training Data
#
# The training dataset has been pre-generated externally and contains 300 examples.
# This data is ready for use - no generation needed!
#
# The data was created using the methodology described in:
#   ../scripts/option_e_dataset_generation_prompt.md
#
# If you ever need to regenerate or expand the dataset for your own domain,
# you can reference that prompt template with Claude or GPT-4.

print("üìù PRE-GENERATED TRAINING DATA")
print("="*70)
print(f"   Current dataset: {len(ALL_EXAMPLES)} examples")
print(f"   Target minimum: {DATASET_CONFIG['min_train_examples'] + DATASET_CONFIG['min_val_examples'] + DATASET_CONFIG['min_test_examples']} examples")
print()
if len(ALL_EXAMPLES) >= 300:
    print("   ‚úÖ Dataset meets the target of 300 examples!")
    print("   ‚úÖ No additional data generation required.")
else:
    print(f"   ‚ö†Ô∏è Dataset has {len(ALL_EXAMPLES)} examples, target is 300")
    print("   To generate more examples:")
    print("   1. Open ../scripts/option_e_dataset_generation_prompt.md")
    print("   2. Use the prompt with Claude or GPT-4")
    print("   3. Add to the appropriate JSON file in data/option-e-browser-llm/")

---

## Part 5: Data Validation

In [None]:
# Data Validation Functions for Troscha format

def validate_example(example: Dict[str, Any]) -> Dict[str, Any]:
    """
    Validate a single training example.
    
    Checks:
    - Correct message format (3 messages)
    - Required roles present (system, user, assistant)
    - Content length requirements
    - <preferences> JSON present in assistant response
    - Valid category
    
    Args:
        example: A training example dict with 'messages' key
        
    Returns:
        Dict with is_valid (bool), errors (list), and warnings (list)
    """
    errors: List[str] = []
    warnings: List[str] = []
    
    messages = example.get("messages", [])
    
    # Check message structure
    if len(messages) != 3:
        errors.append(f"Expected 3 messages, got {len(messages)}")
        return {"is_valid": False, "errors": errors, "warnings": warnings}
    
    # Check roles
    roles = [m.get("role") for m in messages]
    if roles != ["system", "user", "assistant"]:
        errors.append(f"Expected roles [system, user, assistant], got {roles}")
    
    # Check content lengths
    for msg in messages:
        if len(msg.get("content", "").strip()) < 10:
            errors.append(f"Message too short: {msg.get('role')}")
    
    # Check for <preferences> JSON in assistant response
    assistant_content = messages[2].get("content", "")
    if "<preferences>" not in assistant_content:
        errors.append("Missing <preferences> JSON in assistant response")
    elif "</preferences>" not in assistant_content:
        errors.append("Missing closing </preferences> tag")
    else:
        # Try to parse the preferences JSON
        try:
            start = assistant_content.index("<preferences>") + len("<preferences>")
            end = assistant_content.index("</preferences>")
            prefs_json = assistant_content[start:end]
            json.loads(prefs_json)
        except (ValueError, json.JSONDecodeError) as e:
            errors.append(f"Invalid <preferences> JSON: {e}")
    
    # Check assistant response quality
    if len(assistant_content) < 100:
        warnings.append("Assistant response seems short (<100 chars)")
    if len(assistant_content) > 3000:
        warnings.append("Assistant response very long (>3000 chars)")
    
    return {
        "is_valid": len(errors) == 0,
        "errors": errors,
        "warnings": warnings,
    }

def validate_dataset(examples: List[Dict[str, Any]]) -> Dict[str, Any]:
    """
    Validate entire dataset.
    
    Checks:
    - Minimum example count
    - Category distribution
    - Individual example validation
    - <preferences> JSON format
    
    Args:
        examples: List of training example dicts
        
    Returns:
        Dict with validation results including total, valid, invalid counts,
        errors list, warnings list, and category_distribution dict
    """
    results: Dict[str, Any] = {
        "total": len(examples),
        "valid": 0,
        "invalid": 0,
        "errors": [],
        "warnings": [],
        "category_distribution": {},
    }
    
    # Validate each example
    for i, ex in enumerate(examples):
        validation = validate_example(ex)
        if validation["is_valid"]:
            results["valid"] += 1
        else:
            results["invalid"] += 1
            for error in validation["errors"]:
                results["errors"].append(f"Example {i}: {error}")
        
        for warning in validation["warnings"]:
            results["warnings"].append(f"Example {i}: {warning}")
        
        # Track category distribution
        cat = ex.get("category", "unknown")
        results["category_distribution"][cat] = results["category_distribution"].get(cat, 0) + 1
    
    return results

print("‚úÖ Validation functions defined")

In [None]:
# Validate loaded examples

validation_results = validate_dataset(ALL_EXAMPLES)

print("üìã DATASET VALIDATION RESULTS")
print("="*70)
print(f"   Total Examples: {validation_results['total']}")
print(f"   Valid: {validation_results['valid']}")
print(f"   Invalid: {validation_results['invalid']}")

print(f"\nüìä Category Distribution:")
for cat, count in sorted(validation_results['category_distribution'].items()):
    target = CATEGORIES.get(cat, {}).get('target_count', '?')
    pct = count / validation_results['total'] * 100 if validation_results['total'] > 0 else 0
    print(f"   {cat:<25} {count:>3} ({pct:.0f}%)  target: {target}")

if validation_results['errors']:
    print(f"\n‚ùå Errors (showing first 5):")
    for error in validation_results['errors'][:5]:
        print(f"   {error}")

if validation_results['warnings']:
    print(f"\n‚ö†Ô∏è Warnings (showing first 5):")
    for warning in validation_results['warnings'][:5]:
        print(f"   {warning}")

---

## Part 6: Split and Save Dataset

In [None]:
def prepare_final_dataset(examples: List[Dict[str, Any]]) -> DatasetDict:
    """
    Prepare the final dataset with train/val/test splits.
    
    This function:
    1. Shuffles the examples
    2. Splits into train (80%), validation (10%), test (10%)
    3. Converts to Hugging Face Dataset format
    
    Args:
        examples: List of example dicts with 'messages' and 'category' keys
        
    Returns:
        DatasetDict with train, validation, and test splits
    """
    # Shuffle
    shuffled = examples.copy()
    random.seed(42)  # Reproducibility
    random.shuffle(shuffled)
    
    # Calculate split indices
    n = len(shuffled)
    train_end = int(n * DATASET_CONFIG["train_split"])
    val_end = train_end + int(n * DATASET_CONFIG["val_split"])
    
    train_examples = shuffled[:train_end]
    val_examples = shuffled[train_end:val_end]
    test_examples = shuffled[val_end:]
    
    # Convert to dict format for Dataset
    def examples_to_dict(exs: List[Dict[str, Any]]) -> Dict[str, List]:
        return {
            "messages": [ex["messages"] for ex in exs],
            "category": [ex.get("category", "unknown") for ex in exs],
        }
    
    # Create DatasetDict
    dataset_dict = DatasetDict({
        "train": Dataset.from_dict(examples_to_dict(train_examples)),
        "validation": Dataset.from_dict(examples_to_dict(val_examples)),
        "test": Dataset.from_dict(examples_to_dict(test_examples)),
    })
    
    print(f"üìä Dataset Splits:")
    print(f"   Train: {len(train_examples)} examples")
    print(f"   Validation: {len(val_examples)} examples")
    print(f"   Test: {len(test_examples)} examples")
    
    return dataset_dict

print("‚úÖ Dataset preparation function defined")

In [None]:
# Prepare and save the dataset

# Use the loaded examples from external data files
all_examples = ALL_EXAMPLES

# Prepare dataset with train/val/test splits
dataset = prepare_final_dataset(all_examples)

# Save locally
dataset_path = DATA_DIR / "troscha-dataset"
dataset.save_to_disk(str(dataset_path))

print(f"\n‚úÖ Dataset saved to: {dataset_path}")

# Also save as JSON for inspection
json_path = DATA_DIR / "training_data.json"
with open(json_path, 'w') as f:
    json.dump(all_examples, f, indent=2)

print(f"‚úÖ JSON backup saved to: {json_path}")

In [None]:
# Verify saved dataset

from datasets import load_from_disk

loaded_dataset = load_from_disk(str(dataset_path))

print("üìä LOADED DATASET")
print("="*70)
print(loaded_dataset)

print("\nüìù Sample Training Example:")
sample = loaded_dataset["train"][0]
print(f"   Category: {sample['category']}")
print(f"   User: {sample['messages'][1]['content'][:80]}...")
print(f"   Assistant: {sample['messages'][2]['content'][:100]}...")

---

## Common Issues

### Issue 1: Duplicate Questions
**Symptom:** Similar questions phrased slightly differently  
**Fix:** Review for semantic duplicates before finalizing

### Issue 2: Inconsistent Formatting
**Symptom:** Some responses use lists, others don't  
**Fix:** Establish a consistent style guide for responses

### Issue 3: Factual Errors
**Symptom:** Incorrect information in responses  
**Fix:** Verify facts against authoritative sources

### Issue 4: Responses Too Short
**Symptom:** One-sentence answers  
**Fix:** Expand with details, examples, and practical tips

---

## Metrics & Outputs

| Metric | Target | Actual |
|--------|--------|--------|
| Total Examples | 300 | [Your value] |
| Training Split | ~240 | [Your value] |
| Validation Split | ~30 | [Your value] |
| Test Split | ~30 | [Your value] |
| Categories Covered | 12 | [Your value] |
| Validation Pass Rate | 100% | [Your value] |

**Expected ranges:**
- Training examples: 230-250
- Validation examples: 25-35
- Test examples: 25-35

---

## Phase Complete!

You've achieved:
- ‚úÖ Understood the messages format for chat training
- ‚úÖ Loaded pre-built training examples from external data files
- ‚úÖ Implemented data validation with `<preferences>` JSON checking
- ‚úÖ Saved dataset in Hugging Face format with train/val/test splits

**Next:** [Lab 4.6.8.2: QLoRA Fine-Tuning](./lab-4.6.8.2-qlora-finetuning.ipynb)

---

In [None]:
# Cleanup
import gc
gc.collect()

print("‚úÖ Phase 1 Complete!")
print("\nüéØ Next Steps:")
print("   1. Review your dataset and add more examples if needed")
print("   2. Ensure balanced category distribution")
print("   3. Proceed to Lab 4.6.8.2 for QLoRA fine-tuning")
print(f"\n   Dataset location: {dataset_path}")