# Lab 4.6.8.1: Dataset Preparation

**Capstone Option E:** Browser-Deployed Fine-Tuned LLM (Matcha Expert)  
**Phase:** 1 of 6  
**Time:** 4-6 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## Phase Objectives

By completing this phase, you will:
- [ ] Understand the messages format for chat model training
- [ ] Create 150-200 high-quality training examples
- [ ] Implement data validation and quality checks
- [ ] Split data into train/validation/test sets
- [ ] Save dataset in Hugging Face format

---

## Phase Checklist

- [ ] Environment setup complete
- [ ] Dataset format understood
- [ ] Training examples created (150+)
- [ ] Validation examples created (20+)
- [ ] Test examples created (20+)
- [ ] Quality validation passed
- [ ] Dataset saved locally
- [ ] (Optional) Dataset pushed to Hub

---

## Why This Matters

**Quality over Quantity** - For domain-specific fine-tuning, 150 excellent examples beat 10,000 mediocre ones.

| Dataset Size | Quality Needed | Use Case |
|--------------|----------------|----------|
| 50-100 | Very High | Narrow domain adaptation |
| 100-500 | High | **Our target: Domain expertise** |
| 500-5000 | Medium | Broader capabilities |
| 5000+ | Mixed OK | General instruction tuning |

Think of it like training a specialist vs. a generalist:
- **Specialist** (our goal): Deep knowledge in matcha ‚Üí fewer, high-quality examples
- **Generalist**: Broad knowledge ‚Üí many diverse examples

---

## ELI5: What Makes Good Training Data?

> **Imagine you're writing a study guide for a matcha exam.**
>
> Each training example is like a flashcard:
> - **Front** (user question): "What's the difference between ceremonial and culinary grade?"
> - **Back** (expert answer): A detailed, accurate, helpful response
>
> **Good flashcards:**
> - Cover all important topics (grades, preparation, health, culture)
> - Have clear, specific questions
> - Have detailed, accurate answers
> - Include the reasoning, not just facts
>
> **Bad flashcards:**
> - All ask the same thing differently ("what is matcha?" x50)
> - Have one-word answers ("green")
> - Contain wrong information
> - Are vague or confusing

---

## Part 1: Environment Setup

In [None]:
# Environment Setup
import os
import sys
import json
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Any, Optional
from dataclasses import dataclass, field, asdict
import random

# Dataset library
from datasets import Dataset, DatasetDict

print("üçµ PHASE 1: DATASET PREPARATION")
print("="*70)
print(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print(f"Working Directory: {os.getcwd()}")

In [None]:
# Project Configuration
PROJECT_DIR = Path("./matcha-expert")
DATA_DIR = PROJECT_DIR / "data"
DATA_DIR.mkdir(parents=True, exist_ok=True)

# Dataset configuration
DATASET_CONFIG = {
    "name": "matcha-expert-dataset",
    "version": "1.0.0",
    "min_train_examples": 150,
    "min_val_examples": 20,
    "min_test_examples": 20,
    "train_split": 0.8,
    "val_split": 0.1,
    "test_split": 0.1,
}

# System prompt for the matcha expert
SYSTEM_PROMPT = """You are a matcha tea expert with deep knowledge of Japanese tea culture, preparation methods, health benefits, and culinary applications. You provide accurate, helpful information about matcha grades, brewing techniques, traditional ceremonies, and modern recipes. You're passionate about quality matcha and help users make informed choices."""

print(f"üìÅ Project Directory: {PROJECT_DIR}")
print(f"üìä Dataset Name: {DATASET_CONFIG['name']}")
print(f"üìù Target Examples: {DATASET_CONFIG['min_train_examples']}+ training")

---

## Part 2: Understanding the Messages Format

Modern chat models expect data in a specific **messages format**:

In [None]:
# The Messages Format

# Each training example is a conversation with three parts:
example_format = {
    "messages": [
        {
            "role": "system",
            "content": "You are a matcha tea expert..."  # Sets the persona
        },
        {
            "role": "user", 
            "content": "What is ceremonial grade matcha?"  # The question
        },
        {
            "role": "assistant",
            "content": "Ceremonial grade matcha is..."  # The ideal response
        }
    ]
}

print("üìã MESSAGES FORMAT")
print("="*70)
print(json.dumps(example_format, indent=2))
print("\nüí° Each example teaches the model how to respond as a matcha expert")

In [None]:
# Data structures for our dataset

@dataclass
class Message:
    """A single message in a conversation."""
    role: str  # "system", "user", or "assistant"
    content: str
    
    def to_dict(self) -> Dict[str, str]:
        return {"role": self.role, "content": self.content}

@dataclass
class TrainingExample:
    """A complete training example."""
    messages: List[Message]
    category: str  # For tracking topic coverage
    difficulty: str = "medium"  # easy, medium, hard
    
    def to_dict(self) -> Dict[str, Any]:
        return {
            "messages": [m.to_dict() for m in self.messages],
            "category": self.category,
            "difficulty": self.difficulty,
        }

def create_example(
    user_query: str,
    assistant_response: str,
    category: str,
    difficulty: str = "medium",
    system_prompt: str = SYSTEM_PROMPT
) -> TrainingExample:
    """
    Create a training example with the standard format.
    
    Args:
        user_query: The user's question
        assistant_response: The ideal expert response
        category: Topic category (grades, preparation, health, etc.)
        difficulty: Question difficulty (easy, medium, hard)
        system_prompt: The system prompt (defaults to SYSTEM_PROMPT)
    
    Returns:
        TrainingExample ready for dataset
    """
    return TrainingExample(
        messages=[
            Message(role="system", content=system_prompt),
            Message(role="user", content=user_query),
            Message(role="assistant", content=assistant_response),
        ],
        category=category,
        difficulty=difficulty,
    )

print("‚úÖ Data structures defined")

---

## Part 3: Topic Categories

We'll create examples across 8 categories to ensure comprehensive coverage:

In [None]:
# Topic Categories and Target Distribution

CATEGORIES = {
    "grades": {
        "description": "Matcha grades and quality levels",
        "target_count": 25,
        "topics": [
            "Ceremonial vs culinary grade",
            "Premium and cooking grades",
            "Grade indicators (color, texture)",
            "Price vs quality relationship",
            "First harvest vs later harvests",
        ],
    },
    "preparation": {
        "description": "How to prepare matcha",
        "target_count": 25,
        "topics": [
            "Traditional usucha (thin tea)",
            "Koicha (thick tea)",
            "Water temperature",
            "Whisking technique (chasen)",
            "Common mistakes and fixes",
            "Equipment (chawan, chasen, chashaku)",
        ],
    },
    "health": {
        "description": "Health benefits and nutrition",
        "target_count": 20,
        "topics": [
            "Antioxidants (EGCG, catechins)",
            "L-theanine and calm focus",
            "Caffeine content comparison",
            "Chlorophyll and detox",
            "Daily consumption recommendations",
        ],
    },
    "culture": {
        "description": "Japanese tea culture and history",
        "target_count": 20,
        "topics": [
            "Tea ceremony (chanoyu)",
            "History of matcha in Japan",
            "Zen Buddhism connection",
            "Seasonal considerations",
            "Etiquette and traditions",
        ],
    },
    "recipes": {
        "description": "Matcha recipes and culinary uses",
        "target_count": 20,
        "topics": [
            "Matcha latte preparation",
            "Baking with matcha",
            "Smoothies and cold drinks",
            "Desserts (ice cream, mochi)",
            "Savory applications",
        ],
    },
    "quality": {
        "description": "Quality assessment and sourcing",
        "target_count": 20,
        "topics": [
            "Color indicators",
            "Aroma characteristics",
            "Texture and fineness",
            "Origin regions (Uji, Nishio)",
            "Authenticity verification",
        ],
    },
    "storage": {
        "description": "Storage and freshness",
        "target_count": 10,
        "topics": [
            "Proper storage methods",
            "Signs of oxidation",
            "Shelf life expectations",
            "Refrigeration guidelines",
        ],
    },
    "buying": {
        "description": "Buying guide and recommendations",
        "target_count": 10,
        "topics": [
            "What to look for when buying",
            "Price expectations by grade",
            "Red flags and scams",
            "Reputable sources",
        ],
    },
}

print("üìä TOPIC DISTRIBUTION")
print("="*70)
total = 0
for cat, info in CATEGORIES.items():
    print(f"   {cat:<15} {info['target_count']:>3} examples  - {info['description']}")
    total += info['target_count']
print("-"*70)
print(f"   {'TOTAL':<15} {total:>3} examples")

---

## Part 4: Example Training Data

Here are seed examples for each category. You'll expand these to reach 150+ examples.

In [None]:
# Seed Examples - Grades Category

GRADES_EXAMPLES = [
    create_example(
        user_query="What's the difference between ceremonial and culinary grade matcha?",
        assistant_response="""Ceremonial and culinary grade matcha differ significantly in quality, flavor, and intended use:

**Ceremonial Grade:**
- Made from the youngest, most tender tea leaves from the first spring harvest (shincha)
- Stems and veins are removed; only the leaf flesh (tencha) is stone-ground
- Vibrant emerald green color
- Smooth, umami-rich flavor with natural sweetness and minimal bitterness
- Designed for drinking straight as usucha or koicha
- Higher price point ($25-50+ per 30g)

**Culinary Grade:**
- Uses leaves from later harvests or includes more stem material
- Stronger, more astringent flavor that holds up in recipes
- Color is often more olive or yellowish-green
- Perfect for lattes, baked goods, and ice cream
- More affordable ($10-20 per 30g)

**Which to choose:** For drinking straight, always use ceremonial grade. For cooking or blending with strong flavors like milk or sugar, culinary grade works perfectly and saves money.""",
        category="grades",
        difficulty="easy"
    ),
    create_example(
        user_query="What does 'first harvest' mean for matcha quality?",
        assistant_response="""First harvest (ichibancha or shincha) refers to the initial tea picking of the year, typically in late April to early May, and it's crucial for premium matcha quality.

**Why First Harvest Matters:**

1. **Nutrient Concentration**: After winter dormancy, tea plants store nutrients in their roots. The first spring leaves contain the highest concentration of L-theanine, amino acids, and antioxidants.

2. **Tenderness**: First harvest leaves are the youngest and most tender, creating a smoother, less astringent powder.

3. **Chlorophyll Content**: Spring shading combined with new growth produces maximum chlorophyll, giving that prized vibrant green color.

4. **Umami Profile**: Higher amino acid content means more of that savory, sweet umami flavor characteristic of premium matcha.

**Later Harvests:**
- Second harvest (nibancha): June-July, more astringent
- Third harvest (sanbancha): August, even more bitter
- Autumn harvest: September-October, typically lowest grade

True ceremonial grade matcha is exclusively made from first harvest leaves. If a product doesn't specify, it's likely from later harvests.""",
        category="grades",
        difficulty="medium"
    ),
    create_example(
        user_query="How can I tell if matcha is high quality just by looking at it?",
        assistant_response="""Visual inspection can reveal a lot about matcha quality. Here's what to look for:

**Color (Most Important):**
- **High quality**: Vibrant, bright emerald or jade green
- **Low quality**: Dull, olive, yellowish, or brownish green
- The brighter and more vivid the green, the better the chlorophyll content and freshness

**Texture:**
- **High quality**: Extremely fine, silky powder (5-10 microns)
- **Low quality**: Coarse, gritty, or clumpy
- Rub between fingers - premium matcha feels like eyeshadow or talc

**Uniformity:**
- **High quality**: Consistent color throughout, no specks or variations
- **Low quality**: Uneven coloring, visible darker or lighter particles

**Luster:**
- **High quality**: Slight sheen or luminosity when light hits it
- **Low quality**: Flat, matte appearance

**Quick Test:** Put a small amount on white paper. Premium matcha should look almost fluorescent green. If it looks flat olive or khaki, it's lower grade.

Note: Color can fade with oxidation, so even good matcha looks duller if stored improperly.""",
        category="grades",
        difficulty="easy"
    ),
]

print(f"‚úÖ Created {len(GRADES_EXAMPLES)} grade examples")

In [None]:
# Seed Examples - Preparation Category

PREPARATION_EXAMPLES = [
    create_example(
        user_query="What's the correct water temperature for making matcha?",
        assistant_response="""Water temperature is crucial for matcha - using the wrong temperature is one of the most common mistakes.

**Optimal Temperature: 70-80¬∞C (158-176¬∞F)**

**Why Not Boiling Water:**
- Boiling water (100¬∞C) destroys delicate amino acids, especially L-theanine
- Creates excessive bitterness by over-extracting catechins
- Damages the vibrant green color
- Loses the subtle, sweet umami notes

**Temperature Guide by Style:**
- **Usucha (thin tea)**: 75-80¬∞C (167-176¬∞F)
- **Koicha (thick tea)**: 70-75¬∞C (158-167¬∞F) - even cooler to prevent bitterness with more powder

**How to Get the Right Temperature:**
1. Boil water, then let it cool for 2-3 minutes
2. Pour into another vessel (loses ~10¬∞C per transfer)
3. Or use a variable temperature kettle set to 80¬∞C

**Pro Tip:** If your matcha tastes bitter despite using good quality powder, the water was probably too hot. This is almost always the issue for beginners.""",
        category="preparation",
        difficulty="easy"
    ),
    create_example(
        user_query="How do I whisk matcha properly? Mine always has clumps.",
        assistant_response="""Clumpy matcha is frustrating but completely fixable! Here's the proper whisking technique:

**Step-by-Step Method:**

1. **Sift First (Critical Step)**
   - Push matcha through a fine mesh sieve into your bowl
   - This breaks up clumps BEFORE they hit water
   - Use about 1-2g (¬Ω-1 tsp) for usucha

2. **Add a Splash of Water First**
   - Add just 15-20ml of water initially
   - Create a paste by pressing the matcha against the bowl
   - This pre-dissolves the powder

3. **Add Remaining Water**
   - Add the rest (60-70ml total for usucha)
   - Water at 75-80¬∞C

4. **Whisk Correctly**
   - Use a chasen (bamboo whisk) or electric frother
   - Whisk in a **W or M motion**, NOT circular
   - Keep the whisk moving fast from your wrist
   - Whisk for 15-20 seconds until frothy

**Common Mistakes:**
- Circular stirring (doesn't break clumps)
- Skipping the sifting step
- Not enough water velocity
- Scraping the bottom too hard (damages the whisk)

A properly whisked matcha should have a layer of fine, creamy foam on top with no visible clumps.""",
        category="preparation",
        difficulty="medium"
    ),
    create_example(
        user_query="What equipment do I need to make traditional matcha?",
        assistant_response="""Traditional matcha preparation uses specific Japanese tools, each serving an important purpose:

**Essential Equipment:**

1. **Chasen (Ëå∂Á≠Ö) - Bamboo Whisk** - $15-30
   - 80-120 prongs carved from a single piece of bamboo
   - Creates the signature froth
   - More prongs = finer foam
   - Not replaceable with regular whisks

2. **Chawan (Ëå∂Á¢ó) - Tea Bowl** - $20-100+
   - Wide, open design for easy whisking
   - Typically 4-5 inches diameter
   - Any wide ceramic bowl works as a substitute

3. **Chashaku (Ëå∂Êùì) - Bamboo Scoop** - $5-15
   - Traditional measuring spoon
   - One scoop ‚âà 1g of matcha
   - A small regular spoon works too

4. **Furui (ÁØ©) - Sieve** - $10-20
   - Fine mesh strainer
   - Essential for clump-free matcha
   - Any fine tea strainer works

**Nice to Have:**
- Chasen holder (keeps whisk shape)
- Yuzamashi (cooling bowl for water)
- Natsume (tea caddy for storage)

**Budget Alternative:** Start with just a bamboo whisk ($15) and a wide soup bowl. The chasen is the one tool that truly can't be substituted - a fork or regular whisk won't create the proper foam.""",
        category="preparation",
        difficulty="easy"
    ),
]

print(f"‚úÖ Created {len(PREPARATION_EXAMPLES)} preparation examples")

In [None]:
# Seed Examples - Health Category

HEALTH_EXAMPLES = [
    create_example(
        user_query="How much caffeine is in matcha compared to coffee?",
        assistant_response="""Matcha contains roughly the same caffeine as coffee, but the experience is completely different:

**Caffeine Content Comparison:**

| Drink | Caffeine per serving |
|-------|---------------------|
| Matcha (1g serving) | 25-35mg |
| Matcha (2g ceremonial) | 50-70mg |
| Espresso (30ml) | 63mg |
| Brewed coffee (240ml) | 95mg |
| Green tea (steeped) | 25-50mg |

**Why Matcha Feels Different:**

1. **L-Theanine Effect**: Matcha contains high levels of L-theanine, an amino acid that promotes calm focus. It modulates how caffeine affects you, providing alertness without jitters.

2. **Slow Release**: The caffeine in matcha is released gradually over 4-6 hours (vs. coffee's 1-2 hour spike), providing sustained energy without the crash.

3. **No Acidity**: Unlike coffee, matcha doesn't irritate the stomach or cause acid reflux.

**What to Expect:**
- Calm, focused alertness (not the coffee "buzz")
- Energy lasting 4-6 hours
- No jitters or anxiety
- No crash afterward

Many people who can't tolerate coffee switch to matcha for this gentler energy profile.""",
        category="health",
        difficulty="easy"
    ),
    create_example(
        user_query="What are the health benefits of matcha?",
        assistant_response="""Matcha offers significant health benefits because you consume the entire tea leaf, not just a water extract:

**Key Health Benefits:**

1. **Antioxidant Powerhouse**
   - Contains EGCG (epigallocatechin gallate), a potent catechin
   - 137x more antioxidants than regular green tea
   - Helps neutralize free radicals and reduce oxidative stress

2. **Calm Focus & Mental Clarity**
   - High L-theanine content promotes alpha brain waves
   - Enhances focus without causing anxiety
   - Improves memory and cognitive function

3. **Metabolism Support**
   - Catechins may boost metabolic rate
   - Helps with fat oxidation during exercise
   - Supports healthy weight management

4. **Detoxification**
   - High chlorophyll content (from shading)
   - Chlorophyll helps eliminate toxins and heavy metals
   - Supports liver function

5. **Heart Health**
   - May help lower LDL cholesterol
   - Supports healthy blood pressure
   - Anti-inflammatory properties

**Important Notes:**
- Benefits come from regular consumption (1-2 cups daily)
- Quality matters - premium matcha has higher nutrient content
- Matcha is a supplement to a healthy lifestyle, not a cure-all
- Consult a doctor if you have health concerns or take medications""",
        category="health",
        difficulty="easy"
    ),
]

print(f"‚úÖ Created {len(HEALTH_EXAMPLES)} health examples")

In [None]:
# Seed Examples - Culture, Recipes, Quality, Storage, Buying

CULTURE_EXAMPLES = [
    create_example(
        user_query="What is the Japanese tea ceremony?",
        assistant_response="""The Japanese tea ceremony (chanoyu or sad≈ç) is a ritualized practice of preparing and serving matcha, embodying core principles of Japanese aesthetics and philosophy.

**The Four Principles (Rikyu's Philosophy):**

1. **Wa (Âíå) - Harmony**: Between guests, host, and nature
2. **Kei (Êï¨) - Respect**: For all participants and objects
3. **Sei (Ê∏Ö) - Purity**: Physical and spiritual cleanliness
4. **Jaku (ÂØÇ) - Tranquility**: Inner peace and calm

**What Happens:**
- Guests enter through a small door (nijiriguchi), symbolizing humility
- The host prepares matcha with precise, graceful movements
- Each utensil has meaning and is treated with reverence
- Guests admire the scroll, flowers, and tea bowl
- Koicha (thick tea) shared from one bowl, or usucha (thin tea) served individually

**Historical Context:**
- Zen monks brought tea from China in the 12th century
- Sen no Rikyu codified the ceremony in the 16th century
- Originally practiced by samurai and nobility
- Today practiced worldwide as a meditative art

A full formal ceremony can last 4 hours, but casual tea gatherings (chakai) are also common. You don't need to master the ceremony to enjoy matcha - it's about mindfulness and appreciation.""",
        category="culture",
        difficulty="medium"
    ),
]

RECIPE_EXAMPLES = [
    create_example(
        user_query="How do I make a matcha latte at home?",
        assistant_response="""A great matcha latte is easy to make at home - here's the proper technique:

**Ingredients:**
- 1-2 tsp (2-4g) culinary or premium grade matcha
- 2 tbsp (30ml) hot water (75-80¬∞C)
- 1 cup (240ml) milk of choice
- Sweetener to taste (optional)

**Instructions:**

1. **Sift the matcha** into a bowl or cup to remove clumps

2. **Add hot water** (not boiling!) and whisk until smooth paste forms
   - Use a bamboo whisk, milk frother, or small regular whisk
   - No lumps should remain

3. **Heat and froth your milk**
   - Steam or microwave to about 65¬∞C (150¬∞F)
   - Froth with a frother for that creamy texture

4. **Combine**
   - Pour frothed milk over the matcha paste
   - Stir gently to combine
   - Add sweetener if desired (honey, maple syrup, or simple syrup)

**Pro Tips:**
- Use oat milk for the creamiest texture
- Culinary grade is fine - milk masks subtle flavor differences
- For iced: use less water, pour over ice, add cold milk
- Adjust matcha amount to taste - start with 1 tsp if new to matcha

**Common Mistakes:**
- Adding matcha directly to milk (won't dissolve properly)
- Using boiling water (makes it bitter)
- Skipping the sifting step (clumps!)""",
        category="recipes",
        difficulty="easy"
    ),
]

QUALITY_EXAMPLES = [
    create_example(
        user_query="What's special about Uji matcha?",
        assistant_response="""Uji, located in Kyoto Prefecture, is considered the birthplace and gold standard of Japanese matcha production.

**Why Uji is Famous:**

1. **Historical Significance**
   - Tea cultivation began here in the 12th century
   - Developed the shading technique that defines matcha
   - Home to the oldest tea farms in Japan
   - Sen no Rikyu sourced his tea from Uji

2. **Ideal Climate**
   - Surrounded by mountains creating temperature variations
   - Morning mists protect plants and enhance amino acids
   - Rich, mineral-dense soil from the Uji River
   - Perfect balance of sunlight and shade

3. **Expertise**
   - Centuries of refined cultivation techniques
   - Master tea farmers pass down knowledge
   - Strictest quality standards
   - Traditional stone grinding methods

**Other Notable Regions:**
- **Nishio (Aichi)**: Produces 60% of Japan's matcha by volume
- **Shizuoka**: Known for sencha but also produces matcha
- **Kagoshima**: Southern region with longer growing season

**Note:** "Uji matcha" on a label should mean it was at least processed in Uji. For true Uji-grown matcha, look for "Uji-shi" or check if the brand specifies the farm location.""",
        category="quality",
        difficulty="medium"
    ),
]

STORAGE_EXAMPLES = [
    create_example(
        user_query="How should I store matcha to keep it fresh?",
        assistant_response="""Proper storage is essential because matcha is highly susceptible to oxidation, which degrades color, flavor, and nutrients.

**The Three Enemies of Matcha:**
1. **Oxygen** - Oxidizes and dulls the color
2. **Light** - Breaks down chlorophyll and catechins
3. **Heat** - Accelerates degradation

**Storage Guidelines:**

**Unopened Matcha:**
- Store in refrigerator or freezer
- Keep in original sealed packaging
- Can last 6-12 months in freezer

**Opened Matcha:**
- Transfer to an airtight, opaque container
- Keep in refrigerator (not freezer - condensation risk)
- Use within 4-6 weeks for best quality
- Always close container immediately after use

**Critical Tips:**
- Let refrigerated matcha come to room temperature before opening (prevents condensation)
- Keep away from strong-smelling foods (matcha absorbs odors)
- Don't use a wet spoon
- Small containers are better than large ones (less air exposure)

**Signs of Stale Matcha:**
- Color changed from vibrant green to olive/brown
- Flat, hay-like smell instead of fresh vegetal
- Bitter taste without the sweet notes

Stale matcha is safe to consume but tastes inferior. Use it for baking where the flavor difference is less noticeable.""",
        category="storage",
        difficulty="easy"
    ),
]

BUYING_EXAMPLES = [
    create_example(
        user_query="How much should I expect to pay for good matcha?",
        assistant_response="""Matcha pricing varies widely, and understanding the price-quality relationship helps you make informed choices:

**Price Guide by Grade (per 30g):**

| Grade | Price Range | What You Get |
|-------|-------------|-------------|
| Culinary/Cooking | $8-15 | Good for lattes, baking |
| Premium Culinary | $15-25 | Better lattes, smoothies |
| Daily Ceremonial | $25-40 | Good for drinking straight |
| Premium Ceremonial | $40-80 | Excellent quality |
| Competition Grade | $80-200+ | Top 1%, special occasions |

**Red Flags (Too Cheap):**
- Under $10/30g labeled "ceremonial" - likely mislabeled
- Extremely cheap "organic" matcha - often from China or low quality
- No origin information

**What Affects Price:**
1. Harvest timing (first harvest costs most)
2. Processing method (stone-ground vs. ball-milled)
3. Origin and farm reputation
4. Organic certification (adds 20-30%)
5. Packaging and import costs

**My Recommendation:**
- **Beginners**: Start with a $20-30 culinary grade for lattes
- **Drinking straight**: Budget $30-50 for daily drinking quality
- **Special occasions**: $50+ for truly exceptional matcha

Buy from reputable Japanese tea sellers who can verify origin. A good mid-range matcha is better than cheap "ceremonial" grade.""",
        category="buying",
        difficulty="easy"
    ),
]

print(f"‚úÖ Created additional category examples")

In [None]:
# Combine all seed examples

ALL_SEED_EXAMPLES = (
    GRADES_EXAMPLES +
    PREPARATION_EXAMPLES +
    HEALTH_EXAMPLES +
    CULTURE_EXAMPLES +
    RECIPE_EXAMPLES +
    QUALITY_EXAMPLES +
    STORAGE_EXAMPLES +
    BUYING_EXAMPLES
)

print(f"üìä SEED EXAMPLES SUMMARY")
print("="*70)
print(f"   Total seed examples: {len(ALL_SEED_EXAMPLES)}")

# Count by category
category_counts = {}
for ex in ALL_SEED_EXAMPLES:
    cat = ex.category
    category_counts[cat] = category_counts.get(cat, 0) + 1

print(f"\n   By Category:")
for cat, count in sorted(category_counts.items()):
    target = CATEGORIES[cat]["target_count"]
    print(f"   {cat:<15} {count:>3} / {target} target")

print(f"\n‚ö†Ô∏è  You need to create {150 - len(ALL_SEED_EXAMPLES)} more examples!")

---

## Part 5: Your Task - Create More Examples

Now it's your turn! Create more examples to reach the 150+ target.

**Tips for Creating Quality Examples:**

1. **Vary the Question Types:**
   - "What is..." (definitions)
   - "How do I..." (procedures)
   - "Why does..." (explanations)
   - "Which should I..." (recommendations)
   - "Is it true that..." (myth-busting)

2. **Include Different Difficulty Levels:**
   - Easy: Basic facts and definitions
   - Medium: Comparisons and processes
   - Hard: Nuanced decisions and expert knowledge

3. **Make Responses Detailed:**
   - Use formatting (lists, headers)
   - Include specific numbers and facts
   - Add practical tips
   - Anticipate follow-up questions

In [None]:
# Template for adding your own examples
# Copy and modify this template to create more examples

YOUR_EXAMPLES = [
    # Example template - replace with your own
    create_example(
        user_query="YOUR QUESTION HERE",
        assistant_response="""YOUR DETAILED ANSWER HERE.

Include:
- Bullet points for key information
- Specific facts and numbers
- Practical tips
- Clear structure""",
        category="grades",  # One of: grades, preparation, health, culture, recipes, quality, storage, buying
        difficulty="medium"  # One of: easy, medium, hard
    ),
    
    # Add more examples below...
]

print(f"‚ÑπÔ∏è  Add your examples to YOUR_EXAMPLES list above")
print(f"   Target: Create {150 - len(ALL_SEED_EXAMPLES)} more examples")

---

## Part 6: Data Validation

In [None]:
# Data Validation Functions

def validate_example(example: TrainingExample) -> Dict[str, Any]:
    """
    Validate a single training example.
    
    Checks:
    - Correct message format
    - Required roles present
    - Content length requirements
    - Valid category
    
    Returns:
        Dict with is_valid and any errors
    """
    errors = []
    warnings = []
    
    # Check message structure
    if len(example.messages) != 3:
        errors.append(f"Expected 3 messages, got {len(example.messages)}")
    
    # Check roles
    roles = [m.role for m in example.messages]
    if roles != ["system", "user", "assistant"]:
        errors.append(f"Expected roles [system, user, assistant], got {roles}")
    
    # Check content lengths
    for msg in example.messages:
        if len(msg.content.strip()) < 10:
            errors.append(f"Message too short: {msg.role}")
    
    # Check assistant response quality
    assistant_content = example.messages[2].content
    if len(assistant_content) < 100:
        warnings.append("Assistant response seems short (<100 chars)")
    if len(assistant_content) > 3000:
        warnings.append("Assistant response very long (>3000 chars)")
    
    # Check category
    if example.category not in CATEGORIES:
        errors.append(f"Unknown category: {example.category}")
    
    # Check difficulty
    if example.difficulty not in ["easy", "medium", "hard"]:
        errors.append(f"Unknown difficulty: {example.difficulty}")
    
    return {
        "is_valid": len(errors) == 0,
        "errors": errors,
        "warnings": warnings,
    }

def validate_dataset(examples: List[TrainingExample]) -> Dict[str, Any]:
    """
    Validate entire dataset.
    
    Checks:
    - Minimum example count
    - Category distribution
    - Difficulty distribution
    - Individual example validation
    """
    results = {
        "total": len(examples),
        "valid": 0,
        "invalid": 0,
        "errors": [],
        "warnings": [],
        "category_distribution": {},
        "difficulty_distribution": {},
    }
    
    # Validate each example
    for i, ex in enumerate(examples):
        validation = validate_example(ex)
        if validation["is_valid"]:
            results["valid"] += 1
        else:
            results["invalid"] += 1
            for error in validation["errors"]:
                results["errors"].append(f"Example {i}: {error}")
        
        for warning in validation["warnings"]:
            results["warnings"].append(f"Example {i}: {warning}")
        
        # Track distributions
        cat = ex.category
        results["category_distribution"][cat] = results["category_distribution"].get(cat, 0) + 1
        
        diff = ex.difficulty
        results["difficulty_distribution"][diff] = results["difficulty_distribution"].get(diff, 0) + 1
    
    # Check minimums
    if results["total"] < DATASET_CONFIG["min_train_examples"]:
        results["warnings"].append(
            f"Dataset has {results['total']} examples, target is {DATASET_CONFIG['min_train_examples']}+"
        )
    
    return results

print("‚úÖ Validation functions defined")

In [None]:
# Validate seed examples

validation_results = validate_dataset(ALL_SEED_EXAMPLES)

print("üìã DATASET VALIDATION RESULTS")
print("="*70)
print(f"   Total Examples: {validation_results['total']}")
print(f"   Valid: {validation_results['valid']}")
print(f"   Invalid: {validation_results['invalid']}")

print(f"\nüìä Category Distribution:")
for cat, count in sorted(validation_results['category_distribution'].items()):
    target = CATEGORIES.get(cat, {}).get('target_count', '?')
    pct = count / validation_results['total'] * 100
    print(f"   {cat:<15} {count:>3} ({pct:.0f}%)  target: {target}")

print(f"\nüìä Difficulty Distribution:")
for diff, count in sorted(validation_results['difficulty_distribution'].items()):
    pct = count / validation_results['total'] * 100
    print(f"   {diff:<15} {count:>3} ({pct:.0f}%)")

if validation_results['errors']:
    print(f"\n‚ùå Errors:")
    for error in validation_results['errors'][:5]:
        print(f"   {error}")

if validation_results['warnings']:
    print(f"\n‚ö†Ô∏è Warnings:")
    for warning in validation_results['warnings'][:5]:
        print(f"   {warning}")

---

## Part 7: Split and Save Dataset

In [None]:
def prepare_final_dataset(examples: List[TrainingExample]) -> DatasetDict:
    """
    Prepare the final dataset with train/val/test splits.
    
    This function:
    1. Shuffles the examples
    2. Splits into train (80%), validation (10%), test (10%)
    3. Converts to Hugging Face Dataset format
    
    Args:
        examples: List of TrainingExample objects
        
    Returns:
        DatasetDict with train, validation, and test splits
    """
    # Shuffle
    shuffled = examples.copy()
    random.seed(42)  # Reproducibility
    random.shuffle(shuffled)
    
    # Calculate split indices
    n = len(shuffled)
    train_end = int(n * DATASET_CONFIG["train_split"])
    val_end = train_end + int(n * DATASET_CONFIG["val_split"])
    
    train_examples = shuffled[:train_end]
    val_examples = shuffled[train_end:val_end]
    test_examples = shuffled[val_end:]
    
    # Convert to dict format for Dataset
    def examples_to_dict(exs: List[TrainingExample]) -> Dict[str, List]:
        return {
            "messages": [[m.to_dict() for m in ex.messages] for ex in exs],
            "category": [ex.category for ex in exs],
            "difficulty": [ex.difficulty for ex in exs],
        }
    
    # Create DatasetDict
    dataset_dict = DatasetDict({
        "train": Dataset.from_dict(examples_to_dict(train_examples)),
        "validation": Dataset.from_dict(examples_to_dict(val_examples)),
        "test": Dataset.from_dict(examples_to_dict(test_examples)),
    })
    
    print(f"üìä Dataset Splits:")
    print(f"   Train: {len(train_examples)} examples")
    print(f"   Validation: {len(val_examples)} examples")
    print(f"   Test: {len(test_examples)} examples")
    
    return dataset_dict

print("‚úÖ Dataset preparation function defined")

In [None]:
# Prepare and save the dataset

# Combine seed examples (in practice, add YOUR_EXAMPLES here)
all_examples = ALL_SEED_EXAMPLES  # + YOUR_EXAMPLES

# Prepare dataset
dataset = prepare_final_dataset(all_examples)

# Save locally
dataset_path = DATA_DIR / "matcha-dataset"
dataset.save_to_disk(str(dataset_path))

print(f"\n‚úÖ Dataset saved to: {dataset_path}")

# Also save as JSON for inspection
json_path = DATA_DIR / "training_data.json"
with open(json_path, 'w') as f:
    json.dump([ex.to_dict() for ex in all_examples], f, indent=2)

print(f"‚úÖ JSON backup saved to: {json_path}")

In [None]:
# Verify saved dataset

from datasets import load_from_disk

loaded_dataset = load_from_disk(str(dataset_path))

print("üìä LOADED DATASET")
print("="*70)
print(loaded_dataset)

print("\nüìù Sample Training Example:")
sample = loaded_dataset["train"][0]
print(f"   Category: {sample['category']}")
print(f"   Difficulty: {sample['difficulty']}")
print(f"   User: {sample['messages'][1]['content'][:80]}...")
print(f"   Assistant: {sample['messages'][2]['content'][:80]}...")

---

## Common Issues

### Issue 1: Duplicate Questions
**Symptom:** Similar questions phrased slightly differently  
**Fix:** Review for semantic duplicates before finalizing

### Issue 2: Inconsistent Formatting
**Symptom:** Some responses use lists, others don't  
**Fix:** Establish a consistent style guide for responses

### Issue 3: Factual Errors
**Symptom:** Incorrect information in responses  
**Fix:** Verify facts against authoritative sources

### Issue 4: Responses Too Short
**Symptom:** One-sentence answers  
**Fix:** Expand with details, examples, and practical tips

---

## Metrics & Outputs

| Metric | Target | Actual |
|--------|--------|--------|
| Total Examples | 150+ | [Fill in] |
| Training Split | ~120 | [Fill in] |
| Validation Split | ~15 | [Fill in] |
| Test Split | ~15 | [Fill in] |
| Categories Covered | 8 | [Fill in] |
| Validation Pass | 100% | [Fill in] |

---

## Phase Complete!

You've achieved:
- ‚úÖ Understood the messages format for chat training
- ‚úÖ Created seed training examples
- ‚úÖ Implemented data validation
- ‚úÖ Saved dataset in Hugging Face format

**Next:** [Lab 4.6.8.2: QLoRA Fine-Tuning](./lab-4.6.8.2-qlora-finetuning.ipynb)

---

In [None]:
# Cleanup
import gc
gc.collect()

print("‚úÖ Phase 1 Complete!")
print("\nüéØ Next Steps:")
print("   1. Review your dataset and add more examples if needed")
print("   2. Ensure balanced category distribution")
print("   3. Proceed to Lab 4.6.8.2 for QLoRA fine-tuning")
print(f"\n   Dataset location: {dataset_path}")