# Lab 14 — Instruction Tuning Schemas

**Focus Area:** Common instruction‑tuning schemas; templating; line‑wise JSONL generation; validation; dataset hygiene (dedupe, filtering, length control)

This notebook implements the complete solution for Lab 14.

## Setup - Create Required Directories

In [1]:
from pathlib import Path
for p in ['artifacts/jsonl','artifacts/samples','artifacts/stats']:
    Path(p).mkdir(parents=True, exist_ok=True)
print("Directories created successfully")

Directories created successfully


## Part A — Schemas & Templates

### Understanding Two Common Record Layouts

**Schema 1: Trio** *(popular in Flan / Alpaca style)*
```json
{"instruction": "...", "input": "...", "output": "...", "metadata": {"source": "..."}}
```
*Pros:* separates task from context; easy to mix with or without `input`  
*Cons:* needs templating at train time; slightly larger records

**Schema 2: Prompt‑Completion** *(OpenAI fine‑tuning style)*
```json
{"prompt": "...", "completion": "...", "metadata": {"source": "..."}}
```
*Pros:* simple; direct fit to APIs expecting prompt/completion  
*Cons:* prompt formatting is baked in; harder to change templates later

## Part B — Build a Trio Dataset from Lab 13 SFT View

### B1. Loader + light cleaner

In [2]:
import json
import orjson
import re
from pathlib import Path

SRC = Path('artifacts/jsonl/corpus_sft.jsonl')
TRIO = Path('artifacts/jsonl/instruct_trio.jsonl')

# Check if source file exists
if not SRC.exists():
    print(f"WARNING: Source file {SRC} not found!")
    print("This lab requires artifacts from Lab 13.")
    print("Please run Lab 13 first to generate the required files.")
else:
    lang_allow = {"en"}
    min_out_chars = 20

    with SRC.open('r', encoding='utf-8') as fin, TRIO.open('w', encoding='utf-8') as fout:
        kept = 0
        seen = set()
        for line in fin:
            obj = json.loads(line)
            # Expected keys in Lab 13: input, output, metadata{doc_id,type,lang}
            lang = (obj.get('metadata') or {}).get('lang', 'en')
            if lang not in lang_allow:
                continue
            instruction = obj['input']
            inp = ""  # Lab 13 'input' already contains the task; keep Trio input empty here
            out = re.sub(r"\s+", " ", obj['output']).strip()
            if len(out) < min_out_chars:
                continue
            # Deduplicate by (instruction, out)
            key = (instruction, out)
            if key in seen:
                continue
            seen.add(key)
            rec = {
                'instruction': instruction,
                'input': inp,
                'output': out,
                'metadata': {
                    'doc_id': (obj.get('metadata') or {}).get('doc_id'),
                    'schema_version': 'trio-v1'
                }
            }
            fout.write(orjson.dumps(rec).decode() + '\n')
            kept += 1

    print(f"Created Trio dataset: {TRIO}")
    print(f"Records kept: {kept}")

Created Trio dataset: artifacts/jsonl/instruct_trio.jsonl
Records kept: 63


### B2. Add a templated prompt‑completion view

In [3]:
PROMPT = Path('artifacts/jsonl/instruct_prompt_completion.jsonl')
TEMPLATE = """### Instruction:\n{instruction}\n\n### Response:\n"""

if TRIO.exists():
    with TRIO.open('r', encoding='utf-8') as fin, PROMPT.open('w', encoding='utf-8') as fout:
        count = 0
        for line in fin:
            t = json.loads(line)
            prompt = TEMPLATE.format(instruction=t['instruction'])
            completion = t['output']
            fout.write(orjson.dumps({
                'prompt': prompt, 
                'completion': completion, 
                'metadata': t.get('metadata')
            }).decode()+'\n')
            count += 1

    print(f"Created Prompt-Completion dataset: {PROMPT}")
    print(f"Records converted: {count}")
else:
    print(f"Trio file not found: {TRIO}")

Created Prompt-Completion dataset: artifacts/jsonl/instruct_prompt_completion.jsonl
Records converted: 63


### Checkpoint: Show sample records from both formats

In [4]:
import json

print("="*60)
print("SAMPLE TRIO RECORDS (first 3)")
print("="*60)
if TRIO.exists():
    with TRIO.open('r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if i >= 3:
                break
            obj = json.loads(line)
            print(f"\nRecord {i+1}:")
            print(f"  Instruction: {obj['instruction'][:80]}...")
            print(f"  Input: {obj['input'] or '(empty)'}")
            print(f"  Output: {obj['output'][:80]}...")
            print(f"  Metadata: {obj.get('metadata')}")

print("\n" + "="*60)
print("SAMPLE PROMPT-COMPLETION RECORDS (first 3)")
print("="*60)
if PROMPT.exists():
    with PROMPT.open('r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if i >= 3:
                break
            obj = json.loads(line)
            print(f"\nRecord {i+1}:")
            print(f"  Prompt: {obj['prompt'][:100]}...")
            print(f"  Completion: {obj['completion'][:80]}...")
            print(f"  Metadata: {obj.get('metadata')}")

SAMPLE TRIO RECORDS (first 3)

Record 1:
  Instruction: Summarize the key steps from: How to configure SSO (v5). (Overview)....
  Input: (empty)
  Output: Key steps: enable SAML; map claims; verify time sync; check audience URI; review...
  Metadata: {'doc_id': 'DOC-0484', 'schema_version': 'trio-v1'}

Record 2:
  Instruction: Summarize the key steps from: Data Retention Policy — Region US (Troubleshooting...
  Input: (empty)
  Output: Key steps: enable SAML; map claims; verify time sync; check audience URI; review...
  Metadata: {'doc_id': 'DOC-0914', 'schema_version': 'trio-v1'}

Record 3:
  Instruction: Summarize the key steps from: Release 202502 — Key fixes (Setup)....
  Input: (empty)
  Output: Key steps: enable SAML; map claims; verify time sync; check audience URI; review...
  Metadata: {'doc_id': 'DOC-0652', 'schema_version': 'trio-v1'}

SAMPLE PROMPT-COMPLETION RECORDS (first 3)

Record 1:
  Prompt: ### Instruction:
Summarize the key steps from: How to configure SSO (v5). (Ov

## Part C — Synthesize Trio from RAG Chunks

### C1. Build summarization tasks from chunks

In [5]:
import json
import orjson
import re
from pathlib import Path

RAG = Path('artifacts/jsonl/rag_chunks_from_csv.jsonl')
TRIO_RAG = Path('artifacts/jsonl/instruct_trio_from_rag.jsonl')

min_text_chars = 200   # avoid trivially short chunks
max_text_chars = 1200  # avoid overly long chunks

if not RAG.exists():
    print(f"WARNING: RAG chunks file {RAG} not found!")
    print("This lab requires artifacts from Lab 13.")
    print("Please run Lab 13 first to generate the required files.")
else:
    with RAG.open('r', encoding='utf-8') as fin, TRIO_RAG.open('w', encoding='utf-8') as fout:
        kept = 0
        seen = set()
        for line in fin:
            obj = json.loads(line)
            text = re.sub(r"\s+", " ", obj['text']).strip()
            if not (min_text_chars <= len(text) <= max_text_chars):
                continue
            instruction = "Summarize the following section in 2–3 sentences focusing on the key steps and caveats."
            inp = text
            # weak target: headerized bullet summary; in production use a labeler or heuristics
            target = "Key steps: ensure SSO claims are mapped; verify time sync; review settings as noted."
            key = (hash(text) % (10**12))  # approximate dedupe by text hash
            if key in seen:
                continue
            seen.add(key)
            rec = {
                'instruction': instruction,
                'input': inp,
                'output': target,
                'metadata': {
                    'doc_id': obj.get('doc_id'),
                    'chunk_id': obj.get('chunk_id'),
                    'schema_version': 'trio-from-rag-v1'
                }
            }
            fout.write(orjson.dumps(rec).decode() + '\n')
            kept += 1

    print(f"Created Trio dataset from RAG chunks: {TRIO_RAG}")
    print(f"Records kept: {kept}")

Created Trio dataset from RAG chunks: artifacts/jsonl/instruct_trio_from_rag.jsonl
Records kept: 195


### C2. Convert to a chat‑style prompt‑completion

In [6]:
CHAT = Path('artifacts/jsonl/instruct_chat_from_rag.jsonl')
CHAT_TEMPLATE = (
    "<system>You are a concise technical assistant.</system>\n"
    "<user>Instruction: {instruction}\n\nContext: {context}</user>\n"
    "<assistant>"
)

if TRIO_RAG.exists():
    with TRIO_RAG.open('r', encoding='utf-8') as fin, CHAT.open('w', encoding='utf-8') as fout:
        count = 0
        for line in fin:
            t = json.loads(line)
            prompt = CHAT_TEMPLATE.format(instruction=t['instruction'], context=t['input'])
            completion = t['output']
            fout.write(orjson.dumps({
                'prompt': prompt, 
                'completion': completion, 
                'metadata': t.get('metadata')
            }).decode()+'\n')
            count += 1
    
    print(f"Created Chat-style Prompt-Completion dataset: {CHAT}")
    print(f"Records converted: {count}")
else:
    print(f"Trio RAG file not found: {TRIO_RAG}")

Created Chat-style Prompt-Completion dataset: artifacts/jsonl/instruct_chat_from_rag.jsonl
Records converted: 195


### Checkpoint: Compare token length distributions

In [7]:
import json
import numpy as np

if CHAT.exists():
    prompt_lens = []
    completion_lens = []
    
    with CHAT.open('r', encoding='utf-8') as f:
        for line in f:
            obj = json.loads(line)
            # Token proxy: count words
            prompt_lens.append(len(obj['prompt'].split()))
            completion_lens.append(len(obj['completion'].split()))
    
    print("Token Length Statistics (word-based proxy):")
    print("="*60)
    print(f"Prompt tokens:")
    print(f"  Mean: {np.mean(prompt_lens):.1f}")
    print(f"  Median: {np.median(prompt_lens):.1f}")
    print(f"  Min: {np.min(prompt_lens)}")
    print(f"  Max: {np.max(prompt_lens)}")
    print(f"  95th percentile: {np.percentile(prompt_lens, 95):.1f}")
    print(f"\nCompletion tokens:")
    print(f"  Mean: {np.mean(completion_lens):.1f}")
    print(f"  Median: {np.median(completion_lens):.1f}")
    print(f"  Min: {np.min(completion_lens)}")
    print(f"  Max: {np.max(completion_lens)}")
    print(f"  95th percentile: {np.percentile(completion_lens, 95):.1f}")

Token Length Statistics (word-based proxy):
Prompt tokens:
  Mean: 90.4
  Median: 91.0
  Min: 60
  Max: 122
  95th percentile: 122.0

Completion tokens:
  Mean: 14.0
  Median: 14.0
  Min: 14
  Max: 14
  95th percentile: 14.0


## Part D — Validation, Length Governance, and Hygiene

### D1. Pydantic validators (optional but recommended)

In [8]:
from pydantic import BaseModel, Field, ValidationError
from typing import Optional, Dict

class Trio(BaseModel):
    instruction: str
    input: str
    output: str
    metadata: Optional[Dict] = Field(default_factory=dict)

class PC(BaseModel):
    prompt: str
    completion: str
    metadata: Optional[Dict] = Field(default_factory=dict)

# Validate a sample
try:
    _ = Trio(instruction='Do X', input='', output='Done', metadata={'doc_id':'DOC-0001'})
    _ = PC(prompt='### Instruction:\nDo X\n\n### Response:\n', completion='Done')
    print('✓ Pydantic validation OK')
    print('  - Trio schema validated')
    print('  - Prompt-Completion schema validated')
except ValidationError as e:
    print('✗ Validation failed:')
    print(e)

✓ Pydantic validation OK
  - Trio schema validated
  - Prompt-Completion schema validated


### D2. Length filters and simple decontamination

In [9]:
import json
import numpy as np
from pathlib import Path

# Token proxy: count words as a rough stand‑in
def token_proxy(s):
    return max(1, len(s.split()))

MAX_PROMPT_TOK = 700
MAX_COMP_TOK = 350

INP = Path('artifacts/jsonl/instruct_prompt_completion.jsonl')
OUT = Path('artifacts/jsonl/instruct_prompt_completion_cleansed.jsonl')

# Suppose we have eval prompts to avoid leakage
EVAL = {"How many orders does customer", "Summarize the key steps from:"}

if INP.exists():
    kept = 0
    filtered_length = 0
    filtered_decontam = 0
    
    with INP.open('r', encoding='utf-8') as fin, OUT.open('w', encoding='utf-8') as fout:
        for line in fin:
            obj = json.loads(line)
            p, c = obj['prompt'], obj['completion']
            
            # Decontamination check
            if any(trigger in p for trigger in EVAL):
                filtered_decontam += 1
                continue  # decontaminate against eval set
            
            # Length check
            if token_proxy(p) > MAX_PROMPT_TOK or token_proxy(c) > MAX_COMP_TOK:
                filtered_length += 1
                continue
            
            fout.write(line)
            kept += 1
    
    print(f"Cleansed dataset created: {OUT}")
    print(f"="*60)
    print(f"Records kept: {kept}")
    print(f"Filtered by length: {filtered_length}")
    print(f"Filtered by decontamination: {filtered_decontam}")
    print(f"Total filtered: {filtered_length + filtered_decontam}")
else:
    print(f"Input file not found: {INP}")

Cleansed dataset created: artifacts/jsonl/instruct_prompt_completion_cleansed.jsonl
Records kept: 0
Filtered by length: 0
Filtered by decontamination: 63
Total filtered: 63


### D3. Dataset stats

In [10]:
import json
import numpy as np
from pathlib import Path

STATS = Path('artifacts/stats/instruct_stats.json')
counts = {'trio':0,'pc':0,'chat_pc':0,'cleansed_pc':0,'trio_from_rag':0}
lengths = {'prompt':[], 'completion':[]}

file_map = [
    ('artifacts/jsonl/instruct_trio.jsonl', 'trio'),
    ('artifacts/jsonl/instruct_prompt_completion.jsonl', 'pc'),
    ('artifacts/jsonl/instruct_chat_from_rag.jsonl', 'chat_pc'),
    ('artifacts/jsonl/instruct_prompt_completion_cleansed.jsonl', 'cleansed_pc'),
    ('artifacts/jsonl/instruct_trio_from_rag.jsonl', 'trio_from_rag')
]

for filepath, key in file_map:
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            for line in f:
                obj = json.loads(line)
                counts[key] += 1
                
                # Collect length stats for prompt-completion formats
                if 'prompt' in obj and 'completion' in obj:
                    lengths['prompt'].append(len(obj['prompt'].split()))
                    lengths['completion'].append(len(obj['completion'].split()))
    except FileNotFoundError:
        pass

stats_output = {
    'counts': counts,
    'prompt_tokens_mean': float(np.mean(lengths['prompt'])) if lengths['prompt'] else 0.0,
    'prompt_tokens_median': float(np.median(lengths['prompt'])) if lengths['prompt'] else 0.0,
    'prompt_tokens_max': int(np.max(lengths['prompt'])) if lengths['prompt'] else 0,
    'completion_tokens_mean': float(np.mean(lengths['completion'])) if lengths['completion'] else 0.0,
    'completion_tokens_median': float(np.median(lengths['completion'])) if lengths['completion'] else 0.0,
    'completion_tokens_max': int(np.max(lengths['completion'])) if lengths['completion'] else 0
}

STATS.write_text(json.dumps(stats_output, indent=2))

print("Dataset Statistics:")
print("="*60)
print(json.dumps(stats_output, indent=2))
print(f"\nStats saved to: {STATS}")

Dataset Statistics:
{
  "counts": {
    "trio": 63,
    "pc": 63,
    "chat_pc": 195,
    "cleansed_pc": 0,
    "trio_from_rag": 195
  },
  "prompt_tokens_mean": 71.85658914728683,
  "prompt_tokens_median": 60.0,
  "prompt_tokens_max": 122,
  "completion_tokens_mean": 14.0,
  "completion_tokens_median": 14.0,
  "completion_tokens_max": 14
}

Stats saved to: artifacts/stats/instruct_stats.json


## Part E — Wrap‑Up

### Analysis Questions

#### 1. Schema Choice: Trio vs Prompt‑Completion vs Chat

**For this project, I would choose the Trio schema (`{instruction, input, output}`).**

**Reasons:**
1. **Flexibility at Training Time**: The Trio schema separates the task description (`instruction`) from the context (`input`), allowing us to experiment with different prompt templates during training without regenerating the entire dataset. This is critical for A/B testing different instruction formats.

2. **Handles Variable Context**: Some tasks have no context (just an instruction), while others require substantial context. The Trio schema handles both cases elegantly—we can leave `input` empty when not needed, making it more versatile than a baked-in prompt format.

3. **Composability**: The clear separation makes it easier to:
   - Add system messages or prefixes later
   - Support multi-turn conversations by chaining instructions
   - Mix different task types in one dataset without template conflicts

**When to use alternatives:**
- **Prompt-Completion**: If deploying directly to an API that expects this format and you're confident about your prompt template
- **Chat**: For multi-turn conversational fine-tuning where role distinctions (system/user/assistant) are critical

#### 2. Template Example and Modification Strategy

**Training-time template:**
```python
template = (
    "Below is an instruction that describes a task, paired with an input that provides further context. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n"
    "### Input:\n{input}\n\n"
    "### Response:\n"
)
```

**How to change without rewriting the dataset:**

Since we stored data in Trio format, we can modify the template at training time without touching the JSONL:

```python
# Option A: Minimal template
minimal_template = "{instruction}\n{input}\n\n"

# Option B: Chat-style template
chat_template = (
    "<system>You are a helpful assistant.</system>\n"
    "<user>{instruction}\n{input}</user>\n"
    "<assistant>"
)

# Option C: Few-shot template (add examples before the test case)
fewshot_template = (
    "Here are some examples:\n{examples}\n\n"
    "Now complete this:\n### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
)
```

**Key benefit:** The raw data (`instruction`, `input`, `output`) remains clean and reusable. Template changes happen in the training dataloader, not in the dataset itself.

#### 3. Length Filtering Thresholds

**Current thresholds:**
- `MAX_PROMPT_TOK = 700` (word-based proxy)
- `MAX_COMP_TOK = 350` (word-based proxy)

**Rationale:**
- Assumes a model with ~2048 token context (e.g., GPT-2 medium, smaller LLaMA variants)
- Leaves headroom for template overhead (~50-100 tokens) and response generation
- 700 prompt + 350 completion ≈ 1,050 tokens, well under 2K with safety margin

**Adjustments for different models:**

| Model Context | MAX_PROMPT_TOK | MAX_COMP_TOK | Reasoning |
|---------------|----------------|--------------|------------|
| **1K tokens** (small models) | 400 | 200 | Aggressive filtering; prioritize short, focused examples |
| **4K tokens** (GPT-3.5, LLaMA-7B) | 1,500 | 800 | More room for detailed instructions and longer responses |
| **8K+ tokens** (Claude, GPT-4) | 3,000 | 1,500 | Can include long-form documents; still cap to avoid outliers |
| **32K+ tokens** (GPT-4-32K, Claude-2) | 10,000 | 5,000 | Book-length context; useful for RAG with multiple chunks |

**Additional considerations:**
- Use a proper tokenizer (tiktoken, sentencepiece) instead of word-splitting for accurate counts
- Set 95th percentile caps, not just means, to avoid rare long examples causing OOM
- For instruction-tuning, shorter is often better—focus on quality over length

#### 4. Decontamination Pipeline Strategy

**Where to run decontamination:**

1. **Early stage (this lab's approach)**: During JSONL generation, immediately after cleaning
   - ✓ Prevents contaminated data from ever entering downstream pipelines
   - ✓ Easy to audit and version-control the decontamination rules
   - ✗ Requires re-generating JSONL if eval set changes

2. **Split-time decontamination** (recommended for production):
   ```python
   # After generating full dataset, before train/val split
   eval_prompts = load_eval_set()  # Load held-out eval prompts
   train_data = [r for r in full_data if not any(ep in r['prompt'] for ep in eval_prompts)]
   ```
   - ✓ More flexible—can change eval set without regenerating base data
   - ✓ Can use fuzzy matching (edit distance, embeddings) for better detection

3. **Runtime decontamination** (during training):
   - Only if dynamic eval sets are created during training
   - Adds overhead; generally avoided

**Decontamination checklist:**
- [ ] **Exact substring matching**: Remove training examples containing exact eval prompts
- [ ] **Fuzzy matching**: Use edit distance or n-gram overlap (e.g., Jaccard similarity > 0.8)
- [ ] **Semantic deduplication**: Embed prompts and remove training examples within cosine similarity threshold of eval
- [ ] **Source-based filtering**: If eval comes from specific docs, remove all training data from those docs
- [ ] **Date-based splitting**: For time-series data, ensure training data predates eval data

**For this project:**
We implemented **early-stage exact substring matching** (D2 above). For production, I would add:
- Fuzzy matching with edit distance < 10 or n-gram overlap > 0.7
- Store decontamination logs to track what was filtered and why
- Re-run decontamination before each model training run to catch new eval additions

## Final Outputs Summary

In [11]:
from pathlib import Path
import json

print("="*60)
print("FINAL OUTPUTS SUMMARY")
print("="*60)

outputs = [
    ('artifacts/jsonl/instruct_trio.jsonl', 'Trio schema dataset'),
    ('artifacts/jsonl/instruct_prompt_completion.jsonl', 'Prompt-Completion dataset'),
    ('artifacts/jsonl/instruct_trio_from_rag.jsonl', 'Trio from RAG chunks'),
    ('artifacts/jsonl/instruct_chat_from_rag.jsonl', 'Chat-style from RAG'),
    ('artifacts/jsonl/instruct_prompt_completion_cleansed.jsonl', 'Cleansed dataset'),
    ('artifacts/stats/instruct_stats.json', 'Dataset statistics')
]

for path, desc in outputs:
    p = Path(path)
    if p.exists():
        if p.is_file():
            size = p.stat().st_size
            # Count lines for JSONL files
            if path.endswith('.jsonl'):
                with open(p, 'r') as f:
                    line_count = sum(1 for _ in f)
                print(f"✓ {desc}")
                print(f"  Path: {path}")
                print(f"  Size: {size:,} bytes")
                print(f"  Records: {line_count:,}")
            else:
                print(f"✓ {desc}")
                print(f"  Path: {path}")
                print(f"  Size: {size:,} bytes")
    else:
        print(f"✗ {desc} - NOT FOUND")
        print(f"  Expected: {path}")
    print()

print("="*60)
print("Lab 14 Complete!")
print("="*60)

# Display final stats
stats_file = Path('artifacts/stats/instruct_stats.json')
if stats_file.exists():
    print("\nFinal Statistics:")
    with open(stats_file, 'r') as f:
        stats = json.load(f)
        print(json.dumps(stats, indent=2))

FINAL OUTPUTS SUMMARY
✓ Trio schema dataset
  Path: artifacts/jsonl/instruct_trio.jsonl
  Size: 16,321 bytes
  Records: 63

✓ Prompt-Completion dataset
  Path: artifacts/jsonl/instruct_prompt_completion.jsonl
  Size: 17,896 bytes
  Records: 63

✓ Trio from RAG chunks
  Path: artifacts/jsonl/instruct_trio_from_rag.jsonl
  Size: 152,173 bytes
  Records: 195

✓ Chat-style from RAG
  Path: artifacts/jsonl/instruct_chat_from_rag.jsonl
  Size: 171,088 bytes
  Records: 195

✓ Cleansed dataset
  Path: artifacts/jsonl/instruct_prompt_completion_cleansed.jsonl
  Size: 0 bytes
  Records: 0

✓ Dataset statistics
  Path: artifacts/stats/instruct_stats.json
  Size: 322 bytes

Lab 14 Complete!

Final Statistics:
{
  "counts": {
    "trio": 63,
    "pc": 63,
    "chat_pc": 195,
    "cleansed_pc": 0,
    "trio_from_rag": 195
  },
  "prompt_tokens_mean": 71.85658914728683,
  "prompt_tokens_median": 60.0,
  "prompt_tokens_max": 122,
  "completion_tokens_mean": 14.0,
  "completion_tokens_median": 14.0,
  

## Key Takeaways

### Schema Selection
- **Trio**: Best for flexibility and template experimentation
- **Prompt-Completion**: Best for direct API deployment
- **Chat**: Best for conversational fine-tuning

### Dataset Hygiene
1. **Deduplication**: Content-based hashing prevents redundant training
2. **Length Filtering**: Model context budget compliance
3. **Decontamination**: Prevents eval leakage
4. **Language Filtering**: Ensures target language consistency

### Best Practices
- Store raw data in flexible schemas (Trio)
- Apply templates at training time, not generation time
- Use proper tokenizers for accurate length measurement
- Log all filtering decisions for reproducibility
- Validate JSONL line-by-line before training

### Common Pitfalls Avoided
- ✓ Separated instruction from input (no template mixing)
- ✓ Applied decontamination early in pipeline
- ✓ Enforced length budgets with headroom
- ✓ Maintained provenance metadata throughout
- ✓ Validated schema compliance with Pydantic