# Schema Validation: Pydantic vs JSON Schema

This notebook demonstrates two approaches to validating LLM outputs:
1. **Pydantic** - Python-native validation with type coercion and custom validators
2. **JSON Schema** - Language-agnostic validation standard

Both ensure LLM outputs match expected structure before processing downstream.

## Section 1: Pydantic Validation

**Advantages:**
- Automatic type coercion (e.g., `"0.85"` → `0.85`)
- Custom field validators for business logic
- Better error messages
- Native Python integration
- Works seamlessly with type hints

In [None]:
import json
import random
import time
from pydantic import BaseModel, Field, ValidationError

### Define the Pydantic Schema

In [None]:
class EvalModel(BaseModel):
    """Pydantic model for evaluation results."""
    factual_accuracy: float = Field(ge=0, le=1, description="Accuracy score between 0 and 1")
    completeness: float = Field(ge=0, le=1, description="Completeness score between 0 and 1")
    clarity: float = Field(ge=0, le=1, description="Clarity score between 0 and 1")
    comments: str = Field(min_length=1, description="Evaluation comments")

print("✅ Pydantic schema defined")
print("\nSchema structure:")
print(json.dumps(EvalModel.model_json_schema(), indent=2))

### Simulated LLM Responses

These simulate different failure modes:
- Valid JSON with correct types
- Valid JSON with wrong types (string instead of float)
- Invalid JSON

In [None]:
def simulated_llm():
    """Simulate LLM responses with different failure modes."""
    choices = [
        # Valid response
        '{"factual_accuracy": 0.85, "completeness": 0.8, "clarity": 0.9, "comments": "Good."}',
        
        # Type error: string instead of float
        '{"factual_accuracy": "high", "completeness": 0.8, "clarity": 0.9, "comments": "Ok"}',
        
        # Missing required field
        '{"factual_accuracy": 0.85, "completeness": 0.8, "clarity": 0.9}',
        
        # Invalid JSON
        'NOT_JSON: error'
    ]
    return random.choice(choices)

print("✅ Simulator defined")

### Validation with Retry Logic

In [None]:
def validate_with_pydantic(max_retries=3):
    """Validate LLM output with Pydantic, retrying on failure."""
    prompt = "Evaluate the response"
    
    for i in range(max_retries):
        print(f"\n{'='*60}")
        print(f"Attempt {i+1}/{max_retries}")
        print(f"{'='*60}")
        
        raw = simulated_llm()
        print(f"Raw LLM output: {raw}")
        
        # Try to validate output against schema
        try:
            data = json.loads(raw)
            print("✅ Valid JSON")
            
            # 🧩 Convert to Pydantic validated structure
            validated = EvalModel(**data)
            print(f"✅ Pydantic validation passed!")
            print(f"   Validated object: {validated}")
            print(f"\n   Access fields as attributes:")
            print(f"   - Accuracy: {validated.factual_accuracy}")
            print(f"   - Comments: {validated.comments}")
            return True
            
        except json.JSONDecodeError as e:
            print(f"❌ Invalid JSON: {e}")
            print(f"   → Would re-prompt: 'Return valid JSON only'")
            
        except ValidationError as e:
            print(f"❌ Schema Validation Failed:")
            for error in e.errors():
                print(f"   - Field '{error['loc'][0]}': {error['msg']}")
                print(f"     Type: {error['type']}")
            print(f"   → Would re-prompt with validation feedback")
        
        time.sleep(0.1)
    
    print(f"\n❌ Failed after {max_retries} attempts")
    return False

### Run Pydantic Validation Demo

In [None]:
print("🚀 Running Pydantic Validation Demo\n")
result = validate_with_pydantic(max_retries=5)
print(f"\n{'='*60}")
print(f"Final Result: {'✅ SUCCESS' if result else '❌ FAILED'}")
print(f"{'='*60}")

---

## Section 2: JSON Schema Validation

**Advantages:**
- Language-agnostic (works in any language)
- Standard specification (RFC 8927)
- Portable across systems
- Lighter weight (no heavy dependencies)

**Disadvantages:**
- No automatic type coercion
- Less detailed error messages
- No custom business logic validators
- Requires separate validation library

In [None]:
from jsonschema import validate, ValidationError as JSONSchemaValidationError

### Define the JSON Schema

In [None]:
schema = {
    "type": "object",
    "properties": {
        "factual_accuracy": {
            "type": "number",
            "minimum": 0,
            "maximum": 1,
            "description": "Accuracy score between 0 and 1"
        },
        "completeness": {
            "type": "number",
            "minimum": 0,
            "maximum": 1,
            "description": "Completeness score between 0 and 1"
        },
        "clarity": {
            "type": "number",
            "minimum": 0,
            "maximum": 1,
            "description": "Clarity score between 0 and 1"
        },
        "comments": {
            "type": "string",
            "minLength": 1,
            "description": "Evaluation comments"
        },
    },
    "required": ["factual_accuracy", "completeness", "clarity", "comments"],
    "additionalProperties": False
}

print("✅ JSON Schema defined")
print("\nSchema structure:")
print(json.dumps(schema, indent=2))

### Simulated LLM Responses (Same as before)

In [None]:
def simulated_llm_json_schema():
    """Simulate LLM responses for JSON Schema validation."""
    choices = [
        # Valid response
        json.dumps({
            "factual_accuracy": 0.9,
            "completeness": 0.81,
            "clarity": 0.95,
            "comments": "ok"
        }),
        
        # Type error: string instead of number
        json.dumps({
            "factual_accuracy": "high",
            "completeness": 0.8,
            "clarity": 0.9,
            "comments": "good"
        }),
        
        # Missing required field
        json.dumps({
            "factual_accuracy": 0.85,
            "completeness": 0.8,
            "clarity": 0.9
        }),
        
        # Invalid JSON
        'NOT_JSON: error'
    ]
    return random.choice(choices)

print("✅ Simulator defined")

### Validation with Retry Logic

In [None]:
def validate_with_json_schema(max_retries=3):
    """Validate LLM output with JSON Schema, retrying on failure."""
    
    for i in range(max_retries):
        print(f"\n{'='*60}")
        print(f"Attempt {i+1}/{max_retries}")
        print(f"{'='*60}")
        
        raw = simulated_llm_json_schema()
        print(f"Raw LLM output: {raw}")
        
        try:
            data = json.loads(raw)
            print("✅ Valid JSON")
            
            # Validate against JSON Schema
            validate(instance=data, schema=schema)
            print(f"✅ JSON Schema validation passed!")
            print(f"   Validated data: {json.dumps(data, indent=2)}")
            print(f"\n   Access fields as dict keys:")
            print(f"   - Accuracy: {data['factual_accuracy']}")
            print(f"   - Comments: {data['comments']}")
            return True
            
        except json.JSONDecodeError as e:
            print(f"❌ Invalid JSON: {e}")
            print(f"   → Would re-prompt: 'Return valid JSON only'")
            
        except JSONSchemaValidationError as e:
            print(f"❌ Schema Validation Failed:")
            print(f"   Message: {e.message}")
            print(f"   Failed at path: {list(e.path)}")
            print(f"   Schema path: {list(e.schema_path)}")
            print(f"   → Would re-prompt with validation feedback")
        
        time.sleep(0.1)
    
    print(f"\n❌ Failed after {max_retries} attempts")
    return False

### Run JSON Schema Validation Demo

In [None]:
print("🚀 Running JSON Schema Validation Demo\n")
result = validate_with_json_schema(max_retries=5)
print(f"\n{'='*60}")
print(f"Final Result: {'✅ SUCCESS' if result else '❌ FAILED'}")
print(f"{'='*60}")

---

## Comparison Summary

| Feature | Pydantic | JSON Schema |
|---------|----------|-------------|
| **Type Coercion** | ✅ Yes (automatic) | ❌ No |
| **Error Messages** | ✅ Detailed | ⚠️ Basic |
| **Custom Validators** | ✅ Yes | ❌ No |
| **Language Support** | 🐍 Python only | 🌍 Any language |
| **Performance** | Fast | Fast |
| **Learning Curve** | Moderate | Low |
| **Dependencies** | Pydantic package | jsonschema package |
| **API Integration** | Python native | Universal |

### When to Use What?

**Use Pydantic when:**
- Building Python-native agents
- Need custom business logic validation
- Want better error messages for debugging
- Working with LangChain or similar frameworks

**Use JSON Schema when:**
- Need language-agnostic validation
- Building polyglot systems
- Simple structural validation is sufficient
- Want portable validation rules

**Pro Tip:** Many production systems use Pydantic during development for better DX, then generate JSON Schema from Pydantic models for API documentation and cross-language compatibility.

## Bonus: Generate JSON Schema from Pydantic

You can get the best of both worlds:

In [None]:
# Convert Pydantic model to JSON Schema
pydantic_as_json_schema = EvalModel.model_json_schema()

print("🔄 Pydantic model as JSON Schema:")
print(json.dumps(pydantic_as_json_schema, indent=2))

print("\n💡 This allows you to:")
print("   - Develop with Pydantic's rich features")
print("   - Export to JSON Schema for API docs")
print("   - Share schemas with non-Python services")
print("   - Maintain single source of truth")