# Notebook 3: Structured Outputs (Deep Dive)

This is the most important notebook for building reliable LLM systems.

Topics:
- Why free-text responses are dangerous
- JSON mode and schema-based prompting
- Pydantic models for validation
- Handling invalid outputs (retries, fallbacks)
- End-to-end structured extraction

## Setup

In [1]:
from dotenv import load_dotenv
from openai import OpenAI
from pydantic import BaseModel, Field, ValidationError
from typing import Optional, List
import json

load_dotenv()
client = OpenAI()
MODEL = "gpt-4o-mini"

## Part 1: Why Free-Text Responses Are Dangerous

### The Problem

<img src="img/n3_img_1.png" alt="Why Free-Text Responses Are Dangerous" width="560" style="max-width: 100%; height: auto;">

In [2]:
# ‚ùå Free-text response
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": "Extract the product name, price, and category from: 'The Alpine Pro Tent costs $299 and is in the Camping category.'"}
    ],
    temperature=0
)

result = response.choices[0].message.content
print("‚ùå Free-text response:")
print(result)
print("\nProblem: How do we parse this reliably?")

‚ùå Free-text response:
Product Name: Alpine Pro Tent  
Price: $299  
Category: Camping

Problem: How do we parse this reliably?


**Problems with free-text:**
- Format varies between responses
- Hard to parse programmatically
- No validation
- Breaks downstream systems

## Part 2: JSON Mode (Basic)

OpenAI's `response_format` parameter enforces JSON output.

In [3]:
# ‚úÖ JSON mode
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {
            "role": "system",
            "content": "You extract product information. Always respond with valid JSON."
        },
        {
            "role": "user",
            "content": "Extract the product name, price, and category from: 'The Alpine Pro Tent costs $299 and is in the Camping category.'"
        }
    ],
    response_format={"type": "json_object"},
    temperature=0
)

result = response.choices[0].message.content
print("‚úÖ JSON mode response:")
print(result)

# Parse as JSON
data = json.loads(result)
print("\n‚úÖ Parsed data:")
print(f"Product: {data.get('name')}")
print(f"Price: ${data.get('price')}")
print(f"Category: {data.get('category')}")

‚úÖ JSON mode response:
{
  "product_name": "Alpine Pro Tent",
  "price": 299,
  "category": "Camping"
}

‚úÖ Parsed data:
Product: None
Price: $299
Category: Camping


**Key Insight:** JSON mode guarantees valid JSON, but doesn't enforce a specific schema.

## Part 3: Schema-Based Prompting

Define the exact structure you want in the prompt.

In [4]:
# Define schema in prompt
schema_prompt = """Extract product information and return JSON with this exact structure:
{
  "name": string,
  "price": number (without $ symbol),
  "category": string,
  "in_stock": boolean
}
"""

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": schema_prompt},
        {
            "role": "user",
            "content": "The Alpine Pro Tent costs $299 and is in the Camping category. Currently available."
        }
    ],
    response_format={"type": "json_object"},
    temperature=0
)

result = json.loads(response.choices[0].message.content)
print("‚úÖ Schema-based response:")
print(json.dumps(result, indent=2))

‚úÖ Schema-based response:
{
  "name": "Alpine Pro Tent",
  "price": 299,
  "category": "Camping",
  "in_stock": true
}


**Better, but still problems:**
- No type validation
- No required field enforcement
- No value constraints (e.g., price > 0)

**Solution: Pydantic**

## Part 4: Pydantic Models for Validation

Pydantic provides runtime type validation and schema generation.

In [5]:
# Define Pydantic model
class Product(BaseModel):
    """A product with validated fields."""
    name: str = Field(..., min_length=1, description="Product name")
    price: float = Field(..., gt=0, description="Price in dollars")
    category: str = Field(..., description="Product category")
    in_stock: bool = Field(default=True, description="Availability status")
    features: Optional[List[str]] = Field(default=None, description="Product features")

# Generate JSON schema from Pydantic model
schema = Product.model_json_schema()
print("üìã Generated schema:")
print(json.dumps(schema, indent=2))

üìã Generated schema:
{
  "description": "A product with validated fields.",
  "properties": {
    "name": {
      "description": "Product name",
      "minLength": 1,
      "title": "Name",
      "type": "string"
    },
    "price": {
      "description": "Price in dollars",
      "exclusiveMinimum": 0,
      "title": "Price",
      "type": "number"
    },
    "category": {
      "description": "Product category",
      "title": "Category",
      "type": "string"
    },
    "in_stock": {
      "default": true,
      "description": "Availability status",
      "title": "In Stock",
      "type": "boolean"
    },
    "features": {
      "anyOf": [
        {
          "items": {
            "type": "string"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Product features",
      "title": "Features"
    }
  },
  "required": [
    "name",
    "price",
    "category"
  ],
  "title": "Product

### Using Pydantic with OpenAI

In [6]:
def extract_product(text: str) -> Product:
    """Extract product information with Pydantic validation."""
    
    # Build prompt with schema
    schema_json = json.dumps(Product.model_json_schema(), indent=2)
    
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {
                "role": "system",
                "content": f"""Extract product information from text.
Return JSON matching this schema:
{schema_json}

Rules:
- price must be a positive number (no $ symbol)
- name must not be empty
- in_stock defaults to true if not mentioned"""
            },
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"},
        temperature=0
    )
    
    # Parse and validate with Pydantic
    result = json.loads(response.choices[0].message.content)
    return Product(**result)

# Test extraction
text = "The Alpine Pro Tent costs $299 and is in the Camping category. Features: waterproof, 4-person capacity."
product = extract_product(text)

print("‚úÖ Validated product:")
print(f"Name: {product.name}")
print(f"Price: ${product.price}")
print(f"Category: {product.category}")
print(f"In Stock: {product.in_stock}")
print(f"Features: {product.features}")

‚úÖ Validated product:
Name: Alpine Pro Tent
Price: $299.0
Category: Camping
In Stock: True
Features: ['waterproof', '4-person capacity']


**Key Benefits:**
- Type validation (price is float, not string)
- Constraint validation (price > 0)
- Required field enforcement
- IDE autocomplete and type hints

## Part 5: Handling Invalid Outputs

Even with JSON mode, models can return invalid data. Handle it gracefully.

In [7]:
def extract_product_with_retry(
    text: str,
    max_retries: int = 3
) -> Optional[Product]:
    """Extract product with retry logic for validation failures."""
    
    schema_json = json.dumps(Product.model_json_schema(), indent=2)
    
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=MODEL,
                messages=[
                    {
                        "role": "system",
                        "content": f"""Extract product information from text.
Return JSON matching this schema:
{schema_json}

STRICT RULES:
- price must be a positive number (no $ symbol, no text)
- name must not be empty
- category must not be empty
- in_stock must be true or false"""
                    },
                    {"role": "user", "content": text}
                ],
                response_format={"type": "json_object"},
                temperature=0
            )
            
            result = json.loads(response.choices[0].message.content)
            product = Product(**result)
            return product
            
        except ValidationError as e:
            print(f"‚ùå Validation failed (attempt {attempt + 1}/{max_retries}):")
            print(f"   {e}")
            if attempt == max_retries - 1:
                print("   Max retries reached. Returning None.")
                return None
        
        except json.JSONDecodeError as e:
            print(f"‚ùå JSON parsing failed (attempt {attempt + 1}/{max_retries}):")
            print(f"   {e}")
            if attempt == max_retries - 1:
                return None
    
    return None

# Test with valid input
text = "Summit Backpack - $149 - Hiking gear"
product = extract_product_with_retry(text)

if product:
    print("\n‚úÖ Successfully extracted:")
    print(product.model_dump_json(indent=2))
else:
    print("\n‚ùå Extraction failed after retries")


‚úÖ Successfully extracted:
{
  "name": "Summit Backpack",
  "price": 149.0,
  "category": "Hiking gear",
  "in_stock": true,
  "features": null
}


### Fallback Strategies

In [8]:
def extract_product_with_fallback(text: str) -> Product:
    """Extract product with fallback to default values."""
    
    try:
        return extract_product(text)
    except (ValidationError, json.JSONDecodeError) as e:
        print(f"‚ö†Ô∏è Extraction failed: {e}")
        print("   Using fallback values...")
        
        # Return a safe default (Product schema requires price > 0)
        return Product(
            name="Unknown Product",
            price=0.01,
            category="Uncategorized",
            in_stock=False
        )

# Test fallback
product = extract_product_with_fallback("Some ambiguous text")
print("\nüì¶ Product (with fallback):")
print(product.model_dump_json(indent=2))


üì¶ Product (with fallback):
{
  "name": "Ambiguous Product",
  "price": 19.99,
  "category": "Miscellaneous",
  "in_stock": true,
  "features": null
}


## Part 6: Complex Nested Structures

Pydantic handles nested models for complex data.

In [9]:
class Review(BaseModel):
    """A product review."""
    rating: int = Field(..., ge=1, le=5, description="Rating from 1-5")
    comment: str = Field(..., description="Review comment")
    reviewer: str = Field(..., description="Reviewer name")

class ProductWithReviews(BaseModel):
    """A product with reviews."""
    name: str
    price: float = Field(..., gt=0)
    category: str
    reviews: List[Review] = Field(default_factory=list)
    average_rating: Optional[float] = Field(None, ge=1, le=5)

# Extract complex nested data
def extract_product_with_reviews(text: str) -> ProductWithReviews:
    schema_json = json.dumps(ProductWithReviews.model_json_schema(), indent=2)
    
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {
                "role": "system",
                "content": f"""Extract product information including reviews.
Return JSON matching this schema:
{schema_json}

Rules:
- Extract all reviews mentioned
- Rating must be 1-5
- Calculate average_rating from reviews"""
            },
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"},
        temperature=0
    )
    
    result = json.loads(response.choices[0].message.content)
    return ProductWithReviews(**result)

# Test with reviews
text = """Alpine Pro Tent - $299 - Camping
Reviews:
- John: 5 stars - "Excellent tent, very spacious!"
- Sarah: 4 stars - "Good quality but a bit heavy"
- Mike: 5 stars - "Perfect for family camping"
"""

product = extract_product_with_reviews(text)
print("‚úÖ Product with reviews:")
print(product.model_dump_json(indent=2))

‚úÖ Product with reviews:
{
  "name": "Alpine Pro Tent",
  "price": 299.0,
  "category": "Camping",
  "reviews": [
    {
      "rating": 5,
      "comment": "Excellent tent, very spacious!",
      "reviewer": "John"
    },
    {
      "rating": 4,
      "comment": "Good quality but a bit heavy",
      "reviewer": "Sarah"
    },
    {
      "rating": 5,
      "comment": "Perfect for family camping",
      "reviewer": "Mike"
    }
  ],
  "average_rating": 4.67
}


## Part 7: End-to-End Example (Production Pattern)

A complete, production-ready extraction system.

<img src="img/n3_img_2.png" alt="End-to-End Example" width="560" style="max-width: 100%; height: auto;">

In [10]:
from enum import Enum

class ExtractionStatus(str, Enum):
    SUCCESS = "success"
    VALIDATION_ERROR = "validation_error"
    PARSING_ERROR = "parsing_error"
    API_ERROR = "api_error"

class ExtractionResult(BaseModel):
    """Result of an extraction attempt."""
    status: ExtractionStatus
    data: Optional[Product] = None
    error: Optional[str] = None
    attempts: int = 1

def extract_product_production(
    text: str,
    max_retries: int = 3
) -> ExtractionResult:
    """Production-grade extraction with detailed error handling."""
    
    schema_json = json.dumps(Product.model_json_schema(), indent=2)
    
    for attempt in range(1, max_retries + 1):
        try:
            response = client.chat.completions.create(
                model=MODEL,
                messages=[
                    {
                        "role": "system",
                        "content": f"""Extract product information from text.
Return JSON matching this schema:
{schema_json}

STRICT RULES:
- price must be a positive number
- name and category must not be empty
- If information is missing, use reasonable defaults"""
                    },
                    {"role": "user", "content": text}
                ],
                response_format={"type": "json_object"},
                temperature=0
            )
            
            result = json.loads(response.choices[0].message.content)
            product = Product(**result)
            
            return ExtractionResult(
                status=ExtractionStatus.SUCCESS,
                data=product,
                attempts=attempt
            )
            
        except ValidationError as e:
            if attempt == max_retries:
                return ExtractionResult(
                    status=ExtractionStatus.VALIDATION_ERROR,
                    error=str(e),
                    attempts=attempt
                )
        
        except json.JSONDecodeError as e:
            if attempt == max_retries:
                return ExtractionResult(
                    status=ExtractionStatus.PARSING_ERROR,
                    error=str(e),
                    attempts=attempt
                )
        
        except Exception as e:
            if attempt == max_retries:
                return ExtractionResult(
                    status=ExtractionStatus.API_ERROR,
                    error=str(e),
                    attempts=attempt
                )
    
    return ExtractionResult(
        status=ExtractionStatus.API_ERROR,
        error="Max retries exceeded",
        attempts=max_retries
    )

# Test production extraction
texts = [
    "Alpine Pro Tent - $299 - Camping gear",
    "Summit Backpack costs 149 dollars, hiking category",
    "Some random text without product info"
]

for text in texts:
    print(f"\nüìù Input: {text}")
    result = extract_product_production(text)
    
    if result.status == ExtractionStatus.SUCCESS:
        print(f"‚úÖ Status: {result.status.value}")
        print(f"   Product: {result.data.name} - ${result.data.price}")
    else:
        print(f"‚ùå Status: {result.status.value}")
        print(f"   Error: {result.error}")
    
    print(f"   Attempts: {result.attempts}")


üìù Input: Alpine Pro Tent - $299 - Camping gear
‚úÖ Status: success
   Product: Alpine Pro Tent - $299.0
   Attempts: 1

üìù Input: Summit Backpack costs 149 dollars, hiking category
‚úÖ Status: success
   Product: Summit Backpack - $149.0
   Attempts: 1

üìù Input: Some random text without product info
‚úÖ Status: success
   Product: Default Product - $1.0
   Attempts: 1
