# üìã Enforcing Structured Output with Pydantic

**Structured output** ensures that language models return data in a consistent, validated format. This notebook explores how to use Pydantic schemas to enforce structured JSON output from LLMs.

## üéØ What You'll Learn

1. Setting up Google GenAI client for structured output
2. Defining Pydantic schemas for data validation
3. Creating complex nested data structures
4. Implementing custom field validators
5. Extracting structured invoice data from text
6. Comparing extracted data with ground truth
7. Calculating field-level accuracy metrics
8. Batch processing multiple documents

---

## üöÄ Section 1: Setup and Installation

In [None]:
!pip install -U -q deepdiff

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/91.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m91.4/91.4 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from google.colab import userdata
api_key=userdata.get('GOOGLE_API_KEY_1')

In [None]:
# Get Data
!git clone https://github.com/AI360-Labs/GenAI_Fundamentals

Cloning into 'GenAI_Fundamentals'...
remote: Enumerating objects: 86, done.[K
remote: Counting objects: 100% (86/86), done.[K
remote: Compressing objects: 100% (48/48), done.[K
remote: Total 86 (delta 39), reused 81 (delta 38), pack-reused 0 (from 0)[K
Receiving objects: 100% (86/86), 2.94 MiB | 18.37 MiB/s, done.
Resolving deltas: 100% (39/39), done.


---

## üîå Section 2: Connecting to the Model

### üì° Google GenAI Client Setup

We'll use Google's Gemini model for structured data extraction.

In [None]:

import os
from google import genai
from google.genai import types
from tqdm import tqdm
from time import sleep
from dotenv import load_dotenv

load_dotenv()

MODEL_NAME = "gemini-2.5-flash-lite"
client = genai.Client(
    api_key=api_key
)

response = client.models.generate_content(
    model=MODEL_NAME,
    contents="Explain how AI works a 5 words"
)

print(response.text)

Learns from data, makes predictions.


---

## üìä Section 3: Project Overview

### üéØ Project Objectives

**Goal:** Extract structured information from unstructured invoice text (after OCR) for auditing purposes.

### üìã Target Fields

**Invoice Information:**
- `invoice_id` - Unique invoice number
- `invoice_date` - Date when invoice was issued
- `due_date` - Payment due date

**Supplier Information:**
- `supplier_name` - Company/person providing services
- `supplier_address` - Physical address
- `supplier_tax_id` - Tax identification number

**Receiver Information:**
- `receiver_name` - Entity being billed
- `receiver_address` - Client address
- `receiver_tax_id` - Client tax ID

**Payment Details:**
- `total_amount` - Total amount due
- `currency` - Currency type
- `payment_terms` - Payment conditions

### üìÑ Expected JSON Output Format

```json
{
    "invoice_id": "str",
    "invoice_date": "YYYY-MM-DD",
    "supplier_name": "str",
    "supplier_address": "str",
    "receiver_name": "str",
    "receiver_address": "str",
    "receiver_tax_id": "str",
    "total_amount": 0.0,
    "currency": "USD",
    "payment_terms": 30
}
```

---

## üèóÔ∏è Section 4: Defining Pydantic Schemas

### üìê Creating the Schema

We'll define Pydantic models with validation rules to ensure data quality.


In [None]:
from pydantic import BaseModel, Field, field_validator
from datetime import date, datetime
from enum import Enum


class CurrencyEnum(str, Enum):
    """
    Enumeration of the top 10 most popular currencies worldwide.
    """
    USD = "USD"  # United States Dollar
    EUR = "EUR"  # Euro
    JPY = "JPY"  # Japanese Yen
    GBP = "GBP"  # British Pound Sterling
    AUD = "AUD"  # Australian Dollar
    CAD = "CAD"  # Canadian Dollar
    CHF = "CHF"  # Swiss Franc
    CNY = "CNY"  # Chinese Yuan
    HKD = "HKD"  # Hong Kong Dollar
    NZD = "NZD"  # New Zealand Dollar


class Invoice(BaseModel):
    """
    Represents a complete invoice with supplier, receiver, and line item information.
    """
    invoice_id: str | None = Field(
        default=None,
        description="Unique invoice number or identifier"
    )
    invoice_date: date | None = Field(
        default=None,
        description="Date when the invoice was issued (format: YYYY-MM-DD)"
    )
    due_date: date | None = Field(
        default=None,
        description="Date by which payment is due (format: YYYY-MM-DD)"
    )

    supplier_name: str | None = Field(
        default=None,
        description="Name of the company or person providing services"
    )
    supplier_address: str | None = Field(
        default=None,
        description="Physical address of the supplier"
    )
    supplier_tax_id: str | None = Field(
        default=None,
        description="Tax identification number (VAT, EIN, etc.) of the supplier"
    )

    receiver_name: str | None = Field(
        default=None,
        description="Name of the entity or person being billed"
    )
    receiver_address: str | None = Field(
        default=None,
        description="Physical address of the receiver/client"
    )
    receiver_tax_id: str | None = Field(
        default=None,
        description="Tax identification number of the receiver/client"
    )

    total_amount: float | None = Field(
        default=None,
        ge=0,
        description="Total amount due for the entire invoice"
    )
    currency: CurrencyEnum | None = Field(
        default=None,
        description="Currency in which the invoice is issued"
    )
    payment_terms: int | None = Field(
        default=None,
        description="Payment terms and conditions in days"
    )

    @field_validator('invoice_date', 'due_date', mode='before')
    def validate_date_format(cls, v):
        """
        Validates and converts dates from various formats to YYYY-MM-DD.
        Accepts date objects, datetime objects, and strings in multiple formats.
        Returns None for empty strings.
        """
        if v is None:
            return v

        if isinstance(v, date):
            return v

        if isinstance(v, datetime):
            return v.date()

        if isinstance(v, str):
            v = v.strip()

            if v == '':
                return None

            date_formats = [
                '%Y-%m-%d', '%Y/%m/%d','%d-%m-%Y',
                '%d/%m/%Y', '%m-%d-%Y','%m/%d/%Y',
                '%d.%m.%Y', '%Y.%m.%d','%B %d, %Y',
                '%b %d, %Y','%d %B %Y','%d %b %Y',
                '%Y%m%d',
            ]

            for fmt in date_formats:
                try:
                    return datetime.strptime(v, fmt).date()
                except ValueError:
                    continue

            raise ValueError(f'Date format not recognized: {v}')

        raise ValueError(f'Invalid date type: {type(v)}')


In [None]:
test_data_complete = {
    "invoice_id": "INV-2025-001",
    "invoice_date": "2025-01-15",
    "due_date": "2025-02-15",
    "supplier_name": "Tech Solutions Inc.",
    "supplier_address": "123 Main St, San Francisco, CA 94105",
    "supplier_tax_id": "12-3456789",
    "receiver_name": "Acme Corporation",
    "receiver_address": "456 Market St, New York, NY 10001",
    "receiver_tax_id": "98-7654321",
    "total_amount": 6500.0,
    "currency": "USD",
    "payment_terms": 30
}

test_data_partial = {
    "invoice_id": "INV-2025-002",
    "invoice_date": "2025-01-20",
    "supplier_name": "Design Studio LLC",
    "receiver_name": "Beta Corp",
    "total_amount": 2500.0,
    "currency": "EUR"
}

invoice_partial = Invoice.model_validate(test_data_partial)
invoice_complete = Invoice.model_validate(test_data_complete)

---

## ‚úÖ Section 5: Testing Schema Validation

### üìÖ Testing Date Format Validation

Our schema supports multiple date formats and converts them to a standard format.


In [None]:
test_dates = [
    {"invoice_date": "2025-01-15"},
    {"invoice_date": "2025/01/15"},
    {"invoice_date": "15-01-2025"},
    {"invoice_date": "15/01/2025"},
    {"invoice_date": "01/15/2025"},
    {"invoice_date": "15.01.2025"},
    {"invoice_date": "January 15, 2025"},
    {"invoice_date": "Jan 15, 2025"},
    {"invoice_date": "15 January 2025"},
    {"invoice_date": "20250115"},
    {"invoice_date": ""},
    {"invoice_date": "   "},
    {"invoice_date": None},
]

for test_data in test_dates:
    invoice = Invoice.model_validate(test_data)
    date_str = str(test_data['invoice_date']) if test_data['invoice_date'] is not None else "None"
    print(f"{date_str:20} -> {invoice.invoice_date}")


2025-01-15           -> 2025-01-15
2025/01/15           -> 2025-01-15
15-01-2025           -> 2025-01-15
15/01/2025           -> 2025-01-15
01/15/2025           -> 2025-01-15
15.01.2025           -> 2025-01-15
January 15, 2025     -> 2025-01-15
Jan 15, 2025         -> 2025-01-15
15 January 2025      -> 2025-01-15
20250115             -> 2025-01-15
                     -> None
                     -> None
None                 -> None


---

## üìÇ Section 6: Reading Invoice Data

### üìÑ Helper Functions for File I/O

Functions to read invoice files and extract text content.

In [None]:
import json
from pathlib import Path


def read_invoice_text(file_path: str | Path) -> str:
    """
    Read invoice JSON file and extract text field.

    Args:
        file_path: Path to the invoice JSON file

    Returns:
        Text content from the invoice
    """
    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return data.get('text', '')


def get_all_invoices(invoices_dir: str | Path = 'data/invoces') -> list[dict[str, str]]:
    """
    Read all invoice files from directory and extract their text.

    Args:
        invoices_dir: Path to directory containing invoice JSON files

    Returns:
        List of dictionaries with filename and text content
    """
    invoices_path = Path(invoices_dir)
    invoices = []

    for file_path in invoices_path.glob('*.json'):
        text = read_invoice_text(file_path)
        invoices.append({
            'filename': file_path.name,
            'text': text
        })

    return invoices


---

## ü§ñ Section 7: Extraction with Structured Output

### üéØ Extraction Function

Using Gemini with Pydantic schema to enforce structured JSON output.


In [None]:
def extract_invoice_data(invoice_text: str, client: genai.Client) -> Invoice:
    """
    Extract structured invoice data from text using Gemini model.

    Args:
        invoice_text: Raw text extracted from invoice
        client: Google GenAI client instance

    Returns:
        Invoice object with extracted data
    """
    system_prompt = """You are an expert invoice data extraction assistant.
    Your task is to extract structured information from invoice text.

    Instructions:
    - Extract all available fields from the invoice text
    - If a field is not found or unclear, leave it as None
    - For dates, use YYYY-MM-DD format
    - Be precise and accurate in your extraction"""

    response = client.models.generate_content(
        model=MODEL_NAME,
        contents=invoice_text,
        config=types.GenerateContentConfig(
            system_instruction=system_prompt,
            response_mime_type='application/json',
            temperature=0.01,
            response_schema=Invoice
        )
    )

    return Invoice.model_validate_json(response.text)

---

## üß™ Section 8: Testing Extraction

### üìù Single Invoice Test

Let's test the extraction on a sample invoice.


In [None]:
sample_invoice_path = '/content/GenAI_Fundamentals/data/invoces/ocr/invoice_1.json'
invoice_text = read_invoice_text(sample_invoice_path)

extracted_invoice = extract_invoice_data(invoice_text, client)

In [None]:
print(extracted_invoice.model_dump_json(indent=2))

{
  "invoice_id": "030455",
  "invoice_date": "1990-10-26",
  "due_date": null,
  "supplier_name": "Kubin-Nicholson Corporation",
  "supplier_address": "P.O. Box 18674\n5880 North 60th Street\nMilwaukee, WI 53218",
  "supplier_tax_id": null,
  "receiver_name": "AMERICAN TOBACCO CO",
  "receiver_address": "GENERAL ACCOUNTING\nP.O. BOX 1100\nCHESTER\nVA\n23831",
  "receiver_tax_id": null,
  "total_amount": 24216.5,
  "currency": null,
  "payment_terms": 30
}


---

## üìä Section 9: Batch Processing

### üîÑ Processing Multiple Invoices

Functions to process all invoices and compare with ground truth data.


In [None]:
def load_ground_truth(file_path: str | Path) -> Invoice:
    """
    Load ground truth invoice data from JSON file.

    Args:
        file_path: Path to the ground truth JSON file

    Returns:
        Invoice object with ground truth data
    """
    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return Invoice.model_validate(data)


def get_ocr_invoices(ocr_dir: str | Path = 'data/invoces/ocr') -> list[dict[str, str]]:
    """
    Read all OCR invoice files from directory and extract their text.

    Args:
        ocr_dir: Path to directory containing OCR invoice JSON files

    Returns:
        List of dictionaries with filename and text content
    """
    ocr_path = Path(ocr_dir)
    invoices = []

    for file_path in sorted(ocr_path.glob('invoice_*.json')):
        text = read_invoice_text(file_path)
        invoices.append({
            'filename': file_path.name,
            'text': text
        })

    return invoices


def compare_invoices_field_by_field(extracted: Invoice, ground_truth: Invoice) -> dict[str, bool]:
    """Comparing invoices fields."""
    field_comparison = {}
    extracted_dict = extracted.model_dump()
    ground_truth_dict = ground_truth.model_dump()

    for field_name in Invoice.model_fields.keys():
        extracted_val = extracted_dict.get(field_name)
        ground_truth_val = ground_truth_dict.get(field_name)

        if extracted_val is None and ground_truth_val is None:
            field_comparison[field_name] = True
        elif extracted_val is None or ground_truth_val is None:
            field_comparison[field_name] = False
        else:
            if isinstance(extracted_val, str) and isinstance(ground_truth_val, str):
                extracted_normalized = extracted_val.strip().replace('\n', ', ').replace('  ', ' ')
                ground_truth_normalized = ground_truth_val.strip()
                field_comparison[field_name] = extracted_normalized == ground_truth_normalized
            else:
                field_comparison[field_name] = extracted_val == ground_truth_val

    return field_comparison


In [None]:
def process_all_invoices(
    client: genai.Client,
    ocr_dir: str | Path = 'data/invoces/ocr',
    ground_truth_dir: str | Path = 'data/invoces/ground_truth'
) -> dict[str, dict]:
    """
    Process all invoices and compare with ground truth.

    Args:
        client: Google GenAI client instance
        ocr_dir: Path to OCR invoices directory
        ground_truth_dir: Path to ground truth directory

    Returns:
        Dictionary with results for each invoice
    """
    ocr_path = Path(ocr_dir)
    gt_path = Path(ground_truth_dir)

    results = {}

    for ocr_file in tqdm(sorted(ocr_path.glob('invoice_*.json'))):
        invoice_name = ocr_file.stem
        gt_file = gt_path / ocr_file.name

        if not gt_file.exists():
            continue

        invoice_text = read_invoice_text(ocr_file)
        extracted = extract_invoice_data(invoice_text, client)
        ground_truth = load_ground_truth(gt_file)

        field_comparison = compare_invoices_field_by_field(extracted, ground_truth)

        results[invoice_name] = {
            'extracted': extracted,
            'ground_truth': ground_truth,
            'field_comparison': field_comparison,
        }
        sleep(1)

    return results

---

## üìà Section 10: Calculating Accuracy

### üéØ Running Full Extraction Pipeline

Process all invoices and calculate exact match accuracy.


In [None]:
results = process_all_invoices(
    client,
    ocr_dir="/content/GenAI_Fundamentals/data/invoces/ocr",
    ground_truth_dir="/content/GenAI_Fundamentals/data/invoces/ground_truth")


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [00:40<00:00,  2.04s/it]


In [None]:
all_fields = list(results[list(results.keys())[0]]['field_comparison'].keys())
total_invoices = len(results)

field_stats = {}
for field_name in all_fields:
    matches = sum(1 for r in results.values() if r['field_comparison'][field_name])
    field_stats[field_name] = {
        'matches': matches,
        'total': total_invoices,
        'accuracy': matches / total_invoices
    }

print(f"Total invoices: {total_invoices}")
print("\nField-level accuracy:")
for field_name, stats in field_stats.items():
    print(f"{field_name:25}: {stats['accuracy']*100:15} ({stats['matches']}/{stats['total']})")

# Overall accuracy
total_matches = sum(s['matches'] for s in field_stats.values())
total_possible = len(field_stats) * total_invoices
overall_accuracy = total_matches / total_possible if total_possible > 0 else 0
print(f"\nOverall field accuracy: {overall_accuracy:.2%}")

Total invoices: 20

Field-level accuracy:
invoice_id               :            80.0 (16/20)
invoice_date             :            75.0 (15/20)
due_date                 :            75.0 (15/20)
supplier_name            :            45.0 (9/20)
supplier_address         :            40.0 (8/20)
supplier_tax_id          :             5.0 (1/20)
receiver_name            :            90.0 (18/20)
receiver_address         :            40.0 (8/20)
receiver_tax_id          :             0.0 (0/20)
total_amount             :            45.0 (9/20)
currency                 :            75.0 (15/20)
payment_terms            :            75.0 (15/20)

Overall field accuracy: 53.75%


### üîç Analyzing Mismatches

Identifying which invoices didn't match exactly for further investigation.

---
## üìö Summary

### ‚ú® Key Concepts Covered

1. **Pydantic Schemas**: Defining structured invoice data models with comprehensive validation
2. **Field Validators**: Custom date validation supporting multiple formats (YYYY-MM-DD, DD/MM/YYYY, etc.)
3. **Enum Types**: CurrencyEnum restricting values to top 10 global currencies
4. **Date Handling**: Flexible date parsing from various formats to standardized YYYY-MM-DD
5. **Structured Output**: Enforcing JSON schema compliance with Gemini API
6. **Batch Processing**: Processing 20 invoices with systematic comparison
7. **Field-Level Accuracy**: Detailed validation showing per-field performance metrics
8. **OCR Integration**: Extracting structured data from unstructured invoice text
9. **Ground Truth Comparison**: Systematic validation against known correct data
10. **Error Handling**: Graceful handling of missing fields and invalid data

### üí° Best Practices Demonstrated

- ‚úÖ **Comprehensive field definitions** with clear descriptions and validation rules
- ‚úÖ **Flexible date validation** supporting 12+ different date formats
- ‚úÖ **Currency standardization** using enum for consistent data quality
- ‚úÖ **Systematic field comparison** with normalization for text fields
- ‚úÖ **Batch processing with progress tracking** using tqdm
- ‚úÖ **Rate limiting** with sleep delays to respect API limits
- ‚úÖ **Detailed accuracy reporting** at both field and overall levels
- ‚úÖ **Null handling** for missing or unclear data fields


### üéØ Next Steps

- üîπ Implement field-level accuracy metrics
- üîπ Add more custom validators for data quality
- üîπ Explore schema versioning strategies
- üîπ Build error correction mechanisms
- üîπ Extend schemas for more complex documents

---

### üéì Congratulations!

You now understand how to enforce structured output from language models using Pydantic schemas and how to validate extraction quality systematically.
