# Tutorial 5: Custom Schema Building

This notebook demonstrates how to build custom schemas for specific extraction needs.

**What you'll learn:**
- Building schemas from natural language descriptions
- Creating schemas programmatically with SchemaBuilder
- Defining field types: string, integer, float, boolean, categorical
- Modifying and refining detected schemas
- Using the LLMExtractor directly

## 1. Setup

In [None]:
import os
import pandas as pd

os.environ["GEMINI_API_KEY"] = "your-api-key-here"

from structify import (
    Pipeline, 
    Schema, 
    Field, 
    FieldType,
    SchemaBuilder,
    LLMExtractor,
)

## 2. Schema from Natural Language Description

The easiest way to create a custom schema - describe what you want in plain English:

In [None]:
# Create pipeline from description
pipeline = Pipeline.from_description("""
    Extract research findings from academic papers:
    - Author names and publication year
    - The country or region being studied
    - Main numerical finding (coefficient or percentage)
    - Statistical significance (p-value)
    - Methodology used (regression, RCT, difference-in-differences, etc.)
    - Sample size
    - Time period covered
""")

results = pipeline.fit_transform("research_papers/")
results.head()

## 3. Building Schemas with SchemaBuilder

For more control, use SchemaBuilder to define exact fields:

In [None]:
# Create a custom schema
schema = SchemaBuilder.create(
    name="financial_metrics",
    fields=[
        # Required string field
        {"name": "company", "type": "string", "required": True,
         "description": "Company name"},
        
        # Required integer field
        {"name": "year", "type": "integer", "required": True,
         "description": "Fiscal year"},
        
        # Float field for revenue
        {"name": "revenue", "type": "float",
         "description": "Total revenue in millions"},
        
        # Float field for margin
        {"name": "profit_margin", "type": "float",
         "description": "Profit margin percentage"},
        
        # Categorical field with predefined options
        {"name": "sector", "type": "categorical",
         "options": ["Tech", "Finance", "Healthcare", "Energy", "Retail"],
         "description": "Business sector"},
        
        # Boolean field
        {"name": "is_profitable", "type": "boolean",
         "description": "Whether the company was profitable"},
    ],
    # Optional: tell the LLM where to focus
    focus_on=["financial statements", "annual reports", "earnings tables"],
    # Optional: tell the LLM what to skip
    skip=["legal disclaimers", "boilerplate text", "risk factors"],
)

print(f"Created schema: {schema.name}")
print(f"Fields: {len(schema.fields)}")

In [None]:
# Use the schema with a pipeline
pipeline = Pipeline.from_schema(schema)
results = pipeline.fit_transform("annual_reports/")
results.head()

## 4. Field Types Explained

| Type | Python Type | Example | Use For |
|------|-------------|---------|--------|
| `string` | str | "Apple Inc." | Names, text, descriptions |
| `integer` | int | 2023 | Counts, years, IDs |
| `float` | float | 15.7 | Percentages, rates, amounts |
| `boolean` | bool | True/False | Yes/no questions |
| `categorical` | str | "Tech" | Predefined options |

In [None]:
# Creating fields directly
from structify import Field, FieldType

# String field
name_field = Field(
    name="author_name",
    type=FieldType.STRING,
    description="Name of the author",
    required=True,
)

# Integer field
year_field = Field(
    name="publication_year",
    type=FieldType.INTEGER,
    description="Year of publication",
    required=True,
)

# Float field
estimate_field = Field(
    name="estimate_value",
    type=FieldType.FLOAT,
    description="Main effect size or coefficient",
    required=False,
)

# Boolean field
causal_field = Field(
    name="is_causal",
    type=FieldType.BOOLEAN,
    description="Whether causal identification was attempted",
    required=False,
)

# Categorical field (MUST include options)
method_field = Field(
    name="methodology",
    type=FieldType.CATEGORICAL,
    description="Research methodology",
    options=["DID", "IV", "RDD", "RCT", "OLS", "FE", "PSM"],
    required=True,
)

print("Fields created successfully!")

## 5. Creating Schema from Fields

In [None]:
# Create a Schema object directly
custom_schema = Schema(
    name="research_findings",
    description="Schema for extracting research findings from academic papers",
    fields=[
        Field(name="author", type=FieldType.STRING, required=True,
              description="Lead author name"),
        Field(name="year", type=FieldType.INTEGER, required=True,
              description="Publication year"),
        Field(name="country", type=FieldType.STRING, required=True,
              description="Country studied"),
        Field(name="estimate", type=FieldType.FLOAT, required=True,
              description="Main coefficient or effect size"),
        Field(name="std_error", type=FieldType.FLOAT,
              description="Standard error"),
        Field(name="methodology", type=FieldType.CATEGORICAL,
              options=["DID", "IV", "RDD", "RCT", "OLS", "FE"],
              description="Research methodology"),
        Field(name="significant", type=FieldType.BOOLEAN,
              description="Whether statistically significant at p<0.05"),
        Field(name="notes", type=FieldType.STRING, required=True,
              description="Brief context about the finding"),
    ],
    extraction_rules={
        "focus_on": ["results tables", "regression output"],
        "skip": ["bibliography", "acknowledgments"],
    },
)

print(f"Schema: {custom_schema.name}")
for f in custom_schema.fields:
    print(f"  - {f.name}: {f.type.value}")

In [None]:
# Use it with Pipeline
pipeline = Pipeline(schema=custom_schema)
pipeline.fit("papers/")  # Skips detection since schema provided
results = pipeline.transform("papers/")

## 6. Modifying a Detected Schema

You can detect a schema, then modify it:

In [None]:
# Step 1: Detect schema
pipeline = Pipeline(purpose="findings")
pipeline.fit("documents/")

# Step 2: Get the detected schema
detected_schema = pipeline.schema
print("Detected fields:")
for f in detected_schema.fields:
    print(f"  - {f.name}: {f.type.value}")

In [None]:
# Step 3: Modify the schema
# Add a new field
new_field = Field(
    name="custom_field",
    type=FieldType.STRING,
    description="My custom field",
    required=False,
)
detected_schema.fields.append(new_field)

print(f"\nSchema now has {len(detected_schema.fields)} fields")

In [None]:
# Step 4: Select only specific fields
pipeline.select_fields(["author", "year", "estimate", "notes"])

print("Selected fields:")
for f in pipeline.schema.fields:
    print(f"  - {f.name}")

In [None]:
# Step 5: Save modified schema
pipeline.save_schema("modified_schema.json")
print("Modified schema saved!")

## 7. Using LLMExtractor Directly

For more control, use LLMExtractor with your schema:

In [None]:
from structify import LLMExtractor, Schema

# Load or create your schema
schema = Schema.load("my_schema.json")

# Create extractor with custom options
extractor = LLMExtractor(
    schema=schema,
    deduplicate=True,
    sample_ratio=0.5,  # Extract from 50% of files
    max_samples=100,   # Max 100 files
    seed=42,
)

# Fit and transform
extractor.fit("documents/")
results = extractor.transform("documents/")

print(f"Extracted {len(results)} records")

## 8. Schema Validation Tips

### Best Practices:

1. **Keep field names simple**: Use snake_case (e.g., `author_name`, not `AuthorName`)
2. **Write clear descriptions**: Help the LLM understand what to extract
3. **Use categorical wisely**: Only when you have known, finite options
4. **Include notes field**: Always helpful for context
5. **Test on a sample first**: Validate before full extraction

In [None]:
# Validate schema structure
def validate_schema(schema):
    issues = []
    
    for field in schema.fields:
        # Check categorical fields have options
        if field.type == FieldType.CATEGORICAL and not field.options:
            issues.append(f"Categorical field '{field.name}' has no options")
        
        # Check descriptions exist
        if not field.description:
            issues.append(f"Field '{field.name}' has no description")
    
    # Check for notes field
    if not any(f.name == 'notes' for f in schema.fields):
        issues.append("Missing recommended 'notes' field")
    
    if issues:
        print("Schema issues:")
        for issue in issues:
            print(f"  - {issue}")
    else:
        print("Schema looks good!")
    
    return len(issues) == 0

# Test it
validate_schema(custom_schema)

## 9. Complete Custom Schema Example

In [None]:
# Complete example: Medical research extraction

medical_schema = Schema(
    name="clinical_trials",
    description="Extract clinical trial results from medical papers",
    fields=[
        # Study identification
        Field(name="study_id", type=FieldType.STRING, required=True,
              description="Trial ID or study identifier"),
        Field(name="publication_year", type=FieldType.INTEGER, required=True,
              description="Year of publication"),
        
        # Treatment details
        Field(name="intervention", type=FieldType.STRING, required=True,
              description="Drug or treatment being tested"),
        Field(name="comparator", type=FieldType.STRING,
              description="Control/placebo or comparison treatment"),
        
        # Results
        Field(name="effect_size", type=FieldType.FLOAT, required=True,
              description="Primary outcome effect size"),
        Field(name="confidence_interval", type=FieldType.STRING,
              description="95% confidence interval"),
        Field(name="p_value", type=FieldType.FLOAT,
              description="Statistical significance p-value"),
        
        # Categorical fields
        Field(name="study_design", type=FieldType.CATEGORICAL,
              options=["RCT", "Cohort", "Case-control", "Cross-sectional", "Meta-analysis"],
              description="Study design type"),
        Field(name="outcome_type", type=FieldType.CATEGORICAL,
              options=["Efficacy", "Safety", "Both"],
              description="Type of outcome measured"),
        Field(name="significance", type=FieldType.CATEGORICAL,
              options=["p<0.001", "p<0.01", "p<0.05", "p<0.10", "NS"],
              description="Statistical significance level"),
        
        # Sample info
        Field(name="sample_size", type=FieldType.INTEGER,
              description="Number of participants"),
        Field(name="follow_up_months", type=FieldType.INTEGER,
              description="Follow-up duration in months"),
        
        # Boolean
        Field(name="is_blinded", type=FieldType.BOOLEAN,
              description="Whether the study was blinded"),
        
        # Notes (always recommended)
        Field(name="notes", type=FieldType.STRING, required=True,
              description="Brief summary of key finding and context"),
    ],
)

# Save for reuse
medical_schema.save("clinical_trials_schema.json")

print(f"Created schema with {len(medical_schema.fields)} fields")
print("\nFields:")
for f in medical_schema.fields:
    opts = f" (options: {len(f.options)})" if f.options else ""
    req = " [required]" if f.required else ""
    print(f"  {f.name}: {f.type.value}{opts}{req}")

In [None]:
# Use the custom schema
pipeline = Pipeline(
    schema=medical_schema,
    extraction_model="gemini-2.5-pro",
)

pipeline.fit("clinical_papers/")  # Instant - schema already defined
results = pipeline.transform("clinical_papers/")

# Analyze results
print(f"\nExtracted {len(results)} clinical trial results")
if 'study_design' in results.columns:
    print("\nStudy Design Distribution:")
    print(results['study_design'].value_counts())

## Summary

In this tutorial, you learned:
- ✅ Creating schemas from natural language descriptions
- ✅ Building schemas with SchemaBuilder
- ✅ All field types: string, integer, float, boolean, categorical
- ✅ Creating Schema and Field objects directly
- ✅ Modifying and refining detected schemas
- ✅ Using LLMExtractor directly
- ✅ Schema validation best practices

**You've completed all 5 tutorials!**

You're now ready to:
- Extract data from any PDF documents
- Use automatic or custom schemas
- Configure models, sampling, and checkpoints
- Build production-ready extraction pipelines