# Tutorial 4: Advanced Configuration

This notebook covers advanced features for power users.

**What you'll learn:**
- Schema save/load for resume capability
- Using different models for detection vs extraction
- Extraction sampling for cost control and testing
- Checkpoint and resume functionality
- API retry and error handling

## 1. Setup

In [None]:
import os
import pandas as pd

os.environ["GEMINI_API_KEY"] = "your-api-key-here"

from structify import Pipeline, Schema

## 2. Schema Save/Load (Resume Capability)

Save your detected schema to skip detection in future runs. This is useful when:
- You want to extract from new documents using the same schema
- You want to resume extraction without re-running detection
- You want to share schemas between team members

### 2.1 Saving a Schema

In [None]:
# Run detection and save the schema
pipeline = Pipeline(
    purpose="findings",
    detection_mode="moderate",
)

# Fit detects the schema
pipeline.fit("documents/")

# Save schema to JSON or YAML
pipeline.save_schema("my_schema.json")   # JSON format
# pipeline.save_schema("my_schema.yaml")  # YAML format also supported

print("Schema saved!")

### 2.2 Loading a Schema

In [None]:
# Option 1: Load schema in pipeline constructor
pipeline = Pipeline(schema="my_schema.json")

# fit() is instant - no detection needed!
pipeline.fit("documents/")

# Extract data
results = pipeline.transform("documents/")
print(f"Extracted {len(results)} records using saved schema")

In [None]:
# Option 2: Load and inspect schema before using
schema = Schema.load("my_schema.json")

print(f"Schema: {schema.name}")
print(f"Fields: {len(schema.fields)}")
for field in schema.fields:
    print(f"  - {field.name}: {field.type.value}")

# Use with pipeline
pipeline = Pipeline(schema=schema)

### 2.3 Schema File Format

The saved schema file looks like this (JSON):

```json
{
  "name": "detected_schema",
  "description": "Automatically detected schema for research paper (findings)",
  "fields": [
    {
      "name": "estimate_value",
      "type": "float",
      "description": "The coefficient or effect size",
      "required": true
    },
    {
      "name": "methodology",
      "type": "categorical",
      "description": "Research methodology used",
      "options": ["DID", "IV", "RDD", "OLS", "FE"]
    }
  ]
}
```

## 3. Model Selection

Use different models for schema detection vs data extraction:

- **Detection**: Use a fast model (cheaper, quicker)
- **Extraction**: Use a powerful model (more accurate)

In [None]:
# Use fast model for detection, powerful model for extraction
pipeline = Pipeline(
    purpose="findings",
    detection_model="gemini-2.0-flash",   # Fast and cheap for detection
    extraction_model="gemini-2.5-pro",    # Powerful and accurate for extraction
)

results = pipeline.fit_transform("documents/")
print(f"Extracted {len(results)} records")

### Available Models

| Model | Speed | Cost | Accuracy | Best For |
|-------|-------|------|----------|----------|
| `gemini-2.0-flash` | Fast | Low | Good | Detection, quick tests |
| `gemini-2.5-flash` | Fast | Low | Better | Detection |
| `gemini-2.5-pro` | Slower | Higher | Best | Extraction |

In [None]:
# Example: Same fast model for both (cost-effective)
pipeline_fast = Pipeline(
    purpose="findings",
    detection_model="gemini-2.0-flash",
    extraction_model="gemini-2.0-flash",
)

# Example: Powerful model for both (maximum accuracy)
pipeline_accurate = Pipeline(
    purpose="findings",
    detection_model="gemini-2.5-pro",
    extraction_model="gemini-2.5-pro",
)

## 4. Extraction Sampling

Process only a subset of documents for:
- Quick testing and validation
- Cost control
- Initial exploration

### 4.1 Sample by Ratio

In [None]:
# Extract from only 20% of documents
pipeline = Pipeline(
    purpose="findings",
    extraction_sample_ratio=0.2,  # 20% of files
    seed=42,                      # Reproducible sampling
)

results = pipeline.fit_transform("documents/")
print(f"Extracted from ~20% of documents: {len(results)} records")

### 4.2 Sample with Maximum Limit

In [None]:
# Extract from at most 50 documents
pipeline = Pipeline(
    purpose="findings",
    extraction_max_samples=50,  # No more than 50 files
    seed=42,
)

results = pipeline.fit_transform("documents/")
print(f"Extracted from max 50 documents: {len(results)} records")

### 4.3 Combined Sampling

In [None]:
# Extract from 30% of documents, but no more than 100
pipeline = Pipeline(
    purpose="findings",
    extraction_sample_ratio=0.3,   # 30%
    extraction_max_samples=100,    # Cap at 100
    seed=42,                       # Reproducible
)

results = pipeline.fit_transform("documents/")
print(f"Extracted records: {len(results)}")

### 4.4 Reproducible Sampling

Use `seed` to get the same sample every time:

In [None]:
# Same seed = same documents selected
pipeline1 = Pipeline(extraction_sample_ratio=0.2, seed=42)
pipeline2 = Pipeline(extraction_sample_ratio=0.2, seed=42)

# Both will process the exact same subset of documents

## 5. Checkpoint and Resume

Never lose progress! pdf-structify saves checkpoints automatically.

In [None]:
# Enable checkpoints (enabled by default)
pipeline = Pipeline(
    purpose="findings",
    enable_checkpoints=True,
    state_dir=".structify_state",  # Where to save state
)

# Start extraction
results = pipeline.fit_transform("documents/")

### 5.1 Resume After Interruption

If interrupted (Ctrl+C, crash, power outage):

In [None]:
# Just run again - it resumes automatically!
pipeline = Pipeline.resume("documents/")
results = pipeline.transform("documents/")

print(f"Resumed and completed: {len(results)} records")

### 5.2 Force Restart

In [None]:
# Start fresh, ignoring existing checkpoints
pipeline = Pipeline(purpose="findings")
results = pipeline.fit_transform("documents/", force_restart=True)

## 6. API Retry and Error Handling

pdf-structify handles API errors automatically:

- **API errors**: 1 retry with 2-second delay
- **Rate limits**: Automatic backoff and retry
- **Timeouts**: Retry with increasing delays

No configuration needed!

In [None]:
# Retry is automatic - just use the pipeline normally
pipeline = Pipeline(purpose="findings")
results = pipeline.fit_transform("documents/")

# If API errors occur, you'll see:
# WARNING: API error: [error]. Retrying in 2 seconds...

## 7. Complete Advanced Example

Putting it all together:

In [None]:
from structify import Pipeline, Schema

# Full-featured pipeline configuration
pipeline = Pipeline(
    # Purpose and detection
    purpose="findings",
    detection_mode="moderate",
    
    # Model selection
    detection_model="gemini-2.0-flash",    # Fast for detection
    extraction_model="gemini-2.5-pro",     # Accurate for extraction
    
    # Sampling
    extraction_sample_ratio=0.5,           # 50% of documents
    extraction_max_samples=100,            # Max 100 documents
    
    # Reproducibility
    seed=42,
    
    # Checkpointing
    enable_checkpoints=True,
    state_dir=".my_extraction_state",
    
    # Output processing
    deduplicate=True,
    pages_per_chunk=10,
)

# Run the pipeline
pipeline.fit("documents/")

# Save schema for future use
pipeline.save_schema("production_schema.json")

# Extract data
results = pipeline.transform("documents/")

# Export
results.to_csv("extracted_data.csv", index=False)

print(f"\nExtraction complete!")
print(f"Records: {len(results)}")
print(f"Schema saved to: production_schema.json")

## 8. Workflow for Production

Recommended workflow for large-scale extractions:

1. **Test on sample**: Use sampling to validate schema
2. **Save schema**: Keep the schema for consistency
3. **Full extraction**: Run on all documents with saved schema
4. **Use checkpoints**: Enable for safety

In [None]:
# Step 1: Test on 10% sample
test_pipeline = Pipeline(
    purpose="findings",
    extraction_sample_ratio=0.1,
    seed=42,
)
test_results = test_pipeline.fit_transform("documents/")

# Inspect results...
print(f"Test extraction: {len(test_results)} records")
print(test_results.head())

# Step 2: Save schema if satisfied
test_pipeline.save_schema("validated_schema.json")

In [None]:
# Step 3: Full extraction with saved schema
production_pipeline = Pipeline(
    schema="validated_schema.json",  # Skip detection
    extraction_model="gemini-2.5-pro",
    enable_checkpoints=True,
)

production_pipeline.fit("documents/")  # Instant - uses saved schema
final_results = production_pipeline.transform("documents/")

final_results.to_csv("final_output.csv", index=False)
print(f"Full extraction: {len(final_results)} records")

## Summary

In this tutorial, you learned:
- ✅ Schema save/load for resume capability
- ✅ Model selection (detection vs extraction)
- ✅ Extraction sampling for cost control
- ✅ Checkpoint and resume functionality
- ✅ Automatic API retry handling
- ✅ Production workflow best practices

**Next:** Tutorial 5 - Custom Schema Building