# Tutorial 3: Policy Document Extraction

This notebook demonstrates how to extract structured data from policy documents, regulations, and official reports.

**What you'll learn:**
- Using `purpose="policies"` for policy documents
- Extracting incentives, regulations, and interventions
- Understanding automatic category discovery
- Working with value units and notes fields

## 1. Setup

In [None]:
import os
import pandas as pd

# Set your API key
os.environ["GEMINI_API_KEY"] = "your-api-key-here"

from structify import Pipeline

## 2. Policy Document Extraction

The `purpose="policies"` mode is optimized for extracting:

- **Policy Types**: Tax incentives, subsidies, regulations
- **Implementing Agencies**: Government levels, ministries
- **Beneficiaries**: Target sectors, firm types
- **Values**: Amounts, rates, durations
- **Geographic Scope**: Zones, regions, countries

In [None]:
# Create a pipeline optimized for policy documents
pipeline = Pipeline(
    purpose="policies",  # Optimized for policy documents
    detection_mode="moderate",
    seed=42,
)

# Process policy documents
results = pipeline.fit_transform("path/to/policy_documents/")

print(f"Extracted {len(results)} policy records")

## 3. Example: Special Economic Zone Policies

In [None]:
# Extract policies from SEZ documents
pipeline = Pipeline(
    purpose="policies",
    detection_mode="moderate",
    deduplicate=True,
    enable_checkpoints=True,
    seed=42,
)

results = pipeline.fit_transform("sez_documents/")
results.head()

## 4. Understanding Automatic Category Discovery

pdf-structify automatically discovers categories from your documents. For example, if your documents contain different policy types, they will be extracted and categorized.

In [None]:
# View the detected schema with discovered categories
print("Detected Schema for Policy Documents:")
print("=" * 60)

for field in pipeline.schema.fields:
    req = "(required)" if field.required else ""
    print(f"\n{field.name} [{field.type.value}] {req}")
    print(f"  Description: {field.description}")
    
    # Show discovered categories
    if field.options:
        print(f"  Discovered Categories ({len(field.options)}):")
        for opt in field.options:
            print(f"    - {opt}")

## 5. Concise Category Names

pdf-structify uses short, standardized category names:

| Instead of | Uses |
|------------|------|
| "Corporate income tax exemption for 10 years" | "Tax holiday" |
| "Subsidized land lease at 50% market rate" | "Land subsidy" |
| "One-stop service center for permits" | "Admin support" |
| "National government ministry" | "National govt" |

Details go in the **notes** field!

In [None]:
# Example of how categories are used
if 'policy_type' in results.columns:
    print("Policy Type Distribution:")
    print(results['policy_type'].value_counts())

In [None]:
# The notes field contains the details
if 'notes' in results.columns:
    print("\nSample policy details (from notes):")
    for i, (ptype, note) in enumerate(zip(
        results['policy_type'].head(3) if 'policy_type' in results.columns else ['N/A']*3,
        results['notes'].head(3)
    )):
        print(f"\n{i+1}. [{ptype}]")
        print(f"   {note}")

## 6. Mandatory Fields for Policies

For policy documents, these fields are always included:

- **value_unit**: Unit for numeric values (%, years, USD, etc.)
- **notes**: One sentence explaining the policy context

In [None]:
# View value units
if 'value_unit' in results.columns:
    print("Value Units Used:")
    print(results['value_unit'].value_counts())

## 7. Analyzing Policy Data

In [None]:
# Summary statistics
print(f"Total policy records: {len(results)}")
print(f"Columns extracted: {len(results.columns)}")
print(f"\nColumns: {list(results.columns)}")

In [None]:
# Analyze by beneficiary sector (if present)
sector_cols = [c for c in results.columns if 'sector' in c.lower()]
if sector_cols:
    print(f"\nBeneficiary Sectors ({sector_cols[0]}):")
    print(results[sector_cols[0]].value_counts())

In [None]:
# Analyze by implementing agency (if present)
agency_cols = [c for c in results.columns if 'agency' in c.lower() or 'authority' in c.lower()]
if agency_cols:
    print(f"\nImplementing Agencies ({agency_cols[0]}):")
    print(results[agency_cols[0]].value_counts())

## 8. Filtering Policies

In [None]:
# Filter by policy type
if 'policy_type' in results.columns:
    tax_policies = results[results['policy_type'].str.contains('Tax', case=False, na=False)]
    print(f"Tax-related policies: {len(tax_policies)}")
    tax_policies.head()

In [None]:
# Filter by geographic scope (if present)
geo_cols = [c for c in results.columns if 'country' in c.lower() or 'zone' in c.lower() or 'region' in c.lower()]
if geo_cols:
    print(f"\nGeographic Distribution ({geo_cols[0]}):")
    print(results[geo_cols[0]].value_counts().head(10))

## 9. Working with Numeric Values

In [None]:
# Find numeric columns
numeric_cols = results.select_dtypes(include=['float64', 'int64']).columns.tolist()
print(f"Numeric columns: {numeric_cols}")

for col in numeric_cols:
    print(f"\n{col}:")
    print(results[col].describe())

## 10. Save Schema and Export

In [None]:
# Save schema for reuse
pipeline.save_schema("policy_schema.json")
print("Schema saved to policy_schema.json")

In [None]:
# Export results
results.to_csv("policy_data.csv", index=False)
results.to_excel("policy_data.xlsx", index=False)
results.to_json("policy_data.json", orient="records", indent=2)

print("Policy data exported successfully!")

## 11. Comparing Findings vs Policies Purpose

| Aspect | `findings` | `policies` |
|--------|-----------|------------|
| **Best for** | Academic papers, research | Policy docs, regulations |
| **Focus** | Estimates, coefficients | Incentives, interventions |
| **Mandatory fields** | unit, notes | value_unit, notes |
| **Categories** | Methodologies (DID, IV) | Policy types, sectors |

## Summary

In this tutorial, you learned:
- ✅ Using `purpose="policies"` for policy documents
- ✅ Automatic category discovery
- ✅ Concise category naming conventions
- ✅ Mandatory fields (value_unit, notes)
- ✅ Analyzing and filtering policy data
- ✅ Exporting results

**Next:** Tutorial 4 - Advanced Configuration (save/load, models, sampling)