# Tutorial 1: Quick Start - Basic PDF Extraction

This notebook demonstrates the simplest way to extract structured data from PDFs using pdf-structify.

**What you'll learn:**
- Installing and configuring pdf-structify
- 3-line extraction with automatic schema detection
- Viewing and saving results
- Understanding the extraction pipeline

## 1. Installation

Install pdf-structify from PyPI:

In [None]:
# Uncomment to install
# !pip install pdf-structify

## 2. Configuration

You need a Google Gemini API key. Get one free at: https://makersuite.google.com/app/apikey

In [None]:
import os

# Option 1: Set environment variable
os.environ["GEMINI_API_KEY"] = "your-api-key-here"

# Option 2: Use a .env file (recommended for security)
# Create a .env file with: GEMINI_API_KEY=your-api-key-here
# from structify import Config
# Config.from_env()

## 3. The 3-Line Extraction

The simplest way to extract data from PDFs:

In [None]:
from structify import Pipeline

# Create a pipeline with sensible defaults
pipeline = Pipeline.quick_start()

# Fit (detect schema) and transform (extract data) in one step
results = pipeline.fit_transform("path/to/your/pdfs/")

# Save results
results.to_csv("output.csv")

## 4. Understanding What Happened

The pipeline automatically:
1. **Split** large PDFs into manageable chunks (10 pages each)
2. **Detected** the schema by analyzing sample documents
3. **Extracted** structured data from all documents
4. **Deduplicated** the results

Let's explore the results:

In [None]:
# View the first few rows
print(f"Extracted {len(results)} records")
results.head(10)

In [None]:
# View all columns (fields) that were detected
print("Detected fields:")
for col in results.columns:
    print(f"  - {col}")

In [None]:
# View the detected schema
print("\nSchema details:")
for field in pipeline.schema.fields:
    print(f"  {field.name}: {field.type.value} {'(required)' if field.required else ''}")
    if field.options:
        print(f"    Options: {field.options}")

## 5. Saving Results in Different Formats

In [None]:
# CSV (default)
results.to_csv("output.csv", index=False)

# JSON
results.to_json("output.json", orient="records", indent=2)

# Excel
results.to_excel("output.xlsx", index=False)

# Parquet (efficient for large datasets)
results.to_parquet("output.parquet", index=False)

print("Results saved in multiple formats!")

## 6. Basic Customization

You can customize the pipeline with a few parameters:

In [None]:
from structify import Pipeline

# Create a customized pipeline
pipeline = Pipeline(
    # How many pages per PDF chunk
    pages_per_chunk=10,
    
    # Remove duplicate records
    deduplicate=True,
    
    # Enable checkpoint/resume (never lose progress!)
    enable_checkpoints=True,
    
    # Random seed for reproducibility
    seed=42,
)

results = pipeline.fit_transform("path/to/your/pdfs/")

## 7. Resume After Interruption

If your extraction is interrupted (Ctrl+C, power outage, etc.), just run again:

In [None]:
from structify import Pipeline

# This automatically resumes from where it left off!
pipeline = Pipeline.resume("path/to/your/pdfs/")
results = pipeline.transform("path/to/your/pdfs/")

print(f"Resumed and completed! Total records: {len(results)}")

## 8. Progress Tracking

pdf-structify shows beautiful progress bars during extraction:

```
╭─────────────────── Structify Pipeline ───────────────────╮
│ Stage 2/3: Data Extraction                               │
╰──────────────────────────────────────────────────────────╯
Processing papers ━━━━━━━━━━━━━━━━━ 45% 12/25 papers
  Current: "Economic_Study.pdf" part 3/8
  → Found 24 records
```

## Summary

In this tutorial, you learned:
- ✅ How to install and configure pdf-structify
- ✅ The 3-line extraction pattern
- ✅ How to view and save results
- ✅ Basic customization options
- ✅ How to resume interrupted extractions

**Next Steps:**
- Tutorial 2: Research Findings Extraction (optimized for academic papers)
- Tutorial 3: Policy Document Extraction (optimized for policy documents)
- Tutorial 4: Advanced Configuration (models, sampling, schema save/load)
- Tutorial 5: Custom Schema Building