# tatForge Quick Start Guide

This notebook demonstrates how to use **tatForge** - an AI-powered document extraction library that uses vision AI to forge structured data from unstructured documents.

## Features
- **Vision-First**: No OCR - direct visual understanding of documents
- **Schema-Governed**: JSON schemas define what to extract
- **Type-Safe**: BAML ensures extraction accuracy
- **Dual Output**: Canonical data + business transformations

## 1. Installation

Install tatForge using `uv` (recommended) or `pip`:

```bash
# Using uv (10-100x faster than pip)
uv pip install tatforge

# Or using pip
pip install tatforge
```

In [None]:
# Import tatForge
import tatforge

print(f"tatForge version: {tatforge.__version__}")
print(f"Package: {tatforge.__package_name__}")

## 2. Quick Start - Extract Data from a Document

The simplest way to use tatForge is with the `extract_document()` function.

### Allowed Schema Types

| Type | Description | Example |
|------|-------------|---------|
| `string` | Text values | `"type": "string"` |
| `int` | Integer numbers | `"type": "int"` |
| `float` | Decimal numbers | `"type": "float"` |
| `bool` | Boolean true/false | `"type": "bool"` |
| `array` | List of items | `"type": "array", "items": {...}` |
| `object` | Nested structure | `"type": "object", "properties": {...}` |

**Note**: Use `int` and `float` (not `integer` or `number`) for BAML compatibility.

In [None]:
from tatforge import extract_document

# Define what you want to extract using a JSON schema
invoice_schema = {
    "type": "object",
    "properties": {
        "invoice_number": {
            "type": "string",
            "description": "The invoice number or ID"
        },
        "date": {
            "type": "string",
            "description": "Invoice date"
        },
        "total_amount": {
            "type": "float",
            "description": "Total amount due"
        },
        "vendor_name": {
            "type": "string",
            "description": "Name of the vendor/seller"
        },
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "quantity": {"type": "int"},
                    "unit_price": {"type": "float"}
                }
            }
        }
    }
}

print("Schema defined for invoice extraction")

In [None]:
# Extract data from a document (uncomment when you have a PDF)
# result = await extract_document("path/to/invoice.pdf", invoice_schema)
# print(result)

## 3. Using the Full Pipeline

For more control, use the `VisionExtractionPipeline` directly.

In [None]:
from tatforge import (
    VisionExtractionPipeline,
    PipelineConfig,
    SchemaManager,
    PDFAdapter
)

# Configure the pipeline
config = PipelineConfig(
    memory_limit_gb=8,           # Memory limit for processing
    batch_size="auto",           # Automatically determine batch size
    enable_shaped_output=True,   # Enable business transformations
    enforce_1nf=True             # Enforce first normal form
)

print(f"Pipeline config: {config}")

In [None]:
# Initialize the pipeline
pipeline = VisionExtractionPipeline(config=config)
print("Pipeline initialized!")

## 4. Working with Schemas

The `SchemaManager` helps you manage and validate extraction schemas.

In [None]:
from tatforge import SchemaManager, SchemaValidator

# Create a schema manager
schema_manager = SchemaManager()

# Register a schema
shipping_schema = {
    "type": "object",
    "properties": {
        "vessel_name": {"type": "string", "description": "Name of the shipping vessel"},
        "port_of_loading": {"type": "string", "description": "Port where cargo is loaded"},
        "port_of_discharge": {"type": "string", "description": "Destination port"},
        "eta": {"type": "string", "description": "Estimated time of arrival"},
        "cargo": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "commodity": {"type": "string"},
                    "quantity_mt": {"type": "float", "description": "Quantity in metric tons"}
                }
            }
        }
    }
}

schema_manager.register_schema("shipping_manifest", shipping_schema)
print("Shipping manifest schema registered!")

In [None]:
# Validate a schema
validator = SchemaValidator()

# Check if schema is valid
is_valid = validator.validate_schema(shipping_schema)
print(f"Schema is valid: {is_valid}")

## 5. PDF Processing

Use the `PDFAdapter` to convert PDFs to image frames for vision processing.

In [None]:
from tatforge import PDFAdapter, ConversionConfig

# Create PDF adapter with memory limit
pdf_adapter = PDFAdapter(max_memory_mb=500)

# Configure conversion settings
conversion_config = ConversionConfig(
    dpi=200,           # Resolution for rendering
    quality=85,        # JPEG quality (if applicable)
    max_pages=None     # Process all pages (or set limit)
)

print(f"PDF adapter ready with config: DPI={conversion_config.dpi}")

In [None]:
# Example: Process a PDF file
from pathlib import Path

# List available PDFs in the pdfs directory
pdf_dir = Path("../pdfs")
if pdf_dir.exists():
    pdf_files = list(pdf_dir.glob("*.pdf"))
    print(f"Found {len(pdf_files)} PDF files:")
    for pdf in pdf_files[:5]:
        print(f"  - {pdf.name}")
else:
    print("No pdfs directory found. Create one and add test PDFs.")

## 6. Output Formatting

tatForge provides two output formats:
- **Canonical**: Raw extraction data (truth layer)
- **Shaped**: Business-transformed data

In [None]:
from tatforge import CanonicalFormatter, ShapedFormatter, DataExporter

# Initialize formatters
canonical_formatter = CanonicalFormatter()
shaped_formatter = ShapedFormatter()

# Example: Export to different formats
exporter = DataExporter()

print("Available export formats:")
print("  - JSON")
print("  - CSV")
print("  - Parquet")
print("  - DataFrame")

## 7. CLI Usage

tatForge also provides a command-line interface.

```bash
# Get package info
tatforge info

# Validate a schema file
tatforge validate schema.json

# Extract data from a document
tatforge extract invoice.pdf --schema schema.json --output result.json
```

In [None]:
# Run CLI from notebook
!tatforge info --models

## 8. Next Steps

- Check the [API Documentation](https://github.com/Frosselet/COCOINDEX_LEARNING/tree/main/docs)
- Explore the [Architecture Playbook](../docs/architecture-playbook.md)
- Review [Example Schemas](../baml_src/)

## Support

For issues or questions, visit: https://github.com/Frosselet/COCOINDEX_LEARNING/issues