# tatForge Quick Start Guide

This notebook demonstrates how to use **tatForge** - an AI-powered document extraction library that uses vision AI to forge structured data from unstructured documents.

## Prerequisites

Before using tatforge, you need to set up three components:

| Component | Purpose | Setup |
|-----------|---------|-------|
| **ColPali** | Vision embeddings (3B model) | `pip install colpali-engine` |
| **Qdrant** | Vector database | Docker container |
| **BAML** | LLM extraction | API key (Anthropic/OpenAI) |

This notebook will guide you through setting up each prerequisite.

## Step 1: Check Prerequisites

Let's verify which components are available and which need to be installed.

In [None]:
import os
import sys

def check_prerequisites():
    """Check all tatforge prerequisites and report status."""
    status = {}
    
    # 1. Check ColPali engine
    try:
        from colpali_engine.models import ColPali, ColPaliProcessor
        status["colpali"] = {"installed": True, "message": "colpali-engine is installed"}
    except ImportError as e:
        status["colpali"] = {"installed": False, "message": f"Missing: pip install colpali-engine"}
    
    # 2. Check Qdrant client and connection
    try:
        from qdrant_client import QdrantClient
        status["qdrant_client"] = {"installed": True, "message": "qdrant-client is installed"}
        
        # Try to connect to Qdrant
        qdrant_url = os.getenv("QDRANT_URL", "http://localhost:6333")
        try:
            client = QdrantClient(url=qdrant_url, timeout=5)
            collections = client.get_collections()
            status["qdrant_server"] = {"running": True, "message": f"Qdrant running at {qdrant_url}"}
        except Exception as e:
            status["qdrant_server"] = {"running": False, "message": f"Qdrant not running at {qdrant_url}"}
    except ImportError:
        status["qdrant_client"] = {"installed": False, "message": "Missing: pip install qdrant-client"}
        status["qdrant_server"] = {"running": False, "message": "Install qdrant-client first"}
    
    # 3. Check BAML and API keys
    try:
        import baml_py
        status["baml"] = {"installed": True, "message": f"baml-py is installed"}
    except ImportError:
        status["baml"] = {"installed": False, "message": "Missing: pip install baml-py"}
    
    # Check for LLM API keys
    anthropic_key = os.getenv("ANTHROPIC_API_KEY")
    openai_key = os.getenv("OPENAI_API_KEY")
    
    if anthropic_key:
        status["llm_api"] = {"configured": True, "message": "ANTHROPIC_API_KEY is set"}
    elif openai_key:
        status["llm_api"] = {"configured": True, "message": "OPENAI_API_KEY is set"}
    else:
        status["llm_api"] = {"configured": False, "message": "No LLM API key found (ANTHROPIC_API_KEY or OPENAI_API_KEY)"}
    
    # 4. Check tatforge
    try:
        import tatforge
        status["tatforge"] = {"installed": True, "message": f"tatforge {tatforge.__version__} is installed"}
    except ImportError:
        status["tatforge"] = {"installed": False, "message": "Missing: pip install -e ."}
    
    return status

# Run checks
print("=" * 60)
print("TATFORGE PREREQUISITES CHECK")
print("=" * 60)

status = check_prerequisites()

all_ready = True
for component, info in status.items():
    ready = info.get("installed", False) or info.get("running", False) or info.get("configured", False)
    icon = "‚úÖ" if ready else "‚ùå"
    print(f"{icon} {component}: {info['message']}")
    if not ready:
        all_ready = False

print("=" * 60)
if all_ready:
    print("All prerequisites met! You can proceed with extraction.")
else:
    print("Some prerequisites missing. Follow the setup steps below.")

## Step 2: Install ColPali Engine (if missing)

ColPali is a vision-language model that generates patch-level embeddings from document images. The model is ~3B parameters and will be downloaded from HuggingFace.

In [None]:
# Install ColPali engine (uncomment and run if not installed)
# This will download the model from HuggingFace (~6GB)

# !pip install colpali-engine

# Verify installation
try:
    from colpali_engine.models import ColPali, ColPaliProcessor
    print("‚úÖ ColPali engine installed successfully")
except ImportError:
    print("‚ùå ColPali not installed. Uncomment the pip install line above and run.")

## Step 3: Start Qdrant Vector Database

Qdrant stores the ColPali embeddings for semantic search. You can run it via Docker:

```bash
# In terminal, run:
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
```

Or use Qdrant Cloud (free tier available at https://cloud.qdrant.io)

In [ ]:
# Check Qdrant connection
from qdrant_client import QdrantClient
import os

qdrant_url = os.getenv("QDRANT_URL", "http://localhost:6333")

try:
    client = QdrantClient(url=qdrant_url, timeout=5)
    collections = client.get_collections()
    print(f"‚úÖ Qdrant is running at {qdrant_url}")
    print(f"   Collections: {len(collections.collections)}")
except Exception as e:
    print(f"‚ùå Cannot connect to Qdrant at {qdrant_url}")
    print(f"   Error: {e}")
    print(f"\n   Start Qdrant with: docker run -p 6333:6333 qdrant/qdrant")

## Step 4: Configure LLM API Key

BAML uses an LLM (Claude or GPT-4) for structured extraction. Set your API key:

**Option A**: Set in environment (recommended)
```bash
export ANTHROPIC_API_KEY="sk-ant-..."
# or
export OPENAI_API_KEY="sk-..."
```

**Option B**: Set in this notebook (temporary)

In [None]:
# Set API key (uncomment and fill in your key)
import os

# Option B: Set key directly (will only persist for this session)
# os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."  # Your Anthropic key
# os.environ["OPENAI_API_KEY"] = "sk-..."  # Or your OpenAI key

# Check if API key is configured
anthropic_key = os.getenv("ANTHROPIC_API_KEY")
openai_key = os.getenv("OPENAI_API_KEY")

if anthropic_key:
    print(f"‚úÖ ANTHROPIC_API_KEY is set ({anthropic_key[:12]}...)")
elif openai_key:
    print(f"‚úÖ OPENAI_API_KEY is set ({openai_key[:12]}...)")
else:
    print("‚ùå No LLM API key configured")
    print("   Set ANTHROPIC_API_KEY or OPENAI_API_KEY above")

## Step 5: Initialize tatforge Components

Once all prerequisites are met, initialize the tatforge pipeline components.

In [None]:
# Initialize tatforge components
from tatforge.vision.colpali_client import ColPaliClient
from tatforge.storage.qdrant_client import QdrantManager
from tatforge.extraction.baml_interface import BAMLExecutionInterface
from tatforge.outputs.canonical import CanonicalFormatter

# Initialize ColPali client (lazy loading - model loads on first use)
colpali_client = ColPaliClient(
    model_name="vidore/colqwen2-v0.1",
    device="auto",  # Will use CUDA if available, else CPU
    memory_limit_gb=8
)
print("‚úÖ ColPali client initialized (model will load on first use)")

# Initialize Qdrant manager
qdrant_manager = QdrantManager(
    url=os.getenv("QDRANT_URL", "http://localhost:6333"),
    collection_name="tatforge_embeddings"
)
print("‚úÖ Qdrant manager initialized")

# Initialize BAML execution interface
baml_interface = BAMLExecutionInterface(
    colpali_client=colpali_client,
    qdrant_manager=qdrant_manager
)
print("‚úÖ BAML interface initialized")

# Initialize canonical formatter
canonical_formatter = CanonicalFormatter()
print("‚úÖ Canonical formatter initialized")

In [None]:
# Connect to services and load models
# This may take a few minutes on first run as it downloads the ColPali model

import asyncio

async def initialize_services():
    """Initialize all tatforge services."""
    # Connect to Qdrant
    print("Connecting to Qdrant...")
    await qdrant_manager.connect()
    await qdrant_manager.ensure_collection()
    print("‚úÖ Qdrant connected and collection ready")
    
    # Load ColPali model (this downloads ~6GB on first run)
    print("Loading ColPali model (this may take a few minutes on first run)...")
    await colpali_client.load_model()
    print("‚úÖ ColPali model loaded")
    
    print("\nüéâ All services initialized! Ready for extraction.")

# Run initialization
await initialize_services()

## Step 6: Define Extraction Schema

Define a JSON schema for the data you want to extract.

### Allowed Schema Types

| Type | Description | Example |
|------|-------------|---------|
| `string` | Text values | `"type": "string"` |
| `int` | Integer numbers | `"type": "int"` |
| `float` | Decimal numbers | `"type": "float"` |
| `bool` | Boolean true/false | `"type": "bool"` |
| `array` | List of items | `"type": "array", "items": {...}` |
| `object` | Nested structure | `"type": "object", "properties": {...}` |

**Note**: Use `int` and `float` (not `integer` or `number`) for BAML compatibility.

In [None]:
# Define extraction schema for Bunge Loading Statement
loading_statement_schema = {
    "type": "object",
    "properties": {
        "document_title": {
            "type": "string",
            "description": "Title or type of document"
        },
        "last_updated": {
            "type": "string",
            "description": "Last updated date and author"
        },
        "shipments": {
            "type": "array",
            "description": "List of shipping entries",
            "items": {
                "type": "object",
                "properties": {
                    "slot_reference": {"type": "string", "description": "Unique slot reference number (e.g., BG20250025)"},
                    "vessel_name": {"type": "string", "description": "Name of ship"},
                    "nomination_received_date": {"type": "string", "description": "Date nomination was received"},
                    "nomination_accepted_date": {"type": "string", "description": "Date nomination was accepted"},
                    "port": {"type": "string", "description": "Port name"},
                    "eta_from_date": {"type": "string", "description": "ETA of ship from date"},
                    "eta_to_date": {"type": "string", "description": "ETA of ship to date"},
                    "loading_commencement_date": {"type": "string", "description": "ETA of grain loading commencement date"},
                    "etd_date": {"type": "string", "description": "ETD of ship date"},
                    "exporter": {"type": "string", "description": "Exporter name"},
                    "quantity_tonnes": {"type": "int", "description": "Quantity in tonnes"},
                    "commodity": {"type": "string", "description": "Commodity type (WHEAT, BARLEY, CANOLA, MALT)"},
                    "loading_status": {"type": "string", "description": "Loading status: COMMENCED, COMPLETED, or empty"},
                    "loading_completed_date": {"type": "string", "description": "Date loading completed"},
                    "notes": {"type": "string", "description": "Additional notes"}
                }
            }
        }
    }
}

print(f"Schema defined with {len(loading_statement_schema['properties'])} top-level fields")
print(f"Shipment record has {len(loading_statement_schema['properties']['shipments']['items']['properties'])} fields")

## Step 7: Extract Data from Document

Now run the full extraction pipeline on a PDF document.

In [ ]:
# Extract data from the Bunge loading statement PDF
from pathlib import Path
from tatforge import PDFAdapter
from tatforge.core.pipeline import VisionExtractionPipeline, PipelineConfig

# Path to PDF
pdf_path = Path("../pdfs/Bunge_loadingstatement_2025-09-25.pdf")
print(f"Processing: {pdf_path.name}")

# Convert PDF to image frames
pdf_adapter = PDFAdapter(max_memory_mb=500)
frames = await pdf_adapter.convert_to_frames(pdf_path)
print(f"Converted PDF to {len(frames)} image frame(s)")

# Create and run the pipeline
config = PipelineConfig(
    memory_limit_gb=8,
    batch_size="auto",
    enable_shaped_output=True
)

pipeline = VisionExtractionPipeline(
    colpali_client=colpali_client,
    qdrant_manager=qdrant_manager,
    baml_interface=baml_interface,
    canonical_formatter=canonical_formatter,
    config=config
)

# Process the document
print("Running extraction pipeline...")
with open(pdf_path, "rb") as f:
    document_blob = f.read()

result = await pipeline.process_document(
    document_blob=document_blob,
    schema_json=loading_statement_schema
)

print(f"\n‚úÖ Extraction completed!")
print(f"   Processing ID: {result.metadata.processing_id}")
print(f"   Processing time: {result.metadata.processing_time_seconds:.2f}s")
print(f"   Status: {result.metadata.status}")

In [None]:
# Display extracted data
import json

if result.canonical and result.canonical.extraction_data:
    print("Extracted Data:")
    print("-" * 60)
    print(json.dumps(result.canonical.extraction_data, indent=2))
else:
    print("No data extracted. Check the pipeline logs above for details.")

## Step 8: Export Results

Export extracted data to various formats.

In [None]:
# Export to different formats
from tatforge import DataExporter
from pathlib import Path

exporter = DataExporter()

if result.canonical and result.canonical.extraction_data:
    output_dir = Path("../outputs")
    output_dir.mkdir(exist_ok=True)
    
    # Export to JSON
    json_path = output_dir / "bunge_extraction.json"
    exporter.export_json(result.canonical.extraction_data, json_path)
    print(f"‚úÖ Exported to JSON: {json_path}")
    
    # Export to CSV (if shipments array exists)
    if "shipments" in result.canonical.extraction_data:
        csv_path = output_dir / "bunge_shipments.csv"
        exporter.export_csv(result.canonical.extraction_data["shipments"], csv_path)
        print(f"‚úÖ Exported to CSV: {csv_path}")
else:
    print("No data to export")

## Summary

You've completed the tatforge quickstart! You learned how to:

1. **Check prerequisites** - ColPali, Qdrant, BAML, API keys
2. **Install dependencies** - colpali-engine package
3. **Start services** - Qdrant vector database
4. **Configure API keys** - For LLM extraction
5. **Initialize components** - ColPali client, Qdrant manager, BAML interface
6. **Define schemas** - JSON schema with BAML-compatible types
7. **Extract data** - Run the full extraction pipeline
8. **Export results** - JSON, CSV, Parquet formats

## Next Steps

- Try different PDFs from the `pdfs/` directory
- Create custom schemas for your documents
- Explore the CLI: `tatforge --help`
- Read the [Architecture Playbook](../docs/architecture-playbook.md)

## Support

For issues or questions: https://github.com/Frosselet/COCOINDEX_LEARNING/issues