# tatForge Quick Start Guide

Extract structured data from PDF documents using **vision AI**.

## What You'll Do

1. **Load a PDF** and convert it to high-resolution images
2. **Extract data** using GPT-4o vision via BAML
3. **View results** as structured JSON

## Prerequisites

Before running this notebook, ensure you have:

- `tatforge` installed: `pip install -e .` from project root
- `OPENAI_API_KEY` environment variable set
- A PDF file in the `pdfs/` directory

---

## Step 1: Setup Environment

Load environment variables and verify all dependencies are available.

In [None]:
import os
import sys
from pathlib import Path

# Add project root to path
sys.path.insert(0, '..')

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Check prerequisites
checks = []

# Check OpenAI API key
api_key = os.getenv("OPENAI_API_KEY")
if api_key:
    checks.append(("OPENAI_API_KEY", f"{api_key[:8]}...", True))
else:
    checks.append(("OPENAI_API_KEY", "NOT SET", False))

# Check BAML
try:
    import baml_py
    from baml_client import b
    # Verify the function we need exists
    assert hasattr(b, 'ExtractDocumentFieldsFromImage'), "BAML function missing"
    checks.append(("baml-py + client", "OK", True))
except Exception as e:
    checks.append(("baml-py + client", str(e), False))

# Check tatforge.flows
try:
    from tatforge.flows import file_to_pages
    checks.append(("tatforge.flows", "OK", True))
except ImportError as e:
    checks.append(("tatforge.flows", str(e), False))

# Check for PDFs
pdf_dir = Path("../pdfs")
pdfs = list(pdf_dir.glob("*.pdf")) if pdf_dir.exists() else []
checks.append(("PDF files", f"{len(pdfs)} found", len(pdfs) > 0))

# Display results
print("Environment Check")
print("=" * 50)
all_ok = True
for name, status, ok in checks:
    icon = "OK" if ok else "FAIL"
    print(f"[{icon}] {name}: {status}")
    if not ok:
        all_ok = False
print("=" * 50)

if all_ok:
    print("All checks passed! Ready to proceed.")
else:
    print("Some checks failed. Fix issues before continuing.")

## Step 2: Select and Load PDF

Choose a PDF from the `pdfs/` directory and convert it to page images.

In [None]:
from pathlib import Path
from tatforge.flows import file_to_pages

# Find available PDFs
pdf_dir = Path("../pdfs")
available_pdfs = sorted(pdf_dir.glob("*.pdf"))

if not available_pdfs:
    raise FileNotFoundError(f"No PDF files found in {pdf_dir.absolute()}")

print("Available PDFs:")
for i, pdf in enumerate(available_pdfs, 1):
    print(f"  {i}. {pdf.name}")

# Use the first PDF (change index to select different file)
selected_pdf = available_pdfs[0]
print(f"\nSelected: {selected_pdf.name}")

# Read and convert to images
pdf_bytes = selected_pdf.read_bytes()
print(f"PDF size: {len(pdf_bytes):,} bytes")

pages = file_to_pages(selected_pdf.name, pdf_bytes)
print(f"Pages: {len(pages)}")

# Store first page for extraction
page_image = pages[0].image
print(f"First page image: {len(page_image):,} bytes (PNG @ 300 DPI)")

## Step 3: Define Extraction Schema

Specify what data you want to extract from the document.

In [None]:
extraction_prompt = """
Extract all shipping information from this loading statement document.

Return as JSON with this structure:
{
    "document_title": "title of the document",
    "last_updated": "date and author of last update",
    "shipments": [
        {
            "slot_reference": "unique reference number",
            "vessel_name": "name of the ship",
            "port": "port name",
            "eta_from": "ETA from date",
            "eta_to": "ETA to date",
            "commodity": "type of cargo",
            "quantity_tonnes": "quantity in tonnes",
            "exporter": "exporter name",
            "loading_status": "status of loading"
        }
    ]
}
"""

print("Extraction prompt defined.")
print(f"Prompt length: {len(extraction_prompt)} characters")

## Step 4: Run BAML Extraction

Send the page image to GPT-4o vision for structured data extraction.

In [None]:
import asyncio
import base64
import baml_py
from baml_client import b

print(f"Extracting from: {selected_pdf.name}")
print("Sending to GPT-4o vision...")
print("-" * 40)

# Convert image to BAML format
image_b64 = base64.b64encode(page_image).decode("utf-8")
baml_image = baml_py.Image.from_base64("image/png", image_b64)

# Run extraction (using sync client)
result = b.ExtractDocumentFieldsFromImage(
    document_image=baml_image,
    extraction_prompt=extraction_prompt
)

print("Extraction complete!")
print("-" * 40)

## Step 5: View Extracted Data

Display the structured JSON output from the extraction.

In [None]:
import json

print("Extracted Data:")
print("=" * 60)

# Parse and pretty-print the JSON result
try:
    parsed = json.loads(result)
    print(json.dumps(parsed, indent=2))
except json.JSONDecodeError:
    # If not valid JSON, print as-is
    print(result)

print("=" * 60)

## Step 6: Process Multiple Pages (Optional)

If your PDF has multiple pages, extract from each one.

In [None]:
if len(pages) > 1:
    print(f"Processing all {len(pages)} pages...")
    print("=" * 60)
    
    all_results = []
    for i, page in enumerate(pages, 1):
        print(f"\nPage {i}/{len(pages)}...")
        
        # Convert and extract
        img_b64 = base64.b64encode(page.image).decode("utf-8")
        img = baml_py.Image.from_base64("image/png", img_b64)
        
        page_result = b.ExtractDocumentFieldsFromImage(
            document_image=img,
            extraction_prompt=extraction_prompt
        )
        
        all_results.append({
            "page": i,
            "data": json.loads(page_result) if page_result.strip().startswith('{') else page_result
        })
        print(f"  Done.")
    
    print("\n" + "=" * 60)
    print("All Pages Extracted:")
    print(json.dumps(all_results, indent=2))
else:
    print("Single page PDF - already processed above.")

---

## Summary

You've successfully:

1. Loaded a PDF and converted it to 300 DPI images
2. Extracted structured data using GPT-4o vision
3. Parsed the results as JSON

### Next Steps

- **Customize the extraction prompt** for your document type
- **Add more PDFs** to the `pdfs/` directory
- **Use the CLI** for batch processing: `python main.py`
- **Index with ColPali** for semantic search: `cocoindex setup && cocoindex update`

### Resources

- [BAML Documentation](https://docs.boundaryml.com)
- [CocoIndex Documentation](https://cocoindex.io/docs)
- [Project Issues](https://github.com/Frosselet/COCOINDEX_LEARNING/issues)