# tatForge Quick Start Guide with CocoIndex

This notebook demonstrates how to use **tatForge** with **CocoIndex** - the central orchestration framework that wires together:
- **ColPali** for vision-based multi-vector embeddings (spatial awareness)
- **Qdrant** for vector storage and semantic search
- **BAML** for structured LLM extraction from document images

## Architecture

```
INDEXING (cocoindex setup/update):
  pdfs/ → LocalFile → file_to_pages → ColPaliEmbedImage → Qdrant

SEARCH + EXTRACT:
  Query → ColPaliEmbedQuery → Qdrant Search → Page Images
        → extract_with_baml (cached) → Structured Output
```

## Prerequisites

| Component | Purpose | Setup |
|-----------|---------|-------|
| **CocoIndex** | Flow orchestration | `pip install cocoindex[colpali]` |
| **Qdrant** | Vector database | Docker: `docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant` |
| **BAML** | LLM extraction | API key (OpenAI for GPT-4o vision) |
| **ColPali** | Vision embeddings | Included with `cocoindex[colpali]` |

## Step 1: Check Prerequisites

Let's verify CocoIndex, Qdrant, and BAML are available.

In [None]:
import os
import sys

def check_prerequisites():
    """Check all CocoIndex + ColPali prerequisites and report status."""
    status = {}
    
    # 1. Check CocoIndex
    try:
        import cocoindex
        status["cocoindex"] = {"installed": True, "message": "cocoindex is installed"}
    except ImportError:
        status["cocoindex"] = {"installed": False, "message": "Missing: pip install cocoindex[colpali]"}
    
    # 2. Check ColPali engine
    try:
        import colpali_engine
        status["colpali"] = {"installed": True, "message": "colpali-engine is installed"}
    except ImportError:
        status["colpali"] = {"installed": False, "message": "Missing: pip install cocoindex[colpali]"}
    
    # 3. Check Qdrant client and connection
    try:
        from qdrant_client import QdrantClient
        status["qdrant_client"] = {"installed": True, "message": "qdrant-client is installed"}
        
        qdrant_url = os.getenv("QDRANT_URL", "http://localhost:6333")
        try:
            client = QdrantClient(url=qdrant_url, timeout=5)
            collections = client.get_collections()
            status["qdrant_server"] = {"running": True, "message": f"Qdrant running at {qdrant_url}"}
        except Exception as e:
            status["qdrant_server"] = {"running": False, "message": f"Qdrant not running at {qdrant_url}"}
    except ImportError:
        status["qdrant_client"] = {"installed": False, "message": "Missing: pip install qdrant-client"}
        status["qdrant_server"] = {"running": False, "message": "Install qdrant-client first"}
    
    # 4. Check BAML
    try:
        import baml_py
        status["baml"] = {"installed": True, "message": "baml-py is installed"}
    except ImportError:
        status["baml"] = {"installed": False, "message": "Missing: pip install baml-py"}
    
    # 5. Check BAML client (generated)
    try:
        from baml_client import b
        status["baml_client"] = {"installed": True, "message": "BAML client generated"}
    except ImportError:
        status["baml_client"] = {"installed": False, "message": "Run: baml-cli generate"}
    
    # 6. Check for OpenAI API key (required for GPT-4o vision)
    openai_key = os.getenv("OPENAI_API_KEY")
    if openai_key:
        status["openai_api"] = {"configured": True, "message": f"OPENAI_API_KEY is set ({openai_key[:8]}...)"}
    else:
        status["openai_api"] = {"configured": False, "message": "OPENAI_API_KEY not set (required for GPT-4o vision)"}
    
    # 7. Check tatforge flows module
    try:
        sys.path.insert(0, '..')
        from tatforge.flows import document_indexing_flow, query_to_colpali_embedding
        status["tatforge_flows"] = {"installed": True, "message": "tatforge.flows module available"}
    except ImportError as e:
        status["tatforge_flows"] = {"installed": False, "message": f"tatforge.flows not found: {e}"}
    
    return status

# Run checks
print("=" * 60)
print("COCOINDEX + COLPALI PREREQUISITES CHECK")
print("=" * 60)

status = check_prerequisites()

all_ready = True
for component, info in status.items():
    ready = info.get("installed", False) or info.get("running", False) or info.get("configured", False)
    icon = "✅" if ready else "❌"
    print(f"{icon} {component}: {info['message']}")
    if not ready:
        all_ready = False

print("=" * 60)
if all_ready:
    print("All prerequisites met! Ready for CocoIndex + ColPali extraction.")
else:
    print("Some prerequisites missing. Follow the setup steps below.")

## Step 2: Install CocoIndex (if missing)

CocoIndex is the orchestration framework that manages data flows between BAML and Qdrant.

In [None]:
# Install CocoIndex with ColPali support (uncomment and run if not installed)
# !pip install cocoindex[colpali]

# Verify installation
try:
    import cocoindex
    import colpali_engine
    print("✅ CocoIndex with ColPali support installed successfully")
except ImportError as e:
    print(f"❌ Missing dependencies: {e}")
    print("   Uncomment the pip install line above and run.")

## Step 3: Start Qdrant Vector Database

Qdrant stores the ColPali embeddings for semantic search. You can run it via Docker:

```bash
# In terminal, run:
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
```

Or use Qdrant Cloud (free tier available at https://cloud.qdrant.io)

In [ ]:
# Check Qdrant connection
from qdrant_client import QdrantClient
import os

qdrant_url = os.getenv("QDRANT_URL", "http://localhost:6333")

try:
    client = QdrantClient(url=qdrant_url, timeout=5)
    collections = client.get_collections()
    print(f"✅ Qdrant is running at {qdrant_url}")
    print(f"   Collections: {len(collections.collections)}")
except Exception as e:
    print(f"❌ Cannot connect to Qdrant at {qdrant_url}")
    print(f"   Error: {e}")
    print(f"\n   Start Qdrant with: docker run -p 6333:6333 qdrant/qdrant")

## Step 4: Configure LLM API Key

BAML uses an LLM (Claude or GPT-4) for structured extraction. Set your API key:

**Option A**: Set in environment (recommended)
```bash
export ANTHROPIC_API_KEY="sk-ant-..."
# or
export OPENAI_API_KEY="sk-..."
```

**Option B**: Set in this notebook (temporary)

In [None]:
# Set API key (uncomment and fill in your key)
import os

# Option B: Set key directly (will only persist for this session)
# os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."  # Your Anthropic key
# os.environ["OPENAI_API_KEY"] = "sk-..."  # Or your OpenAI key

# Check if API key is configured
anthropic_key = os.getenv("ANTHROPIC_API_KEY")
openai_key = os.getenv("OPENAI_API_KEY")

if anthropic_key:
    print(f"✅ ANTHROPIC_API_KEY is set ({anthropic_key[:12]}...)")
elif openai_key:
    print(f"✅ OPENAI_API_KEY is set ({openai_key[:12]}...)")
else:
    print("❌ No LLM API key configured")
    print("   Set ANTHROPIC_API_KEY or OPENAI_API_KEY above")

## Step 5: Initialize CocoIndex Flow

Import and initialize the CocoIndex flow that orchestrates BAML extraction and Qdrant storage.

In [None]:
# Initialize CocoIndex and import tatforge flows
import os
import sys
import cocoindex
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Add parent directory to path for tatforge imports
sys.path.insert(0, '..')

# Initialize CocoIndex (REQUIRED before using flows)
cocoindex.init()
print("✅ CocoIndex initialized")

# Import the flow functions from tatforge.flows module
from tatforge.flows import (
    # Indexing flow (module-level @cocoindex.flow_def)
    document_indexing_flow,
    query_to_colpali_embedding,
    qdrant_connection,
    # Operations (module-level @cocoindex.op.function)
    file_to_pages,
    Page,
    # Extraction (module-level @cocoindex.op.function with cache)
    extract_with_baml,
    extract_with_schema,
    # Configuration
    QDRANT_GRPC_URL,
    QDRANT_COLLECTION,
    PDF_PATH,
    COLPALI_MODEL,
)

print(f"✅ tatforge.flows imported")
print(f"   PDF Path: {PDF_PATH}")
print(f"   Qdrant gRPC URL: {QDRANT_GRPC_URL}")
print(f"   Collection: {QDRANT_COLLECTION}")
print(f"   ColPali Model: {COLPALI_MODEL}")

In [None]:
# The tatforge.flows module defines CocoIndex flows at MODULE LEVEL
# This is critical - decorators must be at module level, not inside functions

print("tatforge.flows Architecture (from tatforge/flows/):")
print("-" * 60)
print("""
# tatforge/flows/ops.py - Operations
@cocoindex.op.function()
def file_to_pages(filename: str, content: bytes) -> list[Page]:
    '''Convert PDFs to 300 DPI page images'''
    if mime_type == "application/pdf":
        images = convert_from_bytes(content, dpi=300)
        return [Page(page_number=i+1, image=buffer.getvalue()) for i, img in enumerate(images)]

# tatforge/flows/indexing.py - ColPali + Qdrant indexing
qdrant_connection = cocoindex.add_auth_entry(
    "qdrant_connection",
    cocoindex.targets.QdrantConnection(grpc_url=QDRANT_GRPC_URL),
)

@cocoindex.flow_def(name="DocumentIndexingFlow")
def document_indexing_flow(flow_builder, data_scope) -> None:
    # Load PDFs as binary
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(path=PDF_PATH, binary=True)
    )
    
    with data_scope["documents"].row() as doc:
        # Convert to page images
        doc["pages"] = flow_builder.transform(file_to_pages, ...)
        
        with doc["pages"].row() as page:
            # Generate ColPali embedding (multi-vector for spatial awareness)
            page["embedding"] = page["image"].transform(
                cocoindex.functions.ColPaliEmbedImage(model=COLPALI_MODEL)
            )
            output_embeddings.collect(...)
    
    # Export to Qdrant
    output_embeddings.export("document_embeddings", cocoindex.targets.Qdrant(...))

@cocoindex.transform_flow()
def query_to_colpali_embedding(text):
    '''Convert query to ColPali multi-vector embedding'''
    return text.transform(cocoindex.functions.ColPaliEmbedQuery(model=COLPALI_MODEL))

# tatforge/flows/extraction.py - BAML extraction with caching
@cocoindex.op.function(cache=True, behavior_version=1)
async def extract_with_baml(page_image: bytes, extraction_prompt: str) -> dict:
    '''Extract structured data from page image using BAML (cached)'''
    image = baml_py.Image.from_base64("image/png", base64.b64encode(page_image))
    return await b.ExtractDocumentFieldsFromImage(document_image=image, ...)
""")
print("-" * 60)
print("✅ Flows defined at module level. Use CLI to run:")
print("   cocoindex setup  - Create Qdrant collection with schema")
print("   cocoindex update - Index all PDFs with ColPali embeddings")

## Step 6: Direct BAML Extraction (from page images)

ColPali works with **page images**, not raw PDFs. The `file_to_pages` operation converts PDFs to 300 DPI PNG images. You can test BAML extraction directly on these images.

In [None]:
# Convert PDF to page images using tatforge.flows.file_to_pages
from pathlib import Path

# Select a PDF to process
pdf_path = Path("../pdfs/Bunge_loadingstatement_2025-09-25.pdf")

if not pdf_path.exists():
    # Try alternative PDFs
    pdf_dir = Path("../pdfs")
    pdfs = list(pdf_dir.glob("*.pdf"))
    if pdfs:
        pdf_path = pdfs[0]
        print(f"Using available PDF: {pdf_path.name}")
    else:
        print("❌ No PDFs found in ../pdfs/ directory")
        pdf_path = None

if pdf_path:
    print(f"Processing: {pdf_path.name}")
    
    # Read PDF bytes
    pdf_bytes = pdf_path.read_bytes()
    print(f"PDF size: {len(pdf_bytes):,} bytes")
    
    # Convert to page images using tatforge operation
    pages = file_to_pages(pdf_path.name, pdf_bytes)
    print(f"Converted to {len(pages)} page image(s)")
    
    if pages:
        # Get first page for extraction demo
        first_page = pages[0]
        page_image = first_page.image
        print(f"First page image size: {len(page_image):,} bytes (PNG @ 300 DPI)")

## Step 7: Run BAML Extraction

Call the BAML extraction function with the PDF document.

In [None]:
# Run BAML extraction on the page image
extraction_prompt = """
Extract all shipping information from this loading statement document.
Return as JSON with the following structure:
{
    "document_title": "title of the document",
    "last_updated": "date and author of last update",
    "shipments": [
        {
            "slot_reference": "unique reference number",
            "vessel_name": "name of the ship",
            "port": "port name",
            "eta_from": "ETA from date",
            "eta_to": "ETA to date",
            "commodity": "type of cargo",
            "quantity_tonnes": "quantity in tonnes",
            "exporter": "exporter name",
            "loading_status": "status of loading"
        }
    ]
}
"""

if 'page_image' in dir():
    print("Running BAML extraction with GPT-4o vision on page image...")
    print("-" * 60)
    
    # Use the cached extract_with_baml function from tatforge.flows
    result = await extract_with_baml(page_image, extraction_prompt)
    
    print("✅ Extraction complete!")
    print("-" * 60)
else:
    print("❌ No page_image available. Run the previous cell first.")

In [None]:
# Display extracted data
import json

print("Extracted Data:")
print("=" * 60)

if 'result' in dir():
    # extract_with_baml returns a dict with status and extracted_text or error
    if result.get("status") == "success":
        print(f"Status: {result['status']}")
        print("-" * 60)
        extracted = result.get("extracted_text", "")
        # Try to pretty-print if it's JSON
        try:
            if isinstance(extracted, str):
                parsed = json.loads(extracted)
                print(json.dumps(parsed, indent=2))
            else:
                print(json.dumps(extracted, indent=2))
        except (json.JSONDecodeError, TypeError):
            print(extracted)
    else:
        print(f"Status: {result.get('status', 'unknown')}")
        if "error" in result:
            print(f"Error: {result['error']}")
        if "message" in result:
            print(f"Message: {result['message']}")
else:
    print("❌ No result available. Run the extraction cell first.")

## Step 8: Semantic Search with ColPali

After running `cocoindex setup` and `cocoindex update`, you can search documents using ColPali's multi-vector embeddings. This provides spatial awareness - ColPali understands document layout and visual structure.

In [None]:
# Search documents using ColPali embeddings
# (Requires running 'cocoindex setup' and 'cocoindex update' first)

from qdrant_client import QdrantClient

query = "shipping vessel wheat loading"
print(f"Searching for: '{query}'")
print("-" * 60)

try:
    # Generate ColPali query embedding (multi-vector)
    query_embedding = query_to_colpali_embedding.eval(query)
    print(f"Query embedding shape: {len(query_embedding)} vectors")
    
    # Search Qdrant with multi-vector similarity
    client = QdrantClient(url=os.getenv("QDRANT_URL", "http://localhost:6333"))
    
    # Check if collection exists
    collections = [c.name for c in client.get_collections().collections]
    if QDRANT_COLLECTION not in collections:
        print(f"\n❌ Collection '{QDRANT_COLLECTION}' not found.")
        print("   Run these commands first:")
        print("   1. cocoindex setup   (creates collection)")
        print("   2. cocoindex update  (indexes documents)")
    else:
        # Perform search with ColPali multi-vector
        results = client.query_points(
            collection_name=QDRANT_COLLECTION,
            query=query_embedding,
            limit=5,
            with_payload=True,
        )
        
        print(f"\nFound {len(results.points)} results:")
        for i, point in enumerate(results.points, 1):
            payload = point.payload or {}
            filename = payload.get("filename", "unknown")
            page = payload.get("page", "?")
            score = point.score
            print(f"\n{i}. [{score:.3f}] {filename} (page {page})")
            
except Exception as e:
    print(f"Search error: {e}")
    print("\nMake sure you've run:")
    print("  1. cocoindex setup   - Create Qdrant collection")
    print("  2. cocoindex update  - Index documents with ColPali")

## Summary

You've completed the tatForge + CocoIndex + ColPali quickstart! You learned how to:

1. **Check prerequisites** - CocoIndex, ColPali, Qdrant, BAML, OpenAI API key
2. **Initialize CocoIndex** - Central orchestration framework with `cocoindex.init()`
3. **Import tatforge.flows** - Module-level flow definitions for proper CocoIndex integration
4. **Convert PDFs to images** - Using `file_to_pages` operation (300 DPI PNG)
5. **BAML extraction** - Using `extract_with_baml` with caching for efficient LLM calls
6. **ColPali search** - Multi-vector embeddings for spatial-aware document retrieval

## Key Architecture Points

| Component | Role | CocoIndex Integration |
|-----------|------|----------------------|
| **ColPali** | Vision embeddings | `cocoindex.functions.ColPaliEmbedImage/Query()` |
| **Qdrant** | Vector storage | `cocoindex.targets.Qdrant()` with gRPC |
| **BAML** | Structured extraction | `@cocoindex.op.function(cache=True)` |
| **pdf2image** | PDF → PNG conversion | `@cocoindex.op.function()` |

## Commands

```bash
# Setup Qdrant collection with proper schema
cocoindex setup

# Index all PDFs with ColPali embeddings
cocoindex update

# Or use tatforge CLI
tatforge cocoindex setup
tatforge cocoindex update

# Or run the main entry point
python main.py
```

## File Structure

```
tatforge/flows/
├── __init__.py      # Exports all flows and operations
├── ops.py           # file_to_pages operation
├── indexing.py      # document_indexing_flow, query_to_colpali_embedding
└── extraction.py    # extract_with_baml, extract_with_schema (cached)
```

## Next Steps

- Add more PDFs to the `pdfs/` directory
- Run `cocoindex update` to index new documents
- Customize extraction prompts in your code
- Create custom BAML schemas in `baml_src/`
- Read the [CocoIndex docs](https://cocoindex.io/docs)

## Support

For issues: https://github.com/Frosselet/COCOINDEX_LEARNING/issues