# tatForge Quick Start Guide with CocoIndex

This notebook demonstrates how to use **tatForge** with **CocoIndex** - the central orchestration framework that wires together:
- **BAML** for structured LLM extraction with native PDF support
- **Qdrant** for vector storage and semantic search
- **SentenceTransformers** for text embeddings

## Architecture

```
PDF Documents → CocoIndex Flow → BAML Extraction → Qdrant Storage
                     ↓
              Query Handler → Semantic Search
```

## Prerequisites

| Component | Purpose | Setup |
|-----------|---------|-------|
| **CocoIndex** | Flow orchestration | `pip install cocoindex` |
| **Qdrant** | Vector database | Docker: `docker run -p 6333:6333 qdrant/qdrant` |
| **BAML** | LLM extraction | API key (OpenAI for GPT-4o) |
| **SentenceTransformers** | Text embeddings | Included with CocoIndex |

## Step 1: Check Prerequisites

Let's verify CocoIndex, Qdrant, and BAML are available.

In [None]:
import os
import sys

def check_prerequisites():
    """Check all CocoIndex prerequisites and report status."""
    status = {}
    
    # 1. Check CocoIndex
    try:
        import cocoindex
        status["cocoindex"] = {"installed": True, "message": f"cocoindex {cocoindex.__version__ if hasattr(cocoindex, '__version__') else ''} is installed"}
    except ImportError:
        status["cocoindex"] = {"installed": False, "message": "Missing: pip install cocoindex"}
    
    # 2. Check Qdrant client and connection
    try:
        from qdrant_client import QdrantClient
        status["qdrant_client"] = {"installed": True, "message": "qdrant-client is installed"}
        
        qdrant_url = os.getenv("QDRANT_URL", "http://localhost:6333")
        try:
            client = QdrantClient(url=qdrant_url, timeout=5)
            collections = client.get_collections()
            status["qdrant_server"] = {"running": True, "message": f"Qdrant running at {qdrant_url}"}
        except Exception as e:
            status["qdrant_server"] = {"running": False, "message": f"Qdrant not running at {qdrant_url}"}
    except ImportError:
        status["qdrant_client"] = {"installed": False, "message": "Missing: pip install qdrant-client"}
        status["qdrant_server"] = {"running": False, "message": "Install qdrant-client first"}
    
    # 3. Check BAML
    try:
        import baml_py
        status["baml"] = {"installed": True, "message": "baml-py is installed"}
    except ImportError:
        status["baml"] = {"installed": False, "message": "Missing: pip install baml-py"}
    
    # 4. Check BAML client (generated)
    try:
        from baml_client import b
        status["baml_client"] = {"installed": True, "message": "BAML client generated"}
    except ImportError:
        status["baml_client"] = {"installed": False, "message": "Run: baml-cli generate"}
    
    # 5. Check for OpenAI API key (required for GPT-4o vision)
    openai_key = os.getenv("OPENAI_API_KEY")
    if openai_key:
        status["openai_api"] = {"configured": True, "message": f"OPENAI_API_KEY is set ({openai_key[:12]}...)"}
    else:
        status["openai_api"] = {"configured": False, "message": "OPENAI_API_KEY not set (required for GPT-4o)"}
    
    return status

# Run checks
print("=" * 60)
print("COCOINDEX PREREQUISITES CHECK")
print("=" * 60)

status = check_prerequisites()

all_ready = True
for component, info in status.items():
    ready = info.get("installed", False) or info.get("running", False) or info.get("configured", False)
    icon = "✅" if ready else "❌"
    print(f"{icon} {component}: {info['message']}")
    if not ready:
        all_ready = False

print("=" * 60)
if all_ready:
    print("All prerequisites met! Ready for CocoIndex extraction.")
else:
    print("Some prerequisites missing. Follow the setup steps below.")

## Step 2: Install CocoIndex (if missing)

CocoIndex is the orchestration framework that manages data flows between BAML and Qdrant.

In [None]:
# Install CocoIndex (uncomment and run if not installed)
# !pip install cocoindex

# Verify installation
try:
    import cocoindex
    print("✅ CocoIndex installed successfully")
except ImportError:
    print("❌ CocoIndex not installed. Uncomment the pip install line above and run.")

## Step 3: Start Qdrant Vector Database

Qdrant stores the ColPali embeddings for semantic search. You can run it via Docker:

```bash
# In terminal, run:
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
```

Or use Qdrant Cloud (free tier available at https://cloud.qdrant.io)

In [ ]:
# Check Qdrant connection
from qdrant_client import QdrantClient
import os

qdrant_url = os.getenv("QDRANT_URL", "http://localhost:6333")

try:
    client = QdrantClient(url=qdrant_url, timeout=5)
    collections = client.get_collections()
    print(f"✅ Qdrant is running at {qdrant_url}")
    print(f"   Collections: {len(collections.collections)}")
except Exception as e:
    print(f"❌ Cannot connect to Qdrant at {qdrant_url}")
    print(f"   Error: {e}")
    print(f"\n   Start Qdrant with: docker run -p 6333:6333 qdrant/qdrant")

## Step 4: Configure LLM API Key

BAML uses an LLM (Claude or GPT-4) for structured extraction. Set your API key:

**Option A**: Set in environment (recommended)
```bash
export ANTHROPIC_API_KEY="sk-ant-..."
# or
export OPENAI_API_KEY="sk-..."
```

**Option B**: Set in this notebook (temporary)

In [None]:
# Set API key (uncomment and fill in your key)
import os

# Option B: Set key directly (will only persist for this session)
# os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."  # Your Anthropic key
# os.environ["OPENAI_API_KEY"] = "sk-..."  # Or your OpenAI key

# Check if API key is configured
anthropic_key = os.getenv("ANTHROPIC_API_KEY")
openai_key = os.getenv("OPENAI_API_KEY")

if anthropic_key:
    print(f"✅ ANTHROPIC_API_KEY is set ({anthropic_key[:12]}...)")
elif openai_key:
    print(f"✅ OPENAI_API_KEY is set ({openai_key[:12]}...)")
else:
    print("❌ No LLM API key configured")
    print("   Set ANTHROPIC_API_KEY or OPENAI_API_KEY above")

## Step 5: Initialize CocoIndex Flow

Import and initialize the CocoIndex flow that orchestrates BAML extraction and Qdrant storage.

In [None]:
# Initialize CocoIndex
import os
import cocoindex
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Initialize CocoIndex
cocoindex.init()
print("✅ CocoIndex initialized")

# Import the flow functions from our cocoindex_flow module
import sys
sys.path.insert(0, '..')  # Add parent directory to path

from cocoindex_flow import (
    extract_document_fields,
    extract_with_schema,
    text_to_embedding,
    document_extraction_flow,
    search_documents,
    QDRANT_URL,
    QDRANT_COLLECTION,
    PDF_PATH
)

print(f"✅ CocoIndex flow imported")
print(f"   PDF Path: {PDF_PATH}")
print(f"   Qdrant URL: {QDRANT_URL}")
print(f"   Collection: {QDRANT_COLLECTION}")

In [None]:
# The CocoIndex flow will automatically:
# 1. Load PDFs from the configured path (binary mode)
# 2. Extract text using BAML with native PDF support
# 3. Generate embeddings using SentenceTransformer
# 4. Store results in Qdrant

# To run the flow and index all PDFs, use CocoIndex CLI:
# cocoindex run

# Or programmatically trigger the flow (for development/testing):
print("CocoIndex Flow Architecture:")
print("-" * 40)
print("""
@cocoindex.flow_def("DocumentExtractionWithQdrant")
def document_extraction_flow(flow_builder, data_scope):
    
    # 1. Load PDFs as binary
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(path="pdfs", binary=True)
    )
    
    # 2. For each document...
    with data_scope["documents"].row() as doc:
        # Extract using BAML (native PDF support!)
        doc["extracted_text"] = doc["content"].transform(
            extract_document_fields,
            extraction_prompt="..."
        )
        
        # Generate embedding
        doc["embedding"] = text_to_embedding(doc["extracted_text"])
        
        # Collect for storage
        doc_embeddings.collect(...)
    
    # 3. Export to Qdrant
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.targets.Qdrant(collection_name="documents"),
        primary_key_fields=["id"]
    )
""")
print("-" * 40)
print("✅ Flow ready. Run 'cocoindex run' to process documents.")

## Step 6: Direct BAML Extraction (without full flow)

You can also call the BAML extraction directly for testing. This shows how CocoIndex wraps BAML with native PDF support.

In [None]:
# Direct BAML extraction example using native PDF support
import base64
import baml_py
from pathlib import Path
from baml_client import b

# Select a PDF to extract from
pdf_path = Path("../pdfs/Bunge_loadingstatement_2025-09-25.pdf")
print(f"Extracting from: {pdf_path.name}")

# Read PDF as bytes and convert to BAML PDF type
pdf_bytes = pdf_path.read_bytes()
pdf_base64 = base64.b64encode(pdf_bytes).decode("utf-8")
pdf = baml_py.Pdf.from_base64(pdf_base64)

print(f"PDF size: {len(pdf_bytes):,} bytes")
print("Converting to BAML PDF type (native support, no image conversion needed!)")

## Step 7: Run BAML Extraction

Call the BAML extraction function with the PDF document.

In [None]:
# Run BAML extraction with custom prompt
extraction_prompt = """
Extract all shipping information from this loading statement document.
Return as JSON with the following structure:
{
    "document_title": "title of the document",
    "last_updated": "date and author of last update",
    "shipments": [
        {
            "slot_reference": "unique reference number",
            "vessel_name": "name of the ship",
            "port": "port name",
            "eta_from": "ETA from date",
            "eta_to": "ETA to date",
            "commodity": "type of cargo",
            "quantity_tonnes": "quantity in tonnes",
            "exporter": "exporter name",
            "loading_status": "status of loading"
        }
    ]
}
"""

print("Running BAML extraction with GPT-4o (native PDF support)...")
print("-" * 60)

# Call the BAML function (async)
result = await b.ExtractDocumentFieldsFromPDF(
    document=pdf,
    extraction_prompt=extraction_prompt
)

print("✅ Extraction complete!")
print("-" * 60)

In [None]:
# Display extracted data - BAML returns Pydantic models
print("Extracted Data:")
print("=" * 60)

# For ExtractDocumentFieldsFromPDF (returns str), print directly
print(result)

# If using ExtractFromPDF (returns DocumentExtractionResult Pydantic model):
# print(result.model_dump_json(indent=2))  # To JSON
# print(result.model_dump())               # To dict

## Step 8: Semantic Search with CocoIndex Query Handler

After running the CocoIndex flow, you can search documents using the query handler.

In [None]:
# Search documents using the CocoIndex query handler
# (Requires running 'cocoindex run' first to index documents)

try:
    query = "shipping vessel wheat loading"
    print(f"Searching for: '{query}'")
    print("-" * 60)
    
    results = search_documents(query)
    
    print(f"Found {len(results.results)} results:")
    for i, r in enumerate(results.results, 1):
        print(f"\n{i}. [{r['score']:.3f}] {r['filename']}")
        text = r.get('extracted_text', '')[:200]
        print(f"   {text}...")
        
except Exception as e:
    print(f"Search error: {e}")
    print("\nMake sure you've run 'cocoindex run' to index documents first.")

## Summary

You've completed the CocoIndex + BAML + Qdrant quickstart! You learned how to:

1. **Check prerequisites** - CocoIndex, Qdrant, BAML, OpenAI API key
2. **Initialize CocoIndex** - Central orchestration framework
3. **Understand the flow** - How CocoIndex wires BAML and Qdrant together
4. **Direct BAML extraction** - Using native PDF support (`baml_py.Pdf.from_base64()`)
5. **Semantic search** - Query documents via CocoIndex query handler

## Key Architecture Points

| Component | Role | CocoIndex Integration |
|-----------|------|----------------------|
| **BAML** | Structured extraction | `@cocoindex.op.function()` decorator |
| **Qdrant** | Vector storage | `cocoindex.targets.Qdrant()` |
| **SentenceTransformer** | Embeddings | `cocoindex.functions.SentenceTransformerEmbed()` |

## Commands

```bash
# Run the CocoIndex flow to process all PDFs
cocoindex run

# Or run the flow script directly
python cocoindex_flow.py
```

## Next Steps

- Add more PDFs to the `pdfs/` directory
- Customize extraction prompts in `cocoindex_flow.py`
- Create custom BAML schemas in `baml_src/`
- Read the [CocoIndex docs](https://cocoindex.io/docs)

## Support

For issues: https://github.com/Frosselet/COCOINDEX_LEARNING/issues