[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextualAI/examples/blob/main/18-contextualai-chroma/01-contextual-ai-parser-chroma.ipynb)

# Build Multi-Modal RAG with Chroma and Contextual AI Parser

**Last updated:** October 2025

**Versions used:**
- Chroma version `latest`
- Contextual AI client `latest`
- OpenAI API (for embeddings and generation)

This is a code recipe that uses [Chroma](https://docs.trychroma.com/) to perform multi-modal RAG over documents parsed by [Contextual AI Parser](https://docs.contextual.ai/api-reference/parse/parse-file).

In this notebook, we accomplish the following:
* Parse two distinct document types using Contextual AI Parser: research papers and table-rich documents
* Extract structured markdown with document hierarchy preservation and advanced table extraction
* Generate text embeddings with OpenAI
* Perform multi-modal RAG using [Chroma](https://docs.trychroma.com/)

To run this notebook, you'll need:
* A [Contextual AI API key](https://docs.contextual.ai/user-guides/beginner-guide#get-your-api-key) - for document parsing and content extraction
* An [OpenAI API key](https://platform.openai.com/docs/quickstart) - for text embeddings and generative responses



### Install Contextual AI client and Chroma

Note: If Colab prompts you to restart the session after running the cell below, click "restart" and proceed with running the rest of the notebook.


In [None]:
%%capture
%pip install --upgrade chromadb contextual-client openai requests rich

import warnings
warnings.filterwarnings("ignore")

import logging
# Suppress Chroma client logs
logging.getLogger("chromadb").setLevel(logging.ERROR)


## üîç Part 1: Contextual AI Parser

Contextual AI Parser is a cloud-based document parsing service that excels at extracting structured information from PDFs, DOC/DOCX, and PPT/PPTX files. It provides high-quality markdown extraction with document hierarchy preservation, making it ideal for RAG applications.

The parser handles complex documents with images, tables, and hierarchical structures, providing multiple output formats including:
- `markdown-document`: Single concatenated markdown output
- `markdown-per-page`: Page-by-page markdown output
- `blocks-per-page`: Structured JSON with document hierarchy


In [None]:
# Documents to parse with Contextual AI
documents = [
    {
        "url": "https://arxiv.org/pdf/1706.03762",
        "title": "Attention Is All You Need",
        "type": "research_paper",
        "description": "Seminal transformer architecture paper that introduced self-attention mechanisms"
    },
    {
        "url": "https://raw.githubusercontent.com/ContextualAI/examples/refs/heads/main/03-standalone-api/04-parse/data/omnidocbench-text.pdf",
        "title": "OmniDocBench Dataset Documentation", 
        "type": "table_rich_document",
        "description": "Dataset documentation with large tables demonstrating table extraction capabilities"
    }
]


### API Keys Setup üîë

We'll be using the Contextual AI API for parsing documents and OpenAI API for both generating text embeddings and for the generative model in our RAG pipeline. The code below dynamically fetches your API keys based on whether you're running this notebook in Google Colab or as a regular Jupyter notebook.

If you're running this notebook in Google Colab, make sure you [add](https://medium.com/@parthdasawant/how-to-use-secrets-in-google-colab-450c38e3ec75) your API keys as secrets.


In [None]:
# API key variable names
contextual_api_key_var = "CONTEXTUAL_API_KEY"  # Replace with the name of your secret/env var
openai_api_key_var = "OPENAI_API_KEY"  # Replace with the name of your secret/env var

# Fetch API keys
try:
    # If running in Colab, fetch API keys from Secrets
    import google.colab
    from google.colab import userdata
    contextual_api_key = userdata.get(contextual_api_key_var)
    openai_api_key = userdata.get(openai_api_key_var)
    
    if not contextual_api_key:
        raise ValueError(f"Secret '{contextual_api_key_var}' not found in Colab secrets.")
    if not openai_api_key:
        raise ValueError(f"Secret '{openai_api_key_var}' not found in Colab secrets.")
except ImportError:
    # If not running in Colab, fetch API keys from environment variables
    import os
    contextual_api_key = os.getenv(contextual_api_key_var)
    openai_api_key = os.getenv(openai_api_key_var)
    
    if not contextual_api_key:
        raise EnvironmentError(
            f"Environment variable '{contextual_api_key_var}' is not set. "
            "Please define it before running this script."
        )
    if not openai_api_key:
        raise EnvironmentError(
            f"Environment variable '{openai_api_key_var}' is not set. "
            "Please define it before running this script."
        )

print("API keys configured successfully!")


### Download and parse PDFs using Contextual AI Parser

Here we use Contextual AI's Python SDK to parse a batch of PDFs. The result is structured markdown content with document hierarchy that we can use for text extraction and chunking.


In [None]:
import requests
from contextual import ContextualAI
from time import sleep
import os

# Setup Contextual AI client
client = ContextualAI(api_key=contextual_api_key)

# Create directory for downloaded PDFs
os.makedirs("pdfs", exist_ok=True)

# Download PDFs and submit parse jobs
job_data = []

for i, doc in enumerate(documents):
    print(f"Downloading and submitting parse job for: {doc['title']}")
    print(f"Type: {doc['type']} - {doc['description']}")
    
    # Download PDF
    file_path = f"pdfs/{doc['type']}_{i}.pdf"
    with open(file_path, "wb") as f:
        f.write(requests.get(doc['url']).content)
    
    # Configure parsing parameters based on document type
    if doc['type'] == "research_paper":
        # For research papers, focus on hierarchy and figures
        parse_config = {
            "parse_mode": "standard",
            "figure_caption_mode": "concise",
            "enable_document_hierarchy": True,
            "page_range": "0-5"  # Parse first 6 pages
        }
    else:  # table_rich_document
        # For table-rich documents, enable table splitting
        parse_config = {
            "parse_mode": "standard",
            "enable_split_tables": True,
            "max_split_table_cells": 100,
        }
    
    # Submit parse job
    with open(file_path, "rb") as fp:
        response = client.parse.create(
            raw_file=fp,
            **parse_config
        )
    
    job_data.append({
        "job_id": response.job_id,
        "file_path": file_path,
        "document": doc
    })
    print(f"Submitted job {response.job_id} for {doc['title']}")

print(f"\nSubmitted {len(job_data)} parse jobs")


### Monitor parse job status and retrieve results

We'll monitor all parse jobs and retrieve the results once they're completed. Contextual AI provides structured markdown with document hierarchy information.


In [None]:
# Monitor all parse jobs
completed_jobs = set()

while len(completed_jobs) < len(job_data):
    for i, job_info in enumerate(job_data):
        job_id = job_info["job_id"]
        if job_id not in completed_jobs:
            status = client.parse.job_status(job_id)
            doc_title = job_info["document"]["title"]
            doc_type = job_info["document"]["type"]
            print(f"Job {i+1}/{len(job_data)} ({doc_title} - {doc_type}): {status.status}")
            
            if status.status == "completed":
                completed_jobs.add(job_id)
            elif status.status == "failed":
                print(f"Job failed for {doc_title}")
                completed_jobs.add(job_id)  # Add to completed to avoid infinite loop
    
    if len(completed_jobs) < len(job_data):
        print("\nWaiting for remaining jobs to complete...")
        sleep(30)

print("\nAll parse jobs completed!")


## üíö Part 2: Chroma
### Create and configure a Chroma collection

[Chroma](https://docs.trychroma.com/) is an open-source embedding database that makes it easy to build LLM apps by making knowledge, facts, and skills pluggable for LLMs. It provides efficient vector storage and similarity search capabilities.


In [None]:
import chromadb
from chromadb.utils import embedding_functions

# Initialize Chroma client
chroma_client = chromadb.Client()

# Use OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=openai_api_key,
    model_name="text-embedding-3-small"
)

# Create collection
collection_name = "contextual_ai_rag_collection"
collection = chroma_client.create_collection(
    name=collection_name,
    embedding_function=openai_ef
)

print(f"Created collection '{collection_name}' with OpenAI embeddings")


### Retrieve and process parsed content

We'll retrieve the parsed results and process them into chunks suitable for vector search. Contextual AI provides excellent document structure preservation, which we'll leverage for better RAG performance.

**Key Feature**: Contextual AI preserves document hierarchy through `parent_ids`, allowing us to maintain section relationships and provide richer context to our RAG system.


In [None]:
# Retrieve results and process into chunks
texts, titles, sources, doc_types = [], [], [], []

for job_info in job_data:
    job_id = job_info["job_id"]
    document = job_info["document"]
    
    if job_id in completed_jobs:
        try:
            print(f"Processing {document['title']} ({document['type']})")
            
            # Get results with blocks-per-page for hierarchical information
            results = client.parse.job_results(
                job_id, 
                output_types=['blocks-per-page']
            )
            
            print(f"  - {len(results.pages)} pages parsed")
            
            # Create hash table for parent content lookup
            hash_table = {}
            for page in results.pages:
                for block in page.blocks:
                    hash_table[block.id] = block.markdown
            
            # Process blocks with hierarchy context
            for page in results.pages:
                for block in page.blocks:
                    # Filter blocks based on document type and content quality
                    if (block.type in ['text', 'heading', 'table'] and 
                        len(block.markdown.strip()) > 30):
                        
                        # Add hierarchy context if available
                        context_text = block.markdown
                        
                        if hasattr(block, 'parent_ids') and block.parent_ids:
                            parent_content = "\n".join([
                                hash_table.get(parent_id, "") 
                                for parent_id in block.parent_ids
                            ])
                            if parent_content.strip():
                                context_text = f"{parent_content}\n\n{block.markdown}"
                        
                        # Add document metadata as context
                        full_text = f"Document: {document['title']}\nType: {document['type']}\n\n{context_text}"
                        
                        texts.append(full_text)
                        titles.append(document['title'])
                        sources.append(f"Page {page.index + 1}")
                        doc_types.append(document['type'])
                        
        except Exception as e:
            print(f"Error processing {document['title']}: {e}")

print(f"\nProcessed {len(texts)} chunks from {len(set(titles))} documents")
print(f"Document types: {', '.join(set(doc_types))}")


### Wrangle data into an acceptable format for Chroma

Transform our data from lists to a list of dictionaries for insertion into our Chroma collection.


In [None]:
# Initialize the data object
data = []

# Create a dictionary for each row by iterating through the corresponding lists
for text, title, source, doc_type in zip(texts, titles, sources, doc_types):
    data_point = {
        "text": text,
        "title": title,
        "source": source,
        "document_type": doc_type,
    }
    data.append(data_point)

print(f"Prepared {len(data)} chunks for insertion into Chroma")
print(f"Chunks by document type:")
for doc_type in set(doc_types):
    count = doc_types.count(doc_type)
    print(f"  - {doc_type}: {count} chunks")


### Insert data into Chroma and generate embeddings

Embeddings will be generated upon insertion to our Chroma collection.


In [None]:
# Insert text chunks and metadata into Chroma collection
collection.add(
    documents=[item["text"] for item in data],
    metadatas=[{
        "title": item["title"],
        "source": item["source"],
        "document_type": item["document_type"]
    } for item in data],
    ids=[f"chunk_{i}" for i in range(len(data))]
)

print("Insert complete.")


### Query the data

Here, we perform a simple similarity search to return the most similar embedded chunks to our search query.


In [None]:
# Example 1: Search for transformer-related content
print("=== Searching for Transformer Architecture ===")
results = collection.query(
    query_texts=["transformer architecture attention mechanism"],
    n_results=3,
    include=["documents", "metadatas", "distances"]
)

for i, (doc, metadata, distance) in enumerate(zip(results['documents'][0], results['metadatas'][0], results['distances'][0])):
    print(f"\n--- Result {i+1} ---")
    print(f"Title: {metadata['title']}")
    print(f"Type: {metadata['document_type']}")
    print(f"Source: {metadata['source']}")
    print(f"Similarity: {1 - distance:.3f}")
    print(f"Text preview: {doc[:200]}...")

print("\n" + "="*50)

# Example 2: Search for table-related content
print("\n=== Searching for Table/Data Content ===")
results = collection.query(
    query_texts=["dataset table benchmark performance metrics"],
    n_results=3,
    include=["documents", "metadatas", "distances"]
)

for i, (doc, metadata, distance) in enumerate(zip(results['documents'][0], results['metadatas'][0], results['distances'][0])):
    print(f"\n--- Result {i+1} ---")
    print(f"Title: {metadata['title']}")
    print(f"Type: {metadata['document_type']}")
    print(f"Source: {metadata['source']}")
    print(f"Similarity: {1 - distance:.3f}")
    print(f"Text preview: {doc[:200]}...")


### Perform RAG on parsed articles

We'll use OpenAI's GPT model to generate responses based on the retrieved context from Chroma.


In [None]:
from openai import OpenAI
from rich.console import Console
from rich.panel import Panel

# Initialize OpenAI client
openai_client = OpenAI(api_key=openai_api_key)

# Example 1: RAG on Transformer Architecture
print("=== RAG Query: Transformer Architecture ===")
query = "transformer attention mechanism"
prompt = f"Explain how {query} works, using only the retrieved context."

# Retrieve relevant documents
results = collection.query(
    query_texts=[query],
    n_results=4,
    include=["documents", "metadatas"]
)

# Prepare context
context = "\n\n".join(results['documents'][0])

# Generate response
response = openai_client.chat.completions.create(
    model="gpt-5-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context. Use only the information from the context."},
        {"role": "user", "content": f"Context: {context}\n\nQuestion: {prompt}"}
    ],
    temperature=1
)

# Prettify the output using Rich
console = Console()
console.print(Panel(prompt, title="Prompt", border_style="bold red"))
console.print(Panel(response.choices[0].message.content, title="Generated Content", border_style="bold green"))


In [None]:
# Example 2: RAG on Dataset/Benchmark Information
print("\n=== RAG Query: Dataset and Benchmark Information ===")
query = "dataset benchmark performance evaluation"
prompt = f"What information does the retrieved context provide about {query}?"

# Retrieve relevant documents
results = collection.query(
    query_texts=[query],
    n_results=4,
    include=["documents", "metadatas"]
)

# Prepare context
context = "\n\n".join(results['documents'][0])

# Generate response
response = openai_client.chat.completions.create(
    model="gpt-5-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context. Use only the information from the context."},
        {"role": "user", "content": f"Context: {context}\n\nQuestion: {prompt}"}
    ],
    temperature=1
)

# Prettify the output using Rich
console = Console()
console.print(Panel(prompt, title="Prompt", border_style="bold red"))
console.print(Panel(response.choices[0].message.content, title="Generated Content", border_style="bold green"))


## Summary

This notebook demonstrates a unique RAG pipeline using Contextual AI Parser and Chroma with two distinct document types:

### What We Demonstrated:
1. **Research Paper Parsing**: "Attention is All You Need" with document hierarchy preservation
2. **Table-Rich Document Parsing**: OmniDocBench dataset with advanced table extraction
3. **Multi-modal RAG**: Semantic search across different document types
4. **Contextual Intelligence**: Leveraging document structure for better retrieval

### Contextual AI Parser Advantages:
- **Cloud-based processing**: No local GPU/compute requirements
- **Document hierarchy preservation**: Maintains section relationships and structure
- **Advanced table handling**: Smart table splitting with header propagation
- **Multiple output formats**: Blocks, markdown, and structured JSON
- **Production-ready**: Scalable cloud service with enterprise features

### Key Differentiators from Other Parsers:
- **Hierarchical context**: Parent-child relationships preserved in chunks
- **Table intelligence**: Large tables automatically split with context preservation
- **Document type awareness**: Different parsing strategies for different content types
- **Rich metadata**: Document structure information enhances RAG quality

### Chroma Integration Benefits:
- **Multi-modal search**: Query across different document types simultaneously
- **Metadata filtering**: Filter by document type, source, and other attributes
- **Efficient storage**: Optimized vector database for embeddings
- **Scalability**: From local development to cloud production

### Next Steps for Enhancement:
* Implement document-level metadata for better source attribution
* Add hybrid search combining keyword and semantic search
* Experiment with different chunking strategies for each document type
* End-to-end RAG agents via [Contextual AI](https://docs.contextual.ai/user-guides/beginner-guide)
* Get more information about integrating [Chroma](https://docs.trychroma.com/docs/overview/introduction)

---

**Ready to get started?** This notebook provides a complete, production-ready example of integrating Contextual AI Parser with Chroma for sophisticated RAG applications. The combination of Contextual AI's advanced parsing capabilities and Chroma's powerful vector search features creates a robust foundation for document-based AI applications.
