# ChromaDB Tutorial: Storing & Retrieving Vectorized Data

This notebook demonstrates how to use ChromaDB for vector storage and retrieval with support for both OpenAI API and local Ollama models.

## Overview
- Load and process PDF documents
- Create embeddings using OpenAI or local models (Ollama)
- Store embeddings in ChromaDB
- Perform semantic search
- Implement hybrid retrieval (BM25 + semantic)
- Re-rank results using LLM

## 1. Installation and Imports

First, install required packages:
```bash
pip install chromadb langchain langchain-community langchain-openai pypdf rank-bm25 sentence-transformers openai ollama
```

Import all necessary libraries for document processing, embeddings, and vector storage.

In [None]:
import os
import shutil
import random
import warnings
from typing import List

# Document processing
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyMuPDFLoader
from langchain.schema import Document
from langchain_community.retrievers import BM25Retriever

# Embeddings - OpenAI
from langchain_openai import OpenAIEmbeddings

# Embeddings - Local (sentence-transformers)
from langchain.embeddings import HuggingFaceEmbeddings

# LLM clients
from openai import OpenAI
import ollama

# Utilities
from tenacity import retry, stop_after_attempt, wait_random_exponential

warnings.filterwarnings("ignore")

## 2. Configuration

Set up configuration for either OpenAI or Ollama:

**For OpenAI:**
- Set the `OPENAI_API_KEY` environment variable
- Uses OpenAI's `text-embedding-3-small` for embeddings
- Uses GPT-4 for re-ranking

**For Ollama (Local):**
- Requires Ollama to be installed and running locally
- Uses sentence-transformers for embeddings (e.g., `all-MiniLM-L6-v2`)
- Uses any Ollama model for chat (e.g., `llama2`, `mistral`)

Choose your preferred model provider by setting `USE_OLLAMA`.

In [None]:
# ============================================
# CONFIGURATION: Choose your model provider
# ============================================

# Set to True for local Ollama, False for OpenAI API
USE_OLLAMA = False  # Change to True to use Ollama

if USE_OLLAMA:
    # Ollama Configuration (Local)
    OLLAMA_MODEL = "llama2"  # or "mistral", "llama3", etc.
    EMBEDDING_MODEL = "all-MiniLM-L6-v2"  # Local sentence-transformer model
    print(f"Using Ollama with model: {OLLAMA_MODEL}")
    print(f"Using local embeddings: {EMBEDDING_MODEL}")
else:
    # OpenAI Configuration
    OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
    if not OPENAI_API_KEY:
        raise ValueError("OPENAI_API_KEY environment variable not set!")
    
    OPENAI_CHAT_MODEL = "gpt-4"  # or "gpt-3.5-turbo"
    OPENAI_EMBEDDING_MODEL = "text-embedding-3-small"
    print(f"Using OpenAI with chat model: {OPENAI_CHAT_MODEL}")
    print(f"Using OpenAI embeddings: {OPENAI_EMBEDDING_MODEL}")

## 3. Initialize Clients and Embeddings

Create the appropriate clients based on the selected configuration:
- **Embeddings client**: Either OpenAI or local HuggingFace model
- **Chat client**: Either OpenAI or Ollama

In [None]:
# Initialize embeddings
if USE_OLLAMA:
    # Use local sentence-transformers model
    embeddings = HuggingFaceEmbeddings(
        model_name=EMBEDDING_MODEL,
        model_kwargs={'device': 'cpu'},  # Use 'cuda' if GPU available
        encode_kwargs={'normalize_embeddings': True}
    )
    chat_client = None  # Ollama doesn't need a client object
else:
    # Use OpenAI embeddings
    embeddings = OpenAIEmbeddings(
        model=OPENAI_EMBEDDING_MODEL,
        openai_api_key=OPENAI_API_KEY
    )
    chat_client = OpenAI(api_key=OPENAI_API_KEY)

print("✓ Embeddings and chat clients initialized")

## 4. Helper Function: Get LLM Response

This function sends a prompt to the LLM and returns the response. It automatically uses either OpenAI or Ollama based on the configuration.

The `@retry` decorator ensures robustness by automatically retrying failed API calls.

In [None]:
@retry(wait=wait_random_exponential(min=45, max=120), stop=stop_after_attempt(6))
def get_response(prompt: str) -> str:
    """
    Send a prompt to the LLM and get a response.
    
    Args:
        prompt: The user's question or instruction
        
    Returns:
        The model's response as a string
    """
    if USE_OLLAMA:
        # Use Ollama
        response = ollama.chat(
            model=OLLAMA_MODEL,
            messages=[{
                "role": "user",
                "content": prompt
            }]
        )
        return response['message']['content']
    else:
        # Use OpenAI
        response = chat_client.chat.completions.create(
            model=OPENAI_CHAT_MODEL,
            messages=[{
                "role": "user",
                "content": prompt
            }]
        )
        return response.choices[0].message.content

## 5. Load PDF Document

Use LangChain's `PyMuPDFLoader` to load a PDF file and extract text content. The loader automatically handles:
- Text extraction from each page
- Metadata preservation (page numbers, etc.)
- Document structure parsing

**Note:** Update the `pdf_path` variable with your actual PDF file location.

In [None]:
def load_pdf_with_langchain(pdf_path: str) -> List[Document]:
    """
    Load a PDF file and return LangChain Document objects.
    
    Args:
        pdf_path: Path to the PDF file
        
    Returns:
        List of Document objects with page content and metadata
    """
    loader = PyMuPDFLoader(pdf_path)
    documents = loader.load()
    
    print(f"Successfully loaded {len(documents)} document chunks from the PDF.")
    return documents

In [None]:
# Path to your PDF file (update this with your actual file path)
pdf_path = "./Data/Healthcare doc for RAG.pdf"

# Load the document
docs = load_pdf_with_langchain(pdf_path)

## 6. Split Text into Chunks

For effective vector search, we split the document into smaller chunks:

- **chunk_size=600**: Each chunk contains approximately 600 characters
- **chunk_overlap=100**: Adjacent chunks share 100 characters to maintain context

The `RecursiveCharacterTextSplitter` intelligently splits text at natural boundaries (paragraphs, sentences) rather than arbitrary character positions.

In [None]:
# Create text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=100
)

# Split documents into chunks
texts = text_splitter.split_documents(docs)

print(f"Split into {len(texts)} text chunks")
print(f"\nExample chunk:\n{texts[0].page_content[:200]}...")

## 7. Create and Persist ChromaDB Vector Store

Now we'll create a ChromaDB vector store to store our document embeddings:

1. **Embedding Generation**: Convert each text chunk into a vector embedding
2. **Storage**: Store embeddings and metadata in ChromaDB
3. **Persistence**: Save to disk for future use

ChromaDB is used because:
- Lightweight and runs locally
- Fast similarity search
- Easy to persist and reload
- No external dependencies

**Note:** We disable telemetry for privacy reasons.

In [None]:
# Set up tiktoken cache directory (for OpenAI)
tiktoken_cache_dir = os.path.abspath("./.setup/tiktoken_cache/")
os.environ["TIKTOKEN_CACHE_DIR"] = tiktoken_cache_dir

# Disable ChromaDB telemetry for privacy
os.environ["ANONYMIZED_TELEMETRY"] = "False"

# Define persist directory
persist_directory = './vector_embeddings_OPENAI' if not USE_OLLAMA else './vector_embeddings_LOCAL'

# Create ChromaDB vector store
try:
    # Clean up existing database if needed
    if os.path.exists(persist_directory):
        shutil.rmtree(persist_directory)
    
    # Create new vector store
    vectordb = Chroma.from_documents(
        documents=texts,
        embedding=embeddings,
        persist_directory=persist_directory
    )
    
    # Persist to disk
    vectordb.persist()
    
    print(f"✓ Embeddings stored in ChromaDB at: {persist_directory}")
    print(f"✓ Vector store contains {vectordb._collection.count()} embeddings")
    
except Exception as e:
    print(f"Error: {e}")
    print("Creating a new ChromaDB database.")
    
    # Generate random directory name
    persist_directory = f"{persist_directory}_{(''.join(str(random.randint(0, 9)) for _ in range(4)))}"
    
    if os.path.exists(persist_directory):
        shutil.rmtree(persist_directory)
    
    vectordb = Chroma.from_documents(
        documents=texts,
        embedding=embeddings,
        persist_directory=persist_directory
    )
    vectordb.persist()
    
    print(f"✓ New database created at: {persist_directory}")

## 8. Semantic Retrieval Function

This function performs semantic search to find the most relevant document chunks:

1. **Query Embedding**: Convert the search query into a vector
2. **Similarity Search**: Find chunks with similar embeddings using cosine similarity
3. **Deduplication**: Remove duplicate results
4. **Top-K Selection**: Return only the most relevant results

The function fetches more results than needed (`k=top_k*2`) to ensure we have enough unique results after deduplication.

In [None]:
def semantic_retrieval(query: str, top_k: int = 3) -> List[Document]:
    """
    Retrieve top_k semantically relevant documents from ChromaDB using vector search.
    
    Args:
        query: The search query
        top_k: Number of documents to return
        
    Returns:
        List of the most relevant Document objects
    """
    # Fetch more results to ensure enough unique results after deduplication
    results = vectordb.similarity_search(query, k=top_k*2)
    
    # Deduplicate based on page content
    unique_results = []
    seen_contents = set()
    
    for doc in results:
        if doc.page_content not in seen_contents:
            unique_results.append(doc)
            seen_contents.add(doc.page_content)
        
        if len(unique_results) >= top_k:
            break
    
    return unique_results

In [None]:
# Test semantic retrieval
query = "How does the Integrated Clinical Environment (ICE) platform support MIoT implementation in healthcare settings?"
semantic_results = semantic_retrieval(query)

print(f"Found {len(semantic_results)} relevant chunks\n")
for i, doc in enumerate(semantic_results, 1):
    print(f"Semantic Result {i}:")
    print(f"{doc.page_content[:200]}...\n")

## 9. Hybrid Retrieval: BM25 + Semantic Search

Hybrid retrieval combines two complementary approaches:

**1. Semantic Search (Vector-based)**
   - Understands meaning and context
   - Good for conceptual queries
   - Uses embeddings and cosine similarity

**2. BM25 (Keyword-based)**
   - Exact keyword matching
   - Good for specific terms
   - Uses term frequency and document frequency

The function:
1. Gets semantic search results (first half of top_k)
2. Gets BM25 keyword results
3. Combines unique results from both
4. Fills remaining spots with semantic results if needed

In [None]:
def hybrid_retrieval_simple(query: str, top_k: int = 3) -> List[Document]:
    """
    Combine semantic and keyword search results for diverse retrieval.
    
    This approach ensures we get results from both semantic similarity
    and exact keyword matching, providing more comprehensive coverage.
    
    Args:
        query: The search query
        top_k: Total number of documents to return
        
    Returns:
        Combined list of unique, relevant documents
    """
    # Get semantic search results
    semantic_results = vectordb.similarity_search(query, k=top_k*2)
    semantic_contents = [doc.page_content for doc in semantic_results]
    
    # Get keyword search results using BM25
    documents = [Document(page_content=doc) if isinstance(doc, str) else doc 
                 for doc in vectordb.get()["documents"]]
    bm25_retriever = BM25Retriever.from_documents(documents)
    keyword_results = bm25_retriever.get_relevant_documents(query, k=top_k*2)
    
    # Take half from semantic results
    final_results = semantic_results[:top_k//2]
    
    # Add unique keyword results
    for doc in keyword_results:
        if len(final_results) >= top_k:
            break
        if doc.page_content not in semantic_contents:
            final_results.append(doc)
    
    # Fill remaining spots with semantic results if needed
    remaining_spots = top_k - len(final_results)
    if remaining_spots > 0:
        start_idx = len(final_results) - remaining_spots
        final_results.extend(semantic_results[start_idx:start_idx+remaining_spots])
    
    return final_results

In [None]:
# Test hybrid retrieval
hybrid_results = hybrid_retrieval_simple(
    "How does the Integrated Clinical Environment (ICE) platform support MIoT implementation in healthcare settings?"
)

print(f"Found {len(hybrid_results)} relevant chunks using hybrid retrieval\n")
for i, doc in enumerate(hybrid_results, 1):
    print(f"Hybrid Result {i}:")
    print(f"{doc.page_content[:200]}...\n")

## 10. LLM-Based Re-ranking

After initial retrieval, we use the LLM to re-rank results based on true relevance:

**Why Re-rank?**
- Initial retrieval uses only similarity/keywords
- LLM can understand deeper semantic relevance
- Can handle complex multi-aspect queries

**How it works:**
1. Send all retrieved chunks to the LLM
2. Ask LLM to rank them by relevance to the query
3. Parse the rankings from LLM output
4. Return chunks in the new order

This is particularly useful for complex queries where simple similarity isn't enough.

In [None]:
def llm_rerank_with_openai(query: str, retrieved_docs: List[Document], top_k: int = 3) -> List[Document]:
    """
    Use the LLM to re-rank document chunks based on relevance to the query.
    
    Args:
        query: The user's input question
        retrieved_docs: List of LangChain Document objects retrieved from ChromaDB
        top_k: Number of top chunks to return after re-ranking
        
    Returns:
        Sorted list of the most relevant chunks, based on LLM scoring
    """
    # Step 1: Prepare the ranking prompt
    prompt = f"""You are helping rank document chunks based on how well they answer this question:\n\nQuestion: {query}\n\n"""
    prompt += "Here are the chunks:\n\n"
    
    for i, doc in enumerate(retrieved_docs):
        prompt += f"Chunk {i+1}:\n{doc.page_content.strip()}\n\n"
    
    prompt += f"Please rank the top {top_k} chunks in order of relevance. Respond only like this:\nChunk 3, Chunk 1, Chunk 5"
    
    # Step 2: Call LLM for re-ranking
    llm_output = get_response(prompt)
    print(f"LLM Rerank Output:\n{llm_output}")
    
    # Step 3: Extract chunk numbers from the output
    chunk_order = [
        int(s.strip().split()[1]) - 1  # Convert to 0-based index
        for s in llm_output.split(',')
        if s.strip().startswith("Chunk")
    ]
    
    # Step 4: Return sorted chunk objects
    reranked_docs = [
        retrieved_docs[i] 
        for i in chunk_order 
        if i < len(retrieved_docs)
    ]
    
    return reranked_docs

## 11. Run Complete Retrieval Pipeline

Now we'll execute the full retrieval pipeline:

1. **Hybrid Retrieval**: Get initial candidates using both semantic and keyword search
2. **LLM Re-ranking**: Re-order results based on deep semantic understanding
3. **Display Results**: Show the final ranked results

This combines the strengths of all approaches for optimal retrieval quality.

In [None]:
# Define the query
query = "How does the Integrated Clinical Environment (ICE) platform support MIoT implementation in healthcare settings?"

# Step 1: Hybrid retrieval
print("Step 1: Performing hybrid retrieval...\n")
hybrid_results = hybrid_retrieval_simple(query, top_k=5)

print(f"Retrieved {len(hybrid_results)} initial results\n")
print("="*80)

# Step 2: LLM re-ranking
print("\nStep 2: Re-ranking with LLM...\n")
reranked_results = llm_rerank_with_openai(query, hybrid_results, top_k=3)

print("\n" + "="*80)
print("\nFinal Reranked Results:\n")

# Display the reranked chunks
for i, doc in enumerate(reranked_results, 1):
    print(f"\nReranked Chunk {i}:")
    print(f"{doc.page_content[:300]}...")  # Show first 300 characters
    print("-"*80)

## 12. Generate Final Answer with Context

Finally, we can use the retrieved and re-ranked chunks to generate a comprehensive answer to the user's question.

The LLM receives:
- The original query
- The most relevant document chunks as context
- Instructions to answer based on the provided context

In [None]:
# Prepare context from reranked results
context = "\n\n".join([doc.page_content for doc in reranked_results])

# Create final prompt
final_prompt = f"""Based on the following context, answer the question.

Context:
{context}

Question: {query}

Answer:"""

# Get final answer
print("\n" + "="*80)
print("\nGenerating final answer...\n")
final_answer = get_response(final_prompt)

print("Final Answer:")
print("="*80)
print(final_answer)

## Summary

This notebook demonstrated a complete RAG (Retrieval-Augmented Generation) pipeline:

1. ✅ **Document Processing**: Loaded and chunked PDF documents
2. ✅ **Vector Storage**: Created embeddings and stored in ChromaDB
3. ✅ **Semantic Search**: Retrieved relevant chunks using vector similarity
4. ✅ **Hybrid Retrieval**: Combined semantic and keyword-based search
5. ✅ **LLM Re-ranking**: Used LLM to improve result relevance
6. ✅ **Answer Generation**: Generated final answer using retrieved context

### Key Advantages

- **Flexible**: Supports both cloud (OpenAI) and local (Ollama) models
- **No Vendor Lock-in**: No Azure dependencies
- **Privacy-Friendly**: Can run completely offline with Ollama
- **Cost-Effective**: Choose between paid API or free local models

### Next Steps

- Try different chunk sizes and overlap settings
- Experiment with different embedding models
- Add metadata filtering to ChromaDB queries
- Implement conversation history for multi-turn QA
- Add source citations to generated answers