# Refactored RAG with Chroma (ready-to-run notebook)

This notebook is a refactored, modular version of the original notebook:
https://github.com/RobelDawit/RAG-applications/blob/a32a108d87327a56e778e76adcee8dfa1efaedcd/RAG_Vector_store_chroma.ipynb

It provides reusable functions to:
- set the OPENAI_API_KEY interactively,
- load and split a PDF,
- create OpenAI embeddings,
- build and persist a Chroma vector store,
- load the persisted store and run retrieval QA queries.

Run the cells below in order. If you're using Google Colab, the optional drive mount cell will help persist the Chroma DB to your Google Drive.

## Install dependencies

If running in Colab, run the cell below. If running locally, install the packages in your environment.

In [None]:
!pip install -qU pinecone-client langchain openai pypdf unstructured langchain-community tiktoken langchain-openai chromadb

## Imports and logging

In [None]:
import os
import textwrap
import logging
from typing import List, Optional

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


## Helper functions

These are the modular functions used to build/load the vector store and query it.

In [None]:
def set_env_from_prompt(var_name: str, force: bool = False) -> None:
    """
    Prompt user to enter a value for an environment variable if it's not already set.
    Useful in interactive runs (like Colab).
    """
    if force or not os.environ.get(var_name):
        try:
            import getpass
            value = getpass.getpass(f"{var_name}: ")
        except Exception:
            value = input(f"{var_name}: ")
        os.environ[var_name] = value

def load_pdf_documents(pdf_path: str) -> List:
    """
    Load a PDF into LangChain Document objects using PyPDFLoader.
    """
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"PDF not found: {pdf_path}")
    logger.info("Loading PDF: %s", pdf_path)
    loader = PyPDFLoader(pdf_path)
    docs = loader.load()
    logger.info("Loaded %d raw documents/pages", len(docs))
    return docs

def split_documents(documents: List, chunk_size: int = 500, chunk_overlap: int = 50) -> List:
    """
    Split documents into smaller chunks for embedding.
    """
    logger.info("Splitting documents with chunk_size=%d overlap=%d", chunk_size, chunk_overlap)
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    split_docs = splitter.split_documents(documents)
    logger.info("Split into %d chunks", len(split_docs))
    return split_docs

def create_embeddings(openai_api_key: Optional[str] = None):
    """
    Create an embeddings object. If openai_api_key provided, set it into env.
    """
    if openai_api_key:
        os.environ["OPENAI_API_KEY"] = openai_api_key
    if not os.environ.get("OPENAI_API_KEY"):
        raise EnvironmentError("OPENAI_API_KEY is not set. Call set_env_from_prompt or set the env var.")
    logger.info("Creating OpenAI embeddings client")
    return OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))

def build_chroma_vector_store(documents, embeddings, persist_directory: str):
    """
    Build and persist a Chroma vector store from documents + embeddings.
    """
    logger.info("Creating Chroma vector store at %s", persist_directory)
    vectordb = Chroma.from_documents(documents, embeddings, persist_directory=persist_directory)
    logger.info("Persisted vector store to %s", persist_directory)
    return vectordb

def load_chroma_vector_store(embeddings, persist_directory: str):
    """
    Load an existing Chroma vector store (persisted).
    """
    if not os.path.exists(persist_directory):
        raise FileNotFoundError(f"Persist directory not found: {persist_directory}")
    logger.info("Loading Chroma vector store from %s", persist_directory)
    return Chroma(persist_directory=persist_directory, embedding_function=embeddings)

def run_query_on_store(vector_store, query: str, k: int = 10) -> str:
    """
    Run a RetrievalQA chain using OpenAI LLM and the provided vector store retriever.
    Returns the answer string.
    """
    retriever = vector_store.as_retriever(search_kwargs={"k": k})
    llm = OpenAI()  # customize model args here if desired
    qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
    logger.info("Running query against vector store (k=%d)", k)
    response = qa_chain.run(query)
    return response


## (Optional) Mount Google Drive (Colab)

If you are running in Google Colab and want to persist the Chroma DB to Drive, run the cell below and set the paths accordingly.

In [None]:
# Uncomment and run in Colab if needed
# from google.colab import drive
# drive.mount('/content/drive')
#
# # Example paths (adjust to your Drive layout)
# pdf_path = '/content/drive/MyDrive/Books/Aircraft Structures By Megson ( PDFDrive ).pdf'
# persist_dir = '/content/drive/MyDrive/chroma_db'
 
# Or for local runs, set paths directly:
pdf_path = '/path/to/your/Aircraft Structures By Megson ( PDFDrive ).pdf'  # <- change this
persist_dir = './chroma_db'  # <- change this if you want a different local path


## Build / Create vector store

Run the cell below to build and persist the Chroma vector store. This may take some time depending on the PDF size and network latency to the embeddings API.

In [None]:
# Ensure OPENAI_API_KEY exists; will prompt if not set
try:
    if not os.environ.get('OPENAI_API_KEY'):
        set_env_from_prompt('OPENAI_API_KEY')
    emb = create_embeddings()
    # Load and split
    docs = load_pdf_documents(pdf_path)
    split_docs = split_documents(docs, chunk_size=500, chunk_overlap=50)
    # Build and persist
    vectordb = build_chroma_vector_store(split_docs, emb, persist_directory=persist_dir)
    logger.info('Vector store built and persisted at %s', persist_dir)
except Exception as e:
    logger.exception('Error while building vector store: %s', e)


## Query the persisted vector store

Run this cell to load the persisted Chroma DB and ask a question. Adjust the query and k (number of retrieved chunks) as needed.

In [None]:
try:
    # Ensure embeddings client is available (it uses OPENAI_API_KEY)
    emb = create_embeddings()
    vectordb = load_chroma_vector_store(emb, persist_directory=persist_dir)
    query = "Give me a high-level overview of the most important structures (spars, stringers, frames, bulkheads) on a loaded aircraft in flight and a brief overview of what they do."
    answer = run_query_on_store(vectordb, query, k=10)
    print('\nAnswer:\n')
    print(textwrap.fill(answer, width=100))
except Exception as e:
    logger.exception('Error while querying vector store: %s', e)


## Notes & next steps
- If you prefer to run from the command line, I also provided a CLI-style script in the earlier refactor (you can copy the same functions into a .py file).
- For large PDFs consider batching embeddings, adding progress reporting, or using a remote vector DB (Pinecone, Weaviate) for scale.
- If your LangChain or provider package names differ (versions change often), update the imports accordingly (e.g., `langchain.embeddings` vs `langchain_openai`).

If you'd like, I can also:
- add batching for embeddings,
- split by headings using a different text splitter,
- or convert this notebook into a runnable GitHub Actions workflow to index PDFs automatically.