## Load a PDF into ChromaDB with Page-Level Source Metadata

This cell ingests a PDF document (e.g., the Alphabet 10-K), splits it into **semantically meaningful text chunks**, and stores them in **ChromaDB** for retrieval-augmented generation (RAG).

### What this does

* Reads the PDF **page by page**
* Splits each page into overlapping text chunks using `RecursiveCharacterTextSplitter`
* Stores each chunk as a separate document in ChromaDB
* Attaches **rich metadata** to every chunk so responses can be precisely cited

### Metadata captured per chunk

Each stored chunk includes:

* **`source`** — original PDF filename
* **`page_number`** — page in the PDF where the text originated
* **`chunk_id`** — chunk index within the page
* **`char_start` / `char_end`** — character offsets within the page text

This enables downstream features such as:

* Citations like *“Source: Alphabet_10K.pdf, page 42”*
* Debugging exactly where retrieved text came from
* More transparent and trustworthy RAG answers

### Why page-level chunking?

Chunking per page (instead of across the full document) preserves:

* Accurate page references
* Cleaner citations
* Better alignment with how humans reference documents

After running this cell, the document is ready to be queried by the RAG agent using ChromaDB.


In [None]:
import os
import chromadb
from pypdf import PdfReader
from chromadb.utils import embedding_functions
# Import the specific text splitter class
from langchain_text_splitters import RecursiveCharacterTextSplitter

def load_pdf_into_chromadb(
    file_path: str,
    collection_name: str = "alphabet_10k_collection_chunks",
    db_path: str = "../chroma_db_chunks"
):
    """
    Reads a PDF file, chunks its content per page using RecursiveCharacterTextSplitter,
    and loads it into ChromaDB with rich source metadata.
    """
    reader = PdfReader(file_path)

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100,
        length_function=len,
    )

    documents = []
    metadata_list = []
    ids = []

    for page_index, page in enumerate(reader.pages):
        page_text = page.extract_text()

        if not page_text:
            continue

        # Split THIS PAGE only
        page_chunks = text_splitter.split_text(page_text)

        char_cursor = 0
        for chunk_index, chunk in enumerate(page_chunks):
            char_start = page_text.find(chunk, char_cursor)
            char_end = char_start + len(chunk)
            char_cursor = char_end

            documents.append(chunk)
            metadata_list.append({
                "source": os.path.basename(file_path),
                "page_number": page_index, 
                "chunk_id": chunk_index + 1,
                "char_start": char_start,
                "char_end": char_end,
            })

            ids.append(
                f"{os.path.basename(file_path)}_p{page_index+1}_c{chunk_index+1}"
            )

    embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name="all-MiniLM-L6-v2"
    )

    client = chromadb.PersistentClient(path=db_path)
    collection = client.get_or_create_collection(
        name=collection_name,
        embedding_function=embedding_function
    )

    collection.add(
        documents=documents,
        metadatas=metadata_list,
        ids=ids
    )

    print(
        f"Successfully loaded {len(documents)} chunks into ChromaDB "
        f"collection '{collection_name}' with page-level metadata."
    )


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Run the utility
pdf_file_path = "../data/alphabet-form-10-K-2024.pdf"
load_pdf_into_chromadb(pdf_file_path)

Successfully loaded 449 chunks into ChromaDB collection 'alphabet_10k_collection_chunks' with page-level metadata.
