## Let's go PRO!
Advanced RAG Techniques!

Let's start by digging into ingest:

1. No LangChain! Just native for maximum flexibility
2. Let's use an LLM to divide up chunks in a sensible way
3. Let's use the best chunk size and encoder from yesterday
4. Let's also have the LLM rewrite chunks in a way that's most useful("document pre-processing")

In [1]:
# ! ollama pull llama3.2:1b

In [110]:
import os
from pathlib import Path
from openai import OpenAI
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from chromadb import PersistentClient
from tqdm import tqdm
import litellm
from litellm import completion
import numpy as np
from sklearn.manifold import TSNE
import plotly.graph_objects as go


load_dotenv(override=True)

# MODEL = "gpt-4.1-nano"
# MODEL = "openai/gpt-oss-20b:free"
MODEL = "ollama/llama3.2:1b" # need more memory
# MODEL = "groq/meta-llama/llama-guard-4-12b" # does not support structure output
# litellm.api_key = os.getenv('GROQ_API_KEY')


# litellm.api_base = "https://openrouter.ai/api/v1"
# litellm.api_key = os.getenv('OPENROUTER_API_KEY')

DB_NAME = "preprocessed_db"
collection_name = "docs"
embedding_model = "text-embedding-3-small"
KNOWLEDGE_BASE_PATH = Path("knowledge-base")
AVERAGE_CHUNK_SIZE = 500

In [111]:
# Inspired by LangChain's Document - let's have something similar

class Result(BaseModel):
    page_content: str
    metadata: dict

In [112]:
# A class to perfectly represent a chunk

class Chunk(BaseModel):
    headline: str = Field(description="A brief heading for this chunk, typically a few words, that is most likely to be surfaced in a query")
    summary: str = Field(description="A few sentences summarizing the content of this chunk to answer common questions")
    original_text: str = Field(description="The original text of this chunk from the provided document, exactly as is, not changed in any way")

    def as_result(self, document):
        metadata = {"source": document["source"], "type": document["type"]}
        return Result(page_content=self.headline + "\n\n" + self.summary + "\n\n" + self.original_text,metadata=metadata)


class Chunks(BaseModel):
    chunks: list[Chunk]

### Three steps:
1. Fetch documents from the knowledge base, like LangChain did
2. Call an LLM to turn documents into Chunks
3. Store the Chunks in Chroma

That's it!

### Let's start with Step 1

In [113]:
def fetch_documents():
    """A homemade version of the LangChain DirectoryLoader"""

    documents = []

    for folder in KNOWLEDGE_BASE_PATH.iterdir():
        doc_type = folder.name
        for file in folder.rglob("*.md"):
            with open(file, "r", encoding="utf-8") as f:
                documents.append({"type": doc_type, "source": file.as_posix(), "text": f.read()})

    print(f"Loaded {len(documents)} documents")
    return documents

In [114]:
documents = fetch_documents()

Loaded 32 documents


### Donezo! On to Step 2 - make the chunks

In [115]:
def make_prompt(document):
    how_many = (len(document["text"]) // AVERAGE_CHUNK_SIZE) + 1
    return f"""
You take a document and you split the document into overlapping chunks for a KnowledgeBase.

The document is from the shared drive of a company called Insurellm.
The document is of type: {document["type"]}
The document has been retrieved from: {document["source"]}

A chatbot will use these chunks to answer questions about the company.
You should divide up the document as you see fit, being sure that the entire document is returned in the chunks - don't leave anything out.
This document should probably be split into {how_many} chunks, but you can have more or less as appropriate.
There should be overlap between the chunks as appropriate; typically about 25% overlap or about 50 words, so you have the same text in multiple chunks for best retrieval results.

For each chunk, you should provide a headline, a summary, and the original text of the chunk.
Together your chunks should represent the entire document with overlap.

Here is the document:

{document["text"]}

Respond with the chunks.
"""

In [116]:
print(make_prompt(documents[0]))


You take a document and you split the document into overlapping chunks for a KnowledgeBase.

The document is from the shared drive of a company called Insurellm.
The document is of type: company
The document has been retrieved from: knowledge-base/company/about.md

A chatbot will use these chunks to answer questions about the company.
You should divide up the document as you see fit, being sure that the entire document is returned in the chunks - don't leave anything out.
This document should probably be split into 1 chunks, but you can have more or less as appropriate.
There should be overlap between the chunks as appropriate; typically about 25% overlap or about 50 words, so you have the same text in multiple chunks for best retrieval results.

For each chunk, you should provide a headline, a summary, and the original text of the chunk.
Together your chunks should represent the entire document with overlap.

Here is the document:

# About Insurellm

Insurellm was founded by Avery La

In [117]:
def make_messages(document):
    return [
        {
            "role": "system",
            "content": """You are a document chunking assistant. Return ONLY valid JSON with no other text.
            Format: {"chunks": [{"headline": "...", "summary": "...", "original_text": "..."}]}"""
        },
        {
            "role": "user",
            "content": f"Process this document:\n\n{document}"
        }
    ]

In [118]:
make_messages(documents[0])

[{'role': 'system',
  'content': 'You are a document chunking assistant. Return ONLY valid JSON with no other text.\n            Format: {"chunks": [{"headline": "...", "summary": "...", "original_text": "..."}]}'},
 {'role': 'user',
  'content': 'Process this document:\n\n{\'type\': \'company\', \'source\': \'knowledge-base/company/about.md\', \'text\': "# About Insurellm\\n\\nInsurellm was founded by Avery Lancaster in 2015 as an insurance tech startup designed to disrupt an industry in need of innovative products. It\'s first product was Markellm, the marketplace connecting consumers with insurance providers.\\nIt rapidly expanded, adding new products and clients, reaching 200 emmployees by 2024 with 12 offices across the US."}'}]

In [119]:
def process_document(document):
    messages = make_messages(document)
    response = completion(model=MODEL, messages=messages, response_format=Chunks)
    reply = response.choices[0].message.content
    doc_as_chunks = Chunks.model_validate_json(reply).chunks
    return [chunk.as_result(document) for chunk in doc_as_chunks]

In [120]:
# def process_document(document):
#     messages = make_messages(document)
#     response = completion(
#         model=MODEL,
#         messages=messages
#     )
#     reply = response.choices[0].message.content
    
#     # Clean and parse the response
#     reply = reply.strip()
#     if reply.startswith("```json"):
#         reply = reply[7:]
#     if reply.endswith("```"):
#         reply = reply[:-3]
#     reply = reply.strip()
    
#     doc_as_chunks = Chunks.model_validate_json(reply)
#     return doc_as_chunks.chunks

In [121]:
process_document(documents[0])


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.


[1;31mProvider List: https://docs.litellm.ai/docs/providers[0m



APIConnectionError: litellm.APIConnectionError: OllamaException - {"error":"model requires more system memory than is currently available unable to load full model on GPU"}

In [None]:
def create_chunks(documents):
    chunks = []
    for doc in tqdm(documents):
        chunks.extend(process_document(doc))
    return chunks

In [73]:
chunks = create_chunks(documents)

 19%|█▉        | 6/32 [00:10<00:43,  1.69s/it]



[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



RateLimitError: litellm.RateLimitError: RateLimitError: GroqException - {"error":{"message":"Rate limit reached for model `openai/gpt-oss-20b` in organization `org_01ka0pb77tfg1v2n0nzc74g4xp` service tier `on_demand` on tokens per minute (TPM): Limit 8000, Used 7979, Requested 884. Please try again in 6.4725s. Need more tokens? Upgrade to Dev Tier today at https://console.groq.com/settings/billing","type":"tokens","code":"rate_limit_exceeded"}}


In [None]:
print(len(chunks))

#### Well that was easy! If a bit slow.
In the python module version, I sneakily use the multi-processing Pool to run this in parallel, but if you get a Rate Limit Error you can turn this off in the code.