# Scenario

You are given a set of product documentation files (PDFs and Markdown). Your goal is
to create a prototype that answers user questions based on these docs.

# Tasks

Ingest and chunk the documentation.
Store embeddings in a vector database of your choice (e.g., Pinecone, Weaviate,
Chroma).
Implement a retrieval function.
Use an LLM to answer user questions grounded in the retrieved context.

# Constraints

Use LangChain, LlamaIndex, or another modern RAG framework.
Include code and a simple architecture diagram showing the flow.
Write a short note on how you would scale this for millions of documents.

# Deliverable

Code file(s) or notebook link.
Architecture diagram (image or PDF).
Scaling notes.

RAG Architecture Diagram

![Retrieval Augmented Generation](rag-architecture-diagram.png)

RAG pipeline
- load file
- chunk file
- ceate embeddings and store file
- retrieve file
- Create prompt to inject context
- send prompt to llm

Install libraries

In [None]:
!pip install langchain
!pip install langchain_chroma
!pip install langchain_community
!pip install langchain-together

Setup values for later

In [None]:
TOGETHER_API_KEY = os.environ.get("TOGETHER_API_KEY", "Key")
DOC_PATH = "docs"
CHROMA_PATH = "chroma_vectors"
EMBEDDING_MODEL = "BAAI/bge-base-en-v1.5"

Load the documents from a target directory.

In [None]:
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.schema import Document
import os
def load_documents_from_directory(directory: str=DOC_PATH) -> list[Document]:
    """
    Load all PDF and Markdown files from a directory into LangChain Document objects.
    """
    documents = []
    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename)
        
        if filename.lower().endswith(".pdf"):
            loader = PyPDFLoader(filepath)
            documents.extend(loader.load())
        
        elif filename.lower().endswith(".md"):
            loader = TextLoader(filepath, encoding="utf-8")
            documents.extend(loader.load())
    
    return documents

Break down the documents into smaller chunks. This will be used for vecor embeddings and reducing the scope of the document context for more accurate information.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_documents(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=200)
    split_text = text_splitter.split_documents(documents)
    return split_text

Create vector embeddings based on the document chunks. The embeddings will be saved inside a local Chroma database.

In [None]:
from langchain_together import ChatTogether, TogetherEmbeddings
from langchain_chroma import Chroma

def save_vectors(documents: list[Document]):
    embedding = TogetherEmbeddings(model=EMBEDDING_MODEL, api_key=TOGETHER_API_KEY)
    Chroma.from_documents(
        documents=documents,
        embedding=embedding,
        persist_directory=CHROMA_PATH,
    )

Create a function to retrieve embeddings from the local Chroma DB.

In [None]:
def get_embeddings(query):
    if query:
        embedding = TogetherEmbeddings(model=EMBEDDING_MODEL, api_key=TOGETHER_API_KEY)
        vector_store = Chroma(
            persist_directory=CHROMA_PATH,
            embedding_function=embedding
        )
        return vector_store.similarity_search(query, k=3)
    return "No Results Found"

def format_context(docs: list[Document]) -> str:
    return "\n\n".join([doc.page_content for doc in docs])

Setup the LLM and prompts for your RAG pipeline.

In [None]:
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain_ollama.chat_models import ChatOllama

system_message = (
    "You are a helpful assistant that ONLY answers questions based on the "
    "provided context. If no relevant context is provided, do NOT answer the "
    "question and politely inform the user that you don't have the necessary "
    "information to answer their question accurately."
)
prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(system_message),
    HumanMessagePromptTemplate.from_template(
        "Context:\n{context}\n\nQuestion: {query}"
    )
])

llm = ChatTogether(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
    api_key=TOGETHER_API_KEY,
)

Now put everything together and see how the query uses the documents to load the proper context for the questions.

In [None]:
if __name__ == "__main__":
    docs = load_documents_from_directory("docs")
    doc_chunks = split_documents(docs)
    save_vectors(doc_chunks)

    query1 = "How do you install unsloth for LLM fine tuning?"
    context1 = get_embeddings(query1)
    messages1 = prompt.format_messages(
        context=format_context(context1),
        query=query1
    )
    response1 = llm.invoke(messages1)
    print(response1.content)

    query2 = "What is a transformation attention layer?"
    context2 = get_embeddings(query2)
    messages2 = prompt.format_messages(
        context=format_context(context2),
        query=query2
    )
    response2 = llm.invoke(messages2)
    print(response2.content)

# Scaling RAG

To scale a RAG system to millions of documents, I would focus on a robust, distributed architecture. First, documents would be ingested and normalized in a pipeline that handles PDFs, Markdown, and other formats, with deduplication and semantic chunking for high-quality context. Chunks would be embedded in a distributed vector database that supports sharding, replication, and approximate nearest neighbor search, while a parallel keyword index handles lexical filtering. The retrieval pipeline would combine vector and keyword search, optionally reranking candidates with an LLM or cross-encoder, and always include citations. Caching, batching, and asynchronous processing would optimize throughput, while observability and metrics track recall, latency, and system health. Finally, incremental updates, versioning, and careful resource planning ensure the system can grow efficiently without sacrificing reliability or accuracy.

