
# üêç RAG Hands-On (Python) ‚Äî Company Files Q&A Chatbot

**Goal:** Build a **minimal but real** RAG pipeline in Python for Q&A over company files  
(**PDF, DOCX, CSV**) with:

- Multiple **LLM/chat models** (switchable),
- Multiple **embedding models** (cloud + local),
- Multiple **vector stores** (FAISS + Chroma, all offline-capable),
- **Chunking** strategies,
- **Conversation memory** (simple but effective),
- Clean structure and comments so this becomes your **‚ÄúEntrance to the RAG Universe‚Äù**.

> ‚ö†Ô∏è No Tavily, no web search, no UI ‚Äî just a clean backend-style pipeline you can test in a notebook.



## 0. High-Level Architecture

1. **Ingest**: Load PDF / DOCX / CSV from a `/data` directory.  
2. **Chunk**: Split into overlapping text chunks.  
3. **Embed**: Turn chunks into vectors using one of several embedding models.  
4. **Index**: Store embeddings in a vector store (FAISS or Chroma).  
5. **Retrieve**: Given a user question, pull top-k relevant chunks.  
6. **RAG Generate**: Combine question + chunks into a prompt and call an LLM.  
7. **Memory**: Keep track of previous turns and inject them into the prompt.  

You can think of it as two main phases:

- **Offline / Preprocessing**: ingest + chunk + embed + index.  
- **Online / Query-time**: retrieve + generate + memory.


In [None]:

# 1. Install dependencies (run once per environment)
# In Colab you can uncomment these lines.
# !pip install -q langchain langchain-community langchain-openai chromadb faiss-cpu sentence-transformers
# !pip install -q pypdf python-docx pandas



In [None]:

# 2. Imports & basic config

import os
from typing import List, Literal, Dict, Any

# LangChain core
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Vector stores
from langchain_community.vectorstores import FAISS, Chroma

# Embeddings
from langchain_openai import OpenAIEmbeddings  # cloud
from langchain_community.embeddings import SentenceTransformerEmbeddings  # local

# LLMs (chat models)
from langchain_openai import ChatOpenAI

# For loading files
from pypdf import PdfReader
from docx import Document as DocxDocument
import pandas as pd

# ---- API keys / env vars ----
# Set these in your environment before running:
# os.environ["OPENAI_API_KEY"] = "sk-..."

DATA_DIR = "./data"       # put your PDFs, DOCX, CSVs here
CHROMA_DIR = "./chroma_db"  # for Chroma persistence




## 3. File Loaders (PDF, DOCX, CSV)

We keep this simple and transparent.

- **PDF** ‚Üí `pypdf` (per page text).  
- **DOCX** ‚Üí `python-docx` (paragraph join).  
- **CSV** ‚Üí `pandas` (join columns per row).  


In [None]:

def load_pdf(path: str) -> List[Document]:
    reader = PdfReader(path)
    docs = []
    for i, page in enumerate(reader.pages):
        text = page.extract_text() or ""
        if text.strip():
            docs.append(Document(
                page_content=text,
                metadata={"source": path, "page": i}
            ))
    return docs

def load_docx(path: str) -> List[Document]:
    d = DocxDocument(path)
    paragraphs = [p.text for p in d.paragraphs if p.text.strip()]
    text = "\n".join(paragraphs)
    return [Document(page_content=text, metadata={"source": path})]

def load_csv(path: str, text_cols: List[str] = None) -> List[Document]:
    df = pd.read_csv(path)
    if text_cols is None:
        # naive: use all columns
        text_cols = list(df.columns)
    docs = []
    for idx, row in df.iterrows():
        pieces = [f"{col}: {row[col]}" for col in text_cols if pd.notnull(row[col])]
        text = "\n".join(pieces)
        if text.strip():
            docs.append(Document(
                page_content=text,
                metadata={"source": path, "row": int(idx)}
            ))
    return docs

def load_all_documents(data_dir: str = DATA_DIR) -> List[Document]:
    all_docs: List[Document] = []
    for root, _, files in os.walk(data_dir):
        for fname in files:
            path = os.path.join(root, fname)
            if fname.lower().endswith(".pdf"):
                all_docs.extend(load_pdf(path))
            elif fname.lower().endswith(".docx"):
                all_docs.extend(load_docx(path))
            elif fname.lower().endswith(".csv"):
                all_docs.extend(load_csv(path))
    return all_docs

docs = load_all_documents()
print(f"Loaded {len(docs)} raw docs/chunks before splitting.")



## 4. Chunking / Splitting Strategies

We‚Äôll use `RecursiveCharacterTextSplitter` with:

- **Chunk size**: 800‚Äì1200 characters (tweak based on your domain),
- **Overlap**: 150‚Äì250 characters to preserve context across boundaries,
- Optional splits on headings / newlines if desired.

You can experiment with different configurations in one place.


In [None]:

def chunk_documents(
    docs: List[Document],
    chunk_size: int = 1000,
    chunk_overlap: int = 200
) -> List[Document]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    return splitter.split_documents(docs)

chunked_docs = chunk_documents(docs)
print(f"After chunking: {len(chunked_docs)} chunks.")



## 5. Embedding Models (Multiple Options)

We define an **embedding registry** so you can switch between:

- `openai_small` ‚Üí small/cheap OpenAI embeddings (cloud).  
- `openai_large` ‚Üí larger OpenAI embeddings (if you want).  
- `mpnet_local` ‚Üí `all-mpnet-base-v2` (local SentenceTransformer).  

You can easily add more.


In [None]:

EmbeddingName = Literal["openai_small", "openai_large", "mpnet_local"]

def get_embedding_model(name: EmbeddingName):
    if name == "openai_small":
        return OpenAIEmbeddings(model="text-embedding-3-small")
    if name == "openai_large":
        return OpenAIEmbeddings(model="text-embedding-3-large")
    if name == "mpnet_local":
        return SentenceTransformerEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
    raise ValueError(f"Unknown embedding model: {name}")

current_embedding_name: EmbeddingName = "openai_small"
embeddings = get_embedding_model(current_embedding_name)



## 6. Vector Stores (FAISS + Chroma)

We‚Äôll support two popular, offline-friendly vector DBs:

- **FAISS** (in-memory / file-based, fast ANN search):
  - Great for local experiments and many production cases.
- **Chroma** (persistent local DB):
  - Nice developer experience, simple persistence.

You can toggle which backend to use.


In [None]:

VectorStoreName = Literal["faiss", "chroma"]

def build_vectorstore(
    chunks: List[Document],
    store_name: VectorStoreName,
    embeddings_model
):
    if store_name == "faiss":
        vs = FAISS.from_documents(chunks, embeddings_model)
        return vs
    if store_name == "chroma":
        vs = Chroma.from_documents(
            chunks,
            embeddings_model,
            persist_directory=CHROMA_DIR
        )
        return vs
    raise ValueError(f"Unknown vector store: {store_name}")


current_vs_name: VectorStoreName = "faiss"
vectorstore = build_vectorstore(chunked_docs, current_vs_name, embeddings)
print(f"Built vector store with backend = {current_vs_name}")



## 7. Chat Models (Multiple LLM Options)

We define a small registry of chat models (you can pick whatever your account supports):

- `gpt_4_small` ‚Üí e.g. `gpt-4o-mini`-class model (cheaper, fast).  
- `gpt_4_full` ‚Üí e.g. `gpt-4.1`-class model (stronger).  
- You can also plug in **local models** (LM Studio / Ollama) via a custom LLM class,  
  but here we focus on OpenAI-style for simplicity.

> Replace model names with whatever is available to you.


In [None]:

ChatModelName = Literal["gpt_4_small", "gpt_4_full"]

def get_chat_model(name: ChatModelName):
    if name == "gpt_4_small":
        # adjust to your actual small/cheap model name
        return ChatOpenAI(model="gpt-4o-mini", temperature=0)
    if name == "gpt_4_full":
        # adjust to your actual strong model name
        return ChatOpenAI(model="gpt-4.1", temperature=0)
    raise ValueError(f"Unknown chat model: {name}")

current_llm_name: ChatModelName = "gpt_4_small"
llm = get_chat_model(current_llm_name)



## 8. Simple Conversation Memory

We‚Äôll implement a **minimal memory**:

- Keep the last N turns of (user, assistant) pairs.
- Inject them into the prompt before the current question.

This is **manual but explicit** ‚Äî you see exactly what context the model sees.


In [None]:

from collections import deque

ConversationTurn = Dict[str, str]  # {"user": "...", "assistant": "..."}

class SimpleConversationMemory:
    def __init__(self, max_turns: int = 5):
        self.max_turns = max_turns
        self.history: deque[ConversationTurn] = deque(maxlen=max_turns)

    def add_turn(self, user: str, assistant: str):
        self.history.append({"user": user, "assistant": assistant})

    def format_history(self) -> str:
        lines = []
        for turn in self.history:
            lines.append(f"User: {turn['user']}")
            lines.append(f"Assistant: {turn['assistant']}")
        return "\n".join(lines)

memory = SimpleConversationMemory(max_turns=5)



## 9. Retrieval + RAG Answer Function

Now we wire everything together into a single function:

1. Embed & retrieve top-k chunks from our vector store.  
2. Format a prompt that includes:
   - System instructions,
   - Conversation history,
   - Retrieved context,
   - Latest user question.
3. Call the chat model and return the answer.  
4. Update memory with this turn.


In [None]:

from langchain.schema import HumanMessage, SystemMessage

def build_context_from_docs(docs: List[Document]) -> str:
    parts = []
    for i, d in enumerate(docs):
        src = d.metadata.get("source", "unknown")
        ref = f"[{i+1} | {os.path.basename(src)}]"
        parts.append(f"{ref}\n{d.page_content}\n")
    return "\n---\n".join(parts)

def rag_answer(
    question: str,
    k: int = 5,
    system_prompt: str = (
        "You are a helpful assistant answering questions strictly based on the provided context. "
        "If the answer is not in the context, say you don't know."
    )
) -> str:
    # 1) Retrieve
    retriever = vectorstore.as_retriever(search_kwargs={"k": k})
    retrieved_docs = retriever.get_relevant_documents(question)
    context_text = build_context_from_docs(retrieved_docs)

    # 2) Build messages (system + history + new question + context)
    history_text = memory.format_history()
    history_block = f"\n\nConversation so far:\n{history_text}" if history_text else ""

    full_system_prompt = (
        system_prompt
        + "\n\nYou will be given context from company documents.\n"
        + "Use it to answer the question and cite references like [1], [2] where relevant."
    )

    final_user_content = (
        f"Context:\n{context_text}\n\n"
        f"{history_block}\n\n"
        f"User question: {question}\n\n"
        "Answer:"
    )

    messages = [
        SystemMessage(content=full_system_prompt),
        HumanMessage(content=final_user_content),
    ]

    # 3) Call LLM
    response = llm(messages)
    answer = response.content

    # 4) Update memory
    memory.add_turn(question, answer)
    return answer




## 10. Test the Pipeline

Now you can ask questions about your **company files** (PDF, DOCX, CSV) in `./data`.

Try a few:

- ‚ÄúWhat is our refund policy?‚Äù  
- ‚ÄúSummarize the 2023 Q4 metrics.‚Äù  
- ‚ÄúWhat are the responsibilities of the data engineer role?‚Äù  


In [None]:

# Example interactive loop (run and then type questions)
# Stop by interrupting the cell.

while True:
    try:
        q = input("\nAsk a question (or 'exit'): ").strip()
        if not q or q.lower() == "exit":
            print("Goodbye.")
            break
        ans = rag_answer(q, k=5)
        print("\n--- Answer ---")
        print(ans)
    except KeyboardInterrupt:
        print("\nStopped.")
        break



---
### ‚úÖ Summary (Python Hands-On)

This notebook gave you a **minimal but real** RAG stack in Python with:

- PDF / DOCX / CSV ingestion,  
- Chunking with `RecursiveCharacterTextSplitter`,  
- Multiple embeddings (OpenAI + SentenceTransformers),  
- Multiple vector stores (FAISS + Chroma),  
- Switchable chat models,  
- Simple but explicit conversation memory,  
- A clean `rag_answer()` function you can integrate into an API later.

Use this as your **reference template** and customize per project/domain.
