# RAG Pipeline for Q&A over a Text File

This notebook implements a clean Retrieval-Augmented Generation (RAG) pipeline.

1.  **Install** required libraries.
2.  **Load** an `OPENAI_API_KEY` (if available).
3.  **Load** a source `.txt` file.
4.  **Chunk, Embed, & Store** the text in a Chroma vector database.
5.  **Build** a LangChain RAG chain to answer questions.
6.  **Run** an interactive chat loop.


In [1]:
## 1) Install dependencies
import sys
print(sys.version)

# Core libs
!pip -q install langchain langchain-community chromadb sentence-transformers

# For optional local LLM fallback
!pip -q install transformers accelerate

# For OpenAI
!pip -q install langchain-openai


3.10.19 (main, Oct 21 2025, 16:37:10) [Clang 20.1.8 ]


In [2]:
## 2) Load API Key
from pathlib import Path
from dotenv import load_dotenv

# *** UPDATE THIS PATH to your .env file ***
env_path = Path("/Volumes/Untitled/Lessons_By_Week/Project_Rag/Final_Codes/ATT81022.env")
load_dotenv(dotenv_path=env_path)


True

In [3]:
## 3) Set Constants & Check Key
import os
from pathlib import Path

# Path where Chroma (vector DB) will be persisted
CHROMA_DIR = "./chroma"
COLLECTION = "uploaded_text"

# --- Optional: OpenAI ---
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "").strip()
USE_OPENAI = bool(OPENAI_API_KEY)

if USE_OPENAI:
    print("✅ Using OpenAI for generation.")
else:
    print("ℹ️ OPENAI_API_KEY not set — will use local Transformers fallback.")

Path(CHROMA_DIR).mkdir(parents=True, exist_ok=True)
print("CHROMA_DIR =", Path(CHROMA_DIR).resolve())
print("COLLECTION  =", COLLECTION)


✅ Using OpenAI for generation.
CHROMA_DIR = /Volumes/Untitled/Lessons_By_Week/Project_Rag/Final_Codes/chroma
COLLECTION  = uploaded_text


In [4]:
## 4) Load Text Document

# *** UPDATE THIS PATH to your .txt file ***
uploaded_path = "/Volumes/Untitled/Youtube_QA_Rag_System/Working_Pipelines/text/RAG_TEXT.txt"
from pathlib import Path

p = Path(uploaded_path).expanduser()
assert p.exists(), f"File not found: {p}"

text = p.read_text(encoding="utf-8", errors="ignore")
print(f"Loaded {len(text):,} characters from:", p.resolve())


Loaded 10,135 characters from: /Volumes/Untitled/Youtube_QA_Rag_System/Working_Pipelines/text/RAG_TEXT.txt


In [5]:
## 5) Define LLM (Generator)

generator = None

if USE_OPENAI:
    from langchain_openai import ChatOpenAI
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    generator = llm
    print("Using ChatOpenAI: gpt-4o-mini")
else:
    # Local Transformers text2text generation via HF pipeline
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
    print("Loading local model: google/flan-t5-base...")
    model_id = "google/flan-t5-base"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
    hf_pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

    class HFText2TextLLM:
        def __call__(self, prompt_text: str) -> str:
            out = hf_pipe(prompt_text, max_new_tokens=256, truncation=True)
            return out[0]["generated_text"]
    
    generator = HFText2TextLLM()
    print("Using local Transformers: flan-t5-base")


  from .autonotebook import tqdm as notebook_tqdm


Using ChatOpenAI: gpt-4o-mini


In [None]:
COLLECTION = "uploaded_text"

In [6]:
## 6) Chunk, Embed, and Store in Vector DB

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_community.embeddings import HuggingFaceEmbeddings

# 1) Chunk the text
splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)
docs = [Document(page_content=c, metadata={"source": str(p.name)}) 
        for c in splitter.split_text(text)]
print(f"Chunks created: {len(docs)}")

# 2) Embedding function
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    encode_kwargs={"normalize_embeddings": True},
)

# 3) Create (or re-open) the Chroma collection
vs = Chroma(
    collection_name=COLLECTION,
    persist_directory=CHROMA_DIR,
    embedding_function=embeddings,
)

# 4) Add docs
vs.add_documents(docs)
print("✅ Stored in Chroma at:", Path(CHROMA_DIR).resolve())

# 5) Create the retriever
retriever = vs.as_retriever(search_kwargs={"k": 5})
print("\n✅ Created 'retriever' variable.")


Chunks created: 55


  embeddings = HuggingFaceEmbeddings(
  vs = Chroma(


✅ Stored in Chroma at: /Volumes/Untitled/Lessons_By_Week/Project_Rag/Final_Codes/chroma

✅ Created 'retriever' variable.


In [7]:
## 6.5) Upgrade to Multi-Query Retriever
from langchain.retrievers.multi_query import MultiQueryRetriever
import logging

# Optional: Turn on logging so you can see the different questions the AI generates
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

if USE_OPENAI:
    # This uses the LLM (gpt-4o-mini) to generate variations of the question
    # and retrieve documents for all variations.
    retriever = MultiQueryRetriever.from_llm(
        retriever=vs.as_retriever(search_kwargs={"k": 5}),
        llm=llm
    )
    print("✅ Multi-Query Retriever (OpenAI) is active.")
    print("   (The system will now generate variations of your question for better search results.)")

else:
    # Fallback: Multi-query requires a strong instruction-following LLM.
    # Smaller local models (like flan-t5) often fail the strict formatting requirements.
    retriever = vs.as_retriever(search_kwargs={"k": 5})
    print("ℹ️ Using standard retriever (Local Model).")

✅ Multi-Query Retriever (OpenAI) is active.
   (The system will now generate variations of your question for better search results.)


In [8]:
## 7) Build RAG Chain

from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

def format_docs(docs):
    out = []
    for i, d in enumerate(docs):
        src = d.metadata.get("source", "")
        out.append(f"[{i}] {d.page_content}\n(source: {src})")
    return "\n\n".join(out)

SYSTEM_PROMPT = (
    "You are a helpful assistant. Answer the question only from the provided context. "
    "Be friendly and answer little small talk"
    "If the answer isn't present, say: 'I don't see that in the file.' "
    "You may engage in friendly conversation, but never fabricate facts outside the context when answering file-based questions. "
    "Give enough context so the user can understand."
)

prompt = ChatPromptTemplate.from_messages([
    ("system", "{system}"),
    ("human", "Question: {question}\n\nContext:\n{context}\n\nAnswer succinctly:"),
])

# Build chain
if USE_OPENAI:
    chain = (
        RunnableParallel({
            "context": (retriever | format_docs),
            "question": RunnablePassthrough(),
            "system": (lambda _: SYSTEM_PROMPT),
        })
        | prompt
        | generator
        | StrOutputParser()
    )
    print("✅ RAG chain (OpenAI) is ready.")
else:
    # Emulate the same behavior for the local model in a function
    def answer_local(question: str) -> str:
        ctx = format_docs(retriever.get_relevant_documents(question))
        full_prompt = (
            f"{SYSTEM_PROMPT}\n\n"
            f"Question: {question}\n\n"
            f"Context:\n{ctx}\n\n"
            "Answer succinctly:"
        )
        return generator(full_prompt)

    chain = answer_local
    print("✅ RAG function (Local Transformers) is ready.")


✅ RAG chain (OpenAI) is ready.


## 8) Ask Questions

Run the cells below to interact with your RAG pipeline.


In [9]:
# Define the 'ask' function
def ask(question: str):
    if not question.strip():
        return "Please enter a non-empty question."
    if callable(chain) and not hasattr(chain, "invoke"):
        # Local HF function path
        return chain(question)
    # OpenAI path via LangChain
    return chain.invoke(question)


In [None]:
# Run this cell to chat in the console.
# Stop with Ctrl+C (or by interrupting the kernel).
print("RAG chatbot is ready. Ask questions based on your file.")
try:
    while True:
        q = input("\nAsk a question (or press Enter to exit): ").strip()
        if not q:
            break
        print("\n--- Answer ---\n", ask(q))
except KeyboardInterrupt:
    print("\nChat session ended.")
    pass


RAG chatbot is ready. Ask questions based on your file.


INFO:langchain.retrievers.multi_query:Generated queries: ['1. Hello, how can I assist you today?  ', '2. What information or help are you looking for?  ', '3. Is there something specific you would like to discuss or inquire about?']



--- Answer ---
 Hi there! How can I assist you today? If you have any questions about travel or specific topics, feel free to ask!


INFO:langchain.retrievers.multi_query:Generated queries: ['What is your current state or condition?  ', 'How are you feeling today?  ', 'Can you describe your well-being at the moment?']



--- Answer ---
 I don't see that in the file. But I'm here to help! How about you?


INFO:langchain.retrievers.multi_query:Generated queries: ['What is the main topic or subject being discussed here?  ', 'Can you explain the purpose or focus of this content?  ', 'What information or themes does this material cover?']



--- Answer ---
 This is about the diverse aspects of Europe, including its art, culture, festivals, and natural wonders, highlighting the interplay between human creativity and the beauty of nature. It seems to explore the richness of experiences one can find while traveling through Europe.


INFO:langchain.retrievers.multi_query:Generated queries: ['What can you tell me about the Netherlands?  ', 'Can you provide information on the Netherlands?  ', 'What are some key facts about the Netherlands?']



--- Answer ---
 The Netherlands is known for its beautiful canals, tulip fields, and a strong cycling culture. Key attractions include Amsterdam’s Rijksmuseum and the Anne Frank House, which highlight Dutch creativity and resilience. The countryside features picturesque windmills and coastal dunes, creating an idyllic landscape.
