
# RAG Pipeline using LangChain + FAISS + OpenAI

This notebook walks through an end-to-end **Retrieval-Augmented Generation (RAG)** pipeline using:
- **LangChain** (loaders, chunking, retriever, chain)
- **FAISS** (vector index)
- **OpenAI** (embeddings + chat)
  
Process Flow: **load → chunk → embed → index → retrieve → generate** 


## 1) Imports and setup

In [22]:

import os
from pathlib import Path

from langchain_community.document_loaders import PyPDFLoader, TextLoader, UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

DATA_DIR = Path("data")
STORE_DIR = Path("store")
DATA_DIR.mkdir(exist_ok=True)
STORE_DIR.mkdir(exist_ok=True)

print("DATA_DIR:", DATA_DIR.resolve())
print("STORE_DIR:", STORE_DIR.resolve())


DATA_DIR: D:\RAG_with_LangChain\data
STORE_DIR: D:\RAG_with_LangChain\store


## 2) API key and model configuration

In [23]:

# Add OpenAI API key in windows environment variable with the name OPENAI_API_KEY.
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

EMBED_MODEL = "text-embedding-3-small"
CHAT_MODEL  = "gpt-4o-mini"

print("Using embedding model:", EMBED_MODEL)
print("Using chat model:", CHAT_MODEL)


Using embedding model: text-embedding-3-small
Using chat model: gpt-4o-mini


## 3) Load document(s) into memory

In [24]:

# Fault-tolerant: If there's no file in ./data, create a simple define_rag.txt
sample_path = DATA_DIR / "define_rag.txt"
if not any(DATA_DIR.iterdir()):
    sample_path.write_text(
        "Retrieval-Augmented Generation (RAG) is an AI framework that improves the performance of large language models (LLMs) by providing them with additional, external information before they generate a response. Instead of relying solely on the static data they were trained on, RAG allows LLMs to access an authoritative, up-to-date knowledge base."
        "The basic RAG process involves these steps: "
        "Retrieval: When a user submits a query, the system first retrieves relevant documents from an external knowledge base, often using a vector database. "
        "Augmentation: The retrieved information is then used to augment the user's original query, essentially providing the LLM with additional context."
        "Generation: The LLM receives this enriched prompt and generates a more accurate, relevant, and contextually aware response."
    )
    print("Created simple text file:", sample_path)

# Choose a loader based on the file type, and raise an error if file unsupported
file_to_load = next(iter(DATA_DIR.iterdir()))
print("Loading:", file_to_load)

docs = []
ext = file_to_load.suffix.lower()
if ext == ".pdf":
    docs.extend(PyPDFLoader(str(file_to_load)).load())
elif ext in [".md", ".markdown"]:
    docs.extend(UnstructuredMarkdownLoader(str(file_to_load)).load())
elif ext == ".txt":
    docs.extend(TextLoader(str(file_to_load), autodetect_encoding=True).load())
else:
    raise ValueError(f"Unsupported file type: {ext}")

print(f"Loaded {len(docs)} document(s). Preview:")
print(docs[0].page_content[:300])


Loading: data\The Graduate School - Purdue University Northwest - Modern Campus Catalog™.pdf
Loaded 6 document(s). Preview:
2023-2024 Academic Catalog Purdue University Northwest
[ARCHIVED CATALOG]
The Graduate School
Office of Graduate Studies
Admission to the Graduate School
Academic Regulations
Undergraduate and Transfer Credit
RegistrationAdvisory Committees
Plan of Study
Admission to Candidacy
Oral and Written Exami


## 4) Chunking: split long text into overlapping passages

In [25]:
# Chunking helps maintain context within token limits
# text-embedding-3-small → 8191 tokens per input (context limit)
# gpt-4o-mini → 128,000 tokens context window (input + output combined)

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunk(s). Showing first 1–2:")
for i, ch in enumerate(chunks[:3]):
    print(f"--- Chunk {i+1} ---\n{ch.page_content}\n")


Created 34 chunk(s). Showing first 1–2:
--- Chunk 1 ---
2023-2024 Academic Catalog Purdue University Northwest
[ARCHIVED CATALOG]
The Graduate School
Office of Graduate Studies
Admission to the Graduate School
Academic Regulations
Undergraduate and Transfer Credit
RegistrationAdvisory Committees
Plan of Study
Admission to Candidacy
Oral and Written Examinations
Graduation Deadlines
The Graduate School (Graduate Studies Office) oversees all aspects of graduate education at Purdue University Northwest.
This includes admissions and records (graduation), new courses, graduate staff employment, and program development. As
a unit of the system-wide Graduate School, Purdue University Northwest Graduate School coordinates all activities with
Purdue University Graduate School.
Click here for the policies and procedures of the Purdue Graduate School.

--- Chunk 2 ---
Purdue University Graduate School.
Click here for the policies and procedures of the Purdue Graduate School.
Office of Graduate St

## 5) Embeddings + FAISS index

In [26]:

embeddings = OpenAIEmbeddings(model=EMBED_MODEL)
vectorstore = FAISS.from_documents(chunks, embeddings)
print("FAISS index built in-memory.")


FAISS index built in-memory.


## 6) Retrieval - get relevant chunks for a question

In [30]:

retriever = vectorstore.as_retriever(
    search_type="mmr", 
    search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.35}
)

#question = "What does Degree-Seeking Applicants generally must submit?"
question = "What are the English Proficiency Requirement?"
#question = "Expalin about Plan of Study"
results = retriever.invoke(question)
print("Top retrieved chunks (preview):")
for i, d in enumerate(results):
    print(f"\n[{i+1}] {d.metadata.get('source', 'doc')} | page={d.metadata.get('page')}")
    print(d.page_content[:400])


Top retrieved chunks (preview):

[1] data\The Graduate School - Purdue University Northwest - Modern Campus Catalog™.pdf | page=3
below). The application for certificate study can be found at https://gradapply. purdue.edu/appl y/
English Proficiency Requirement
International applicants whose native language is not English are required to provide proof of English proficiency at the
time of recommendation for admission to degree, certificate, non-degree, and teacher license graduate programs in one of
the following ways:
a. A 

[2] data\The Graduate School - Purdue University Northwest - Modern Campus Catalog™.pdf | page=3
lege may require grades higher than C- in certain courses or in all courses on the plan of study. Pass-fail
or satisfactory/unsatisfactory grades are not acceptable for inclusion on the graduate plan of study, although those courses
may be a requirement for the degree. (Required courses which do not qualify for inclusion in the plan of study because they
do not receive

## 7) Generation - answer using retrieved context (RAG)

In [28]:

llm = ChatOpenAI(model=CHAT_MODEL, temperature=0.3)

prompt = PromptTemplate(
    input_variables=["question", "context"],
    template=(
        "You are a precise assistant. Use ONLY the provided context. "
        "If the answer is not in the context, say you don't know.\n\n"
        "Question: {question}\n\nContext:\n{context}\n\nAnswer:"
    )
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt, "document_variable_name": "context"}
)

answer = qa_chain.invoke(question)
print("Q:", question)
print("\nAnswer:\n", answer)


Q: What does Degree-Seeking Applicants generally must submit?

Answer:
 {'query': 'What does Degree-Seeking Applicants generally must submit?', 'result': 'Degree-Seeking Applicants generally must submit:\n\n1. A completed online application.\n2. Three letters of recommendation, or as directed by the department or program.\n3. Other requirements, as detailed by individual departments and colleges, which typically include a goal statement or statement of purpose, and/or a copy of any relevant professional license, a resume, or other documents as required by the application.\n4. Test scores or other demonstration of academic ability for graduate work, which may include GRE scores and English proficiency scores for international applicants.'}


In [31]:
# Show both the answer and the source chunks GPT used
qa_chain_sources = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt, "document_variable_name": "context"},
    return_source_documents=True
)

result = qa_chain_sources.invoke(question)

print("=" * 80)
print("QUESTION:", question)
print("=" * 80)
print("\nANSWER:\n")
print(result["result"])
print("\n" + "=" * 80)
print("SOURCE CHUNKS (used for context):\n")

for i, doc in enumerate(result["source_documents"], start=1):
    source = doc.metadata.get("source", "unknown source")
    page = doc.metadata.get("page", "N/A")
    print(f"[{i}] {source} | Page: {page}")
    print(doc.page_content[:500], "\n" + "-" * 60)


QUESTION: What are the English Proficiency Requirement?

ANSWER:

International applicants whose native language is not English are required to provide proof of English proficiency in one of the following ways: 

a. A TOEFL (Test of English as a Foreign Language) score of 80, including minimum scores of Writing 18, Speaking 18, Listening 14, and Reading 19. 

Note that in addition to required minimum scores for each category, the Graduate School also requires a minimum overall score that is higher than the minimums for the four area tests combined.

SOURCE CHUNKS (used for context):

[1] data\The Graduate School - Purdue University Northwest - Modern Campus Catalog™.pdf | Page: 3
below). The application for certificate study can be found at https://gradapply. purdue.edu/appl y/
English Proficiency Requirement
International applicants whose native language is not English are required to provide proof of English proficiency at the
time of recommendation for admission to degree, certifica