<a href="https://colab.research.google.com/github/Joseph89155/retrieval-augmented-qa-pipeline/blob/main/Retrieval_Augmented_Generation_Pipeline_using_LangChain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧠🔍 Generative AI and RAG (Retrieval-Augmented Generation) Project

## 📌 Overview
In this project, we explore **Generative AI** integrated with **Retrieval-Augmented Generation (RAG)** to build an intelligent question-answering pipeline. The system can retrieve relevant chunks from a document and generate context-aware answers using a Large Language Model (LLM). Unlike traditional QA systems, this approach ensures that responses are grounded in actual content rather than hallucinated information.

## 🎯 Objectives
- To extract and store key information from a PDF document using **chunking** and **semantic embeddings**
- To use **FAISS**, a vector similarity search library, for fast and efficient retrieval
- To integrate **generative models** using Hugging Face Transformers and LangChain for creating detailed answers
- To compare **document-grounded answers** with generic ones and highlight the benefits of RAG
- To apply **prompt engineering** techniques for guiding LLMs toward clearer and more relevant responses

## 🧩 Key Technologies Used
- **Python** (Google Colab environment)
- **LangChain** for chaining LLMs and retrieval components
- **Sentence-Transformers** for creating vector embeddings
- **FAISS** for similarity search
- **Hugging Face Transformers** for using pre-trained LLMs
- **PyMuPDF (fitz)** or **pdfplumber** for PDF parsing

## 🌍 Real-World Relevance
RAG pipelines are increasingly important in real-world NLP applications:
- **Customer support chatbots** grounded in documentation
- **Legal and compliance assistants** referencing internal policies
- **Medical assistants** referring to health records and guidelines
- **Enterprise knowledge bases** for internal document search and summarization

By completing this project, we simulate how modern AI solutions use internal knowledge sources to produce accurate, transparent, and grounded answers — a growing demand in regulated and high-stakes environments.

---


## 📄 Step 1: Uploading and Reading the PDF

We begin by uploading our document  *"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"*  which introduces the RAG architecture. This foundational paper will act as our knowledge source for querying. We'll load the PDF, extract its text, and inspect the content before chunking it into retrievable segments.


In [8]:
# Install required library
!pip install -q PyMuPDF

# Import necessary modules
import fitz  # PyMuPDF
import os

# Load the PDF file
pdf_path = "2005.11401v4.pdf"  # Ensure this matches your uploaded filename
doc = fitz.open(pdf_path)

# Extract all text
all_text = ""
for page in doc:
    all_text += page.get_text()

# Display a sample of the text
print(all_text[:3000])  # Show first 3000 characters

Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks
Patrick Lewis†‡, Ethan Perez⋆,
Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,
Mike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†
†Facebook AI Research; ‡University College London; ⋆New York University;
plewis@fb.com
Abstract
Large pre-trained language models have been shown to store factual knowledge
in their parameters, and achieve state-of-the-art results when ﬁne-tuned on down-
stream NLP tasks. However, their ability to access and precisely manipulate knowl-
edge is still limited, and hence on knowledge-intensive tasks, their performance
lags behind task-speciﬁc architectures. Additionally, providing provenance for their
decisions and updating their world knowledge remain open research problems. Pre-
trained models with a differentiable access mechanism to explicit non-parametric
memory have so far been only investigated for extractive downstream t

## Step 2: Splitting the Document into Chunks

To enable semantic search and retrieval, we split the full document into smaller text chunks. These chunks provide the retriever with granular segments of the document, allowing the system to identify and return the most relevant context for any query.

We use a **sliding window approach** with slight overlaps to maintain semantic continuity across chunks. Each chunk will be embedded and stored in the FAISS vector store during the next step.


In [9]:
from typing import List
import re

def clean_text(text: str) -> str:
    # Basic cleaning to remove excessive whitespace and line breaks
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def split_into_chunks(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """
    Splits text into overlapping chunks using a sliding window approach.
    :param text: The full document text
    :param chunk_size: Max number of words per chunk
    :param overlap: Number of overlapping words between chunks
    :return: List of chunked text segments
    """
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = words[start:end]
        chunks.append(" ".join(chunk))
        start += chunk_size - overlap  # Slide the window

    return chunks

# Clean and split the document
cleaned_text = clean_text(all_text)
chunks = split_into_chunks(cleaned_text, chunk_size=500, overlap=50)

# Display some chunks
print(f"Total Chunks Created: {len(chunks)}\n")
print(f"Sample Chunk [0]:\n{chunks[0][:1000]}")

Total Chunks Created: 22

Sample Chunk [0]:
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Patrick Lewis†‡, Ethan Perez⋆, Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†, Mike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela† †Facebook AI Research; ‡University College London; ⋆New York University; plewis@fb.com Abstract Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when ﬁne-tuned on down- stream NLP tasks. However, their ability to access and precisely manipulate knowl- edge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-speciﬁc architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre- trained models with a differentiable access mechanism to explicit non-parametric memory have so far been o

## Step 3: Creating Embeddings and FAISS Vector Store

We use the `SentenceTransformers` library to convert each text chunk into a high-dimensional vector (embedding) that captures its semantic meaning. These embeddings are stored in a FAISS index to enable fast approximate nearest neighbor search — the backbone of our document retriever.


In [10]:
!pip install -q langchain_community

In [14]:
# Embedding & Vector Store (LangChain FAISS)

# Install the necessary packages (if not already done)
!pip install -q faiss-cpu sentence-transformers langchain

from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document

# Wrap each chunk into a LangChain Document
documents = [Document(page_content=chunk) for chunk in chunks]

# Use the same embedding model we used earlier
embedding_function = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Build the FAISS index using LangChain's wrapper (handles docstore + IDs)
vectorstore = FAISS.from_documents(documents, embedding_function)

# Confirm successful build
print(f"✅ LangChain-compatible FAISS index created with {len(documents)} documents.")


  embedding_function = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
  return forward_call(*args, **kwargs)


✅ LangChain-compatible FAISS index created with 22 documents.


## Step 4: Integrating a Language Model with the Vector Store (RAG Pipeline)

In this step, we build a complete Retrieval-Augmented Generation (RAG) pipeline using LangChain. This includes:
- Connecting a language model (LLM) from Hugging Face
- Wrapping our FAISS index in a retriever interface
- Creating a RAG chain that retrieves relevant chunks and passes them to the LLM for contextual, grounded answers


In [15]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA

# Load a lightweight text2text model
qa_pipeline = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    tokenizer="google/flan-t5-base",
    max_length=512
)

# Wrap the pipeline for LangChain
llm = HuggingFacePipeline(pipeline=qa_pipeline)

# Build the RAG-style RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# Run a test query
query = "What is RAG-Token and how does it differ from RAG-Sequence?"
result = qa_chain.invoke(query)  # ✅ .invoke instead of deprecated __call__

# Display the output
print("🔍 Question:", query)
print("🧠 RAG Answer:\n", result['result'])


Device set to use cpu
  return forward_call(*args, **kwargs)
Token indices sequence length is longer than the specified maximum sequence length for this model (2730 > 512). Running this sequence through the model will result in indexing errors


🔍 Question: What is RAG-Token and how does it differ from RAG-Sequence?
🧠 RAG Answer:
 The RAG-Token model can be seen as a standard, autoregressive seq2seq genera- tor with transition probability: p′ (yi|x, y1:i1) = P ztop-k(p(|x)) p(zi|x) p(yi|x, zi, y1:i1) To decode, we can plug p′ (yi|x, y1:i1) into a standard beam decoder. For RAG-Sequence models, we report test results using 50 retrieved documents, and we use the Thorough Decoding approach since answers are generally short. We use greedy decoding for QA as we did not find beam search improved results. For Open-Domain QA, multiple answer annotations are often available for a given question. These answer annotations are exploited by extractive models during training as typically all the answer annotations are used to find matches within documents when preparing training data. For RAG, we also make use of multiple annotation examples for Natural Questions and WebQuestions by training the model with each (q, a) pair separately, leadi

## Step 5: Prompt Engineering and Comparative Evaluation

In this step, we evaluate how retrieval improves generative QA. We test the same question on two systems:
- ❌ A generic LLM with no access to source documents (pure parametric knowledge)
- ✅ A RAG system that retrieves context from a document index before generating an answer

This helps us observe the impact of retrieval grounding on the quality, specificity, and accuracy of answers.


In [16]:
from transformers import pipeline

# Non-RAG: Plain model with no access to documents
generic_llm = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    tokenizer="google/flan-t5-base",
    max_length=512
)

# Same test question
question = "What is RAG-Token and how does it differ from RAG-Sequence?"

# Generate generic (non-grounded) response
generic_response = generic_llm(question)[0]['generated_text']

# Generate RAG-based (document-grounded) response
rag_response = qa_chain.invoke(question)['result']

# Compare outputs
print("❌ Generic LLM Response:\n", generic_response)
print("\n" + "="*80 + "\n")
print("✅ RAG-Grounded Response:\n", rag_response)


Device set to use cpu
  return forward_call(*args, **kwargs)


❌ Generic LLM Response:
 RAG-Token


✅ RAG-Grounded Response:
 The RAG-Token model can be seen as a standard, autoregressive seq2seq genera- tor with transition probability: p′ (yi|x, y1:i1) = P ztop-k(p(|x)) p(zi|x) p(yi|x, zi, y1:i1) To decode, we can plug p′ (yi|x, y1:i1) into a standard beam decoder. For RAG-Sequence models, we report test results using 50 retrieved documents, and we use the Thorough Decoding approach since answers are generally short. We use greedy decoding for QA as we did not find beam search improved results. For Open-Domain QA, multiple answer annotations are often available for a given question. These answer annotations are exploited by extractive models during training as typically all the answer annotations are used to find matches within documents when preparing training data. For RAG, we also make use of multiple annotation examples for Natural Questions and WebQuestions by training the model with each (q, a) pair separately, leading to a small increase i