Building a PDF chat bot — Retrieval Augmented Generation (RAG)

Reference: https://medium.com/@prithiviraj7r/building-a-pdf-chat-bot-retrieval-augmented-generation-rag-0bcf6060bbd6

Date: Monday - 6th October 2025

List of components:
• Extract text from PDF docs
• Chunking/Segmentation of text
• Embedding text & Ingestion in Vectorstore
• Conversation using LLMs (OpenAI)

In [1]:
from PyPDF2 import PdfReader, PdfWriter

def get_pdf_content(pdf_path):

    raw_text = ""

    for document in documents:
        reader = PdfReader(pdf_path)
        content = []
        for page in reader.pages:
            content.append(page.extract_text())
    
    return "\n".join(content)

In [2]:
# Chunking

from langchain.text_splitter import CharacterTextSplitter

def chunk_text(text, chunk_size=1000, overlap=200):
    chunks = []
    start = 0
    text_length = len(text)

    while start < text_length:
        end = min(start + chunk_size, text_length)
        chunks.append(text[start:end])
        start += chunk_size - overlap

    return chunks

In [3]:
# Embeddings and Vector Store

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

def get_embeddings(chunks):
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2") # Fast and Free
    vector_store = FAISS.from_texts(chunks, embeddings)
    return vector_store

In [4]:
# Conversation using LLMs (OpenAI)

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain_community.llms import HuggingFaceHub

def get_conversation_chain(vector_store):
    llm = HuggingFaceHub(
        repo_id = "HuggingFaceH4/zephyrl-mini",
        temperature=0, model_name="gpt-3.5-turbo")
    memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm,
        vector_store.as_retriever(),
        memory=memory
    )
    return conversation_chain

Testing Branches.

01