# LangChain Zero to Mastery: Part 3 - Document Processing with RAG


## 1. Introduction to RAG

Retrieval-Augmented Generation (RAG) combines retrieval systems and LLMs to generate answers based on external knowledge sources.
In this part, we'll build a system that retrieves relevant documents and generates accurate answers.

---


### 2. Installing Dependencies

Ensure you have LangChain and FAISS installed.


In [22]:
!pip install langchain openai faiss-cpu
!pip install pypdf

Collecting pypdf
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.1.0-py3-none-any.whl (297 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.1.0


In [7]:
# Import necessary modules
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI


---


## 3. Loading and Splitting Documents

See the second example below for how to work with multiple pdf documents.


In [3]:
# Load sample documents (you can replace this part with your own data)
documents = [
    {"content": "LangChain enables seamless integration with LLMs.", "metadata": {"title": "LangChain Overview"}},
    {"content": "RAG systems use vector databases to retrieve relevant information.", "metadata": {"title": "RAG Systems"}},
    {"content": "FAISS is a library for efficient similarity search.", "metadata": {"title": "FAISS Library"}},
]

# Split documents into smaller chunks
text_splitter = CharacterTextSplitter(chunk_size=150, chunk_overlap=20)
split_docs = []
for doc in documents:
    split_docs.extend(
        [{"content": chunk, "metadata": doc["metadata"]} for chunk in text_splitter.split_text(doc["content"])]
    )

print("Split Documents:", split_docs)

Split Documents: [{'content': 'LangChain enables seamless integration with LLMs.', 'metadata': {'title': 'LangChain Overview'}}, {'content': 'RAG systems use vector databases to retrieve relevant information.', 'metadata': {'title': 'RAG Systems'}}, {'content': 'FAISS is a library for efficient similarity search.', 'metadata': {'title': 'FAISS Library'}}]


### Working with multiple pdf files


In [23]:
import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document

# Define the directory containing your PDF files
pdf_directory = "./data"

# List all PDF files in the directory
pdf_files = [os.path.join(pdf_directory, f) for f in os.listdir(pdf_directory) if f.endswith(".pdf")]

# Initialize the text splitter
text_splitter = CharacterTextSplitter(chunk_size=150, chunk_overlap=20)

# Load and process each PDF file
split_docs = []
for pdf_file in pdf_files:
    # Load the PDF
    pdf_loader = PyPDFLoader(pdf_file)
    pdf_documents = pdf_loader.load()
    
    # Split the text into chunks and add to the list
    for doc in pdf_documents:
        chunks = text_splitter.split_text(doc.page_content)
        for chunk in chunks:
            split_docs.append(Document(page_content=chunk, metadata=doc.metadata))

print(f"Processed {len(pdf_files)} PDF files.")
print(f"Total split documents: {len(split_docs)}")


Processed 1 PDF files.
Total split documents: 8


---


## 4. Creating a Vector Store

See below for multiple documentation implemenation instead of the sample dictionary.


In [10]:
from langchain.schema import Document

# Convert dictionaries to Document objects
documents = [
    Document(page_content=doc["content"], metadata=doc["metadata"]) for doc in split_docs
]

# Initialize embeddings
embeddings = OpenAIEmbeddings()

# Build a FAISS vector store
vector_store = FAISS.from_documents(documents, embeddings)

# Save the vector store for reuse
vector_store.save_local("vector_store")


### For multiple documents


In [24]:
# Initialize embeddings
embeddings = OpenAIEmbeddings()

# Build a FAISS vector store
vector_store = FAISS.from_documents(split_docs, embeddings)

# Save the vector store for reuse
vector_store.save_local("vector_store_multiple_pdfs")

print("FAISS vector store saved as 'vector_store'.")

FAISS vector store saved as 'vector_store'.


---


## 5. Building the Retrieval QA System


In [19]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI

# Define the OpenAI LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)  # Replace "gpt-4" with your desired model

# Define the retriever from your vector store
retriever = vector_store.as_retriever()

# Define the prompt template for combining documents
combine_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=(
        "You are an assistant with access to the following context. "
        "Use it to answer the question as accurately as possible.\n\n"
        "Context:\n{context}\n\n"
        "Question:\n{question}\n\n"
        "Answer:"
    ),
)

# Build the RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Specifies how documents are combined
    retriever=retriever,
    chain_type_kwargs={"prompt": combine_prompt},
)

# Query the QA chain
query = "What is FAISS used for?"
response = qa_chain.run(query)
print(f"Question: {query}\nAnswer: {response}")


Question: What is FAISS used for?
Answer: FAISS is used for efficient similarity search.


### For multiple documents


In [25]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI

# Define the OpenAI LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)  # Replace "gpt-4" with your desired model

# Define the retriever from your vector store
retriever = vector_store.as_retriever()

# Define the prompt template for combining documents
combine_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=(
        "You are an assistant with access to the following context. "
        "Use it to answer the question as accurately as possible.\n\n"
        "Context:\n{context}\n\n"
        "Question:\n{question}\n\n"
        "Answer:"
    ),
)

# Build the RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Specifies how documents are combined
    retriever=retriever,
    chain_type_kwargs={"prompt": combine_prompt},
)

# Query the QA chain
query = "What is the method of TECOMINER?"
response = qa_chain.run(query)
print(f"Question: {query}\nAnswer: {response}")


Question: What is the method of TECOMINER?
Answer: TeCoMiner employs a method based on topological considerations regarding co-occurrence networks of terms, rather than relying on generative probabilistic models like many traditional topic modeling tools. It identifies topics as communities within these term networks. This approach allows users to explore topics interactively and provides advantages in terms of topic interpretation and control over topic granularity. The tool facilitates the discovery of semantically similar terms by mapping topic terms into a 300-dimensional vector space using a pre-trained fastText word embedding and applying agglomerative clustering based on Euclidean distance.


---


## 6. Querying the System


In [20]:
# First query example
query = "What is RAG?"
response = qa_chain.run(query)
print(f"Question: {query}\nAnswer: {response}")

# Another query
query = "What is FAISS used for?"
response = qa_chain.run(query)
print(f"Question: {query}\nAnswer: {response}")

Question: What is RAG?
Answer: RAG, or Retrieval-Augmented Generation, is a system that enhances the capabilities of language models by incorporating a retrieval mechanism to access relevant information from external sources, often using vector databases. This approach allows RAG systems to generate more informed and accurate responses by retrieving contextually relevant data during the generation process.
Question: What is FAISS used for?
Answer: FAISS is used for efficient similarity search.


---


## 7. Seamless ChatBot

In this part, we use `ConversationalRetrievalChain` instead of `RetrRetrievalQA` because of the following differences:

### Comparison

| **Aspect**                | **RetrievalQA**                      | **ConversationalRetrievalChain**           |
| ------------------------- | ------------------------------------ | ------------------------------------------ |
| **Context Handling**      | Does not handle context (stateless). | Handles context (stateful, multi-turn).    |
| **Complexity**            | Simpler and faster.                  | More complex due to history handling.      |
| **Use Case**              | Single-turn QA.                      | Conversational, multi-turn dialogue.       |
| **Example Question Type** | "What is the capital of France?"     | "Who is he?" (after discussing Napoleon).  |
| **Conversation Memory**   | Not supported.                       | Maintains and uses conversational history. |

---

### Which One to Choose?

- **Use `RetrievalQA`** for:

  - Quick, one-off queries.
  - Tasks where conversational memory is unnecessary.
  - Simple document search or knowledge base QA.

- **Use `ConversationalRetrievalChain`** for:
  - Chatbots or virtual assistants.
  - Applications where user queries depend on previous interactions.
  - Enhanced natural language understanding in dialogue.


In [15]:
import os
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.schema import Document
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory


# Path to data and vector store
DATA_DIR = "../data"
VECTOR_STORE_PATH = "vector_store"

# Debug: Check DATA_DIR
print(f"Absolute DATA_DIR path: {os.path.abspath(DATA_DIR)}")

# Function to load documents
def load_documents(data_dir):
    print("load_documents function is called.")
    if not os.path.exists(data_dir):
        print(f"Error: The directory {data_dir} does not exist.")
        return []
    
    file_names = os.listdir(data_dir)
    print(f"Files in directory '{data_dir}': {file_names}")

    documents = []
    for file_name in file_names:
        if file_name.endswith(".pdf"):
            file_path = os.path.join(data_dir, file_name)
            print(f"Processing file: {file_path}")
            try:
                loader = PyPDFLoader(file_path)
                docs = loader.load()
                documents.extend(docs)
            except Exception as e:
                print(f"Error loading {file_path}: {e}")

    print(f"Loaded {len(documents)} documents from {data_dir}")
    return documents



# Function to create vector store if it doesn't exist
def create_vector_store(documents, vector_store_path):
    # Split documents into smaller chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    split_docs = text_splitter.split_documents(documents)

    # Create embeddings and vector store
    embeddings = OpenAIEmbeddings()
    vector_store = FAISS.from_documents(split_docs, embeddings)
    vector_store.save_local(vector_store_path)
    print(f"Vector store created and saved at {vector_store_path}.")
    return vector_store

# Debug: Force vector store creation
print("Vector store not found. Creating a new one...")
documents = load_documents(DATA_DIR)
if not documents:
    print("No documents were loaded.")
else:
    vector_store = create_vector_store(documents, VECTOR_STORE_PATH)

# Initialize retriever
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3})

# Initialize memory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# Initialize the chat model
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

# Create the ConversationalRetrievalChain
conversational_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory
)

# Conversational loop
print("PDF-based ChatBot is ready! Type your questions below. Type '/exit' to end the conversation.\n")
while True:
    user_input = input("You: ")
    if user_input.lower() == "/exit":
        print("ChatBot: Goodbye! Have a great day!")
        break

    query = {"question": user_input}
    response = conversational_chain.run(query)
    print(f"ChatBot: {response}")


Absolute DATA_DIR path: h:\Documents\Work\langchain-zero-to-mastery\data
Vector store not found. Creating a new one...
load_documents function is called.
Files in directory '../data': ['nke-10k-2023.pdf', 'TECOMINER Topic Discovery Through Term Community Detection - Mar 2021.pdf']
Processing file: ../data\nke-10k-2023.pdf
Processing file: ../data\TECOMINER Topic Discovery Through Term Community Detection - Mar 2021.pdf
Loaded 115 documents from ../data
Vector store created and saved at vector_store.
PDF-based ChatBot is ready! Type your questions below. Type '/exit' to end the conversation.

ChatBot: TeCoMiner is an interactive tool designed for exploring the topic content of text collections. Unlike traditional topic modeling tools that rely on generative probabilistic models, TeCoMiner is based on topological considerations of co-occurrence networks of terms. It allows users to identify topics, visualize them, and analyze their interrelations within large document datasets. The tool 

## 8. Summary

In this part, we:

- Split documents into manageable chunks.
- Created a FAISS-based vector store for similarity search.
- Built a Retrieval QA system using LangChain.
- Built a ChatBot using ConversationalRetrievalChain using LangChain.

I'm glad you're following till now! Keep up the good work! In the next part, we will explore custom agents and tool integrations for more complex applications. Stay tuned!
