# Building a Multi PDF RAG Chatbot Using Langchain and Streamlit

Daniel Godden

## Project Overview

This project walks through the steps required to build a Multi-PDF Retrieval-Augmented Generation (RAG) Streamlit-based web application that enables users to read, process, and interact with PDF data through a conversational AI chatbot. The project was initially developed by Paras Madan and the original guide can be found [here](https://medium.com/gopenai/building-a-multi-pdf-rag-chatbot-langchain-streamlit-with-code-d21d0a1cf9e5).

### Streamlit: A Framework for Data Science Web Applications

Streamlit is an open-source Python framework specifically designed for creating custom web applications tailored to data science and machine learning projects. Its user-friendly interface allows data scientists and machine learning engineers to rapidly develop interactive, visually appealing, and shareable web apps directly from Python scripts. With Streamlit, developers can build complex applications without requiring extensive web development expertise, making it a powerful tool for bridging the gap between data analysis and user interaction.

### LangChain: Advanced Language Model Integration

LangChain is a versatile framework that empowers developers to build sophisticated applications leveraging large language models (LLMs). It excels in creating structured workflows that integrate external data sources and maintain context across interactions, making it ideal for developing conversational agents and other advanced AI-driven applications. LangChain allows developers to chain together multiple steps or tasks, facilitating complex use cases such as conversational AI, data-driven content generation, and more. With support for custom tools and flexible pipelines, LangChain enhances the functionality of language models, enabling the development of powerful, context-aware applications.

### Project Goals

The main objective of this project is to create a web application that allows users to upload multiple PDF documents, ask questions related to the content of these documents, and receive contextually relevant answers generated by a conversational AI chatbot. This is achieved through the following steps:

1. **PDF Text Extraction**: Using PyPDF2, the application extracts text from the uploaded PDF files.

2. **Text Chunking**: The extracted text is split into smaller chunks using LangChain's text-splitting tools, making it manageable for processing by the language model.

3. **Vector Store Creation**: The text chunks are converted into vector embeddings using Spacy, and these vectors are stored in a FAISS index, which allows for efficient similarity search and information retrieval.

4. **Conversational Chain Setup**: A conversational chain is established using LangChain, where the language model (OpenAI's GPT) interacts with the user, processing their queries and retrieving relevant information from the vector store.

5. **Streamlit Integration**: The entire process is wrapped into a user-friendly web application using Streamlit, providing an intuitive interface for users to upload PDFs, ask questions, and view the AI-generated responses.

### Why This Project Matters

With the increasing reliance on unstructured data, particularly in the form of PDFs, this project demonstrates how modern AI and web development tools can be harnessed to unlock the potential of such data. By integrating advanced language models with efficient retrieval mechanisms, the project offers a practical solution for interacting with and extracting value from complex documents.


## Pip Installs

This code cell uses the `%pip install` command to install a set of Python libraries necessary for the notebook's operations. Below is a detailed explanation of each library being installed:

1. **streamlit**: 
   - A framework for creating interactive web applications directly from Python scripts. It is particularly useful for data science and machine learning projects, allowing you to quickly build and deploy data apps.

2. **PyPDF2**: 
   - A library used for working with PDF files in Python. It supports tasks such as extracting text, merging PDFs, and more.

3. **python-dotenv**: 
   - This library is used to load environment variables from a `.env` file into the environment. It's useful for managing secret keys and configuration settings without hardcoding them in the script.

4. **Langchain**: 
   - A framework designed to simplify the development of applications powered by large language models. It provides tools for chaining together different components like prompts, memory, and agents.

5. **langchain_community**: 
   - A collection of community-driven tools, extensions, and integrations for the Langchain framework.

6. **langchain_anthropic**: 
   - A specific extension of Langchain designed to work with models provided by Anthropic, a company known for developing advanced AI models.

7. **langchain_openai**: 
   - Another extension of Langchain tailored for integrating with OpenAI's language models, making it easier to build applications that utilize OpenAI's capabilities.

8. **Spacy**: 
   - An advanced Natural Language Processing (NLP) library in Python, designed for production use. It offers pre-trained models, tokenization, part-of-speech tagging, named entity recognition, and more.

9. **faiss-cpu**: 
   - A library developed by Facebook AI Research (FAISS) for efficient similarity search and clustering of dense vectors, commonly used in tasks involving large datasets and machine learning.

This command will download and install the latest versions of these libraries, ensuring that all dependencies are met.


In [None]:
%pip install streamlit PyPDF2 python-dotenv Langchain langchain_community langchain_anthropic langchain_openai Spacy faiss-cpu 

## Imports and Setup

This code cell performs several critical setup tasks:

1. **Library Imports**:
   - **streamlit as st**: Imports the Streamlit library, which is used to create web applications directly from Python.
   - **PyPDF2's PdfReader**: A class from the PyPDF2 library used to read PDF files.
   - **langchain.text_splitter's RecursiveCharacterTextSplitter**: This is used to split large texts into smaller, manageable chunks, typically for processing with language models.
   - **langchain_core.prompts' ChatPromptTemplate**: A tool for defining and managing chat prompts for language models.
   - **langchain_community.embeddings' SpacyEmbeddings**: Provides a method to create embeddings (numerical representations) of text using Spacy, an NLP library.
   - **langchain_community.vectorstores' FAISS**: FAISS is a library for efficient similarity search and clustering of dense vectors, integrated here for use with Langchain.
   - **langchain.tools.retriever's create_retriever_tool**: A utility for creating a retriever tool, which is typically used for fetching relevant pieces of information based on a query.
   - **dotenv's load_dotenv**: This function loads environment variables from a `.env` file, which is often used to manage sensitive information like API keys securely.
   - **langchain_anthropic's ChatAnthropic**: A class to interact with Anthropic's language models.
   - **langchain_openai's ChatOpenAI and OpenAIEmbeddings**: These are used to interact with OpenAI's models and to generate embeddings using OpenAI's tools.
   - **os**: Python's built-in library to interact with the operating system. It is used here for setting environment variables and accessing the system's environment.

   Note: The line importing the `pipeline` function from `transformers` is commented out, indicating it might be an optional or alternative functionality.

2. **Environment Configuration**:
   - The line `os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"` is included to prevent errors related to loading duplicate shared libraries, particularly in environments where Intel's Math Kernel Library (MKL) might cause conflicts due to multiple instances of the same library being loaded.

3. **Loading Environment Variables**:
   - `load_dotenv()` is called to load any environment variables defined in a `.env` file. This is typically where sensitive information like API keys is stored.
   - The `api_key` variable retrieves the OpenAI API key from the environment variables, which will be used to authenticate requests to the OpenAI API.

In [None]:
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.tools.retriever import create_retriever_tool
from dotenv import load_dotenv
from langchain_anthropic import ChatAnthropic
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
#from transformers import pipeline
from langchain.agents import AgentExecutor, create_tool_calling_agent

import os

'''
The line `os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"` is used in Python to prevent errors related to loading duplicate shared libraries, 
particularly when using the Intel Math Kernel Library (MKL) in multi-threaded environments. 
It allows the program to continue running by permitting multiple instances of the same library to be loaded without causing conflicts.
'''

os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

# Load the environment variables from .env file
load_dotenv()

# Retrieve the API key from environment variables
api_key = os.getenv("OPENAI_API_KEY")

## Reading and Processing PDF Files

This code cell defines two important functions: `pdf_read` and `get_chunks`. These functions are used to extract text from PDF documents and split that text into smaller chunks for further processing.

#### `pdf_read(pdf_doc)`
- **Purpose**: 
  - This function extracts and concatenates text from a list of PDF files.
- **Arguments**:
  - `pdf_doc (list)`: A list of PDF file paths or file-like objects from which text will be extracted.
- **Returns**:
  - `str`: A single string containing the concatenated text extracted from all pages of the provided PDFs.
- **How it works**:
  - It initializes a `PdfReader` object for each PDF file in the list.
  - It then loops through each page of the PDF, extracting the text and appending it to a cumulative `text` variable.
  - The function finally returns the complete text extracted from all the PDFs.

#### `get_chunks(text)`
- **Purpose**:
  - This function splits a large block of text into smaller, manageable chunks that are ideal for processing with language models or other text analysis tools.
- **Arguments**:
  - `text (str)`: The full text string that needs to be split into chunks.
- **Returns**:
  - `list`: A list of text chunks, where each chunk is of a specified size, with some overlap between consecutive chunks.
- **How it works**:
  - It creates an instance of `RecursiveCharacterTextSplitter` with a chunk size of 1000 characters and an overlap of 200 characters.
  - The `split_text` method is called to divide the text into these chunks, which ensures that important contextual information is preserved across chunks.

In [None]:
def pdf_read(pdf_doc):
    """
    Extracts text from a list of PDF documents.

    Args:
        pdf_doc (list): A list of PDF file paths or file-like objects.

    Returns:
        str: The concatenated text extracted from all the pages of the provided PDFs.
    """
    text = ""
    for pdf in pdf_doc:
        pdf_reader = PdfReader(pdf)  # Initialize a PdfReader object for each PDF
        for page in pdf_reader.pages:  # Loop through all the pages of the PDF
            text += page.extract_text()  # Extract and append text from each page
    return text

def get_chunks(text):
    """
    Splits a large block of text into smaller chunks.

    Args:
        text (str): The text to be split into chunks.

    Returns:
        list: A list of text chunks, each of a specified size with some overlap.
    """
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = text_splitter.split_text(text)  # Split the text into chunks with overlap
    return chunks

## Creating a Searchable Text Database and Making Embeddings

This code cell uses a shell command to download a specific pre-trained language model for Spacy:

- **Command**: `!python -m spacy download en_core_web_sm`
  - This command runs Spacy's model downloader to fetch the **`en_core_web_sm`** model, which is a small, efficient English language model.
  - **`en_core_web_sm`**:
    - This is one of Spacy's pre-trained models that includes a vocabulary, syntax, and entities suitable for various natural language processing tasks like tokenization, part-of-speech tagging, and named entity recognition.
    - It's called "small" because it is designed to be lightweight and fast, with a smaller file size, making it suitable for applications where resources are limited or where quick processing is required.
  
This model is necessary for tasks that involve text processing, such as creating embeddings, parsing text, or recognizing named entities in the text.

In [None]:
!python -m spacy download en_core_web_sm


This code cell performs two main tasks: initializing the Spacy embeddings model and defining a function to create and save a FAISS vector store from text chunks.

#### Initializing Spacy Embeddings
- **SpacyEmbeddings Initialization**:
  - `embeddings = SpacyEmbeddings(model_name="en_core_web_sm")`
  - This line initializes an instance of `SpacyEmbeddings` using the `en_core_web_sm` model, which is a small, efficient English language model. This model will be used to generate vector embeddings for text data.

#### `vector_store(text_chunks)`
- **Purpose**:
  - This function creates a FAISS vector store from a list of text chunks and saves it locally. A FAISS vector store allows for efficient similarity search and clustering of these text embeddings, which is useful for tasks like document retrieval or semantic search.

- **Arguments**:
  - `text_chunks (list of str)`: A list of text strings, typically the output from the `get_chunks` function, that will be converted into vector embeddings.

- **Function Process**:
  1. **Convert Text Chunks to Vectors**:
     - The function uses the initialized `SpacyEmbeddings` model to convert the text chunks into vector embeddings.
     - `FAISS.from_texts(text_chunks, embedding=embeddings)` creates a FAISS index from these embeddings, allowing for efficient similarity search later on.
  
  2. **Save FAISS Vector Store**:
     - The created FAISS index is saved to a local file named `"faiss_db"` using the `save_local` method. This allows the vector store to be loaded later for retrieval tasks without needing to recompute the embeddings.

- **Returns**:
  - This function does not return any values; it simply creates and saves the vector store to a file.


In [None]:
# Initialize a SpacyEmbeddings instance with the specified model
embeddings = SpacyEmbeddings(model_name="en_core_web_sm")

def vector_store(text_chunks):
    """
    Creates and saves a FAISS vector store from a list of text chunks.

    This function converts a list of text chunks into vector embeddings using the
    Spacy embeddings model, then creates a FAISS index to store these vectors
    and saves the index to a local file named "faiss_db".

    Args:
        text_chunks (list of str): A list of text strings to be converted into vectors.

    Returns:
        None
    """
    # Create a FAISS vector store from the text chunks using the specified embeddings
    vector_store = FAISS.from_texts(text_chunks, embedding=embeddings)
    
    # Save the created FAISS index to a local file
    vector_store.save_local("faiss_db")


## Setting Up the Conversational AI

This code cell defines two important functions: `get_conversational_chain` and `user_input`. These functions are responsible for setting up a conversational agent, retrieving responses to user queries, and managing the interaction between the user and the AI model.

#### `get_conversational_chain(tools, ques)`
- **Purpose**:
  - This function initializes a conversational chain using a language model and specified tools, processes a user query, and generates a response.
- **Arguments**:
  - `tools (list)`: A list of tools that the conversational agent can use to retrieve or process information.
  - `ques (str)`: The user's question or input that needs to be answered.
- **Function Process**:
  1. **Language Model Initialization**:
     - The function initializes a language model (`ChatOpenAI`) with specific settings, including the model name (`gpt-3.5-turbo`), a temperature setting (which controls the randomness of the model's output), and the API key for authentication.
  
  2. **Prompt Template Creation**:
     - A prompt template is defined using `ChatPromptTemplate.from_messages`. This template instructs the language model to answer questions based on the provided context and ensures that if the answer is not available, the model will state so without providing incorrect information.
  
  3. **Tool and Agent Creation**:
     - The tools are wrapped in a list, and an agent is created using `create_tool_calling_agent`, which allows the agent to use both the language model and the specified tools to generate responses.
  
  4. **Agent Execution**:
     - The `AgentExecutor` is initialized with the agent, tools, and verbosity settings. It processes the input question and generates a response.
  
  5. **Response Handling**:
     - The generated response is printed to the console and displayed on the Streamlit interface using `st.write`.

- **Returns**:
  - This function does not return a value but instead prints and displays the response.

#### `user_input(user_question)`
- **Purpose**:
  - This function manages user input by loading a local FAISS database, creating a retriever tool, and passing the user's question to the conversational chain for processing.
- **Arguments**:
  - `user_question (str)`: The question asked by the user that needs to be answered by the system.
- **Function Process**:
  1. **Loading FAISS Index**:
     - The function loads a previously saved FAISS index (a vector store) from a local file using `FAISS.load_local`. This index contains the vector embeddings generated from the text chunks.
  
  2. **Creating a Retriever Tool**:
     - A retriever tool is created from the loaded FAISS index using `new_db.as_retriever()`. This tool is configured with a specific name and description that clarify its purpose.
  
  3. **Processing User Input**:
     - The user's question is passed to the `get_conversational_chain` function, which uses the retriever tool and the language model to generate and display a response.
  
- **Returns**:
  - This function does not return a value but processes the user question and displays the result.


In [None]:
def get_conversational_chain(tools, ques):
    """
    Initializes a conversational chain using a language model and tools, and retrieves a response for a given question.

    This function sets up a conversational agent using the specified language model and tools, then 
    uses the agent to process a question and print the response. It also writes the response to the Streamlit interface.

    Args:
        tools (list): A list of tools to be used by the conversational agent.
        ques (str): The question or input for which the response is to be generated.

    Returns:
        None
    """
    # Initialize the language model with specified settings
    llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, api_key=api_key)
    # Use Hugging Face's transformers pipeline for text generation
    #generator = pipeline('text-generation', model='gpt2')
    
    # Generate a response using the Hugging Face model
    #response = generator(ques, max_length=100, num_return_sequences=1)[0]['generated_text']
    

    
    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                """You are a helpful assistant. Answer the question as detailed as possible from the provided context, make sure to provide all the details, if the answer is not in
                provided context just say, "answer is not available in the context", don't provide the wrong answer""",
            ),
            ("placeholder", "{chat_history}"),
            ("human", "{input}"),
            ("placeholder", "{agent_scratchpad}"),
        ]
    )
    
    # Wrap the tools into a list
    tool = [tools]
    
    # Create an agent that uses the language model and tools to process inputs
    agent = create_tool_calling_agent(llm, tool, prompt)
    
    # Initialize the agent executor with the agent, tools, and verbosity setting
    agent_executor = AgentExecutor(agent=agent, tools=tool, verbose=True)
    
    # Invoke the agent executor with the given question and get the response
    response = agent_executor.invoke({"input": ques})
    
    # Print the response to the console
    print(response)
    
    # Display the response in the Streamlit interface
    st.write("Reply: ", response)#['output'])

def user_input(user_question):
    """
    Handles user input by loading a local FAISS database, creating a retriever tool, 
    and using it to answer the user's question through a conversational chain.

    This function loads a previously saved FAISS index, sets up a retriever tool for extracting information 
    from the database, and then passes the user's question to the conversational chain for a response.

    Args:
        user_question (str): The question asked by the user for which an answer is needed.

    Returns:
        None
    """
    # Load the FAISS index from a local file
    new_db = FAISS.load_local("faiss_db", embeddings, allow_dangerous_deserialization=True)
    
    # Create a retriever tool from the loaded FAISS index
    retriever = new_db.as_retriever()
    
    # Set up the retriever tool with a specific name and description
    retrieval_chain = create_retriever_tool(retriever, "pdf_extractor", "This tool is to give answer to queries from the pdf")
    
    # Use the conversational chain to handle the user question
    get_conversational_chain(retrieval_chain, user_question)


## User Interaction

This code cell defines the `main` function, which sets up and runs a Streamlit application for interacting with PDF files through a question-answering interface. The app allows users to upload PDF files, ask questions based on the content of these files, and receive answers generated by a language model.

#### `main()`
- **Purpose**:
  - The `main` function initializes and configures the Streamlit app, handling user inputs, file uploads, and PDF processing.

- **Function Process**:
  1. **Streamlit Page Configuration**:
     - `st.set_page_config(page_title="Chat PDF")`: This sets the configuration for the Streamlit app, specifically the title of the web page.
  
  2. **Header Setup**:
     - `st.header("RAG based Chat with PDF")`: This creates the main header displayed at the top of the app, indicating that the app is for Retrieval-Augmented Generation (RAG) based chat with PDFs.
  
  3. **User Input for Questions**:
     - `user_question = st.text_input("Ask a Question from the PDF Files")`: This creates a text input field where users can type in their questions related to the uploaded PDF files.
     - If the user inputs a question, the `user_input` function is called to process the question and generate a response based on the content of the PDFs.

  4. **PDF Upload and Processing**:
     - The app includes a sidebar (`st.sidebar`) where users can upload their PDF files.
     - `pdf_doc = st.file_uploader("Upload your PDF Files and Click on the Submit & Process Button", accept_multiple_files=True)`: This file uploader widget allows users to upload multiple PDF files.
     - If the user clicks the "Submit & Process" button, the app:
       - Displays a spinner (`st.spinner("Processing...")`) to indicate that the PDFs are being processed.
       - Calls the `pdf_read` function to extract text from the uploaded PDFs.
       - Uses the `get_chunks` function to split the extracted text into manageable chunks.
       - Stores these chunks in a FAISS vector store via the `vector_store` function.
       - Finally, notifies the user that the processing is complete with `st.success("Done")`.

- **Returns**:
  - This function does not return any values but manages the user interface and interaction flow within the Streamlit app.

In [None]:
def main():
    """
    Main function to run the Streamlit app for interacting with PDF-based question answering.

    This function sets up the Streamlit page configuration and layout, including headers, 
    user input fields, and file upload widgets. It handles user questions and processes
    uploaded PDF files to enable question answering based on the content of the PDFs.

    Returns:
        None
    """
    # Configure the Streamlit page
    st.set_page_config(page_title="Chat PDF")
    
    # Set up the main header of the app
    st.header("RAG based Chat with PDF")
    
    # Create a text input field for users to ask questions related to the PDF files
    user_question = st.text_input("Ask a Question from the PDF Files")
    
    # If a question is provided, handle it by passing it to the `user_input` function
    if user_question:
        user_input(user_question)
    
    # Sidebar for uploading PDF files and processing them
    with st.sidebar:
        # File uploader widget for PDF files
        pdf_doc = st.file_uploader("Upload your PDF Files and Click on the Submit & Process Button", accept_multiple_files=True)
        
        # Button to trigger PDF processing
        if st.button("Submit & Process"):
            with st.spinner("Processing..."):
                # Read and extract text from the uploaded PDF files
                raw_text = pdf_read(pdf_doc)
                
                # Split the raw text into chunks
                text_chunks = get_chunks(raw_text)
                
                # Store the chunks in a vector store for later retrieval
                vector_store(text_chunks)
                
                # Notify the user that the processing is complete
                st.success("Done")

## Creating and Running the Streamlit App

To create and run the Streamlit app from the terminal, follow these steps:

1. **Create a Python File**:
   - Open your preferred code editor (e.g., VSCode, PyCharm, or a simple text editor).
   - Copy all the code you have written in your Jupyter notebook, including the imports, function definitions (`pdf_read`, `get_chunks`, `vector_store`, `get_conversational_chain`, `user_input`, `main`), and any additional code.
   - Paste the code into a new file and save it with the name `app.py`.

2. **Add the Streamlit Run Command**:
   - At the bottom of your `app.py` file, ensure you include the following check to run the `main` function when the script is executed:

   ```python
   if __name__ == "__main__":
       main()
