# Project Overview: Chat with Multiple PDFs
![Application Workflow](https://hackernoon.imgix.net/images/7Ug0PkGHNSN2Kd1AXJvurvkWCxF2-g983p9j.jpeg)



## INTRODUCTION



This project aims to create an interactive, user-friendly web application for querying and interacting with content from multiple PDF documents. Using **Streamlit**, the application provides a seamless interface for users to upload PDFs and ask questions about their content. The system processes the uploaded documents, enabling a conversation-like interaction powered by advanced language and embedding models.

The project integrates state-of-the-art tools such as **Langchain**, **Ollama Embeddings**, **Chroma**, and a Llama-based conversational model to deliver an efficient and dynamic querying experience.   
<img src="https://www.astera.com/wp-content/uploads/2024/10/Basic-RAG-Pipeline.jpg" alt="Application Workflow" width="550"/>


---

## Objectives

- Develop a **Streamlit-based application** to allow conversational interactions with multiple PDFs.
- Utilize AI and vector-based techniques to extract, process, and retrieve relevant information from document content.
- Enhance user experience by enabling natural language querying and maintaining a chat history.

---



## Requirements

### Key Tools and Libraries:
1. **Streamlit**:
   - Front-end interface for uploading PDF files and handling user interactions.
2. **PyPDF2**:
   - Extracts text from PDF documents.
3. **Langchain**:
   - Handles text processing, embeddings, conversational memory, and retrieval-based question-answering.
4. **Chroma**:
   - Stores text embeddings for fast and efficient vector-based search and retrieval.
5. **OllamaEmbeddings**:
   - Generates vector representations of the text for semantic understanding.
6. **ChatOllama**:
   - Enables advanced conversational capabilities using the Llama3.1 model.
7. **Custom HTML Templates**:
   - Enhances the visual presentation of the chat interface for better user experience.


# Setting Up the Project Environment

To successfully run the application, you need to set up the environment by installing the necessary tools, libraries, and models. Below is a detailed guide to setting up **Ollama**, downloading models, configuring **Chroma**, and running the **Streamlit** application.



---

### 1. Setting Up Ollama
**Ollama** is required to access the language models **Llama3.1** and **nomic-embed-text**, which are crucial for conversational querying and embedding generation.

1. **Install Ollama**:
   - Visit the [Ollama website](https://ollama.com/) and download the installation package for your operating system.
   - For macOS, you can use Homebrew:
     ```bash
     brew install ollama
     ```

2. **Download Required Models**:
   After installing Ollama, download the models used in the project:
   - **nomic-embed-text** for generating vector embeddings.
   - **llama3.1** for conversational AI.

   Run the following commands:
   ```bash
   ollama pull nomic-embed-text
   ollama pull llama3.1
3. **Start Ollam**a:
To enable the models, start the Ollama service by running:
```bash
ollama start

### 2. Setting Up Other Dependencies :
Chroma is used to store vector embeddings for efficient search and retrieval.

1. **Install Chroma using pip**:

```bash
pip install chromadb
```
No additional setup is needed as the project automatically initializes and configures Chroma to store embeddings in the directory ./chroma_db.


Streamlit provides the user interface for uploading PDFs and interacting with their content.

2. **Install Streamlit**: Use pip to install Streamlit:

```bash
pip install streamlit
```

3. **Install Other Dependencies** :
Install additional Python libraries required for the project:

```bash
pip install langchain PyPDF2
```
### 3. Running the Application
Once everything is set up, follow these steps to run the application:

1. **Prepare the Environment**:
Ensure Ollama is running and the required models are downloaded.
Start the Ollama service using ollama start.
2. **Run the Streamlit Application**:
Navigate to the directory containing the ui.py file.
Start the Streamlit app by running:
```bash
streamlit run ui.py
```
3. **Interact with the Application**:

Open the Streamlit app in your browser (usually at http://localhost:8501).
Upload your PDF files and start asking questions to interact with the document content.


# Code Explanation


# 1. Importing Necessary Libraries:

In [None]:
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.chat_models import ChatOllama
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from htmlTemplates import css, bot_template, user_template

**Streamlit**: Used to create the web application interface.  
**PyPDF2**: Used for reading and extracting text from PDF documents.  
**Langchain**: Provides tools for text splitting, embeddings, and conversational AI.  
**Chroma**: Vector storage system used for storing embeddings to enable fast retrieval.  
**OllamaEmbeddings**: Used to convert text into vector embeddings.  
**ChatOllama**: Implements the conversational model based on Llama3.1.  
**ConversationBufferMemory**: Maintains chat history during the conversation.  
**htmlTemplates**: Custom templates to style the chat UI.  

# 2. Function to Extract Text from PDFs:

In [None]:
def get_pdf_text(pdf_docs):
    """Extract text from PDF files."""
    text = ""
    for pdf in pdf_docs:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

This function takes a list of PDF documents and extracts the text from each page of the PDF, concatenating it into one string. The PdfReader class is used for PDF text extraction.

# 3. Function to Split Text into Chunks:


In [None]:
def get_text_chunks(text):
    """Split extracted text into smaller chunks."""
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_text(text)
    return chunks

After extracting text from the PDFs, this function splits the text into smaller, manageable chunks. This is done using Langchain’s CharacterTextSplitter, which splits based on a character limit, ensuring that each chunk is of a reasonable size for processing.

# 4. Function to Create a Vectorstore:

In [None]:
def get_vectorstore(text_chunks):
    """Convert the text chunks into a vector store using Ollama embeddings."""
    embeddings = OllamaEmbeddings(model="nomic-embed-text", show_progress=True)
    vectorstore = Chroma.from_texts(texts=text_chunks, embedding=embeddings, persist_directory="./chroma_db")
    return vectorstore

Here, the text chunks are passed to OllamaEmbeddings to create vector embeddings. These embeddings are then stored in Chroma for efficient similarity search, enabling quick retrieval of relevant chunks based on user queries.

# 5. Setting up the Conversational Chain:

In [None]:
def get_conversation_chain(vectorstore):
    """Setup conversational chain with Llama3.1 model."""
    llm = ChatOllama(model="llama3.1")

    memory = ConversationBufferMemory(
        memory_key='chat_history', return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        memory=memory
    )
    return conversation_chain

In this function, a ChatOllama conversational model is initialized. The model is paired with ConversationBufferMemory to maintain a history of the conversation. The ConversationalRetrievalChain is created, linking the model, vector store, and memory together for efficient query handling.

# 6. Handling User Input:

In [None]:
def handle_userinput(user_question):
    """Handle user input and display chat history."""
    response = st.session_state.conversation({'question': user_question})
    st.session_state.chat_history = response['chat_history']

    for i, message in enumerate(st.session_state.chat_history):
        if i % 2 == 0:
            st.write(user_template.replace(
                "{{MSG}}", message.content), unsafe_allow_html=True)
        else:
            st.write(bot_template.replace(
                "{{MSG}}", message.content), unsafe_allow_html=True)

This function handles user queries. When a question is asked, it sends the query to the conversational chain and updates the chat history. The conversation is displayed in the app interface using custom HTML templates for both user and bot messages.

# 7. Main Function to Run the Streamlit App:

In [None]:
def main():
    """Main function to run the Streamlit app."""
    st.set_page_config(page_title="Chat with multiple PDFs",
                       page_icon=":books:")
    st.write(css, unsafe_allow_html=True)

    if "conversation" not in st.session_state:
        st.session_state.conversation = None
    if "chat_history" not in st.session_state:
        st.session_state.chat_history = None

    st.header("Chat with multiple PDFs :books:")
    user_question = st.text_input("Ask a question about your documents:")
    if user_question:
        handle_userinput(user_question)

    with st.sidebar:
        st.subheader("Your documents")
        pdf_docs = st.file_uploader(
            "Upload your PDFs here and click on 'Process'", accept_multiple_files=True)
        if st.button("Process"):
            with st.spinner("Processing"):
                # get pdf text
                raw_text = get_pdf_text(pdf_docs)

                # get the text chunks
                text_chunks = get_text_chunks(raw_text)

                # create vector store with Ollama embeddings
                vectorstore = get_vectorstore(text_chunks)

                # create conversation chain with Llama3.1
                st.session_state.conversation = get_conversation_chain(vectorstore)

The main() function is the entry point of the Streamlit app. It configures the page, handles user input, and manages PDF file uploads. Once PDFs are uploaded, the text is processed, and the conversation chain is set up, enabling users to ask questions and receive responses.

# RESULTS

## 1. Page d'acceuil :  
<img src="https://i.imgur.com/8HrUImt.png" alt="Application Workflow" width="500"/>


## 2. Uploading Files
<img src="https://i.imgur.com/1JR8mAN.png" alt="Application Workflow" width="300"/>  <img src="https://i.imgur.com/emSowx5.png" alt="Application Workflow" width="300"/>



## 3. Querying the database and getting the answer
<img src="https://i.imgur.com/lwFLxMl.png" alt="Application Workflow" width="700"/>

<img src="https://i.imgur.com/BDCYpq8.png" alt="Application Workflow" width="500"/> <img src="https://i.imgur.com/gaT40Db.png" alt="Application Workflow" width="500"/>




# Conclusion

This application combines Streamlit, Langchain, and Ollama to provide an interactive and powerful tool for querying multiple PDF documents. The user-friendly interface enables document uploading and seamless conversations with the content.  


