# QnA Conversational Agent

# 📚 Chat with Your Data Bot - Jupyter Notebook 🤖

Welcome to the **Chat with Your Data Bot** Jupyter Notebook! This notebook demonstrates how to set up and use a conversational AI model to interact with your PDF or TXT documents using Azure OpenAI and the LangChain library. Follow along to understand the process of initializing the model, uploading documents, and asking questions to get insightful answers.

## Overview

In this notebook, we will cover the following steps:

1. **Setting Up the Environment**: Installing the necessary libraries and configuring the Azure OpenAI credentials.
2. **Loading and Preprocessing Documents**: Uploading PDF or TXT files and splitting them into smaller chunks for efficient processing.
3. **Creating Embeddings and Vector Store**: Using Hugging Face embeddings to convert document chunks into vectors and storing them in an in-memory vector store.
4. **Initializing the Conversational Retrieval Chain**: Setting up the retrieval chain to interact with the documents based on user queries.
5. **Interacting with the Bot**: Asking questions and receiving answers along with references to the source documents.

Let's get started! 🚀

---

## Prerequisites

Before running this notebook, ensure you have the following:

- **Azure OpenAI API Key**: Your Azure OpenAI API key, endpoint, deployment name, and API version.
- **Python Environment**: A Python environment with the necessary libraries installed (as listed in the `requirements.txt` file).

---

## Note

### **main.py and utils.py needs to be put in a seperate files .py files,** 
### **once it is done, go to the terminal and type:**

```streamlit run main.py```

---

### Table of Contents

1. [Setting Up the Environment](#setting-up-the-environment)
2. [Loading and Preprocessing Documents](#loading-and-preprocessing-documents)
3. [Creating Embeddings and Vector Store](#creating-embeddings-and-vector-store)
4. [Initializing the Conversational Retrieval Chain](#initializing-the-conversational-retrieval-chain)
5. [Interacting with the Bot](#interacting-with-the-bot)


## Setting Up the Environment

In [None]:
import streamlit as st
from langchain.chat_models import AzureChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders import PyPDFLoader, TextLoader
from utils import load_db
import tempfile
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain.text_splitter import RecursiveCharacterTextSplitter


#I have used Azure OpenAI API Key for this bot, if ypu are using OpenAI, please follow the syntax from OpenAI Website.
# Initialize Azure OpenAI client
openai_api_key = ""
azure_endpoint = ""
azure_deployment_name = ""
azure_api_version = ""

llm = AzureChatOpenAI(
    api_key=openai_api_key,
    azure_endpoint=azure_endpoint,
    deployment_name=azure_deployment_name,
    api_version=azure_api_version,
    temperature=0,
    max_tokens=500
)

## Loading and Processing the Documents

### I have added those processes in the streamlit app, in the below code.

In [None]:
# Streamlit app
# main.py

import streamlit as st
from langchain.chat_models import AzureChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders import PyPDFLoader, TextLoader
from utils import load_db
import tempfile
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain.text_splitter import RecursiveCharacterTextSplitter


#I have used Azure OpenAI API Key for this bot, if you are using OpenAI, please follow the syntax from OpenAI Website.
# Initialize Azure OpenAI client
openai_api_key = ""
azure_endpoint = ""
azure_deployment_name = ""
azure_api_version = ""

llm = AzureChatOpenAI(
    api_key=openai_api_key,
    azure_endpoint=azure_endpoint,
    deployment_name=azure_deployment_name,
    api_version=azure_api_version,
    temperature=0,
    max_tokens=500
)

st.title('Chat with Your Data Bot')
st.sidebar.header('Upload a PDF or TXT file')

uploaded_file = st.sidebar.file_uploader("Choose a file", type=["pdf", "txt"])



# Once the file is uploaded, the code creates a local db file, here it is creating a Chroma based DB.
if uploaded_file is not None:
    file_extension = uploaded_file.name.split('.')[-1]
    
    # Create a temporary file to save the uploaded file
    with tempfile.NamedTemporaryFile(delete=False, suffix=f".{file_extension}") as tmp_file:
        tmp_file.write(uploaded_file.read())
        tmp_file_path = tmp_file.name

    if file_extension == 'pdf':
        loader = PyPDFLoader(tmp_file_path)
    elif file_extension == 'txt':
        loader = TextLoader(tmp_file_path)
    else:
        st.error('Unsupported file type!')

    documents = loader.load()
    qa = load_db(documents, llm)

    st.session_state.qa = qa
    st.session_state.chat_history = []

    st.success(f'Loaded file: {uploaded_file.name}')


if 'qa' in st.session_state:
    user_query = st.text_input("Enter your question:")

    if st.button('Submit'):
        result = st.session_state.qa({"question": user_query, "chat_history": st.session_state.chat_history})
        st.session_state.chat_history.extend([(user_query, result["answer"])])

# The output shows the answer, the original query and the sourced paragraphs with in the file.
        st.write(f"**User:** {user_query}")
        st.write(f"**ChatBot:** {result['answer']}")

        st.write("**DB Query:**", result["generated_question"])
        st.write("**Source Documents:**")
        for doc in result["source_documents"]:
            st.write(doc)

    if st.button('Clear History'):
        st.session_state.chat_history = []
        st.write("Chat history cleared!")


#

## Setting up Vector DB through --> Chroma & Initalizing Conversation Chain

In [None]:
#utils.py

def load_db(documents, llm, k=4, chain_type="stuff"):
    # Initialize a text splitter with specific chunk size and overlap
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
    
    # Split the input documents into smaller chunks
    docs = text_splitter.split_documents(documents)

    # Ensure each document chunk has text and metadata
    for doc in docs:
        if not hasattr(doc, 'metadata'):  # If the document does not have metadata, add an empty metadata dictionary
            doc.metadata = {}
        doc.metadata['text'] = doc.page_content  # Add the page content as a metadata field

    # Define embeddings using a pre-trained Hugging Face model
    embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

    # Create an in-memory vector database from the document chunks using the embeddings
    db = DocArrayInMemorySearch.from_documents(docs, embedding)

    # Create a retriever from the vector database that retrieves documents based on similarity
    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": k})

    # Initialize a conversational retrieval chain with the given LLM and retriever
    qa = ConversationalRetrievalChain.from_llm(
        llm=llm,  # Language model to use for generating responses
        chain_type=chain_type,  # Type of chain to use (default is "stuff")
        retriever=retriever,  # Retriever for fetching relevant documents
        return_source_documents=True,  # Whether to return the source documents
        return_generated_question=True,  # Whether to return the generated question
    )

    return qa  # Return the initialized conversational retrieval chain


## Its time to interact with the chatbot!
### Hope you liked the bot
### Please visit [@hiteshhhh007](https://github.com/hiteshhhh007) for more such projects and codes!