# Documentation for Research Paper Chatbot 

### SE research paper chatbot

Group Name: [csusb_fall2024_cse6550_team4](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team4)

Instructor: Dr. Alzahrani, Nabeel

Course: CSE 6550: Software Engineer Concepts Fall 2024

Source: [Github](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team4)

# 1. Introduction

Purpose:

The purpose of this project is to create an AI-powered research paper chatbot that helps users extract, summarize, and understand content from academic papers. It will offer an interactive Q&A experience, providing accurate, contextually relevant answers to questions about specific sections, making complex research information more accessible and comprehensible.

Objective:

The Paper Chatbot enhances interaction with academic papers by allowing users to upload documents, ask questions, and receive summaries or clarifications. It simplifies extracting key information, aiding students, researchers, and professionals in efficiently understanding complex content.

Prerequisites:
Github, Docker, Jupyter Notebook, Python

# 2. Setup

Purpose: Load environment variables and setup libraries for the chatbot.

Input: Necessary libraries and API key setup. 

Output: Environment variables and libraries loaded successfully 

Processing: Import necessary packages and initialize configurations.

### Importing libraries

This Python script sets up an environment for processing and interacting with documents, particularly academic papers. It imports libraries for tasks like document loading, splitting, embedding generation, and integrates tools such as ChatMistralAI and HuggingFaceEmbeddings, along with a Milvus vector database for efficient data retrieval. Environment variables are loaded for secure API access, and paths are set for data storage and model configuration.

In [None]:
import os
from dotenv import load_dotenv
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.schema import Document
from langchain_core.prompts import PromptTemplate
#from langchain_mistralai import MistralAIEmbeddings
from langchain_mistralai.chat_models import ChatMistralAI
#from langchain_cohere import ChatCohere
from langchain_milvus import Milvus
from langchain_community.document_loaders import WebBaseLoader, RecursiveUrlLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import create_retrieval_chain
from langchain_huggingface import HuggingFaceEmbeddings
from pymilvus import connections, utility
from requests.exceptions import HTTPError
from httpx import HTTPStatusError
from langchain_community.document_loaders import PyPDFDirectoryLoader
from roman import toRoman
load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")

MILVUS_URI = "./milvus/milvus_vector.db"
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
data_dir = "./volumes"
print("imported libraries successfully.")

imported libraries successfully.


### Configuring environment variables

Initialize the required environment variables for document processing. `CORPUS_SOURCE` can be updated to change the document source, `MISTRAL_API_KEY` holds the API key for MistralAI, `MILVUS_URI` defines the path to the Milvus Lite database file, and `MODEL_NAME` sets the embedding model used for analyzing the corpus.

In [6]:
from dotenv import load_dotenv
import os 

# Load environment variables from .env file
load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")


USER_AGENT = os.getenv("USER_AGENT", "my_custom_user_agent")
os.environ["USER_AGENT"] = USER_AGENT

# Configuration settings  
MILVUS_URI = "./milvus/milvus_vector.db"  
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"  
CORPUS_SOURCE = 'https://dl.acm.org/doi/proceedings/10.1145/3597503'  

print("Environment initialized, documents loaded, embeddings configured.")

Environment initialized, documents loaded, embeddings configured.


# 3. Building the Chatbot

Purpose: Create the basic structure and functions for the chatbot logic.

Input: User queries for processing.

Output: Chatbot-generated responses. 

Processing: Set up the RAG model and vector store retrieval.

### Creating Hugging Face Embedding Function

- The `get_embedding_function` function then initializes the embedding function with the specified model and prints a confirmation message. 

- Calling `get_embedding_function` creates and stores this embedding function in `embedding_function`. Finally, it prints a message confirming that the embedding function was created successfully with the specified model

In [7]:
def get_embedding_function():
    """
    returns embedding function for the model

    Returns:
        embedding function
    """
    embedding_function = HuggingFaceEmbeddings(model_name=MODEL_NAME)
    return embedding_function

### Prompt Generator

- The `create_prompt` function serves as a simple template generator for formulating queries related to research topics. When called, it returns a predefined string that instructs the model to provide a detailed summary of the latest research on a specified subject, indicated by the placeholder `{input}`. 

- This allows the function to be flexible, enabling users to input various research topics and receive contextually relevant summaries.

In [8]:
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"  
# Specifies the URI for the Milvus vector database
MILVUS_URI = "./milvus/milvus_vector.db"  

# Placeholder function for creating a prompt
def create_prompt():
    return "Provide a detailed summary of the latest research on: {input}"  

### Vector store loader

- `load_exisiting_db` function is a placeholder that simulates the loading of a vector store from a specified URI.

- It defines a local class, `VectorStore`, which includes a method called `as_retriever` that allows the instance of `VectorStore` to be treated as a retriever. When called, this function returns an instance of the `VectorStore` class, providing a structure for further integration with a retrieval system.

In [9]:
# function for loading the vector store
def load_exisiting_db(uri):
    class VectorStore:  
        def as_retriever(self): 
            return self 
    return VectorStore()  

### Document Chain Placeholder

- The `create_stuff_documents_chain` function is a placeholder designed to establish a document processing chain that integrates a specified model and a prompt. 
- Currently, it returns a string that signifies the intended location for the actual document chain implementation.

In [10]:
# Placeholder function for creating a document chain
def create_stuff_documents_chain(model, prompt):
    return "Document chain here." 

### Create Retrieval Chain

- This function defines a placeholder for creating a retrieval chain, which integrates a retriever and a document chain to generate a response.
  
- It returns a lambda function that simulates the retrieval process by providing contextual information and a sample generated answer.

In [11]:
# Placeholder function for creating a retrieval chain
def create_retrieval_chain(retriever, document_chain):
    return lambda x: {  # Returns a lambda function to generate a response
        "context": [  # Contextual information with metadata for sources
            {"metadata": {"source": "https://example.com/research_paper1"}}, 
            {"metadata": {"source": "https://example.com/research_paper2"}}
        ], 
        "answer": "Generated response"  # Sample generated answer
    }

# 4. Improving the Chatbot

Purpose: Enhance the chatbot with NLP capabilities and optimize document retrieval. 

Input: Documents to split and user queries. 

Output: Improved and factually accurate responses. 

Processing: Use Milvus for vector storage and NLP for document processing.

### Improving Chatbot with RAG and Embeddings

The code sets up a retrieval-augmented generation (RAG) system using a Milvus vector store to manage document embeddings. It includes functions to load documents, create or load a vector store, and manage embedding schemas for storage. Progress is logged with print statements to inform the user of each operation's status.

In [12]:
import os
from tqdm import tqdm
from pymilvus import (
    connections,
    FieldSchema,
    CollectionSchema,
    DataType,
    Collection,
    utility
)

# Suppress tqdm warnings by setting environment variable
os.environ['TQDM_DISABLE'] = '1'

# Placeholder function to simulate loading documents from the web
def load_documents_from_web():
    return ["Document 1 content", "Document 2 content", "Document 3 content"]

# Splits documents into smaller chunks
def split_documents(documents):
    return [doc for doc in documents]

# Placeholder function to simulate getting an embedding function
def get_embedding_function():
    def embedding_function(doc):
        return [0.1] * 512  # Example fixed-size embedding
    return embedding_function

def create_vector_store(docs, embeddings, uri):
    # Create the directory if it does not exist
    head = os.path.split(uri)
    os.makedirs(head[0], exist_ok=True)
    print("Directory created for vector store if it did not exist")

    # Connect to the Milvus database
    connections.connect("default", uri=uri)
    print("Connected to the Milvus database")

    # Define collection name
    collection_name = "research_paper_chatbot"

    # Check if the collection already exists
    if utility.has_collection(collection_name):
        print("Collection already exists. Loading existing Vector Store.")
        vector_store = Collection(name=collection_name)
        print("Existing Vector Store Loaded")
    else:
        print("Creating new Vector Store...")
        fields = [
            FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=512),
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
        ]
        schema = CollectionSchema(fields=fields, description="Collection for research paper embeddings")
        collection = Collection(name=collection_name, schema=schema)
        print("New Vector Store Created with provided documents")

        # Insert documents into the vector store
        for doc in docs:
            embedding = embeddings(doc)
            collection.insert([[embedding]])

    return collection

def load_exisiting_db(uri):
    collection_name = "research_paper_chatbot"
    vector_store = Collection(name=collection_name)
    print("Loaded existing Vector Store from Milvus database")
    return vector_store

if __name__ == '__main__':
    try:
        print("Loading documents from the web...")
        documents = load_documents_from_web()
        print(f"Loaded {len(documents)} documents from the web.")

        print("Splitting documents into chunks...")
        docs = split_documents(documents)
        print(f"Split into {len(docs)} chunks.")

        print("Getting embedding function...")
        embeddings = get_embedding_function()

        uri = './milvus/milvus_vector.db'

        print("Creating vector store...")
        vector_store = create_vector_store(docs, embeddings, uri)
        print("Vector store created successfully.")

        print("Loading existing vector store...")
        loaded_vector_store = load_exisiting_db(uri)
        print("Loaded existing vector store successfully.")

    except Exception as e:
        print(f"An error occurred: {e}")

    print("Finished operations.")

Loading documents from the web...
Loaded 3 documents from the web.
Splitting documents into chunks...
Split into 3 chunks.
Getting embedding function...
Creating vector store...
Directory created for vector store if it did not exist
Connected to the Milvus database
Creating new Vector Store...
New Vector Store Created with provided documents
Vector store created successfully.
Loading existing vector store...
Loaded existing Vector Store from Milvus database
Loaded existing vector store successfully.
Finished operations.


### Query Response Generator
- The `query_rag` function serves as the main entry point for generating responses using a Retrieval-Augmented Generation (RAG) model.

- It initializes the model and sets up the necessary components, including the prompt and retrieval mechanisms, to process a user query. After generating a response, it compiles a list of relevant sources used in the response, ensuring that users receive both the answer and the origins of the information provided.

In [13]:
def create_prompt():
    """
    Creates a prompt for the model.
    
    Returns:
        PromptTemplate: An instance of PromptTemplate for the model.
    """
    
    return PromptTemplate(
        input_variables=["context"],  
        template="Given the following context: {context}, provide a comprehensive response."
    )

def query_rag(query):
    """
    Entry point for the RAG model to generate an answer to a given query.

    Args:
        query (str): The query string for which an answer is to be generated.
    
    Returns:
        str: The answer to the query
    """
    model = ChatMistralAI(model='open-mistral-7b')  
    print("Model Loaded") 

    prompt = create_prompt()  

    # Load the vector store and create the retriever
    vector_store = load_existing_db(uri=MILVUS_URI)  
    retriever = vector_store.as_retriever()  
    
    try:
        document_chain = create_stuff_documents_chain(model, prompt)  
        print("Document Chain Created") 

        retrieval_chain = create_retrieval_chain(retriever, document_chain)  
        print("Retrieval Chain Created")  
    
        # Generate a response to the query
        response = retrieval_chain({"input": f"{query}"})
    except HTTPStatusError as e:  
        print(f"HTTPStatusError: {e}") 
        if e.response.status_code == 429:  
            return "I am currently experiencing high traffic. Please try again later.", []  
        return f"HTTPStatusError: {e}", []  
    
    # Logic to add sources to the response
    max_relevant_sources = 4  
    all_sources = ""  
    sources = []  
    count = 1  

    for i in range(max_relevant_sources):  
        try:
            source = response["context"][i]["metadata"]["source"] 
            if source not in sources:  
                sources.append(source)  
                all_sources += f"[Source {count}]({source}), "  
                count += 1  
        except IndexError:  
            break  
            
    all_sources = all_sources[:-2]  
    response["answer"] += f"\n\nSources: {all_sources}"  
    print("Response Generated")  

    return response["answer"], sources  

 ### Adding Evaluation Metrics with Confusion Matrix

Purpose: To manage and store performance metrics related to a chatbot's classification performance (e.g., True Positives, False Positives, Accuracy, etc.).

Inputs:
- `increment_value (integer)`: Value to increment or update specific metrics.
- `metric (string)`: The name of the metric to update.
- `columns (string)`: The columns to fetch from the database.

Processing:
- Interacts with an database to execute queries.
- Performs necessary calculations for metrics like sensitivity, specificity, precision, recall, F1 score. 
- Safely handles division to avoid division by zero errors.

Outputs:
Updates or retrieves performance metrics stored in the database.

In [14]:
def update_performance_metrics(self):
    """
    Recalculate and update the performance metrics in the database.
    """
    metrics = self.get_performance_metrics('true_positive, true_negative, false_positive, false_negative')
    accuracy = self.safe_division(metrics['true_positive'] + metrics['true_negative'], 
                                  metrics['true_positive'] + metrics['true_negative'] + metrics['false_positive'] + metrics['false_negative'])
    precision = self.safe_division(metrics['true_positive'], metrics['true_positive'] + metrics['false_positive'])
    sensitivity = self.safe_division(metrics['true_positive'], metrics['true_positive'] + metrics['false_negative'])
    specificity = self.safe_division(metrics['true_negative'], metrics['true_negative'] + metrics['false_positive'])
    recall = self.safe_division(metrics['true_positive'], metrics['true_positive'] + metrics['false_negative'])

    if precision and sensitivity:
        f1_score = self.safe_division(2 * precision * sensitivity, precision + sensitivity)
    else:
        f1_score = None

    with self.connection:
        self.connection.execute('''
            UPDATE performance_metrics
            SET accuracy = ?, precision = ?, sensitivity = ?, specificity = ?, f1_score = ?, recall = ?
            WHERE id = 1
        ''', (accuracy, precision, sensitivity, specificity, f1_score, recall))

# 5. Testing the Chatbot

Purpose: Test the chatbot with a sample query

Input: Sample queries and documents. 

Output: Responses and any errors encountered. 

Processing: Test the RAG model’s response to ensure correctness.

### Retrieve an answer from RAG.

Submit a question to the RAG system, get the response, and display it

In [None]:
response, sources = query_rag("How does LLm's work??")
print(response)

# 6. Conclusion

Recap: We set up a retrieval-augmented generation (RAG) chatbot using LangChain and Milvus, integrated PDF loading, and improved its functionality.

Next Steps: Consider enhancing the chatbot with real-time data retrieval or using more advanced NLP techniques.

Resources: For more details, visit [Github](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team4) and 
[Wiki for reference](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team4/wiki)