# Documentation for Research Paper Chatbot 

### SE research paper chatbot

Group Name: [csusb_fall2024_cse6550_team4](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team4)

Instructor: Dr. Alzahrani, Nabeel

Course: CSE 6550: Software Engineer Concepts Fall 2024

Source: [Github](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team4)

# 1. Introduction

Purpose:

The purpose of this project is to create an AI-powered research paper chatbot that helps users extract, summarize, and understand content from academic papers. It will offer an interactive Q&A experience, providing accurate, contextually relevant answers to questions about specific sections, making complex research information more accessible and comprehensible.

Objective:

The Paper Chatbot enhances interaction with academic papers by allowing users to upload documents, ask questions, and receive summaries or clarifications. It simplifies extracting key information, aiding students, researchers, and professionals in efficiently understanding complex content.

Prerequisites:
Github, Docker, Jupyter Notebook, Python

# 2. Setup

Purpose:
The code sets up an environment for building a document processing and retrieval system that utilizes LangChain, MistralAI, and Milvus for vector storage. It aims to load, split, store, and analyze documents using machine learning models to support various NLP applications such as document retrieval or question-answering

Input:
- Environment variables loaded from a `.env` file (e.g., `MISTRAL_API_KEY`)
- Documents from `CORPUS_SOURCE`, potentially PDFs from a directory (`PyPDFDirectoryLoader`), or web content via `WebBaseLoader` and `RecursiveUrlLoader`
- Model information (`MODEL_NAME`) and storage directory paths (`data_dir`, `MILVUS_URI`)

Output:
- Status messages confirming imported libraries successfully
- Prepared documents stored in Milvus as vector embeddings and configure tools for document retrieval, response generation chains  

Processing:
- Library imports: Loads libraries for document processing, NLP, and vector storage
- Environment setup: Secures API keys using `dotenv`
- Document preparation: Loads documents from sources (web/PDFs) and splits them using `RecursiveCharacterTextSplitter`
- Model initialization: Uses `HuggingFaceEmbeddings` for text embeddings, with optional MistralAI/Cohere integration
- Vector database: Connects to Milvus for storing and managing document embeddings using `pymilvus`

In [5]:
import os
from dotenv import load_dotenv
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.schema import Document
from langchain_core.prompts import PromptTemplate
#from langchain_mistralai import MistralAIEmbeddings
from langchain_mistralai.chat_models import ChatMistralAI
#from langchain_cohere import ChatCohere
from langchain_milvus import Milvus
from langchain_community.document_loaders import WebBaseLoader, RecursiveUrlLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import create_retrieval_chain
from langchain_huggingface import HuggingFaceEmbeddings
from pymilvus import connections, utility
from requests.exceptions import HTTPError
from httpx import HTTPStatusError
from data import CORPUS_SOURCE
from langchain_community.document_loaders import PyPDFDirectoryLoader
from roman import toRoman
load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")

MILVUS_URI = "./milvus/milvus_vector.db"
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
data_dir = "./volumes"
print("imported libraries successfully.")

imported libraries successfully.


### Configuring environment variables

Purpose: The code initializes an environment for NLP applications, specifically for document retrieval and embedding generation using machine learning models.

Input:

- Environment variables loaded from a `.env` file (`MISTRAL_API_KEY` and `USER_AGENT`)
- Configuration values such as `MILVUS_URI`, `MODEL_NAME`, and `CORPUS_SOURCE`

Output:
Confirms successful setup of the environment, document source readiness, and embedding model configuration for future processing.

Processing:
- Environment setup: Loads environment variables to set up API access and a user agent for HTTP requests.
- Configuration: Prepares paths and settings for embedding models and vector storage.
- Embedding preparation: Sets up a model (`sentence-transformers/all-MiniLM-L6-v2`) for processing documents and generating embeddings.
- Document retrieval: References a document source URL (`CORPUS_SOURCE`) as a placeholder for further processing or document loading.

In [6]:
from dotenv import load_dotenv
import os 

# Load environment variables from .env file
load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")


USER_AGENT = os.getenv("USER_AGENT", "my_custom_user_agent")
os.environ["USER_AGENT"] = USER_AGENT

# Configuration settings  
MILVUS_URI = "./milvus/milvus_vector.db"  
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"  
CORPUS_SOURCE = 'https://dl.acm.org/doi/proceedings/10.1145/3597503'  

print("Environment initialized, documents loaded, embeddings configured.")

Environment initialized, documents loaded, embeddings configured.


# 3. Building the Chatbot

### Creating Hugging Face Embedding Function
Purpose: To create and return an embedding function configured with a specified NLP model for generating text embedding

Input:
- Model Name (`MODEL_NAME`): A pre-defined string representing the name of the model (e.g., `"sentence-transformers/all-MiniLM-L6-v2"`). This value is set globally and used to initialize the embedding function.

Output:
Returns an embedding function configured with the given model. This function can then be used to generate embeddings for text data.

Processing:
- The function creates an instance of `HuggingFaceEmbeddings` using `MODEL_NAME`, setting up an embedding generator based on the specified model.
- It returns this configured embedding generator for use in NLP tasks like document analysis or semantic search.

In [7]:
def get_embedding_function():
    """
    returns embedding function for the model

    Returns:
        embedding function
    """
    embedding_function = HuggingFaceEmbeddings(model_name=MODEL_NAME)
    return embedding_function

### Prompt Generator

Purpose: To set up configurations for an NLP system and provide a placeholder function for generating prompts.

Input:
- Model Name: A string (`"sentence-transformers/all-MiniLM-L6-v2"`) representing the embedding model.
- Milvus URI: A string (`"./milvus/milvus_vector.db"`) specifying the location of the Milvus vector database.

Output:
- A configured environment ready for NLP tasks.
- `create_prompt()` function returns a string template to be used for generating detailed research summaries.

Processing:
- Configuration: Sets the model name for embedding generation and the URI for the vector database connection.
- Prompt creation: Defines a simple placeholder function create_prompt() that returns a formatted prompt template for use in text generation.

In [1]:
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"  
# Specifies the URI for the Milvus vector database
MILVUS_URI = "./milvus/milvus_vector.db"  

# Placeholder function for creating a prompt
def create_prompt():
    return "Provide a detailed summary of the latest research on: {input}"  

### Vector store loader

Purpose: To simulate the loading of a vector store from a specified URI and provide an interface for data retrieval.

Input:
URI: A string representing the location of the vector database (e.g., `MILVUS_URI`).

Output:
`VectorStore` object with an as_retriever method that simulates returning itself for further operations. This acts as a placeholder for actual vector store functionality.

Processing:
- The function defines a placeholder `VectorStore` class with an `as_retriever` method that returns the `VectorStore` instance itself, simulating a `vector store` capable of retrieval operations.
- The function returns an instance of `VectorStore` initialized when called.

In [9]:
# function for loading the vector store
def load_exisiting_db(uri):
    class VectorStore:  
        def as_retriever(self): 
            return self 
    return VectorStore()  

### Document Chain Placeholder

Purpose: To serve as a placeholder function for creating a document processing chain using a model and a prompt.

Input:
- Model: An object representing an NLP model that will be used in the document chain.
- Prompt: A string or object representing the prompt template to guide document processing.

Output:

Returns a string: "Document chain here." to signify that a document processing chain has been set up (used for demonstration or placeholder purposes).

Processing:
- The function takes the model and prompt as arguments but currently does not perform any operations on them.
- It returns a placeholder string indicating that a document chain has been created.

In [2]:
# Placeholder function for creating a document chain
def create_stuff_documents_chain(model, prompt):
    return "Document chain here." 

### Create Retrieval Chain

Purpose: To act as a placeholder function for setting up a retrieval chain that uses a retriever and a document chain to produce a response with contextual information.

Input:
- Retriever: An object or method used to retrieve relevant data.
- Document Chain: A chain that processes documents to generate responses or insights.

Output:
- Returns a lambda function that, when called, returns a dictionary with contextual metadata and a sample generated answer. 
- This simulates the behavior of a retrieval chain that would provide an answer based on retrieved data.

Processing:
- The function returns a lambda function that simulates generating a response. This lambda takes an input (`x`) and returns a dictionary containing:
    - Context: List of dictionaries with metadata (e.g., source URLs) to simulate source references for the response.
    - Answer: Placeholder string `"Generated response"` representing the result produced by the retrieval chain.

In [11]:
# Placeholder function for creating a retrieval chain
def create_retrieval_chain(retriever, document_chain):
    return lambda x: {  # Returns a lambda function to generate a response
        "context": [  # Contextual information with metadata for sources
            {"metadata": {"source": "https://example.com/research_paper1"}}, 
            {"metadata": {"source": "https://example.com/research_paper2"}}
        ], 
        "answer": "Generated response"  # Sample generated answer
    }

# 4. Improving the Chatbot

### Improving Chatbot with RAG and Embeddings
Purpose: To create and manage a vector store using Milvus, which stores embeddings generated from web documents for NLP applications.

Input:
- Documents: Text content loaded from the web through the `load_documents_from_web()` function.
- Embedding Function: A function that converts document text into numerical embeddings.
- URI: A file path (`'./milvus/milvus_vector.db'`) specifying the location of the Milvus vector database.

Output:
- Prints console messages at each step, confirming the successful execution of loading, processing, and vector store management.
- If errors occur, an exception message is printed.

Processing:
- Document loading: Retrieves documents from a web source.
- Document splitting: Splits the loaded documents into manageable chunks.
- Embedding generation: Uses an embedding function to generate fixed-size numerical embeddings for each document chunk.
- Vector store creation:
   - Connects to Milvus.
   - Checks if a collection ("research_paper_chatbot") exists. If not, creates a new collection schema.
   - Inserts document embeddings into the collection.
- Vector store loading: Loads an existing vector store collection from Milvus for retrieval operations

In [12]:
import os
from tqdm import tqdm
from pymilvus import (
    connections,
    FieldSchema,
    CollectionSchema,
    DataType,
    Collection,
    utility
)

# Suppress tqdm warnings by setting environment variable
os.environ['TQDM_DISABLE'] = '1'

# Placeholder function to simulate loading documents from the web
def load_documents_from_web():
    return ["Document 1 content", "Document 2 content", "Document 3 content"]

# Splits documents into smaller chunks
def split_documents(documents):
    return [doc for doc in documents]

# Placeholder function to simulate getting an embedding function
def get_embedding_function():
    def embedding_function(doc):
        return [0.1] * 512  # Example fixed-size embedding
    return embedding_function

def create_vector_store(docs, embeddings, uri):
    # Create the directory if it does not exist
    head = os.path.split(uri)
    os.makedirs(head[0], exist_ok=True)
    print("Directory created for vector store if it did not exist")

    # Connect to the Milvus database
    connections.connect("default", uri=uri)
    print("Connected to the Milvus database")

    # Define collection name
    collection_name = "research_paper_chatbot"

    # Check if the collection already exists
    if utility.has_collection(collection_name):
        print("Collection already exists. Loading existing Vector Store.")
        vector_store = Collection(name=collection_name)
        print("Existing Vector Store Loaded")
    else:
        print("Creating new Vector Store...")
        fields = [
            FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=512),
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
        ]
        schema = CollectionSchema(fields=fields, description="Collection for research paper embeddings")
        collection = Collection(name=collection_name, schema=schema)
        print("New Vector Store Created with provided documents")

        # Insert documents into the vector store
        for doc in docs:
            embedding = embeddings(doc)
            collection.insert([[embedding]])

    return collection

def load_exisiting_db(uri):
    collection_name = "research_paper_chatbot"
    vector_store = Collection(name=collection_name)
    print("Loaded existing Vector Store from Milvus database")
    return vector_store

if __name__ == '__main__':
    try:
        print("Loading documents from the web...")
        documents = load_documents_from_web()
        print(f"Loaded {len(documents)} documents from the web.")

        print("Splitting documents into chunks...")
        docs = split_documents(documents)
        print(f"Split into {len(docs)} chunks.")

        print("Getting embedding function...")
        embeddings = get_embedding_function()

        uri = './milvus/milvus_vector.db'

        print("Creating vector store...")
        vector_store = create_vector_store(docs, embeddings, uri)
        print("Vector store created successfully.")

        print("Loading existing vector store...")
        loaded_vector_store = load_exisiting_db(uri)
        print("Loaded existing vector store successfully.")

    except Exception as e:
        print(f"An error occurred: {e}")

    print("Finished operations.")

Loading documents from the web...
Loaded 3 documents from the web.
Splitting documents into chunks...
Split into 3 chunks.
Getting embedding function...
Creating vector store...
Directory created for vector store if it did not exist
Connected to the Milvus database
Creating new Vector Store...
New Vector Store Created with provided documents
Vector store created successfully.
Loading existing vector store...
Loaded existing Vector Store from Milvus database
Loaded existing vector store successfully.
Finished operations.


### Query Response Generator
Purpose: To provide an entry point for querying a Retrieval-Augmented Generation (RAG) system, generate a response using NLP models, and include source references


Input:
Query: A string containing the user's question or request for information

Output:

Response: A string containing the generated answer with source references.
Sources: A list of URLs or source identifiers referenced in the response.

Processing:
1. Model and Prompt Initialization:
    - Loads a ChatMistralAI model and creates a prompt template with create_prompt() for structured input.
2. Vector Store Retrieval:
    - Loads an existing vector store from Milvus and creates a retriever to fetch relevant documents.
3. Document Chain Creation:
    - Uses `create_stuff_documents_chain()` to set up a document processing chain.
4. Retrieval Chain Setup:
    - Constructs a retrieval chain that integrates the retriever and document chain to process the query.
5. Query Handling and Response Generation:
    - Generates a response using the retrieval chain.
    - Catches and handles HTTPStatusError for handling service load issues (e.g., error 429 for high traffic).
6. Source Attribution:
    - Extracts and formats up to four unique sources from the response context to include in the output.
    - Appends source references to the generated response.

In [13]:
def create_prompt():
    """
    Creates a prompt for the model.
    
    Returns:
        PromptTemplate: An instance of PromptTemplate for the model.
    """
    
    return PromptTemplate(
        input_variables=["context"],  
        template="Given the following context: {context}, provide a comprehensive response."
    )

def query_rag(query):
    """
    Entry point for the RAG model to generate an answer to a given query.

    Args:
        query (str): The query string for which an answer is to be generated.
    
    Returns:
        str: The answer to the query
    """
    model = ChatMistralAI(model='open-mistral-7b')  
    print("Model Loaded") 

    prompt = create_prompt()  

    # Load the vector store and create the retriever
    vector_store = load_existing_db(uri=MILVUS_URI)  
    retriever = vector_store.as_retriever()  
    
    try:
        document_chain = create_stuff_documents_chain(model, prompt)  
        print("Document Chain Created") 

        retrieval_chain = create_retrieval_chain(retriever, document_chain)  
        print("Retrieval Chain Created")  
    
        # Generate a response to the query
        response = retrieval_chain({"input": f"{query}"})
    except HTTPStatusError as e:  
        print(f"HTTPStatusError: {e}") 
        if e.response.status_code == 429:  
            return "I am currently experiencing high traffic. Please try again later.", []  
        return f"HTTPStatusError: {e}", []  
    
    # Logic to add sources to the response
    max_relevant_sources = 4  
    all_sources = ""  
    sources = []  
    count = 1  

    for i in range(max_relevant_sources):  
        try:
            source = response["context"][i]["metadata"]["source"] 
            if source not in sources:  
                sources.append(source)  
                all_sources += f"[Source {count}]({source}), "  
                count += 1  
        except IndexError:  
            break  
            
    all_sources = all_sources[:-2]  
    response["answer"] += f"\n\nSources: {all_sources}"  
    print("Response Generated")  

    return response["answer"], sources  

 ### Adding Evaluation Metrics with Confusion Matrix

Purpose: To manage and store performance metrics related to a chatbot's classification performance (e.g., True Positives, False Positives, Accuracy, etc.).

Inputs:
- `increment_value (integer)`: Value to increment or update specific metrics.
- `metric (string)`: The name of the metric to update.
- `columns (string)`: The columns to fetch from the database.

Processing:
- Interacts with an database to execute queries.
- Performs necessary calculations for metrics like sensitivity, specificity, precision, recall, F1 score. 
- Safely handles division to avoid division by zero errors.

Outputs:
Updates or retrieves performance metrics stored in the database.

In [14]:
def update_performance_metrics(self):
    """
    Recalculate and update the performance metrics in the database.
    """
    metrics = self.get_performance_metrics('true_positive, true_negative, false_positive, false_negative')
    accuracy = self.safe_division(metrics['true_positive'] + metrics['true_negative'], 
                                  metrics['true_positive'] + metrics['true_negative'] + metrics['false_positive'] + metrics['false_negative'])
    precision = self.safe_division(metrics['true_positive'], metrics['true_positive'] + metrics['false_positive'])
    sensitivity = self.safe_division(metrics['true_positive'], metrics['true_positive'] + metrics['false_negative'])
    specificity = self.safe_division(metrics['true_negative'], metrics['true_negative'] + metrics['false_positive'])
    recall = self.safe_division(metrics['true_positive'], metrics['true_positive'] + metrics['false_negative'])

    if precision and sensitivity:
        f1_score = self.safe_division(2 * precision * sensitivity, precision + sensitivity)
    else:
        f1_score = None

    with self.connection:
        self.connection.execute('''
            UPDATE performance_metrics
            SET accuracy = ?, precision = ?, sensitivity = ?, specificity = ?, f1_score = ?, recall = ?
            WHERE id = 1
        ''', (accuracy, precision, sensitivity, specificity, f1_score, recall))

# 5. Testing the Chatbot

Purpose: Test the chatbot with a sample query

Input: Sample queries and documents. 

Output: Responses and any errors encountered. 

Processing: Test the RAG model’s response to ensure correctness.

### Retrieve an answer from RAG.

Submit a question to the RAG system, get the response, and display it

In [None]:
response, sources = query_rag("How does LLm's work??")
print(response)

# 6. Conclusion

Recap: We set up a retrieval-augmented generation (RAG) chatbot using LangChain and Milvus, integrated PDF loading, and improved its functionality.

Next Steps: Consider enhancing the chatbot with real-time data retrieval or using more advanced NLP techniques.

Resources: For more details, visit [Github](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team4) and 
[Wiki for reference](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team4/wiki)