# Documentation for Research Paper Chatbot (Team-4)

The Paper Chatbot, is a system designed to enhance user interaction with academic papers by providing a conversational interface. It enables users to upload research documents and ask questions, receive summaries, and clarify complex topics within the text, simplifying the extraction of critical information without reading the entire paper. The system is aimed at students, researchers, and professionals, streamlining their ability to understand and leverage detailed content efficiently

### Importing libaries
This Python script sets up an environment for processing and interacting with documents, particularly academic papers. It imports libraries for tasks like document loading, splitting, embedding generation, and integrates tools such as ChatMistralAI and HuggingFaceEmbeddings, along with a Milvus vector database for efficient data retrieval. Environment variables are loaded for secure API access, and paths are set for data storage and model configuration.

In [None]:
import os
from dotenv import load_dotenv
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.schema import Document
from langchain_core.prompts import PromptTemplate
#from langchain_mistralai import MistralAIEmbeddings
from langchain_mistralai.chat_models import ChatMistralAI
#from langchain_cohere import ChatCohere
from langchain_milvus import Milvus
from langchain_community.document_loaders import WebBaseLoader, RecursiveUrlLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import create_retrieval_chain
from langchain_huggingface import HuggingFaceEmbeddings
from pymilvus import connections, utility
from requests.exceptions import HTTPError
from httpx import HTTPStatusError
from langchain_community.document_loaders import PyPDFDirectoryLoader
from roman import toRoman
load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")

MILVUS_URI = "./milvus/milvus_vector.db"
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
data_dir = "./volumes"
print("imported libraries successfully.")

imported libraries successfully.


### Configuring environment variables

Initialize the required environment variables for document processing. `CORPUS_SOURCE` can be updated to change the document source, `MISTRAL_API_KEY` holds the API key for MistralAI, `MILVUS_URI` defines the path to the Milvus Lite database file, and `MODEL_NAME` sets the embedding model used for analyzing the corpus.

In [16]:
from dotenv import load_dotenv
import os 

# Load environment variables from .env file
load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")

# Ensure USER_AGENT environment variable is set; assign a default if missing
USER_AGENT = os.getenv("USER_AGENT", "my_custom_user_agent")
os.environ["USER_AGENT"] = USER_AGENT

# Configuration settings  
MILVUS_URI = "./milvus/milvus_vector.db"  # URI for Milvus vector database
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"  # Model for sentence embeddings
CORPUS_SOURCE = 'https://dl.acm.org/doi/proceedings/10.1145/3597503'  # Document corpus URL

print("Environment initialized, documents loaded, embeddings configured.")

Environment initialized, documents loaded, embeddings configured.


### Creating Hugging Face Embedding Function

- The `get_embedding_function` function then initializes the embedding function with the specified model and prints a confirmation message. 

- Calling `get_embedding_function` creates and stores this embedding function in `embedding_function`. Finally, it prints a message confirming that the embedding function was created successfully with the specified model

In [17]:
def get_embedding_function():
    """
    returns embedding function for the model

    Returns:
        embedding function
    """
    embedding_function = HuggingFaceEmbeddings(model_name=MODEL_NAME)
    return embedding_function

### Prompt Generator

- The `create_prompt` function serves as a simple template generator for formulating queries related to research topics. When called, it returns a predefined string that instructs the model to provide a detailed summary of the latest research on a specified subject, indicated by the placeholder `{input}`. 

- This allows the function to be flexible, enabling users to input various research topics and receive contextually relevant summaries.

In [18]:
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"  
# Specifies the URI for the Milvus vector database
MILVUS_URI = "./milvus/milvus_vector.db"  

# Placeholder function for creating a prompt
def create_prompt():
    return "Provide a detailed summary of the latest research on: {input}"  # Returns a prompt template for querying

### Vector store loader

- `load_exisiting_db` function is a placeholder that simulates the loading of a vector store from a specified URI.

- It defines a local class, `VectorStore`, which includes a method called `as_retriever` that allows the instance of `VectorStore` to be treated as a retriever. When called, this function returns an instance of the `VectorStore` class, providing a structure for further integration with a retrieval system.

In [19]:
# Placeholder function for loading the vector store
def load_exisiting_db(uri):
    class VectorStore:  # Defines a local class for the vector store
        def as_retriever(self):  # Method to return itself as a retriever
            return self  # Returns the instance of VectorStore
    return VectorStore()  # Returns an instance of VectorStore

### Document Chain Placeholder

- The `create_stuff_documents_chain` function is a placeholder designed to establish a document processing chain that integrates a specified model and a prompt. 
- Currently, it returns a string that signifies the intended location for the actual document chain implementation.

In [20]:
# Placeholder function for creating a document chain
def create_stuff_documents_chain(model, prompt):
    return "Document chain here."  # Returns a string indicating where the document chain would be

### Create Retrieval Chain

- This function defines a placeholder for creating a retrieval chain, which integrates a retriever and a document chain to generate a response.
  
- It returns a lambda function that simulates the retrieval process by providing contextual information and a sample generated answer.

In [21]:
# Placeholder function for creating a retrieval chain
def create_retrieval_chain(retriever, document_chain):
    return lambda x: {  # Returns a lambda function to generate a response
        "context": [  # Contextual information with metadata for sources
            {"metadata": {"source": "https://example.com/research_paper1"}}, 
            {"metadata": {"source": "https://example.com/research_paper2"}}
        ], 
        "answer": "Generated response"  # Sample generated answer
    }

### Vector store management (Milvus database)

The code sets up a retrieval-augmented generation (RAG) system using a Milvus vector store to manage document embeddings. It includes functions to load documents, create or load a vector store, and manage embedding schemas for storage. Progress is logged with print statements to inform the user of each operation's status.

In [24]:
import os
from tqdm import tqdm
from pymilvus import (
    connections,
    FieldSchema,
    CollectionSchema,
    DataType,
    Collection,
    utility
)

# Suppress tqdm warnings by setting environment variable
os.environ['TQDM_DISABLE'] = '1'

# Placeholder function to simulate loading documents from the web
def load_documents_from_web():
    return ["Document 1 content", "Document 2 content", "Document 3 content"]

# Splits documents into smaller chunks
def split_documents(documents):
    return [doc for doc in documents]

# Placeholder function to simulate getting an embedding function
def get_embedding_function():
    def embedding_function(doc):
        return [0.1] * 512  # Example fixed-size embedding
    return embedding_function

def create_vector_store(docs, embeddings, uri):
    # Create the directory if it does not exist
    head = os.path.split(uri)
    os.makedirs(head[0], exist_ok=True)
    print("Directory created for vector store if it did not exist")

    # Connect to the Milvus database
    connections.connect("default", uri=uri)
    print("Connected to the Milvus database")

    # Define collection name
    collection_name = "research_paper_chatbot"

    # Check if the collection already exists
    if utility.has_collection(collection_name):
        print("Collection already exists. Loading existing Vector Store.")
        vector_store = Collection(name=collection_name)
        print("Existing Vector Store Loaded")
    else:
        print("Creating new Vector Store...")
        fields = [
            FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=512),
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
        ]
        schema = CollectionSchema(fields=fields, description="Collection for research paper embeddings")
        collection = Collection(name=collection_name, schema=schema)
        print("New Vector Store Created with provided documents")

        # Insert documents into the vector store
        for doc in docs:
            embedding = embeddings(doc)
            collection.insert([[embedding]])

    return collection

def load_exisiting_db(uri):
    collection_name = "research_paper_chatbot"
    vector_store = Collection(name=collection_name)
    print("Loaded existing Vector Store from Milvus database")
    return vector_store

if __name__ == '__main__':
    try:
        print("Loading documents from the web...")
        documents = load_documents_from_web()
        print(f"Loaded {len(documents)} documents from the web.")

        print("Splitting documents into chunks...")
        docs = split_documents(documents)
        print(f"Split into {len(docs)} chunks.")

        print("Getting embedding function...")
        embeddings = get_embedding_function()

        uri = './milvus/milvus_vector.db'

        print("Creating vector store...")
        vector_store = create_vector_store(docs, embeddings, uri)
        print("Vector store created successfully.")

        print("Loading existing vector store...")
        loaded_vector_store = load_exisiting_db(uri)
        print("Loaded existing vector store successfully.")

    except Exception as e:
        print(f"An error occurred: {e}")

    print("Finished operations.")

Loading documents from the web...
Loaded 3 documents from the web.
Splitting documents into chunks...
Split into 3 chunks.
Getting embedding function...
Creating vector store...
Directory created for vector store if it did not exist
Connected to the Milvus database
Creating new Vector Store...
New Vector Store Created with provided documents
Vector store created successfully.
Loading existing vector store...
Loaded existing Vector Store from Milvus database
Loaded existing vector store successfully.
Finished operations.


### Query Response Generator
- The `query_rag` function serves as the main entry point for generating responses using a Retrieval-Augmented Generation (RAG) model.

- It initializes the model and sets up the necessary components, including the prompt and retrieval mechanisms, to process a user query. After generating a response, it compiles a list of relevant sources used in the response, ensuring that users receive both the answer and the origins of the information provided.

In [25]:
def create_prompt():
    """
    Creates a prompt for the model.
    
    Returns:
        PromptTemplate: An instance of PromptTemplate for the model.
    """
    # Create a prompt template using the PromptTemplate class
    return PromptTemplate(
        input_variables=["context"],  # Change to context
        template="Given the following context: {context}, provide a comprehensive response."
    )

def query_rag(query):
    """
    Entry point for the RAG model to generate an answer to a given query.

    Args:
        query (str): The query string for which an answer is to be generated.
    
    Returns:
        str: The answer to the query
    """
    model = ChatMistralAI(model='open-mistral-7b')  # Initializes the ChatMistralAI model
    print("Model Loaded")  # Print statement confirming the model has loaded

    prompt = create_prompt()  # Calls the function to create the prompt

    # Load the vector store and create the retriever
    vector_store = load_existing_db(uri=MILVUS_URI)  # Calls the load_existing_db function
    retriever = vector_store.as_retriever()  
    
    try:
        document_chain = create_stuff_documents_chain(model, prompt)  # Creates a document chain
        print("Document Chain Created")  # Print confirmation for document chain creation

        retrieval_chain = create_retrieval_chain(retriever, document_chain)  # Creates a retrieval chain
        print("Retrieval Chain Created")  # Print confirmation for retrieval chain creation
    
        # Generate a response to the query
        response = retrieval_chain({"input": f"{query}"})  # Calls the retrieval chain with the query input
    except HTTPStatusError as e:  # Catches HTTP errors
        print(f"HTTPStatusError: {e}")  # Print the error message
        if e.response.status_code == 429:  # Checks for rate limit error
            return "I am currently experiencing high traffic. Please try again later.", []  # Returns message for high traffic
        return f"HTTPStatusError: {e}", []  # Returns error message for other HTTP errors
    
    # Logic to add sources to the response
    max_relevant_sources = 4  # Defines maximum number of sources to include
    all_sources = ""  # Initialize string to hold all source links
    sources = []  # Initialize list to hold unique sources
    count = 1  # Initialize counter for source numbering

    for i in range(max_relevant_sources):  # Loop over the number of maximum relevant sources
        try:
            source = response["context"][i]["metadata"]["source"]  # Retrieves source from response context
            # Check if the source is already added to the list
            if source not in sources:  # If source is unique
                sources.append(source)  # Add source to the list
                all_sources += f"[Source {count}]({source}), "  # Append formatted source link to all_sources
                count += 1  # Increment the source counter
        except IndexError:  # Handle case where there are no more sources
            break  # Exit the loop if no more sources are available
            
    all_sources = all_sources[:-2]  # Remove the last comma and space from all_sources
    response["answer"] += f"\n\nSources: {all_sources}"  # Append all sources to the answer
    print("Response Generated")  # Print confirmation of response generation

    return response["answer"], sources  # Return the generated answer and list of sources

# Retrieve an answer from RAG.

Submit a question to the RAG system, get the response, and display it

In [None]:
response, sources = query_rag("How does LLm's work??")
print(response)