# Documentation for bot.py

### Overview of bot.py

- Environment and model setup: Initializes environment variables, loads the Mistral API key, and sets up an embedding model with Hugging Face for document embeddings.

- Retrieval chain and prompt creation: Builds a retrieval-augmented chain using Milvus and defines a prompt template to generate concise, source-based responses to queries.

- Document processing and vector store management: Loads, splits, and stores documents in a Milvus vector store for efficient retrieval, enabling the chatbot to find and return relevant answers with cited sources.

### Import libaries and setup environment variables

- `import os` : Import the os module for operating system functionality

- `from dotenv import load_dotenv` : Import load_dotenv to load environment variables from a `.env` file

- `from langchain.chains.combine_documents import create_stuff_documents_chain` : Import function for combining documents

- `from langchain.schema import Document` : Import Document schema from langchain

- `from langchain_core.prompts import PromptTemplate` : Import prompt template for generating prompts

- `from langchain_mistralai.chat_models import ChatMistralAI` : Import ChatMistralAI model for conversation

- `from langchain_milvus import Milvus` : Import Milvus for vector storage

- `from langchain_community.document_loaders import WebBaseLoader, RecursiveUrlLoader` : Import loaders for web documents

- `from langchain_text_splitters import RecursiveCharacterTextSplitter` : Import splitter for text documents

- `from langchain.chains import create_retrieval_chain` : Import function to create a retrieval chain

- `from langchain_huggingface import HuggingFaceEmbeddings` : Import HuggingFace embeddings

- `from pymilvus import connections, utility` : Import Milvus connection and utility functions

- `from requests.exceptions import HTTPError`  : Import HTTPError for handling HTTP exceptions

- `from httpx import HTTPStatusError` : Import HTTPStatusError for HTTP status exceptions

- This code sets up a chatbot using document retrieval and embeddings. It imports necessary modules, loads environment variables (like MISTRAL_API_KEY), sets a user agent if needed, and configures a Milvus database URI (MILVUS_URI) for storing embeddings.
- It also specifies an embedding model (MODEL_NAME) and a document source URL (CORPUS_SOURCE), then confirms the setup with a print statemen

In [2]:
import os
from dotenv import load_dotenv
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.schema import Document
from langchain_core.prompts import PromptTemplate
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_milvus import Milvus
from langchain_community.document_loaders import WebBaseLoader, RecursiveUrlLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import create_retrieval_chain
from langchain_huggingface import HuggingFaceEmbeddings
from pymilvus import connections, utility
from requests.exceptions import HTTPError
from httpx import HTTPStatusError

# Load environment variables from .env file
load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")

# Ensure USER_AGENT environment variable is set; assign a default if missing
USER_AGENT = os.getenv("USER_AGENT", "my_custom_user_agent")
os.environ["USER_AGENT"] = USER_AGENT

# Configuration settings  
MILVUS_URI = "./milvus/milvus_vector.db"  # URI for Milvus vector database
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"  # Model for sentence embeddings
CORPUS_SOURCE = 'https://dl.acm.org/doi/proceedings/10.1145/3597503'  # Document corpus URL

# Print confirmation message
print("Environment initialized, documents loaded, embeddings configured.")

Environment initialized, documents loaded, embeddings configured.


### Creating Hugging Face Embedding Function

This code sets up an embedding function using the Hugging Face model sentence-transformers/all-MiniLM-L6-v2. First, it disables tqdm warnings by setting an environment variable. The `get_embedding_function` function then initializes the embedding function with the specified model and prints a confirmation message. Calling `get_embedding_function` creates and stores this embedding function in `embedding_function`. Finally, it prints a message confirming that the embedding function was created successfully with the specified model

In [17]:
# Suppress tqdm warnings by setting environment variable
os.environ['TQDM_DISABLE'] = '1'  
# Defines the model name for embeddings
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2" 

# Creates embedding function using specified model
def get_embedding_function():
    """
    Returns embedding function for the model.

    Returns:
        embedding function
    """
    embedding_function = HuggingFaceEmbeddings(model_name=MODEL_NAME)  
    
    print("Returns Hugging Face model embedding function.")  
    
    return embedding_function  # Returns the created embedding function

# Calls the function and stores the resulting embedding function 
embedding_function = get_embedding_function()  

# Prints confirmation of the embedding function creation along with the model name
print(f"Embedding function created with model: {MODEL_NAME}")  

Returns Hugging Face model embedding function.
Embedding function created with model: sentence-transformers/all-MiniLM-L6-v2


### Query response generation

- This code implements a retrieval-augmented generation (RAG) chatbot that answers questions using a document retrieval model. It initializes the ChatMistralAI language model, defines a prompt template, and sets up a Milvus vector store for embeddings.
  
- The query_rag() function processes a query, loads the necessary components, and generates a response that includes relevant source links. An example query demonstrates how the chatbot functions

In [7]:
# Defines the model name for embeddings
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"  
# Specifies the URI for the Milvus vector database
MILVUS_URI = "./milvus/milvus_vector.db"  

# Placeholder function for creating a prompt
def create_prompt():
    return "Provide a detailed summary of the latest research on: {input}"  # Returns a prompt template for querying

# Placeholder function for loading the vector store
def load_exisiting_db(uri):
    class VectorStore:  # Defines a local class for the vector store
        def as_retriever(self):  # Method to return itself as a retriever
            return self  # Returns the instance of VectorStore
    return VectorStore()  # Returns an instance of VectorStore

# Placeholder function for creating a document chain
def create_stuff_documents_chain(model, prompt):
    return "Document chain here."  # Returns a string indicating where the document chain would be

# Placeholder function for creating a retrieval chain
def create_retrieval_chain(retriever, document_chain):
    return lambda x: {  # Returns a lambda function to generate a response
        "context": [  # Contextual information with metadata for sources
            {"metadata": {"source": "https://example.com/research_paper1"}}, 
            {"metadata": {"source": "https://example.com/research_paper2"}}
        ], 
        "answer": "Generated response"  # Sample generated answer
    }

def query_rag(query):
    """
    Entry point for the RAG model to generate an answer to a given query.

    This function initializes the RAG model, sets up the necessary components such as the prompt template, vector store, 
    retriever, document chain, and retrieval chain, and then generates a response to the provided query.

    Args:
        query (str): The query string for which an answer is to be generated.
    
    Returns:
        str: The answer to the query
    """
    model = ChatMistralAI(model='open-mistral-7b')  # Initializes the ChatMistralAI model
    print("Model Loaded")  # Print statement confirming the model has loaded

    prompt = create_prompt()  # Calls the function to create the prompt

    # Load the vector store and create the retriever
    vector_store = load_exisiting_db(uri=MILVUS_URI)  
    retriever = vector_store.as_retriever()  
    
    try:
        document_chain = create_stuff_documents_chain(model, prompt)  # Creates a document chain
        print("Document Chain Created")  # Print confirmation for document chain creation

        retrieval_chain = create_retrieval_chain(retriever, document_chain)  # Creates a retrieval chain
        print("Retrieval Chain Created")  # Print confirmation for retrieval chain creation
    
        # Generate a response to the query
        response = retrieval_chain({"input": f"{query}"})  # Calls the retrieval chain with the query input
    except HTTPStatusError as e:  # Catches HTTP errors
        print(f"HTTPStatusError: {e}")  # Print the error message
        if e.response.status_code == 429:  # Checks for rate limit error
            return "I am currently experiencing high traffic. Please try again later.", []  # Returns message for high traffic
        return f"HTTPStatusError: {e}", []  # Returns error message for other HTTP errors
    
    # Logic to add sources to the response
    max_relevant_sources = 4  # Defines maximum number of sources to include
    all_sources = ""  # Initialize string to hold all source links
    sources = []  # Initialize list to hold unique sources
    count = 1  # Initialize counter for source numbering

    for i in range(max_relevant_sources):  # Loop over the number of maximum relevant sources
        try:
            source = response["context"][i]["metadata"]["source"]  # Retrieves source from response context
            # Check if the source is already added to the list
            if source not in sources:  # If source is unique
                sources.append(source)  # Add source to the list
                all_sources += f"[Source {count}]({source}), "  # Append formatted source link to all_sources
                count += 1  # Increment the source counter
        except IndexError:  # Handle case where there are no more sources
            break  # Exit the loop if no more sources are available
            
    all_sources = all_sources[:-2]  # Remove the last comma and space from all_sources
    response["answer"] += f"\n\nSources: {all_sources}"  # Append all sources to the answer
    print("Response Generated")  # Print confirmation of response generation

    print("Initializes components, retrieves documents, generates response.")  # Summary of what the function does

    return response["answer"], sources  # Return the generated answer and list of sources

# Example usage
query = "Latest research on machine learning in healthcare"  # Defines an example query
answer, sources = query_rag(query)  # Calls the query_rag function with the example query
print(answer)  # Prints the generated answer

Model Loaded
Document Chain Created
Retrieval Chain Created
Response Generated
Initializes components, retrieves documents, generates response.
Generated response

Sources: [Source 1](https://example.com/research_paper1), [Source 2](https://example.com/research_paper2)


### Vector store management

The code sets up a retrieval-augmented generation (RAG) system using a Milvus vector store to manage document embeddings. It includes functions to load documents, create or load a vector store, and manage embedding schemas for storage. Progress is logged with print statements to inform the user of each operation's status.

In [16]:
def load_documents_from_web():
    """
    Placeholder function to simulate loading documents from the web.
    
    Returns:
        list: A list of loaded document contents.
    """
    return ["Document 1 content", "Document 2 content", "Document 3 content"]  # Example document contents

def split_documents(documents):
    """
    Splits documents into smaller chunks.
    
    Args:
        documents (list): List of documents to split.
    
    Returns:
        list: List of split document chunks.
    """
    return [doc for doc in documents]  # Example: returns the documents as-is

def get_embedding_function():
    """
    Placeholder function to simulate getting an embedding function.
    
    Returns:
        function: A simulated embedding function.
    """
    def embedding_function(doc):
        return [0.1] * 512  # Example fixed-size embedding
    return embedding_function

def create_vector_store(docs, embeddings, uri):
    """
    This function initializes a vector store using the provided documents and embeddings.
    It connects to a local Milvus database specified by the URI. If a collection named "research_paper_chatbot" already exists,
    it loads the existing vector store; otherwise, it creates a new vector store and drops any existing one.

    Args:
        docs (list): A list of documents to be stored in the vector store.
        embeddings: A function or model that generates embeddings for the documents.
        uri (str): Path to the local milvus db

    Returns:
        vector_store: The vector store created
    """
    # Create the directory if it does not exist
    head = os.path.split(uri)  # Split the URI into directory and file name
    os.makedirs(head[0], exist_ok=True)  # Create directory if it doesn't exist
    print("Directory created for vector store if it did not exist")  # Log directory creation

    # Connect to the Milvus database
    connections.connect("default", uri=uri) 
    print("Connected to the Milvus database")  

    # Define collection name
    collection_name = "research_paper_chatbot"

    # Check if the collection already exists
    if utility.has_collection(collection_name):  
        print("Collection already exists. Loading existing Vector Store.") 
        vector_store = Collection(name=collection_name)  # Load existing vector store
        print("Existing Vector Store Loaded")  # Log successful load of existing store
    else:
        print("Creating new Vector Store...")
        fields = [
            FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=512),  # Adjust dimensions as necessary
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
        ]
        schema = CollectionSchema(fields=fields, description="Collection for research paper embeddings")
        collection = Collection(name=collection_name, schema=schema)  # Create new collection
        print("New Vector Store Created with provided documents")  # Log new store creation

        # Insert documents into the vector store
        for doc in docs:
            embedding = embeddings(doc)  # Generate embedding
            collection.insert([[embedding]])  # Insert the embedding into the collection

    return vector_store  # Return the created or loaded vector store

def load_exisiting_db(uri):
    """
    Load an existing vector store from the local Milvus database specified by the URI.

    Args:
        uri (str): Path to the local milvus db.

    Returns:
        vector_store: The vector store created.
    """
    collection_name = "research_paper_chatbot"
    vector_store = Collection(name=collection_name)  # Load existing vector store
    print("Loaded existing Vector Store from Milvus database")  # Log successful load of existing store
    return vector_store  # Return the loaded vector store

if __name__ == '__main__':  # Check if this script is being executed directly
    try:
        # Load documents from the web
        print("Loading documents from the web...")  
        documents = load_documents_from_web()  
        print(f"Loaded {len(documents)} documents from the web.")  # Log number of loaded documents

        # Split the documents into chunks
        print("Splitting documents into chunks...")  # Log document splitting start
        docs = split_documents(documents)  # Split documents into smaller chunks
        print(f"Split into {len(docs)} chunks.")  # Log number of chunks created

        # Get the embedding function
        print("Getting embedding function...")  # Log embedding function retrieval
        embeddings = get_embedding_function()  # Retrieve the embedding function

        # Define the URI for the Milvus database
        uri = './milvus/milvus_vector.db'  # Set the URI variable for the database

        # Create the vector store
        print("Creating vector store...")
        vector_store = create_vector_store(docs, embeddings, uri)  
        print("Vector store created successfully.")

        # Load the existing vector store
        print("Loading existing vector store...")
        loaded_vector_store = load_exisiting_db(uri)  
        print("Loaded existing vector store successfully.")

    except Exception as e:
        print(f"An error occurred: {e}")  # Print any error that occurs

    print("Finished operations.")  # Log completion of operations

Loading documents from the web...
Loaded 3 documents from the web.
Splitting documents into chunks...
Split into 3 chunks.
Getting embedding function...
Creating vector store...
Directory created for vector store if it did not exist
Connected to the Milvus database
Collection already exists. Loading existing Vector Store.
Existing Vector Store Loaded
Vector store created successfully.
Loading existing vector store...
Loaded existing Vector Store from Milvus database
Loaded existing vector store successfully.
Finished operations.
