# ITS Support Chatbot

This chatbot is an educational tool that's built to answer questions related to the CSUSB's [Information Technology Services](https://www.csusb.edu/its). The chatbot was built by team 1 for [CSE 6550: Software Engineering Concepts](https://catalog.csusb.edu/coursesaz/cse/)

In this notebook, we will demonstrate how the chatbot uses retrieval augemented generation (RAG) to answer questions using the ITS website as the primary data source.

[![GitHub](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team1) 
[![Wiki](https://img.shields.io/badge/Wiki-blue?style=flat&logo=wikipedia&logoColor=white)](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team1/wiki)

## Table of Contents
1. [Setup](#1-Setup)  
    - 1.1. [Import Required Libraries](#1.1-Importing-Required-Libraries)  
    - 1.2. [Set Up Environment Variables](#1.2-Set-Up-Environment-Variables)  
2. [Building the Chatbot](#2.-Building-the-Chatbot)  
    - 2.1 [Vector Store and Embeddings](#2.1-Vector-Store-and-Embeddings)
        - 2.1.1. [Create Vector Store](#2.1.1-Function-to-fetch-the-embedding-model)  
        - 2.1.2. [Fetch Embedding Model](#2.1.2-Function-to-fetch-the-embedding-model)  
    - 2.2. [Document Handling](#2.-Document-Handling)  
        - 2.2.1. [Text Cleaning](#2.1-Function-to-Clean-Text)  
        - 2.2.2. [Clean HTML Content](#2.2-Function-to-Clean-and-Extract-Text-from-HTML-Content)  
        - 2.2.3. [Load Documents from the Web](#2.3-Function-for-loading-documents-from-the-web) 
    - 2.3 [Milvus Vector Store Management](#2.3.-Milvus-Vector-Store-Management) 
        - 2.3.1. [Load Existing Vector Store](#2.3.1-Function-to-load-existing-vector-store-(Milvus-database))
        - 2.3.2. [Split Documents into Chunks](#2.3.2-Function-to-split-documents)   
        - 2.3.3. [Create New Vector Store](#2.3.3-Function-to-Create-New-Vector-Store-(Milvus-database))  
        - 2.3.4. [Initialize Milvus](#2.3.4-Core-function-for-initializing-Milvus) 
        - 2.3.5. [Initializing Vector Store](#2.3.5-Initializing-Vector-Store)
3. [Testing the Chatbot](#3.-Testing-the-Chatbot)  
    - 3.1. [Create RAG Prompt](#3.1-Function-to-create-RAG-prompt)  
    - 3.2. [Query RAG](#3.2-Function-to-query-RAG-model)  
    - 3.3. [Retrieve RAG Response](#3.3-Get-response-from-RAG)  
4. [Contributors](#4-Contributors)


## 1. Setup

### 1.1 Importing Required Libraries

Importing core libraries: dotenv for environment management, requests and httpx for HTTP requests, pymilvus for vector storage and langchain.Extensions of langchain include core, mistralai, milvus, community, text-splitters and huggingface.

In [None]:
import os
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,  # Set the logging level to INFO
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)

# Suppress Hugging Face tokenizers parallelism warning
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Installing dependencies if not already installed, suppressing "Requirement already satisfied" warnings
!pip install -q httpx pymilvus --root-user-action=ignore
!pip install -q langchain langchain-core langchain-mistralai langchain-cohere langchain-milvus langchain-community beautifulsoup4 langchain-text-splitters langchain-huggingface --root-user-action=ignore

import warnings
warnings.filterwarnings('ignore')

from pymilvus import connections, utility

from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.schema import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_milvus import Milvus
from langchain_community.document_loaders import RecursiveUrlLoader
from bs4 import BeautifulSoup
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import create_retrieval_chain
from langchain_huggingface import HuggingFaceEmbeddings

from httpx import HTTPStatusError

print("Dependencies imported successfully.")

### 1.2 Set Up Environment Variables


Load the necessary environment variables for RAG operation. `CORPUS_SOURCE` can be modified to load a different corpus, `MISTRAL_API_KEY` stores the MistralAI API key, `MILVUS_URI` provides the path for the milvus lite database file, and `MODEL_NAME` specifies the embedding model for the corpus.

In [None]:
CORPUS_SOURCE = 'https://www.csusb.edu/its'
MISTRAL_API_KEY = os.environ.get("MISTRAL_API_KEY")
MILVUS_URI = "milvus/jupyter_milvus_vector.db"
MODEL_NAME = "sentence-transformers/all-MiniLM-L12-v2"

print("ENV variables defined.")

## 2.Building the Chatbot

### 2.1 Vector Store and Embeddings

#### 2.1.1 Function to Check Vector Store (Milvus database)

Purpose: 
To check if the Milvus vector store already exists at the specified URI.

Input: Path to the Milvus database `uri` (str).

Output: 
Returns a boolean (True if the vector store exists, False otherwise).

Processing:
- Creates the /milvus directory if it doesn’t exist.
- Connects to the Milvus database at the specified uri.
- Checks if the collection IT_support exists in the Milvus database.

In [None]:
def vector_store_check(uri):
    """
    Returns response on whether the vector storage exists

    Returns:
        boolean
    """
    # Create the directory if it does not exist
    head = os.path.split(uri)
    os.makedirs(head[0], exist_ok=True)
    
    # Connect to the Milvus database
    connections.connect("default", uri=uri)

    # Return True if exists, False otherwise
    return utility.has_collection("IT_support")

print("Function `vector_store_check` defined.")

#### 2.1.2 Function to fetch the embedding model

Purpose:
To load and initialize the embedding model for vectorizing documents.

Input:
None.

Output: Returns the embedding function loaded from the Hugging Face model specified in `MODEL_NAME`.

Processing:
- Loads the embedding model using `HuggingFaceEmbeddings`.
- Returns the initialized embedding function.

In [None]:
def get_embedding_function():
    """
    Returns embedding function for the model

    Returns:
        embedding function
    """
    embedding_function = HuggingFaceEmbeddings(model_name=MODEL_NAME)
    
    return embedding_function

print("Function `get_embedding_function` defined.")

### 2.2. Document Handling

#### 2.2.1 Function to Clean Text
Purpose:
To clean a given text by removing extra whitespace and blank lines.

Input:
- `text` (str): The input text to be cleaned.

Output:
Returns the cleaned text with unnecessary whitespace and blank lines removed.

Processing:
- Splits the text into lines and trims leading/trailing spaces from each line.
- Removes empty lines from the text.
- Joins the cleaned lines into a single string.

In [None]:
def clean_text(text):
    """Further clean the text by removing extra whitespace and new lines."""
    lines = (line.strip() for line in text.splitlines())
    cleaned_lines = [line for line in lines if line]
    return '\n'.join(cleaned_lines)

print("Function `clean_text` defined.")

#### 2.2.2 Function to Clean and Extract Text from HTML Content
Purpose:
To extract and clean the main content from an HTML document.

Input:
- `html_content` (str): The HTML content to be cleaned.

Output:
Returns the cleaned plain text content extracted from the HTML.

Processing:

- Parses the HTML using `BeautifulSoup`.
- Removes unnecessary elements like `<script>, <style>, <header>, <footer>, and <nav>`.
- Extracts text from the `<main>` tag if it exists, or the entire document otherwise.
- Cleans the extracted text using the clean_text function.

In [None]:
def clean_text_from_html(html_content):
    """Clean HTML content to extract main text."""
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove unnecessary elements
    for script_or_style in soup(['script', 'style', 'header', 'footer', 'nav']):
        script_or_style.decompose()

    main_content = soup.find('main')
    if main_content:
        content = main_content.get_text(separator='\n')
    else:
        content = soup.get_text(separator='\n')

    return clean_text(content)

print("Function `clean_text_from_html` defined.")

#### 2.2.3 Function for loading documents from the web

Purpose:
To recursively load and clean documents from a web source specified in `CORPUS_SOURCE`.

Input:
None.

Output:
Returns a list of cleaned documents as `Document` objects.

Processing:

- Uses RecursiveUrlLoader to load all pages from the base URL (`CORPUS_SOURCE`).
- Iterates through the loaded documents:
- Cleans the text using `clean_text_from_html`.
- Creates `Document` objects with the cleaned text and metadata.
- Returns the list of cleaned `Document` objects.

In [None]:
def load_documents_from_web():
    """
    Load the documents from the web and store the page contents

    Returns:
        list: The documents loaded from the web
    """
    loader = RecursiveUrlLoader(
        url=CORPUS_SOURCE,
        prevent_outside=True,
        base_url=CORPUS_SOURCE
        )
    raw_documents = loader.load()
    
    # Ensure documents are cleaned
    cleaned_documents = []
    for doc in raw_documents:
        cleaned_text = clean_text_from_html(doc.page_content)
        cleaned_documents.append(Document(page_content=cleaned_text, metadata=doc.metadata))

    return cleaned_documents

print("Function `load_documents_from_web` defined.")

### 2.3 Milvus Vector Store Management

#### 2.3.1 Function to load existing vector store (Milvus database)

Accepts the path to the database and embedding function to establish a connection with the database, returning the connected vector store.

In [None]:
def load_existing_db(uri=MILVUS_URI):
    """
    Load an existing vector store from the local Milvus database specified by the URI.

    Args:
        uri (str, optional): Path to the local milvus db. Defaults to MILVUS_URI.

    Returns:
        vector_store: The vector store created
    """
    # Load an existing vector store
    vector_store = Milvus(
        collection_name="IT_support",
        embedding_function=get_embedding_function(),
        connection_args={"uri": uri},
    )
    
    logger.info("Vector store loaded")
    return vector_store

print("Function `load_existing_db` defined.")

#### 2.3.2 Function to split documents

Takes the documents loaded from `load_documents_from_web` and splits them into chunks of 1000 characters. Overlaps 300 characters to preserve context. Returns the documents split into chunks.

In [None]:
def split_documents(documents):
    """
    Split the documents into chunks

    Args:
        documents (list): The documents to split

    Returns:
        list: list of chunks of documents
    """
    # Create a text splitter to split the documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=300,
        is_separator_regex=False,
    )
    
    # Split the documents into chunks
    docs = text_splitter.split_documents(documents)
    
    logger.info("Documents successfully split")
    return docs

print("Function `split_documents` defined.")

#### 2.3.3 Function to Create New Vector Store (Milvus database)

Uses the documents retrieved from `load_documents_from_web`, the embedding function from `get_embedding_function`, and the database path to establish the vector store. It returns the vector store once it has been created.

In [None]:
def create_vector_store(docs, embeddings, uri):
    """
    This function initializes a vector store using the provided documents and embeddings.

    Args:
        docs (list): A list of documents to be stored in the vector store.
        embeddings : A function or model that generates embeddings for the documents.
        uri (str): Path to the local milvus db

    Returns:
        vector_store: The vector store created
    """
    # Create a new vector store and drop any existing one
    vector_store = Milvus.from_documents(
        documents=docs,
        embedding=embeddings,
        collection_name="IT_support",
        connection_args={"uri": uri},
        drop_old=True,
    )
    
    logger.info("Vector store created")
    return vector_store

print("Function `create_vector_store` defined.")

#### 2.3.4 Core function for initializing Milvus

This is the primary function for initializing Milvus, utilizing the previously defined functions to fully set up the vector store. Executing `initialize_milvus` will invoke all the necessary functions required for creating the vector store.

In [None]:
def initialize_milvus(uri: str=MILVUS_URI):
    """
    Initialize the vector store for the RAG model

    Args:
        uri (str, optional): Path to the local vector storage. Defaults to MILVUS_URI.

    Returns:
        vector_store: The vector store created
    """
    if vector_store_check(uri):
        vector_store = load_existing_db(uri)
        logger.info("Embeddings loaded from existing storage")
    else:
        embeddings = get_embedding_function()
        logger.info("Embeddings Loaded")
        documents = load_documents_from_web()
        logger.info("Documents Loaded")
    
        # Split the documents into chunks
        docs = split_documents(documents=documents)
        logger.info("Documents Splitting completed")
    
        vector_store = create_vector_store(docs, embeddings, uri)
    logger.info("Milvus successfully initialized")
    return vector_store

print("Function `initialize_milvus` defined.")

#### 2.3.5 Initializing vector store (Milvus database)

This process may take a considerable amount of time due to the embedding function. Please be patient while it completes.

In [None]:
print("Starting Milvus initialization.")
initialize_milvus()

## 3. Testing the Chatbot

### 3.1 Function to create RAG prompt

Purpose:
To create a prompt template for the RAG model with predefined system instructions.

Input:
None.

Output:
Returns a `ChatPromptTemplate` object.

Processing:
- Defines a template with system instructions for generating accurate and context-based responses.
- Initializes a `ChatPromptTemplate` using the system template and a human prompt structure.

In [None]:
def create_prompt():
    """
    Create a prompt template for the RAG model

    Returns:
        PromptTemplate: The prompt template for the RAG model
    """
    # Define the prompt template
    PROMPT_TEMPLATE = """\
    You are an AI assistant that provides answers strictly based on the provided context. Adhere to these guidelines:
     - Only answer questions based on the content within the <context> tags.
     - If the <context> does not contain information related to the question, respond only with: "I don't have enough information to answer this question."
     - For unclear questions or questions that lack specific context, request clarification from the user.
     - Provide specific, concise ansewrs. Where relevant information includes statistics or numbers, include them in the response.
     - Avoid adding any information, assumption, or external knowledge. Answer accurately within the scope of the given context and do not guess.
     - If information is missing, respond only with: "I don't have enough information to answer this question."
    """

    prompt = ChatPromptTemplate.from_messages([
        ("system", PROMPT_TEMPLATE),
        ("human", "<question>{input}</question>\n\n<context>{context}</context>"),
    ])

    logger.info("Prompt template defined")
    return prompt

print("Function `create_prompt` defined.")

### 3.2 Function to query RAG model

Loads the MistralAI model, the prompt template, and the vector store (Milvus database). Converts the vector store into a retriever that fetches documents containing relevant context. Constructs a document chain that includes all context-related documents. Creates a retrieval chain that utilizes the documents and retriever to gather context based on the user's question. Additionally, retrieves source metadata from the context documents that have been fetched.

In [None]:
def query_rag(query):
    """
    Entry point for the RAG model to generate an answer to a given query

    This function initializes the RAG model, sets up the necessary components such as the prompt template, vector store, 
    retriever, document chain, and retrieval chain, and then generates a response to the provided query.

    Args:
        query (str): The query string for which an answer is to be generated.
    
    Returns:
        str: The answer to the query
    """
    # Define the model
    model = ChatMistralAI(model='open-mistral-7b')
    logger.info("Model Loaded")

    prompt = create_prompt()

    # Load the vector store and create the retriever
    vector_store = load_existing_db(uri=MILVUS_URI)
    retriever = vector_store.as_retriever(search_type="mmr", search_kwargs={"score_threshold": 0.7, "k":5})
    try:
        document_chain = create_stuff_documents_chain(model, prompt)
        logger.info("Document Chain Created")

        retrieval_chain = create_retrieval_chain(retriever, document_chain)
        logger.info("Retrieval Chain Created")
    
        # Generate a response to the query
        response = retrieval_chain.invoke({"input": f"{query}"})
    except HTTPStatusError as e:
        logger.error(f"HTTPStatusError: {e}")
        if e.response.status_code == 429:
            error_message = "I am currently experiencing high traffic. Please try again later."
            logger.error(error_message)
            return error_message, []
        error_message = "I am unable to answer this question at the moment. Please try again later."
        logger.error(error_message)
        return error_message, []
    
    # logic to add sources to the response
    max_relevant_sources = 4 # number of sources at most to be added to the response
    all_sources = ""
    sources = []
    count = 1
    for i in range(max_relevant_sources):
        try:
            source = response["context"][i].metadata["source"]
            # check if the source is already added to the list
            if source not in sources:
                sources.append(source)
                all_sources += f"[Source {count}]({source}), "
                count += 1
        except IndexError: # if there are no more sources to add
            break
    all_sources = all_sources[:-2] # remove the last comma and space
    response["answer"] += f"\n\nSources: {all_sources}"
    print("------------------------------------------------------------------------")
    print("Response Generated:\n")
    
    return response["answer"]

print("Function `query_rag` defined.")

### 3.3 Get response from RAG

Send a question to RAG, retrieve the response, and print it.

In [None]:
response = query_rag(input("Enter your query: "))

print(response)