# CSUSB's CSE Academic Chatbot 

### Introduction

This chatbot is designed to assist users with academic inquiries, specifically related to California State University, San Bernardino (CSUSB). By leveraging data from the official CSUSB website, the chatbot provides accurate and relevant information about academic programs, admission processes, faculty, campus resources, and much more.
The chatbot's purpose is to offer a virtual assistant that can help current and prospective students navigate CSUSB's academic landscape, including answering frequently asked questions, providing resource links, and delivering personalized responses based on specific queries.

### Objective

In this Jupyter notebook, we will demonstrate how to set up and use the CSUSB academic chatbot. This will involve:

Loading data from CSUSB's official website and possibly other trusted sources.
Setting up the chatbot model using simple rule-based logic or more advanced natural language processing (NLP) techniques.
Handling user queries by interpreting the input and providing helpful responses.

### Prerequisites

Before you start, ensure you have the following:

Python knowledge: Basic Python skills will be helpful for understanding the code.
Jupyter notebook setup: If you haven't already, install Jupyter Notebook and launch it.
Libraries: We will use Python libraries like requests, nltk, pandas, and sklearn. If not already installed, you can use pip to install them.

## Table of contents
1. [Setup](#1.-Setup)

2. [Model Setup](#2.-Model-Setup)
    - 2.1. [Environment Variables](#2.1-Environment-Variables)
    - 2.2. [Embedding Models](#2.2-Embedding-Models)
3. [Inference](#3.-Inference)
    - 3.1. [Helper Functions](#3.1-Helper-Functions)
    - 3.2. [User Query Handling](#3.2-User-Query-Handling)
4. [User Input](#User-Input)

5. [Conclusion](#Conclusion)

## 1. Setup

### 1.1 Pre-requirements and Environment setup

Initially, verify the Python version installed on your system. It ensures compatibility by checking if the installed Python version is 3.10 or higher, which is required for this project.

Steps:

Executes the command `!python --version` to display the current Python version.
Provides a confirmation or warning message based on the output.
Dependencies:

- Python must already be installed on the system.
- Python version >= 3.10 is mandatory.

Download the latest version of Python from: https://www.python.org/downloads/

In [None]:
!python --version

Environment setup: Install the necessary tools `ipykernel` and `virtualenv` and sets up a new virtual environment for the project.

Steps:

Install ipykernel:
- Used to manage Jupyter kernel connections in the virtual environment.

Install virtualenv:
- Creates isolated Python environments.

Create and Activate Virtual Environment:

- A new virtual environment named `chatbot` is created.
- Instructions are provided to activate it.
Dependencies:

`Python >= 3.10` must already be installed.
Administrative privileges may be required for installation.

In [None]:
import os
import subprocess

# Suppress pip installation output
subprocess.run(
    "pip install ipykernel --root-user-action=ignore > NUL 2>&1", shell=True
)
subprocess.run(
    "pip install --user virtualenv --root-user-action=ignore --no-warn-script-location > NUL 2>&1",
    shell=True,
)

# Create the virtual environment
subprocess.run("python -m venv chatbot > NUL 2>&1", shell=True)

# Simulate activation (actual activation is done in the shell, this is just confirmation)
print("Virtual Environment Created!")


## 1. Setup

### Install Required Packages

This cell installs essential packages for the chatbot and data processing. Key packages include `pymilvus` for database management, `langchain` for LLM chaining, and `beautifulsoup4` for web scraping from CSUSB's academic pages.

In [None]:
pip install pymilvus[model] langchain langchain_community langchain_huggingface langchain_milvus beautifulsoup4 requests nltk langchain_mistralai sentence-transformers scipy streamlit python-dotenv tabulate


### Define Corpus Source
This cell defines the source URLs for data extraction. The primary source is CSUSB’s academic website. Comments suggest loading has been minimized to reduce execution time.

In [None]:
corpus_source = ["https://www.csusb.edu/cse"]
# Commenting this website inorder to reduce load time of data scraping cell
# "https://catalog.csusb.edu/"
print('Source data for web crawling defined.')

In [None]:
import os
os.makedirs("milvus_lite", exist_ok=True)
MILVUS_URI = "./milvus_lite/milvus_vector.db"


### Web Scrapping

This section loads data from the defined CSUSB academic website, processing the HTML and preparing it for embedding.( change the context)

In [None]:
import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from langchain.text_splitter import CharacterTextSplitter
import time
import threading

# Directory to store unreachable URLs
unreachable_dir = "unreachable_urls"
os.makedirs(unreachable_dir, exist_ok=True)

# Save unreachable URL to file
def save_unreachable_url(url):
    with open(os.path.join(unreachable_dir, "unreachable_urls.txt"), "a") as f:
        f.write(url + "\n")

# Function to load webpages and extract content
def load_webpages(url):
    try:
        response = requests.get(url, timeout=10)  # Setting a timeout
        response.raise_for_status()  # Check if the request was successful
    except (requests.exceptions.ConnectTimeout, requests.exceptions.ConnectionError, requests.exceptions.Timeout):
        print(f"Failed to connect to {url}")
        save_unreachable_url(url)
        return {"content": "", "source": url}
    except requests.exceptions.HTTPError as err:
        print(f"HTTP error for {url}: {err}")
        save_unreachable_url(url)
        return {"content": "", "source": url}

    soup = BeautifulSoup(response.text, 'html.parser')
    content_list = []

    # Process <li> items and follow links to extract linked page content
    li_items = soup.find_all('li')
    for li in li_items:
        li_text = li.get_text()
        links = [a['href'] for a in li.find_all('a', href=True)]

        # Fetch content from each link in the <li>
        for link in links:
            linked_url = urljoin(url, link)
            linked_content_data = load_linked_content(linked_url)
            if linked_content_data:
                content_list.append(f"{li_text}: {linked_content_data}")

    # Extract text and links from paragraphs in the main page
    paragraphs = soup.find_all('p')
    for p in paragraphs:
        paragraph_text = p.get_text()
        links = [a['href'] for a in p.find_all('a', href=True)]
        combined_text = f"{paragraph_text} (Links: {', '.join(links)})" if links else paragraph_text
        content_list.append(combined_text)

    # Combine content into a single text
    content = " ".join(content_list)
    return {"content": content, "source": url}

# Function to load content from linked pages
def load_linked_content(link_url):
    try:
        response = requests.get(link_url, timeout=10)  # Setting a timeout
        response.raise_for_status()
    except (requests.exceptions.ConnectTimeout, requests.exceptions.ConnectionError, requests.exceptions.Timeout):
        save_unreachable_url(link_url)
        return ""
    except requests.exceptions.HTTPError as err:
        save_unreachable_url(link_url)
        return ""

    soup = BeautifulSoup(response.text, 'html.parser')
    content_list = []

    # Extract paragraphs and links from the linked page
    paragraphs = soup.find_all('p')
    for p in paragraphs:
        paragraph_text = p.get_text()
        links = [a['href'] for a in p.find_all('a', href=True)]
        combined_text = f"{paragraph_text} (Links: {', '.join(links)})" if links else paragraph_text
        content_list.append(combined_text)

    # Combine the extracted content from the linked page
    content = " ".join(content_list)
    return content

# Split text into chunks for further processing
def split_text(text, chunk_size=40000):
    text_splitter = CharacterTextSplitter(separator=",", chunk_size=chunk_size, chunk_overlap=0)
    return text_splitter.split_text(text)

# Function to track time and display message after 3 seconds
def time_monitor_thread():
    time.sleep(3)  # Wait for 3 seconds
    print("Loading is taking longer than expected. ETA: ~3 minutes...")

# Get texts data from the URLs
def get_texts_data(base_url):
    start_time = time.time()  # Track the start time
    texts = []
    
    # Start time monitoring in a separate thread
    threading.Thread(target=time_monitor_thread, daemon=True).start()
    
    for url in base_url:
        page_data = load_webpages(url)
        content = page_data["content"]
        source = page_data["source"]

        if content:
            cleaned_content = content.replace("\n", " ")
            split_contents = split_text(cleaned_content)

            # Create a document entry with each chunk and the source
            for split_content in split_contents:
                texts.append({
                    "page_content": split_content,
                    "source": source
                })

      
        elapsed_time = time.time() - start_time
        if elapsed_time >= 3:
            break

    if texts:  # If texts were collected successfully
        print("Successfully processed the following entries:")
        for i in range(min(2, len(texts))):  # Print up to 2 entries
            print(f"Text {i+1}: {texts[i]['page_content'][:100]}...")  # Print first 100 characters
        print(f"Successfully processed {len(texts)} text chunks.")
    else:
        print("No content found or extracted.")
    
    return texts

# Extract only the page content from the texts data
def extract_text_content(texts):
    return [text["page_content"] for text in texts if "page_content" in text]

# Start processing the URLs
corpus_source = ["https://www.csusb.edu/cse"] 

texts = get_texts_data(corpus_source)
text_contents = extract_text_content(texts)


### Define Display Function for Tabular Data
Defines a function to display extracted data in a table format using the `tabulate` library, aiding in data readability.

In [None]:
from tabulate import tabulate

# Function to display data in table format
def display_texts_data(texts):
    if texts:  # Check if texts are available
        # Print a message indicating the number of entries processed
        print(f"Successfully processed {len(texts)} entries.")
    else:
        print("No data available to display.")

# Call the function to display data
display_texts_data(texts)


## 2. Model Setup

### 2.1. Environment Variables

Since this cell is setting paths (`MILVUS_URI`, `collection_name`, `output_folder`) and model names (`MODEL_NAME`, `MODEL_NAME_2`), it involves defining key environment-like variables for your project setup.

#### Import Dependencies and Set Milvus URI
This cell imports `nltk` for text processing and sets the URI for Milvus, a vector database where embeddings will be stored.

In [None]:
import nltk
import os
nltk.download('punkt')
MILVUS_URI = "./milvus_lite/milvus_vector.db"
# Switch between models to get optimized information retrieval on QA tasks
MODEL_NAME = "sentence-transformers/all-MiniLM-L12-v2"
MODEL_NAME_2 = "sentence-transformers/msmarco-distilbert-base-v3"
collection_name = "Academic_Webpages"
output_folder = "csusb_cse_content"

# Ensure directories exist
os.makedirs("milvus_lite", exist_ok=True)
os.makedirs(output_folder, exist_ok=True)
print('Libraries and configurations set up completed.')

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_milvus.utils.sparse import BM25SparseEmbedding
dense_embedding_func = HuggingFaceEmbeddings(model_name=MODEL_NAME_2)
dense_dim = len(dense_embedding_func.embed_query(text_contents[1]))
# print(dense_dim)
sparse_embedding_func = BM25SparseEmbedding(corpus=text_contents)
sparse_embedding_func.embed_query(text_contents[1])

### Import Milvus and Data Processing Libraries
Here, the cell imports `pymilvus` and other necessary libraries for vector storage and retrieval.

In [None]:
from pymilvus import connections, utility, Collection, CollectionSchema, FieldSchema, DataType
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_milvus.retrievers import MilvusCollectionHybridSearchRetriever
from langchain_milvus.utils.sparse import BM25SparseEmbedding
from langchain_mistralai.chat_models import ChatMistralAI
from pymilvus import (
    Collection,
    CollectionSchema,
    DataType,
    FieldSchema,
    WeightedRanker,
    connections,
    utility,

)
import requests
from bs4 import BeautifulSoup
from langchain_huggingface import HuggingFaceEmbeddings
import nltk
import os
from urllib.parse import urljoin,urlparse
from scipy.sparse import csr_matrix
import numpy as np
from langchain.text_splitter import CharacterTextSplitter
print('Milvus and vector operations libraries imported.')

### Initialize Milvus Connection
Defines a function to initialize a connection to Milvus, ensuring data can be stored and queried.

In [None]:
# Initialize Milvus
def initialize_milvus():
    global collection
    # Connect to Milvus
    connections.connect("default", uri=MILVUS_URI)
    print(f"Connected to Milvus at {MILVUS_URI}")

    # Check if the collection exists
    if utility.has_collection(collection_name):
        # Load the existing collection
        collection = Collection(name=collection_name)
        print(f"Collection '{collection_name}' already exists.")
        return collection

    else:
        print(f"Collection '{collection_name}' does not exist. Creating new collection.")

        # Define schema fields
        pk_field = "doc_id"
        dense_field = "dense_vector"
        sparse_field = "sparse_vector"
        text_field = "text"
        source_field = "source"

        # Define fields for the collection schema
        fields = [
            FieldSchema(
                name=pk_field,
                dtype=DataType.VARCHAR,
                is_primary=True,
                auto_id=True,
                max_length=100,
            ),
            FieldSchema(name=dense_field, dtype=DataType.FLOAT_VECTOR, dim=dense_dim),
            FieldSchema(name=sparse_field, dtype=DataType.SPARSE_FLOAT_VECTOR),
            FieldSchema(name=text_field, dtype=DataType.VARCHAR, max_length=65_535),
            FieldSchema(name=source_field, dtype=DataType.VARCHAR, max_length=500)
        ]

        # Create the schema and the collection
        schema = CollectionSchema(fields=fields, enable_dynamic_field=False)
        collection = Collection(name=collection_name, schema=schema, consistency_level="Strong")
        print(f"Created collection '{collection_name}'.")

        # Create indexes for dense and sparse vectors
        dense_index = {"index_type": "FLAT", "metric_type": "IP"}
        sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"}

        collection.create_index(dense_field, dense_index)
        collection.create_index(sparse_field, sparse_index)

        print(f"Created sparse vector index on '{dense_field} {sparse_field}'.")

        # Flush to persist changes
        collection.flush()
        print(f"Flushed collection '{collection_name}' to persist changes.")

    # Insert vectors into the collection
    entities = []

    for text in texts:
        text_content= str(text["page_content"])
        source = text["source"]
        entity = {
            "dense_vector": dense_embedding_func.embed_documents([text_content])[0],
            "sparse_vector": sparse_embedding_func.embed_documents([text_content])[0],
            "text": text_content,
            "source": source

        }
        entities.append(entity)

    # Check if the collection already contains data
    if collection.num_entities == 0:
        collection.insert(entities)
        print(f"Inserted {len(entities)} entities into the collection '{collection_name}'.")
    else:
        print(f"Collection '{collection_name}' already contains data. Skipping insertion.")
    return collection

### Execute Milvus Initialization
Executes the `initialize_milvus()` function to establish the Milvus connection.

In [None]:
initialize_milvus()

### Set the API Key for Authentication

This is because the code is setting an environment variable (`API_KEY`) that is likely used later in the workflow for authentication or accessing external services. It makes sense to group this action under Environment Variables since it's related to configuration and setup for your environment.

In [None]:
# Setting the API key
import os

# Set the API key
os.environ["API_KEY"] = "8O9iSlbqRZOL1ssm82bnpIuKr6MfNXlT"

### 2.2. Embedding Models
The code here initializes and applies dense and sparse embeddings to the extracted text content, setting up embedding functions and verifying their outputs, which is directly related to model configuration and embedding preparation.

### Prepare Text Data for Embedding
Extracts and cleans text content for embedding, preparing it for storage in the Milvus database.

In [None]:
import time
import threading

# Function to track time and display message after 3 seconds
def time_monitor_thread():
    time.sleep(3)  # Wait for 3 seconds
    print("Loading is taking longer than expected. ETA: ~2 minutes...")

# Extract the cleaned text content from the dictionaries for embedding
def process_embeddings(corpus_source):
    start_time = time.time()  # Track the start time
    # Start time monitoring in a separate thread
    threading.Thread(target=time_monitor_thread, daemon=True).start()
    
    # Load data and extract text content
    texts = get_texts_data(corpus_source)
    text_contents = extract_text_content(texts)

    # Check if there is any text content to process
    if text_contents:
        # Initialize the dense and sparse embeddings
        dense_embedding_func = HuggingFaceEmbeddings(model_name=MODEL_NAME_2)
        
        # Calculate the dimension of the dense embedding using the second text (if available)
        dense_dim = len(dense_embedding_func.embed_query(text_contents[1]))
        
        # Initialize the sparse embedding using BM25
        sparse_embedding_func = BM25SparseEmbedding(corpus=text_contents)
        
        # Perform embedding query on the second text
        sparse_embedding_func.embed_query(text_contents[1])
        
        # Output success messages and embedding dimensions
        print(f"Successfully processed {len(text_contents)} text contents.")
        print(f"Dense embedding dimension: {dense_dim}")
        print("Sparse embedding query executed successfully.")
    else:
        print("No text content available to process.")

# Call the function to process embeddings
process_embeddings(corpus_source)


### 3.1. Helper Functions
The function `format_docs` is designed to process and format documents, extract their content, and gather associated metadata like sources. This is a utility function to help with text preparation or output formatting, which makes it a good fit for the Helper Functions section.

### Format Documents for Embedding and Image Extraction
This cell defines a function to format documents for embedding and extract images associated with each document.

In [None]:
# Function to format documents with their sources and extract associated images
def format_docs(docs):
    formatted_content = ""
    sources = set()

    # Loop through each document to retrieve text and source
    for doc in docs:
        content = getattr(doc, "text", "")
        source = doc.metadata.get("source", "Unknown source")

        formatted_content += f"{content}\n\n"
        sources.add(source)

    # Combine sources into a formatted string
    formatted_sources = "\n".join(sources)

    return formatted_content, formatted_sources

### 3.2. User Query Handling

This is because the code defines the process for handling user queries through the RAG (Retrieval-Augmented Generation) chain. It takes a user query, retrieves relevant context from Milvus, formats it, and then invokes the language model to generate a response. The function `invoke_llm_for_response` ties all the components together—loading data, querying the model, and formatting the results, which fits the purpose of User Query Handling.

In [None]:
# Invoke RAG chain to generate LLM Response
# Function to invoke the language model for generating a response
def get_api_key():
    """Retrieve the API key from the environment."""
    api_key = os.getenv("API_KEY")
    if not api_key:
        raise ValueError("API key not found. Ensure the API key is set in main.py before proceeding.")
    return api_key
def invoke_llm_for_response(query: str):
    api_key = get_api_key()
    if not isinstance(query, str):
        raise ValueError("The input query must be a string.")

    if len(query.split()) < 2:
        return "Please ask a more specific question.", [], []  # Ensure this return has three items

    # Initialize the language model
    llm = ChatMistralAI(model='open-mistral-7b', api_key=api_key)

    # Define the prompt template
    PROMPT_TEMPLATE = """
    Human: You are an AI assistant, and provide answers to questions by using fact-based and statistical information when possible.
    Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.

    <context>
    {context}
    </context>

    <question>
    {question}
    </question>

    Assistant:"""

    prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["context", "question"])

    # Ensure `texts` are strings
    texts = get_texts_data(corpus_source)
    if not texts:
        return "No content found in the specified URLs. Please check your data source.", [], []
    texts = [text['page_content'] if isinstance(text, dict) and 'page_content' in text else text for text in texts if isinstance(text, str) or isinstance(text, dict)]

    # Define the fields and search parameters for the Milvus retriever
    dense_field = "dense_vector"
    sparse_field = "sparse_vector"
    text_field = "text"
    sparse_search_params = {"metric_type": "IP"}
    dense_search_params = {"metric_type": "IP", "params": {}}
    collection = Collection('Academic_Webpages')

    # Initialize the Milvus retriever
    retreiver = MilvusCollectionHybridSearchRetriever(
        collection=collection,
        rerank=WeightedRanker(0.7, 0.3),
        anns_fields=[dense_field, sparse_field],
        field_embeddings=[dense_embedding_func, sparse_embedding_func],
        field_search_params=[dense_search_params, sparse_search_params],
        top_k=5,
        text_field=text_field,
    )

    hybrid_results = retreiver.invoke(query)
    # Have to implement re-ranking function for the hybrid retriever for exact query matching
    formatted_content, formatted_sources = format_docs(hybrid_results)

    context_callable = lambda x: formatted_content

    # Define the RAG chain manually with the specified format
    rag_chain = (
        {"context": context_callable, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )

    # Invoke the RAG chain with the specific question
    response = rag_chain.invoke({"input": query})

    final_response = f"{response}\n\nSources:\n{formatted_sources}"
    print(final_response,formatted_sources)

    return final_response,formatted_sources


### 4.User Input

To execute this query using the invoke_llm_for_response function, ensure that your environment is set up correctly with the required data sources (e.g., Milvus collection) and API key. Based on the function, this query will trigger the retrieval of relevant documents from the CSUSB Academic webpages, process them through the RAG (Retrieval-Augmented Generation) chain, and return a generated response from the model.

In [None]:
# Function to execute the query and display the response
def query_rag(query: str):
    return invoke_llm_for_response(query)

# Get user input for query
response = query_rag(input("Enter your query: "))

# Print the response
print("Response:", response[0])
print("Sources:", response[1])
