# CSUSB's CSE Academic Chatbot 

- [GitHub](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team2)  
- [Wiki](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team2/wiki)

## Table of contents
1. [Introduction](#1.-Introduction)

2. [Setup](#2.-Setup)
    - 2.1. [Pre-requirements and Environment setup](#2.1-pre-requirements-and-environment-setup)
    - 2.2. [Environment Variables](#2.2-Environment-Variables)
3. [Building the Chatbot](#3.-Building-the-Chatbot)
    - 3.1. [Initialize Milvus Connection](#3.1.-initialize-milvus-connection)
    - 3.2. [Execute Milvus Initialization](#3.2.-Execute-Milvus-Initialization)
    - 3.3. [Set the API Key for Authentication](#3.3.-set-the-api-key-for-authentication)
4. [Improving the Chatbot with Inference](#4.-Improving-the-Chatbot-with-Inference)
    - 4.1. [Helper Functions](#4.1-Helper-Functions)
    - 4.2. [User Query Handling](#4.2-User-Query-Handling)

5. [Testing the Chatbot](#5.-Testing-the-Chatbot)

6. [Conclusion](#Conclusion)

### 1. Introduction

This chatbot is designed to assist users with academic inquiries, specifically related to California State University, San Bernardino (CSUSB). By leveraging data from the official CSUSB website, the chatbot provides accurate and relevant information about academic programs, admission processes, faculty, campus resources, and much more.
The chatbot's purpose is to offer a virtual assistant that can help current and prospective students navigate CSUSB's academic landscape, including answering frequently asked questions, providing resource links, and delivering personalized responses based on specific queries.

#### Objective

In this Jupyter notebook, we will demonstrate how to set up and use the CSUSB academic chatbot. This will involve:

Loading data from CSUSB's official website and possibly other trusted sources.
Setting up the chatbot model using simple rule-based logic or more advanced natural language processing (NLP) techniques.
Handling user queries by interpreting the input and providing helpful responses.

#### Prerequisites

Before you start, ensure you have the following:

Python knowledge: Basic Python skills will be helpful for understanding the code.
Jupyter notebook setup: If you haven't already, install Jupyter Notebook and launch it.
Libraries: We will use Python libraries like requests, nltk, pandas, and sklearn. If not already installed, you can use pip to install them.

## 2. Setup

### 2.1 Pre-requirements and Environment setup

-  Initially, verify the Python version installed on your system. It ensures compatibility by checking if the installed Python version is 3.10 or higher, which is required for this project.

#### Steps:

- Executes the command `!python --version` to display the current Python version.
Provides a confirmation or warning message based on the output.
Dependencies:

    - Python must already be installed on the system.
    - Python version >= 3.10 is mandatory.

Download the latest version of Python from: https://www.python.org/downloads/

In [None]:
!python --version

#### Environment setup

- Install the necessary tools `ipykernel` and `virtualenv` and sets up a new virtual environment for the project.

#### Steps:

- Install ipykernel:
    - Used to manage Jupyter kernel connections in the virtual environment.

- Install virtualenv:
    - Creates isolated Python environments.

- Create and Activate Virtual Environment:

    - A new virtual environment named `chatbot` is created.
    - Instructions are provided to activate it.
- Dependencies:

    - `Python >= 3.10` must already be installed.
    - Administrative privileges may be required for installation.

In [None]:
import os
import subprocess

# Suppress pip installation output
subprocess.run(
    "pip install ipykernel --root-user-action=ignore > NUL 2>&1", shell=True
)
subprocess.run(
    "pip install --user virtualenv --root-user-action=ignore --no-warn-script-location > NUL 2>&1",
    shell=True,
)

# Create the virtual environment
subprocess.run("python -m venv chatbot > NUL 2>&1", shell=True)

# Simulate activation (actual activation is done in the shell, this is just confirmation)
print("Virtual Environment Created!")


#### Install Required Packages

- This cell installs essential packages for the chatbot and data processing. Key packages include `pymilvus` for database management, `langchain` for LLM chaining, and `beautifulsoup4` for web scraping from CSUSB's academic pages.

In [None]:
pip install pymilvus[model] langchain langchain_community langchain_huggingface langchain_milvus beautifulsoup4 requests nltk langchain_mistralai sentence-transformers scipy streamlit python-dotenv tabulate


#### Define Corpus Source
- This cell defines the source URLs for data extraction. The primary source is CSUSB’s academic website. Comments suggest loading has been minimized to reduce execution time.

In [None]:
CORPUS_SOURCES = ["https://www.csusb.edu/cse","https://catalog.csusb.edu/colleges-schools-departments/natural-sciences/computer-science-engineering/"]

ALLOWED_CSE_NAVIGATION_SECTIONS = [
    "Welcome",
    "Programs",
    "Faculty and Staff",
    "Advising",
    "Resources",
    "Internships & Careers",
    "Computer Labs & Support",
    "Faculty in the News",
    "Contact Us",
]

# Define exclusions
EXCLUDED_TEXTS = ["Give to CNS"]  # Keywords to exclude
EXCLUDED_URLS = ["https://www.csusb.edu/give-cns"]  # Specific URLs to exclude

ALLOWED_CATALOG_NAVIGATION_SECTIONS = ["Overview", "Faculty", "Undergraduate Degrees", "Graduate Degree", "Minor", "Certificates", "Courses"]

print("Defined corpus sources, allowed navigation sections, and exclusions for web scraping!")

#### Setting Up Local Directory and Milvus URI Path

-   This code snippet creates a directory named `milvus_lite` if it doesn't already exist and defines the file path `MILVUS_URI` for storing Milvus vector data locally. It ensures the directory structure is in place for managing vector database files.

In [None]:
import os
os.makedirs("milvus_lite", exist_ok=True)
MILVUS_URI = "milvus_vector.db"
print("Directory 'milvus_lite' has been created or already exists.")
print(f"MILVUS_URI is set to: {MILVUS_URI}")

#### Web Scrapping

- This section loads data from the defined CSUSB academic website, processing the HTML and preparing it for embedding.( change the context)

In [None]:
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin

# Website 1 - "https://www.csusb.edu/cse"
def scrape_source_1(base_url):
    """
    Scrape all internal links and their content from the given base URL.
    Detect and parse tables within div elements. Handle accordion cases.
    Only scrape allowed navigation items and exclude specific content.
    Ensure all links are properly formatted as full URLs and duplicates are removed.
    :param base_url: The base URL of the website to scrape.
    :return: A list of scraped data from all internal links.
    """
    visited_links = set()
    scraped_data = []

    def parse_table(table):
        """Parse a table and return structured data as rows with headers."""
        headers = [header.get_text(strip=True) for header in table.find_all("th")]
        rows = []
        for row in table.find_all("tr"):
            cells = [cell.get_text(strip=True) for cell in row.find_all(["th", "td"])]
            if len(cells) == len(headers):  # Match cells to headers
                row_data = dict(zip(headers, cells))
                rows.append(row_data)
        return rows

    def format_url(href):
        """Ensure the href is a full URL."""
        if href.startswith("http"):
            return href
        return urljoin("https://www.csusb.edu", href.lstrip("/"))

    def scrape_page(url, visited):
        """Scrape a single page and extract its content, including tables within divs."""
        if url in visited:
            return None
        visited.add(url)

        # print(f"Scraping page: {url}")
        try:
            response = requests.get(url)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, "html.parser")

            page_data = {"url": url, "content": []}
            unique_links = set()  # To track unique links on this page

            # Extract navigation links
            nav_links = soup.find_all("a", href=True)
            for link in nav_links:
                link_text = link.get_text(strip=True)
                href = link["href"]

                # Ensure the href is a full URL
                href = format_url(href)

                # Skip duplicates
                if href in unique_links:
                    continue
                unique_links.add(href)

                # Store links
                # print(f"Storing link: {href}")
                page_data["content"].append({"type": "link", "url": href, "text": link_text})

                if link_text in ALLOWED_CSE_NAVIGATION_SECTIONS:  # Only process allowed navigation
                    # print(f"Allowed navigation link found: {link_text} -> {href}")

                    # Recursively scrape allowed navigation links
                    if href.startswith(base_url) and href not in visited:
                        nested_page_data = scrape_page(href, visited)
                        if nested_page_data:
                            page_data.setdefault("internal_links", []).append(nested_page_data)

                # Include PDF links explicitly
                if href.endswith(".pdf"):
                    # print(f"PDF link found: {href}")
                    page_data["content"].append({"type": "pdf", "url": href, "text": link_text})

            # Extract content from <div> and check for tables
            for div in soup.find_all("div"):
                table = div.find("table")  # Check if there's a table inside the div
                if table:
                    # print(f"Table found inside a div on {url}")
                    structured_table = parse_table(table)
                    if structured_table:
                        page_data["content"].append({"type": "table", "data": structured_table})

                # Check if the div contains an accordion
                if "accordion" in div.get("class", []) or "accordion" in div.get("id", ""):
                    for p in div.find_all("p"):
                        a_tag = p.find("a", href=True)
                        if a_tag:
                            href = a_tag["href"]
                            href = format_url(href)

                            # Skip duplicates
                            if href in unique_links:
                                continue
                            unique_links.add(href)

                            # Skip excluded links
                            if href in EXCLUDED_URLS or p.get_text(strip=True) in EXCLUDED_TEXTS:
                                # print(f"Skipping excluded link: {href}")
                                continue

                            accordion_content = {
                                "type": "link",
                                "url": href,
                                "text": p.get_text(strip=True),
                            }
                            # print(f"Accordion link found: URL = {href}, Text = {p.get_text(strip=True)}")
                            page_data["content"].append(accordion_content)

            # Extract other HTML elements
            for tag in ["h1", "h2", "h3", "p", "li"]:
                for element in soup.find_all(tag):
                    text = element.get_text(strip=True)
                    if text and text not in EXCLUDED_TEXTS:  # Skip excluded texts
                        page_data["content"].append({"type": tag, "text": text})

            return page_data

        except Exception as e:
            print(f"Error scraping {url}: {e}")
            return None

    # Start scraping from the base URL
    main_page_data = scrape_page(base_url, visited_links)
    if main_page_data:
        scraped_data.append(main_page_data)

    return scraped_data

# Website 2 - "https://catalog.csusb.edu/colleges-schools-departments/natural-sciences/computer-science-engineering/"
def scrape_navigation_section(url, section_name, visited=set()):
    """
    Scrape a specific navigation section and follow internal links.
    :param url: The base URL of the section.
    :param section_name: The name of the navigation section.
    :param visited: Set to track visited links.
    :return: Scraped data from the section without the `section` field.
    """
    try:
        if url in visited:
            return []
        visited.add(url)

        # print(f"Scraping navigation section: {url}")  # Print the navigation section URL
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')

        # Create section data, excluding the `section` field
        section_data = {"url": url, "content": []}

        # Scrape main content of the section
        for tag in ["h1", "h2", "h3", "p", "li", "div"]:
            for element in soup.find_all(tag):
                text = element.get_text(strip=True)
                if text:
                    section_data["content"].append({"type": tag, "text": text})

        # Scrape and process internal links within the section
        internal_data = scrape_internal_links(soup, url, visited)
        if internal_data:
            section_data["internal_links"] = internal_data

        return section_data

    except Exception as e:
        print(f"Error scraping section {section_name} at {url}: {e}")
        return None

def scrape_internal_links(soup, base_url, visited):
    """
    Find and scrape data from all internal links within a section.
    :param soup: Parsed HTML content of the current page.
    :param base_url: Base URL of the current page.
    :param visited: Set to track visited links.
    :return: List of scraped content from internal links.
    """
    internal_content = []
    for link in soup.find_all("a", href=True):
        href = link["href"]
        full_url = urljoin(base_url, href)
        # Check if it's an internal link and not visited
        if full_url not in visited and full_url.startswith(base_url):
            # print(f"Scraping internal link: {full_url}")  # Print the link being scraped
            visited.add(full_url)
            try:
                response = requests.get(full_url)
                response.raise_for_status()
                sub_soup = BeautifulSoup(response.text, 'html.parser')

                # Scrape content from the internal page
                page_content = []
                for tag in ["h1", "h2", "h3", "p", "li", "div"]:
                    for element in sub_soup.find_all(tag):
                        text = element.get_text(strip=True)
                        if text:
                            page_content.append({"type": tag, "text": text})

                # Check for further internal links within the page
                deeper_links = scrape_internal_links(sub_soup, base_url, visited)
                if deeper_links:
                    page_content.extend(deeper_links)

                internal_content.append({
                    "url": full_url,
                    "content": page_content
                })

            except Exception as e:
                print(f"Error scraping internal link {full_url}: {e}")

    return internal_content


def merge_data_sources(data_source_1, data_source_2):
    """
    Merge two data sources into one unified knowledge base.
    :param data_source_1: Data scraped from the main page.
    :param data_source_2: Data scraped from navigation sections.
    :return: Merged data source.
    """
    merged_data = []

    # Add data from data_source_1
    if isinstance(data_source_1, list):
        merged_data.extend(data_source_1)
    elif isinstance(data_source_1, dict):
        merged_data.append(data_source_1)

    # Add data from data_source_2
    if isinstance(data_source_2, list):
        merged_data.extend(data_source_2)
    elif isinstance(data_source_2, dict):
        merged_data.append(data_source_2)

    return merged_data


data_source_1 = scrape_source_1(CORPUS_SOURCES[0])
data_source_2 = scrape_navigation_section(CORPUS_SOURCES[1],ALLOWED_CATALOG_NAVIGATION_SECTIONS)

def merged_data():
    return merge_data_sources(data_source_1, data_source_2)
print("Web scraping initialized: Data from CORPUS_SOURCES collected and merged successfully.")

### 2.2. Environment Variables

Since this cell is setting paths (`MILVUS_URI`, `collection_name`, `output_folder`) and model names (`MODEL_NAME`, `MODEL_NAME_2`), it involves defining key environment-like variables for your project setup.

#### Import Dependencies and Set Milvus URI
This cell imports `nltk` for text processing and sets the URI for Milvus, a vector database where embeddings will be stored.

In [None]:
import nltk
import os
nltk.download('punkt')
# Switch between models to get optimized information retrieval on QA tasks
MODEL_NAME = "sentence-transformers/all-MiniLM-L12-v2"
collection_name = "CSUSB_CSE_Data"
output_folder = "csusb_cse_content"

# Ensure directories exist
os.makedirs(output_folder, exist_ok=True)
print('Libraries and configurations set up completed.')

#### Import Milvus and Data Processing Libraries
- Here, the cell imports `pymilvus` and other necessary libraries for vector storage and retrieval.

In [None]:
from pymilvus import Collection
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_mistralai.chat_models import ChatMistralAI
import numpy as np
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
import httpx
import requests
from bs4 import BeautifulSoup
from langchain_huggingface import HuggingFaceEmbeddings
from urllib.parse import urljoin,urlparse
from scipy.sparse import csr_matrix
import numpy as np
from langchain.text_splitter import CharacterTextSplitter
import re
print('Milvus and vector operations libraries imported.')

## 3. Building the Chatbot

### 3.1 Initialize Milvus Connection
Defines a function to initialize a connection to Milvus, ensuring data can be stored and queried.

In [None]:
from pymilvus import connections, utility, Collection, CollectionSchema, FieldSchema, DataType
from sentence_transformers import SentenceTransformer
import json

def initialize_milvus(data, milvus_uri=MILVUS_URI):
    """Initialize Milvus, create collection, and insert data from content and internal links."""
    print("Initializing Milvus and creating a collection...")

    # Connect to Milvus
    connections.connect(alias="default", uri=milvus_uri)

    # Define schema
    fields = [
        FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
        FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384),
        FieldSchema(name="text_content", dtype=DataType.VARCHAR, max_length=50000),
        FieldSchema(name="url", dtype=DataType.VARCHAR, max_length=1000)
    ]
    schema = CollectionSchema(fields, "CSUSB_CSE_Collection")

    # Drop existing collection if present
    collection_name = "CSUSB_CSE_Data"
    if utility.has_collection(collection_name):
        Collection(name=collection_name).drop()

    # Create and load collection
    collection = Collection(name=collection_name, schema=schema)
    collection.create_index(field_name="embedding", index_params={"index_type": "FLAT", "metric_type": "L2"})
    collection.load()

    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    def process_entry(entry, results):
        """Process 'content' and 'internal_links' recursively."""
        current_url = entry.get("url", "")

        # Process 'content' fields
        for content in entry.get("content", []):
            if content.get("type") in ["h1", "h2", "h3", "p"]:
                text = content.get("text")
                if text:
                    embedding = model.encode(text).tolist()
                    results.append((text, current_url, embedding))

        # Process 'data' and 'url' fields in `internal_links`
        for link in entry.get("internal_links", []):
            link_url = link.get("url")
            for data_entry in link.get("data", []):
                text = json.dumps(data_entry)  # Convert structured data to string
                embedding = model.encode(text).tolist()
                results.append((text, link_url, embedding))

            # Recurse into nested `internal_links`
            process_entry(link, results)

    # Process all data
    results = []
    for item in data:
        process_entry(item, results)

    # Debugging: Check results
    print(f"Results Count: {len(results)}")
    if not results:
        print("No data to insert. Please check the input data.")
        return

    # Prepare data for insertion
    embeddings, text_contents, urls = [], [], []
    for idx, (text, url, embedding) in enumerate(results):
        # Validate embedding dimensions
        if len(embedding) != 384:
            print(f"Skipping entry at index {idx} with invalid embedding dimension: {len(embedding)}")
            continue
        # Validate text and URL
        if not isinstance(text, str) or not text.strip():
            print(f"Skipping entry at index {idx} with invalid text content: {text}")
            continue
        if not isinstance(url, str) or not url.strip():
            print(f"Skipping entry at index {idx} with invalid URL: {url}")
            continue

        embeddings.append(embedding)
        text_contents.append(text)
        urls.append(url)

    # Debugging: Check final lengths
    print(f"Embeddings: {len(embeddings)}, Texts: {len(text_contents)}, URLs: {len(urls)}")
    if len(embeddings) != len(text_contents) or len(embeddings) != len(urls):
        print("Error: Mismatch in data lengths.")
        return

    # Insert into Milvus
    try:
        collection.insert([embeddings, text_contents, urls])
        print(f"Number of Documents in Collection: {collection.num_entities}")
        print("Data insertion completed.")
    except Exception as e:
        print(f"Error inserting data into Milvus: {e}")


def initialize_milvus_insert_data():
    KB = merged_data()
    #print("Input Data Sample:", KB[:3])
    return initialize_milvus(KB)

print("Milvus initialized: Collection created, embeddings generated, and data inserted successfully.")

### 3.2 Execute Milvus Initialization
Executes the `initialize_milvus()` function to establish the Milvus connection.

In [None]:
initialize_milvus_insert_data()

### 3.3 Set the API Key for Authentication

This is because the code is setting an environment variable (`API_KEY`) that is likely used later in the workflow for authentication or accessing external services. It makes sense to group this action under Environment Variables since it's related to configuration and setup for your environment.

For the API key, please visit the [Team2 Discussions](https://csusb.instructure.com/courses/43192/discussion_topics/419700) chat section.
Copy paste the `API` key for further process. 

In [None]:
from dotenv import load_dotenv
import os
from ipywidgets import Text, Button, VBox, Output
from IPython.display import display

# Load environment variables
load_dotenv(override=True)

# Output widget for feedback
output = Output()

# API key variable
api_key = os.getenv("MISTRAL_API_KEY")

# Function to handle API key input through the widget
def create_api_key_widget():
    global api_key
    api_input = Text(
        description="API Key:",
        placeholder="Enter your MISTRAL API key",
        layout={"width": "400px"}
    )
    submit_button = Button(description="Submit", button_style="success")

    def on_submit_clicked(_):
        global api_key
        if api_input.value:
            api_key = api_input.value
            os.environ["MISTRAL_API_KEY"] = api_key
            with output:
                output.clear_output()
                print("API key successfully set.")
        else:
            with output:
                output.clear_output()
                print("Error: Please enter a valid API key.")

    submit_button.on_click(on_submit_clicked)
    return VBox([api_input, submit_button, output])

# Display the widget only if the API key is missing
if not api_key:
    print("Please provide it using the widget below.")
    api_key_widget = create_api_key_widget()
    display(api_key_widget)
else:
    print("Environment variables successfully set up.")

### 4.2. User Query Handling

- This is because the code defines the process for handling user queries through the `Retrieval-Augmented Generation` chain. It takes a user query, retrieves relevant context from Milvus, formats it, and then invokes the language model to generate a response.
- The function `invoke_llm_for_response` ties all the components together—loading data, querying the model, and formatting the results, which fits the purpose of User Query Handling.

In [None]:
# Initialize the embedding model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

def get_api_key():
    """Retrieve the API key from the environment."""
    api_key = os.getenv("API_KEY")
    if not api_key:
        raise ValueError("API key not found. Ensure the API key is set in main.py before proceeding.")
    return api_key

def search_milvus(query):
    """Search Milvus collection for a query and return the top results."""
    # model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    # Load the collection
    collection_name="CSUSB_CSE_Data"
    collection = Collection(name=collection_name)
    collection.load()

    # Convert the query into an embedding
    query_embedding = np.array(model.encode(query), dtype=np.float32).tolist()

    # Define search parameters
    search_params = {"metric_type": "L2", "params": {"nprobe": 10}}

    # Perform the search
    # print("Searching for:", query)
    # Perform the search
    # print("Searching for:", query)

    results = collection.search(
        data=[query_embedding],          # Query embedding
        anns_field="embedding",          # Field to search
        param=search_params,
        limit=50,                         # Number of results
        expr=None,                       # Optional filter
        output_fields=["text_content", "url"]  # Specify fields to retrieve
    )
    # Collect the context
    context_chunks = []

    # Display results
    for i, result in enumerate(results[0]):
        # print(f"Result {i+1}:")
        text_content = result.entity.get("text_content")
        url = result.entity.get("url")
        if text_content:  # Ensure the content is valid
            context_chunks.append(f"{text_content.strip()}\n(Source: {url})")
        # print(f"Text: {text_content}")
        # print(f"URL: {url}")
        # print(f"Score: {result.distance}")
        # print("-" * 40)

    # Create the context by concatenating the top results
    context = " ".join(context_chunks[:30])
    #print(context,"Context")
    return context

def extract_keywords_from_query(query, max_keywords=5):
    """Extract keywords dynamically from the query."""
    vectorizer = TfidfVectorizer(stop_words="english", max_features=max_keywords)
    vectorizer.fit([query])  # Fit the vectorizer only on the query
    keywords = vectorizer.get_feature_names_out()
    return list(keywords)  # Ensure keywords are returned as a Python list

def compare_keywords_with_context(query, context, max_keywords=5):
    """Extract keywords from query and compare them with the context."""
    # Extract keywords from the query
    keywords = extract_keywords_from_query(query, max_keywords=max_keywords)

    # Compare keywords with the context
    matched_keywords = [keyword for keyword in keywords if keyword.lower() in context.lower()]

    # Calculate the relevance score
    relevance_score = len(matched_keywords) / len(keywords) if keywords else 0
    return keywords, matched_keywords, relevance_score

def handle_stopword_prompts(query):
    """
    Handle conversational prompts or queries with stop words
    and return a predefined guidance response.
    """
    conversational_prompts = [
        "hi", "hello", "what is your name", "who are you", 
        "how are you", "what do you do", "what's your name"
    ]
    # Normalize the query for comparison
    normalized_query = query.strip().lower()

    if any(prompt in normalized_query for prompt in conversational_prompts):
        return (
            "I am an academic advisor chatbot, designed to assist with CSE-related questions. "
            "I am equipped with data from:\n"
            "- CSE Website: https://www.csusb.edu/cse\n"
            "- CSE Catalog: https://catalog.csusb.edu/colleges-schools-departments/natural-sciences/computer-science-engineering/"
        )
    return None

def get_relevant_context(query):
    """
    Retrieve relevant context and handle low relevance scores or unexpected errors gracefully.
    """
    try:
        # Handle conversational prompts
        guidance_response = handle_stopword_prompts(query)
        if guidance_response:
            return guidance_response, None  # Return guidance directly for stopword prompts

        # Proceed with Milvus search if not a conversational prompt
        context = search_milvus(query)

        # Extract and compare keywords
        keywords, matched_keywords, relevance_score = compare_keywords_with_context(query, context)

        #print(f"Keywords: {keywords}")
        #print(f"Matched Keywords: {matched_keywords}")
        #print(f"Relevance Score: {relevance_score:.2f}")

        # Handle relevance score
        if relevance_score <= 0.33:
            context = (
                "Sorry, I can’t help with that. I’m here to assist with CSE academic advising—"
                "try asking about courses, schedules, or resources!"
            )
            sources = None  # No sources for low relevance
        else:
            # Extract all sources from the context
            sources = []
            for line in context.split("\n"):
                if "(Source:" in line:
                    source = line.split("(Source:")[1].strip().rstrip(")")
                    sources.append(source)

            # Join sources into a single string for further processing
            sources = "\n".join(sources) if sources else None

        return context, sources

    except ValueError as e:
        # Handle the empty vocabulary error gracefully
        if "empty vocabulary" in str(e):
            print(f"Encountered ValueError: {e}")
            context = (
                "I am an academic advisor chatbot, designed to assist with CSE-related questions. "
                "I am equipped with data from:\n"
                "- CSE Website: https://www.csusb.edu/cse\n"
                "- CSE Catalog: https://catalog.csusb.edu/colleges-schools-departments/natural-sciences/computer-science-engineering/"
            )
            return context, None

        # Reraise other unexpected errors
        raise e

def generate_response_with_source(rag_chain, context_chunks, sources, query):
    """
    Generate the final response with the very first source or fallback response,
    ensuring the response text does not include URLs and the source is shown separately.
    """
    # Handle guidance response directly
    guidance_response = handle_stopword_prompts(query)
    if guidance_response:
        return guidance_response  # Return guidance for stopword prompts

    # Initialize variables
    normalized_sources = []

    # Parse sources and extract URLs
    if sources:
        for line in sources.split("\n"):
            if "http" in line:
                # Extract URL and clean it
                url = line.split()[0].rstrip(")")
                parsed_url = urlparse(url)
                normalized_url = f"{parsed_url.scheme}://{parsed_url.netloc}{parsed_url.path}"
                normalized_sources.append(normalized_url)

        # Debugging: Check normalized sources
        #print("Normalized Sources:", normalized_sources)

        # Get the very first source
        first_source = normalized_sources[0] if normalized_sources else None
    else:
        first_source = None

    # Handle the response based on the source and RAG chain output
    if first_source is None:
        # Low relevance or no sources available
        response = context_chunks  # Fallback response
    else:
        # Generate the response using the RAG chain
        response = rag_chain.invoke({"context": context_chunks, "question": query})

        # Check for URLs in the response text
        urls_in_response = re.findall(r"http[s]?://\S+", response)

        # If the response mentions URLs, remove them
        if urls_in_response:
            for url in urls_in_response:
                response = response.replace(url, "").strip()

        # If the response indicates insufficient information, remove the source
        if response.strip().lower().startswith("i don't"):
            response = f"{response.strip()}"
            first_source = None  # Set source to None
        else:
            # Append the first source to the response
            response = (
                f"{response.strip()}\n\nSource:\n{first_source.strip()}"
            )

    return response

def invoke_llm_for_response(query):
    try:
        """Generate a response with highlighted keywords and exclude sources if no information is provided."""
        llm = ChatMistralAI(model='open-mistral-7b', api_key=get_api_key())
        # Define the prompt template
        PROMPT_TEMPLATE = """
        You are a helpful assistant tasked with answering questions based strictly on the provided context. Use only the information, facts, and statistics explicitly given in the context to formulate your response. Do not include any additional information or assumptions outside the context.
        
        Context:
        {context}
        
        Question:
        {question}
        
        Instructions:
        - Provide a concise and accurate answer based solely on the context above.
        - If the context does not contain enough information to answer the question, respond with:
        "I don’t have enough information to answer this question."
        - Do not generate, assume, or make up any details beyond the given context.
        """
        prompt = PromptTemplate(
            input_variables=["context", "question"],
            template=PROMPT_TEMPLATE
        )
        rag_chain = (
            {"context": RunnablePassthrough(), "question": RunnablePassthrough()}
            | prompt
            | llm
            | StrOutputParser()
        )

        context_chunks, source = get_relevant_context(query)

         # Generate the response with the most repetitive source
        response = generate_response_with_source(rag_chain, context_chunks, source, query)

        print(response, "Response")
        return response

    except httpx.HTTPStatusError as e:
        if e.response.status_code == 429:
            return "Rate limit exceeded. Please wait a moment before trying again."
        else:
            raise e
        
print("Chatbot initialized: Milvus search, context retrieval, and LLM response generation ready.")

### 5. Testing the Chatbot

- To execute this query using the invoke_llm_for_response function, ensure that your environment is set up correctly with the required data sources (e.g., `Milvus collection`) and API key. 
- Based on the function, this query will trigger the retrieval of relevant documents from the CSUSB Academic webpages, process them through the RAG `Retrieval-Augmented Generation` chain, and return a generated response from the model.

In [None]:
# Function to execute the query and display the response
def query_rag(query: str):
    return invoke_llm_for_response(query)

# Get user input for query
response = query_rag(input("Enter your query: "))


### 6. Conclusion

- Developed a chatbot using the RAG system for retrieving academic documents and generating context-based responses.
- Integrated Milvus for vector-based document retrieval and Mistral AI for natural language processing.
- Configured the chatbot within Jupyter Notebook for interactive query handling.
### Next Steps:

Scale the chatbot to include more diverse datasets and enhance query handling capabilities.
Resources:

Built as part of [CSE 6550: Software Engineering Concepts](https://catalog.csusb.edu/coursesaz/cse/)

Resources:
- [GitHub](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team2)  
- [Wiki](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team2/wiki)
