## Installation of Required Libraries

This cell installs essential libraries like `pymilvus`, `langchain`, `langchain_community`, `langchain_huggingface`, and `tabulate` to enable chatbot setup and management.

In [None]:
pip install pymilvus[model] langchain langchain_community langchain_huggingface langchain_milvus beautifulsoup4 requests nltk langchain_mistralai sentence-transformers scipy streamlit python-dotenv tabulate
print('Libraries installation completed.')

Collecting langchain_community
  Downloading langchain_community-0.3.5-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain_huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Collecting langchain_milvus
  Downloading langchain_milvus-0.1.6-py3-none-any.whl.metadata (1.9 kB)
Collecting langchain_mistralai
  Downloading langchain_mistralai-0.2.1-py3-none-any.whl.metadata (2.4 kB)
Collecting streamlit
  Downloading streamlit-1.39.0-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting pymilvus[model]
  Downloading pymilvus-2.4.9-py3-none-any.whl.metadata (5.6 kB)
Collecting environs<=9.5.0 (from pymilvus[model])
  Downloading environs-9.5.0-py2.py3-none-any.whl.metadata (14 kB)
Collecting ujson>=2.0.0 (from pymilvus[model])
  Downloading ujson-5.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.3 kB)
Collecting milvus-lite<2.5.0,>=2.4.

## Define Source Data for Web Crawling

This cell defines the source corpus from which data will be scraped. This source list is used by the web crawler to collect text for the chatbot knowledge base.

In [7]:
corpus_source = ["https://www.csusb.edu/cse"]
# Commenting this website inorder to reduce load time of data scraping cell
# "https://catalog.csusb.edu/"
print('Source data for web crawling defined.')

## Setup and Configuration

This section initializes necessary libraries and configurations, including the download of `nltk` resources and setting up the Milvus vector database path.

In [8]:
import nltk

nltk.download('punkt')
MILVUS_URI = "./milvus_lite/milvus_vector.db"
# Switch between models to get optimized information retrieval on QA tasks
MODEL_NAME = "sentence-transformers/all-MiniLM-L12-v2"
MODEL_NAME_2 = "sentence-transformers/msmarco-distilbert-base-v3"
collection_name = "Academic_Webpages"
output_folder = "csusb_cse_content"

# Ensure directories exist
os.makedirs("milvus_lite", exist_ok=True)
os.makedirs(output_folder, exist_ok=True)

print('Libraries and configurations set up completed.')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Import Core Libraries for Milvus and Vector Operations

Imports modules for working with Milvus, managing connections, and defining vector schemas.

In [9]:
from pymilvus import connections, utility, Collection, CollectionSchema, FieldSchema, DataType
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_milvus.retrievers import MilvusCollectionHybridSearchRetriever
from langchain_milvus.utils.sparse import BM25SparseEmbedding
from langchain_mistralai.chat_models import ChatMistralAI
from pymilvus import (
    Collection,
    CollectionSchema,
    DataType,
    FieldSchema,
    WeightedRanker,
    connections,
    utility,

)
import requests
from bs4 import BeautifulSoup
from langchain_huggingface import HuggingFaceEmbeddings
import nltk
import os
from urllib.parse import urljoin,urlparse
from scipy.sparse import csr_matrix
import numpy as np
from langchain.text_splitter import CharacterTextSplitter
print('Milvus and vector operations libraries imported.')

## Data Collection with Web Scraping

Implements the web scraping process to extract academic data from the CSUSB website for chatbot training.

In [10]:
# Load Data from the CSUSB Academic websites - Data Scraping
# Implemenataion of web crawler

import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from langchain.text_splitter import CharacterTextSplitter

# Directory to store unreachable URLs
unreachable_dir = "unreachable_urls"
os.makedirs(unreachable_dir, exist_ok=True)

# Save unreachable URL to file
def save_unreachable_url(url):
    with open(os.path.join(unreachable_dir, "unreachable_urls.txt"), "a") as f:
        f.write(url + "\n")
    # print(f"Saved unreachable URL: {url}")

# Function to load webpages and extract content
def load_webpages(url):
    try:
        response = requests.get(url, timeout=10)  # Setting a timeout
        response.raise_for_status()  # Check if the request was successful
    except (requests.exceptions.ConnectTimeout, requests.exceptions.ConnectionError, requests.exceptions.Timeout):
        print(f"Failed to connect to {url}")
        save_unreachable_url(url)
        return {"content": "", "source": url}
    except requests.exceptions.HTTPError as err:
        print(f"HTTP error for {url}: {err}")
        save_unreachable_url(url)
        return {"content": "", "source": url}

    soup = BeautifulSoup(response.text, 'html.parser')
    content_list = []

    # Process <li> items and follow links to extract linked page content
    li_items = soup.find_all('li')
    for li in li_items:
        li_text = li.get_text()
        links = [a['href'] for a in li.find_all('a', href=True)]

        # Fetch content from each link in the <li>
        for link in links:
            linked_url = urljoin(url, link)
            linked_content_data = load_linked_content(linked_url)
            if linked_content_data:
                content_list.append(f"{li_text}: {linked_content_data}")

    # Extract text and links from paragraphs in the main page
    paragraphs = soup.find_all('p')
    for p in paragraphs:
        paragraph_text = p.get_text()
        links = [a['href'] for a in p.find_all('a', href=True)]
        combined_text = f"{paragraph_text} (Links: {', '.join(links)})" if links else paragraph_text
        content_list.append(combined_text)

    # Combine content into a single text
    content = " ".join(content_list)
    return {"content": content, "source": url}

# Function to load content from linked pages
def load_linked_content(link_url):
    try:
        response = requests.get(link_url, timeout=10)  # Setting a timeout
        response.raise_for_status()
    except (requests.exceptions.ConnectTimeout, requests.exceptions.ConnectionError, requests.exceptions.Timeout):
        # print(f"Failed to connect to {link_url}")
        save_unreachable_url(link_url)
        return ""
    except requests.exceptions.HTTPError as err:
        # print(f"HTTP error for {link_url}: {err}")
        save_unreachable_url(link_url)
        return ""

    soup = BeautifulSoup(response.text, 'html.parser')
    content_list = []

    # Extract paragraphs and links from the linked page
    paragraphs = soup.find_all('p')
    for p in paragraphs:
        paragraph_text = p.get_text()
        links = [a['href'] for a in p.find_all('a', href=True)]
        combined_text = f"{paragraph_text} (Links: {', '.join(links)})" if links else paragraph_text
        content_list.append(combined_text)

    # Combine the extracted content from the linked page
    content = " ".join(content_list)
    return content

# Split text into chunks for further processing
def split_text(text, chunk_size=40000):
    text_splitter = CharacterTextSplitter(separator=",", chunk_size=chunk_size, chunk_overlap=0)
    return text_splitter.split_text(text)

# Get texts data from the URLs
def get_texts_data(base_url):
    texts = []
    for url in base_url:
        page_data = load_webpages(url)
        content = page_data["content"]
        source = page_data["source"]

        if content:
            cleaned_content = content.replace("\n", " ")
            split_contents = split_text(cleaned_content)

            # Create a document entry with each chunk and the source
            for split_content in split_contents:
                texts.append({
                    "page_content": split_content,
                    "source": source
                })

    print(texts)
    return texts

# Extract only the page content from the texts data
def extract_text_content(texts):
    return [text["page_content"] for text in texts if "page_content" in text]



texts = get_texts_data(corpus_source)
text_contents = extract_text_content(texts)


[{'page_content': 'Request Info  : We\'re excited that you are interested in learning more about California State University, San Bernardino (CSUSB). Please complete and submit the form below. We look forward to keeping in touch with you! \xa0 Are you a prospective international student?\xa0Complete and submit the\xa0International Student Inquiry Form\xa0instead to receive updates directly from our\xa0International Admissions Office. (Links: https://csusbernardino.radiusbycampusmgmt.com/ssc/iform/Kx671wkA700kx6700tI70n.ssc, https://www.csusb.edu/international-education/international-admissions) \xa0 Or if you a prospective graduate student, use our\xa0Graduate Program Request Form\xa0to be connected with the\xa0Office of Graduate Studies.   \xa0 (Links: https://csusbernardino.radiusbycampusmgmt.com/ssc/iform/I7SB8GMc003m0x671waB3.ssc, https://www.csusb.edu/graduate-studies)    Apply  : California State University, San Bernardino is a preeminent center of intellectual and cultural activ

## Define Data Display Function

Defines a function to display extracted data in a table format using `tabulate` for easier readability.

In [11]:
from tabulate import tabulate

# Function to display data in a table format
def display_texts_data(texts):
    # Convert data to table format and print
    print(tabulate(texts, headers="keys", tablefmt="grid"))
display_texts_data(texts)

Output hidden; open in https://colab.research.google.com to view.

## Retrieve and Process Text Data

Retrieves processed text data from the specified source and extracts relevant content for storage in the vector database.

In [12]:
texts = get_texts_data(corpus_source)
# Extract the cleaned text content from the dictionaries for embedding
text_contents = extract_text_content(texts)

# Initialize the dense and sparse embeddings
dense_embedding_func = HuggingFaceEmbeddings(model_name=MODEL_NAME_2)
dense_dim = len(dense_embedding_func.embed_query(text_contents[1]))
# print(dense_dim)
sparse_embedding_func = BM25SparseEmbedding(corpus=text_contents)
sparse_embedding_func.embed_query(text_contents[1])

[{'page_content': 'Request Info  : We\'re excited that you are interested in learning more about California State University, San Bernardino (CSUSB). Please complete and submit the form below. We look forward to keeping in touch with you! \xa0 Are you a prospective international student?\xa0Complete and submit the\xa0International Student Inquiry Form\xa0instead to receive updates directly from our\xa0International Admissions Office. (Links: https://csusbernardino.radiusbycampusmgmt.com/ssc/iform/Kx671wkA700kx6700tI70n.ssc, https://www.csusb.edu/international-education/international-admissions) \xa0 Or if you a prospective graduate student, use our\xa0Graduate Program Request Form\xa0to be connected with the\xa0Office of Graduate Studies.   \xa0 (Links: https://csusbernardino.radiusbycampusmgmt.com/ssc/iform/I7SB8GMc003m0x671waB3.ssc, https://www.csusb.edu/graduate-studies)    Apply  : California State University, San Bernardino is a preeminent center of intellectual and cultural activ

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/499 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


{0: 1.096582,
 4: 0.36552736,
 5: 1.8276368,
 6: 7.3105474,
 7: 10.234766,
 8: 14.255568,
 9: 12.427931,
 10: 12.427931,
 11: 27.414553,
 12: 1.096582,
 13: 3.2897463,
 14: 1.8276368,
 15: 1.096582,
 21: 15.352149,
 22: 35.090626,
 25: 1.8276368,
 26: 0.70279574,
 27: 0.36552736,
 28: 15.352149,
 29: 1.8276368,
 30: 53.366997,
 31: 37.64932,
 32: 2.8273866,
 33: 4.24108,
 34: 6.5794926,
 35: 1.096582,
 36: 17.910841,
 37: 1.8276368,
 38: 2.193164,
 40: 5.5893493,
 41: 7.3105474,
 42: 1.4136933,
 43: 7.3105474,
 44: 0.35139787,
 45: 0.36552736,
 46: 0.36552736,
 47: 0.70279574,
 48: 0.17435339,
 49: 0.36552736,
 50: 0.725937,
 51: 1.8276368,
 52: 0.725937,
 53: 1.4136933,
 54: 0.725937,
 55: 0.36552736,
 56: 0.725937,
 57: 3.2897463,
 58: 0.725937,
 59: 1.096582,
 60: 0.36552736,
 62: 0.0,
 63: 0.34870678,
 64: 0.36552736,
 65: 0.36552736,
 66: 0.35139787,
 67: 0.5340825,
 68: 0.36552736,
 69: 0.725937,
 71: 0.35139787,
 72: 0.0,
 73: 3.2897463,
 74: 0.7310547,
 75: 1.4136933,
 76: 0.17

## Initialize Milvus Vector Database

Initializes Milvus by establishing a connection and defining the necessary collection schema for storing vectors of processed data.

In [13]:
# Initialize Milvus
def initialize_milvus():
    global collection
    # Connect to Milvus
    connections.connect("default", uri=MILVUS_URI)
    print(f"Connected to Milvus at {MILVUS_URI}")

    # Check if the collection exists
    if utility.has_collection(collection_name):
        # Load the existing collection
        collection = Collection(name=collection_name)
        print(f"Collection '{collection_name}' already exists.")
        return collection

    else:
        print(f"Collection '{collection_name}' does not exist. Creating new collection.")

        # Define schema fields
        pk_field = "doc_id"
        dense_field = "dense_vector"
        sparse_field = "sparse_vector"
        text_field = "text"
        source_field = "source"

        # Define fields for the collection schema
        fields = [
            FieldSchema(
                name=pk_field,
                dtype=DataType.VARCHAR,
                is_primary=True,
                auto_id=True,
                max_length=100,
            ),
            FieldSchema(name=dense_field, dtype=DataType.FLOAT_VECTOR, dim=dense_dim),
            FieldSchema(name=sparse_field, dtype=DataType.SPARSE_FLOAT_VECTOR),
            FieldSchema(name=text_field, dtype=DataType.VARCHAR, max_length=65_535),
            FieldSchema(name=source_field, dtype=DataType.VARCHAR, max_length=500)
        ]

        # Create the schema and the collection
        schema = CollectionSchema(fields=fields, enable_dynamic_field=False)
        collection = Collection(name=collection_name, schema=schema, consistency_level="Strong")
        print(f"Created collection '{collection_name}'.")

        # Create indexes for dense and sparse vectors
        dense_index = {"index_type": "FLAT", "metric_type": "IP"}
        sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"}

        collection.create_index(dense_field, dense_index)
        collection.create_index(sparse_field, sparse_index)

        print(f"Created sparse vector index on '{dense_field} {sparse_field}'.")

        # Flush to persist changes
        collection.flush()
        print(f"Flushed collection '{collection_name}' to persist changes.")

    # Insert vectors into the collection
    entities = []

    for text in texts:
        text_content= str(text["page_content"])
        source = text["source"]
        entity = {
            "dense_vector": dense_embedding_func.embed_documents([text_content])[0],
            "sparse_vector": sparse_embedding_func.embed_documents([text_content])[0],
            "text": text_content,
            "source": source

        }
        entities.append(entity)

    # Check if the collection already contains data
    if collection.num_entities == 0:
        collection.insert(entities)
        print(f"Inserted {len(entities)} entities into the collection '{collection_name}'.")
    else:
        print(f"Collection '{collection_name}' already contains data. Skipping insertion.")
    return collection

## Execute Milvus Initialization

Executes the initialization function for Milvus, preparing it for data storage and retrieval operations.

In [14]:
initialize_milvus()

Connected to Milvus at ./milvus_lite/milvus_vector.db
Collection 'Academic_Webpages' does not exist. Creating new collection.
Created collection 'Academic_Webpages'.
Created sparse vector index on 'dense_vector sparse_vector'.
Flushed collection 'Academic_Webpages' to persist changes.
Inserted 22 entities into the collection 'Academic_Webpages'.


<Collection>:
-------------
<name>: Academic_Webpages
<description>: 
<schema>: {'auto_id': True, 'description': '', 'fields': [{'name': 'doc_id', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 100}, 'is_primary': True, 'auto_id': True}, {'name': 'dense_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'sparse_vector', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>}, {'name': 'text', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535}}, {'name': 'source', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 500}}], 'enable_dynamic_field': False}

## Format Documents for Language Model Compatibility

Formats documents for language model compatibility, including text cleaning and metadata extraction.

In [15]:
# Function to format documents with their sources and extract associated images
def format_docs(docs):
    formatted_content = ""
    sources = set()

    # Loop through each document to retrieve text and source
    for doc in docs:
        content = getattr(doc, "text", "")
        source = doc.metadata.get("source", "Unknown source")

        formatted_content += f"{content}\n\n"
        sources.add(source)

    # Combine sources into a formatted string
    formatted_sources = "\n".join(sources)

    return formatted_content, formatted_sources

## Configure API Key for External Services

Sets up the environment variable for API access, enabling connections to external language model services.

In [21]:
# Setting the API key
import os

# Set the API key
os.environ["API_KEY"] = "IetRnH5Lb578MdB5Ml0HNTdMBzeHUe7q"

## Generate Response using RAG Chain

Defines the retrieval-augmented generation (RAG) function to invoke the language model and generate responses based on user queries.

In [24]:
# Invoke RAG chain to generate LLM Response
# Function to invoke the language model for generating a response
def get_api_key():
    """Retrieve the API key from the environment."""
    api_key = os.getenv("API_KEY")
    if not api_key:
        raise ValueError("API key not found. Ensure the API key is set in main.py before proceeding.")
    return api_key
def invoke_llm_for_response(query: str):
    api_key = get_api_key()
    if not isinstance(query, str):
        raise ValueError("The input query must be a string.")

    if len(query.split()) < 2:
        return "Please ask a more specific question.", [], []  # Ensure this return has three items

    # Initialize the language model
    llm = ChatMistralAI(model='open-mistral-7b', api_key=api_key)

    # Define the prompt template
    PROMPT_TEMPLATE = """
    Human: You are an AI assistant, and provide answers to questions by using fact-based and statistical information when possible.
    Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.

    <context>
    {context}
    </context>

    <question>
    {question}
    </question>

    Assistant:"""

    prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["context", "question"])

    # Ensure `texts` are strings
    texts = get_texts_data(corpus_source)
    if not texts:
        return "No content found in the specified URLs. Please check your data source.", [], []
    texts = [text['page_content'] if isinstance(text, dict) and 'page_content' in text else text for text in texts if isinstance(text, str) or isinstance(text, dict)]

    # Define the fields and search parameters for the Milvus retriever
    dense_field = "dense_vector"
    sparse_field = "sparse_vector"
    text_field = "text"
    sparse_search_params = {"metric_type": "IP"}
    dense_search_params = {"metric_type": "IP", "params": {}}
    collection = initialize_milvus()

    # Initialize the Milvus retriever
    retreiver = MilvusCollectionHybridSearchRetriever(
        collection=collection,
        rerank=WeightedRanker(0.7, 0.3),
        anns_fields=[dense_field, sparse_field],
        field_embeddings=[dense_embedding_func, sparse_embedding_func],
        field_search_params=[dense_search_params, sparse_search_params],
        top_k=5,
        text_field=text_field,
    )

    hybrid_results = retreiver.invoke(query)
    # Have to implement re-ranking function for the hybrid retriever for exact query matching
    formatted_content, formatted_sources = format_docs(hybrid_results)

    context_callable = lambda x: formatted_content

    # Define the RAG chain manually with the specified format
    rag_chain = (
        {"context": context_callable, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )

    # Invoke the RAG chain with the specific question
    response = rag_chain.invoke({"input": query})

    final_response = f"{response}\n\nSources:\n{formatted_sources}"
    print(final_response,formatted_sources)

    return final_response,formatted_sources


## Test LLM Response Generation

A test query to evaluate the chatbot's ability to generate responses using the RAG chain.

In [25]:
query = "What programs are offered by the School of Computer Science and Engineering?"
invoke_llm_for_response(query)

[{'page_content': 'Request Info  : We\'re excited that you are interested in learning more about California State University, San Bernardino (CSUSB). Please complete and submit the form below. We look forward to keeping in touch with you! \xa0 Are you a prospective international student?\xa0Complete and submit the\xa0International Student Inquiry Form\xa0instead to receive updates directly from our\xa0International Admissions Office. (Links: https://csusbernardino.radiusbycampusmgmt.com/ssc/iform/Kx671wkA700kx6700tI70n.ssc, https://www.csusb.edu/international-education/international-admissions) \xa0 Or if you a prospective graduate student, use our\xa0Graduate Program Request Form\xa0to be connected with the\xa0Office of Graduate Studies.   \xa0 (Links: https://csusbernardino.radiusbycampusmgmt.com/ssc/iform/I7SB8GMc003m0x671waB3.ssc, https://www.csusb.edu/graduate-studies)    Apply  : California State University, San Bernardino is a preeminent center of intellectual and cultural activ

('The School of Computer Science and Engineering offers a variety of programs. These include undergraduate degrees such as Bachelor of Science in Computer Science, Bachelor of Science in Information Technology, and Bachelor of Science in Software Engineering. For graduate studies, they offer Master of Science in Computer Science, Master of Science in Information Assurance and Cybersecurity, and Doctor of Philosophy in Computer Science. They also provide several minor programs and certificates. Please visit the official university website for the most up-to-date and detailed information.\n\nSources:\nhttps://www.csusb.edu/cse',
 'https://www.csusb.edu/cse')