# Textbook Chatbot (Team 3)

This chatbot serves as an educational resource designed to respond to questions about the textbook.

In this notebook, we will demonstrate how the chatbot uses retrieval augemented generation (RAG) to answer questions using the SWEBOK textbook as the primary data source.

[![GitHub](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team3) 
[![Wiki](https://img.shields.io/badge/Wiki-blue?style=flat&logo=wikipedia&logoColor=white)](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team3/wiki)

# Table of Contents
1. [Introduction](#1.-Introduction)
2. [Setup](#2.-Setup)
3. [Building the Chatbot](#3.-Building-the-Chatbot)
   - [Document loading](#3.1-Document-loading)
   - [Embeddings](#3.2-Embeddings)
   - [LLM setup](#3.3-LLM-setup)
   - [Mistral loader](#3.4-Mistral-loader)
4. [Improving the Chatbot with inference](#4.-Improving-the-Chatbot-with-inference)
   - [Helpful functions](#4.1-Helpful-functions)
   - [Prompt engineering](#4.2-Prompt-engineering)
5. [Testing the Chatbot](#5.-Testing-the-Chatbot)
6. [Conclusion](#6.-Conclusion)

# 1. Introduction

Purpose:
This chatbot is as an educational tool that's built to answer questions related to the textbook, [Software Engineering Body of Knowledge (SWEBOK)](https://www.computer.org/education/bodies-of-knowledge/software-engineering). The chatbot was built by team 3 for [CSE 6550: Software Engineering Concepts](https://catalog.csusb.edu/coursesaz/cse/)

Objective: 
In this notebook, we will demonstrate how the chatbot uses retrieval augemented generation (RAG) to answer questions using the SWEBOK textbook as the primary data source.

Prerequisites:
Github, Docker, Mamba, Python, Jupyter Notebook

Resourses:
[![GitHub](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team3) 
[![Wiki](https://img.shields.io/badge/Wiki-blue?style=flat&logo=wikipedia&logoColor=white)](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team3/wiki)

## 1. Setup

- To ensure compatibility, it is necessary to verify the Python version installed on your system. This project requires Python 3.10 or higher. Follow these steps to check and prepare your environment:

Steps to Verify Python Version:

- Check Installed Version:
Open your terminal or command prompt and execute the following command:

- For windows or Linux OS use command
```python --version```

- For Macos Os use command
```!python3 --version```

Dependency Requirements:
- Python must already be installed on your system.
- Python version of 3.10 or higher is mandatory for this project to function correctly.

- If Python is not installed, download and install the latest version of Python from the official Python website.
https://www.python.org/downloads/

In [17]:
!python3 --version

Python 3.12.6


### Creating Virtual Environment
This code is setting up a virtual environment for Python. Here's what it does in simple terms:

- Install Required Tools:
    - It makes sure necessary Python packages (`ipykernel` and `virtualenv`) are installed.
    - These are tools needed for creating and managing the virtual environment.

- Create a Virtual Environment:
    - It creates a virtual environment named `chatbot`.
    - A virtual environment is like a separate workspace where you can install Python packages without affecting the global system settings.

In [19]:
import os
import subprocess

# Suppress pip installation output
subprocess.run(
    "pip install ipykernel --root-user-action=ignore > NUL 2>&1", shell=True
)
subprocess.run(
    "pip install --user virtualenv --root-user-action=ignore --no-warn-script-location > NUL 2>&1",
    shell=True,
)

# Create the virtual environment
subprocess.run("python -m venv chatbot > NUL 2>&1", shell=True)

# Simulate activation (actual activation is done in the shell, this is just confirmation)
print("Virtual Environment Created.")

Virtual Environment Created.


### Importing necessary dependencies

- This cell installs essential packages for the chatbot and data processing.

In [13]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Install all dependencies
print("Installing dependencies. This can take up to 3 minutes.")
!pip install faiss-cpu huggingface_hub ipykernel jupyter langchain langchain-community \
langchain-huggingface langchain-mistralai pandas pypdf python-dotenv roman streamlit \
sentence-transformers sqlalchemy tiktoken yake -q
print("Dependencies installed.")

Installing dependencies. This can take up to 3 minutes.
Dependencies installed.


### Update `pip` if needed

In [14]:
print("Updating pip")
%pip install --upgrade pip -q

Updating pip
Note: you may need to restart the kernel to use updated packages.


### Adding requirement.txt 
- This Python script generates a `requirements.txt` file containing all the necessary dependencies for the project, ensuring an easy setup by listing all required libraries in a standard format for installation using `pip`

In [15]:
# Write dependencies to requirements.txt
dependencies = """
faiss-cpu
huggingface_hub
ipykernel
jupyter
langchain
langchain-community
langchain-huggingface
langchain-mistralai
pandas
pypdf
python-dotenv
roman
streamlit
sentence-transformers
sqlalchemy
tiktoken
yake
"""

# Save to a file
with open("requirements.txt", "w") as file:
    file.write(dependencies.strip())
    
print("requirements.txt created successfully!")

requirements.txt created successfully!


### Installing dependencies from requirement.txt file

In [16]:
# Install all dependencies from the requirements.txt file
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


# 3. Building the Chatbot

### 3.1 Document loading

Purpose: 
The code loads documents from a specified directory to build the data corpus for the chatbot

Input:
The input refers to the documents loaded from the directory specified by document_path, which will be used to process user queries related to the chatbot's knowledge base.

Output:
The output is the collection of documents loaded from the directory into the `documents` variable, which will be used for further processing in the chatbot.

Processing:
- The code loads documents from the specified directory into the `documents` variable, creating a corpus for the chatbot to use in responding to queries.
- The primary data source used in this project is [Software Engineering Body of Knowledge (SWEBOK)](https://www.computer.org/education/bodies-of-knowledge/software-engineering).

In [1]:
import os
import sys
sys.path.append(os.path.dirname(os.getcwd())) # Change current directory to root

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

corpus_source = "swebok" # Set corpus source

# Create a relative path for the textbook
document_path = os.path.abspath(os.path.join("../data", corpus_source))
persist_directory = os.path.join(document_path, "faiss_indexes")

# Process textbook PDF
from backend.document_loading import load_documents_from_directory
documents = load_documents_from_directory(document_path)

Loading documents from /Users/loksaivajja/Downloads/csusb_fall2024_cse6550_team3-main 2/data/swebok...


### 3.2 Embeddings

Purpose:
Download and initialize the embedding model from HuggingFace to generate vector embeddings for text.

Input:
The input is the model name (`"Alibaba-NLP/gte-large-en-v1.5"`) which is used to fetch the embedding model from HuggingFace.

Output:
The output is the `EMBEDDING_FUNCTION`, which is an instance of the `HuggingFaceEmbeddings` class, ready to generate embeddings for text using the specified model.

Processing:
- we have retrieved the textbook, we need to create vector embeddings for it
- We will use [FAISS](https://python.langchain.com/docs/integrations/vectorstores/faiss/) as our vector database and [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) as our embedding model

In [21]:
# Download the embedding model from HuggingFace
from langchain_huggingface import HuggingFaceEmbeddings
EMBEDDING_MODEL_NAME = "Alibaba-NLP/gte-large-en-v1.5"
EMBEDDING_FUNCTION = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME, model_kwargs={'trust_remote_code': True})

Creating/loading the embeddings (This will take a couple of minutes):

### Loading Vector store
Purpose:
To create or load FAISS vector embeddings for the documents, allowing for efficient retrieval during chatbot interactions.

Input:
The input is the `documents` (loaded documents to be embedded) and `persist_directory` (the directory to store or retrieve the FAISS index).

Output:
The output is `faiss_store`, which is a FAISS vector store containing the document embeddings for efficient search and retrieval.

Processing:
The `load_or_create_faiss_vector_store` function processes the documents by either creating new FAISS embeddings or loading existing ones from the specified directory (`persist_directory`), enabling fast document retrieval based on vector similarity.

In [22]:
# Using pre-built load_or_create_faiss_vector_store function to create or load FAISS embeddings
from backend.document_loading import load_or_create_faiss_vector_store
faiss_store = load_or_create_faiss_vector_store(documents, persist_directory)

Loading existing FAISS vector store from /Users/loksaivajja/Downloads/csusb_fall2024_cse6550_team3-main 2/data/swebok/faiss_indexes/collection...



## 3.3 LLM setup

### Environment Variables

Purpose:
To load environment variables from a `.env` file, retrieve the Mistral API key, and ensure the key is available for further usage in the application.

Input:
- `.env` file containing environment variables (like `MISTRAL_API_KEY`).
- `api_key`: A string variable that may hold the Mistral API key (if manually provided).

Output:
Prints "Environment variables successfully setup" if successful, or raises an error if the `MISTRAL_API_KEY` is not found.

Processing:
we have to setup environment variables that will contain our API keys.
- If you have already created a `.env` file and added the `MISTRAL_API_KEY` you do not have to do anything. 
- If not, then you can add your API key below. Get an API key [here](https://console.mistral.ai/api-keys/).

In [4]:
from dotenv import load_dotenv
load_dotenv(override=True)

api_key = "" # add your Mistral API key here if needed
if api_key == "":
    api_key = os.getenv("MISTRAL_API_KEY")
elif not api_key:
	raise ValueError("MISTRA API KEY not found")
print("Environment variables succesfully setup")

Environment variables succesfully setup


### 3.4 Mistral loader

Purpose:
To load the Mistral AI model (in this case, "open-mistral-7b") using the `ChatMistralAI` class from `langchain_mistralai`, and configure it with necessary parameters (such as temperature, max tokens, and top-p) for generating responses.

Input:
- `model_name`: The name of the pre-trained model to be used, here set as "open-mistral-7b".
- `api_key`: The Mistral API key used for authenticating the model access.

Output:
- The model is loaded and ready to be used for generating responses.
- Prints "Successfully loaded Mistral 7B" upon successful loading of the model.

Processing:
We will be using [Mistral 7B](https://mistral.ai/news/announcing-mistral-7b/) as our primary large language model. This will combined with our retriever to create our RAG application.

In [5]:
from langchain_mistralai import ChatMistralAI

# Load and configure the Mistral AI LLM.
model_name = "open-mistral-7b"
def load_llm(model_name):
	return ChatMistralAI(
		model=model_name, # Model name
		mistral_api_key=api_key, # Mistral API key
		temperature=0.2,
		max_tokens=256,
		top_p=0.4,
	)
    
llm = load_llm(model_name)
print("Succesfully loaded Mistral 7B")

Succesfully loaded Mistral 7B


## 4. Improving the Chatbot with inference

### 4.1 Helpful functions

Purpose:
Retrieve and filter the top k most similar documents from the FAISS vector store based on a question.

Input:
- `question`: The user's query.
- `vector_store`: FAISS vector store.
- `k`: Number of similar documents to return.
- `distance_threshold`: Score threshold for filtering.

Output:
Filtered list of documents that are most similar to the question.

Processing:
Perform a similarity search in the vector store, filter documents based on the score threshold, return the relevant documents

In [6]:
# Get top k most similar documents using FAISS vector store.
def similarity_search(question, vector_store, k, distance_threshold = 420.0):
	retrieved_docs = vector_store.similarity_search_with_score(question, k=k)
	filtered_docs = [doc for doc, score in retrieved_docs if score <= distance_threshold]
	return filtered_docs
print("Performs document similarity search.")

### RAG-based response generation
Purpose:
Generate responses to a user's question using the RAG system by combining relevant documents and the LLM

Input:
- `question`: The user's query
- `prompt`: The system's prompt format
- `llm`: The language model to generate the response

Output:
Streamed response chunks as an answer, enriched with context from the relevant documents

Processing:
Retrieve the top k relevant documents using `similarity_search`, format the context from the documents and the user query, use the LLM to generate a response, streaming chunks of the answer

In [23]:
# Uses the RAG system to answer the user's questions
def chat_completion(question, prompt, llm):
    top_k = 10 # The maximum number of documents that similarity search will return
    
    relevant_docs = similarity_search(question, faiss_store, top_k) # Get relevant documents
    
    context = "\n\n".join([doc.page_content for doc in relevant_docs]) # Format retrived documents
    messages = prompt.format_messages(input=question, context=context) 
    
    # Stream response
    full_response = {"answer": "", "context": relevant_docs}
    for chunk in llm.stream(messages):
        full_response["answer"] += chunk.content
        yield (chunk.content)
print("Performs chat completion function.")

Performs chat completion function.


### Interactive chatbot interface
Purpose:
Provide an interactive interface where the user can input a question (or prompt) and get a response from the chatbot by invoking the RAG system

Input:
User-provided prompt (query) through the `prompt_input` widget

Output:
Display the chatbot's response, streamed in chunks, in the output widget

Processing:
- Create Input/Output Widgets: `prompt_input` for user input, `submit_button` for triggering the action, and `output` to display the response
- Button Click Action: When the button is clicked, it triggers the `on_submit` function
- Response Generation: The function uses the `chat_completion` to generate a response based on the user query. It then streams and displays the response in the `output` widget

In [24]:
import ipywidgets as widgets
from IPython.display import display

# Prompt widget
prompt_input = widgets.Text(
    placeholder='Enter your prompt here...',
    description='Prompt:',
    layout=widgets.Layout(width='500px')
)
# Sumbit button
submit_button = widgets.Button(
    description='Submit',
    button_style='primary'
)
output = widgets.Output()
def on_submit(b):
    with output:
        output.clear_output()
        user_prompt = prompt_input.value
        if not user_prompt:
            user_prompt = "Who is Hironori Washizaki?"
        print(f"\nPrompt: {user_prompt}\n")
        # Stream the response
        for response_chunk in chat_completion(user_prompt, prompt, llm):
            print(response_chunk, end='', flush=True)

submit_button.on_click(on_submit)
print(" Interactive chatbot response system.")

 Interactive chatbot response system.


### 4.2 Prompt engineering
Purpose:
Define the system behavior for the chatbot and format the prompt template for interacting with the user and providing responses based on the context.

Input:
The `system_prompt` specifies the chatbot’s instructions for how to respond.
The prompt template defines how the question and context are formatted for the chatbot.

Output:
A formatted prompt template (`prompt`) that combines the system instructions and user input (question and context) for processing by the LLM

Processing:
- System Behavior Definition: The `system_prompt` sets rules for how the chatbot should answer questions, handle uncertainty, and identify itself
- Prompt Template Creation: The `ChatPromptTemplate` is created with a combination of the system prompt and the user’s question/context format, ready to be used for generating the response.

In [9]:
from langchain_core.prompts import ChatPromptTemplate

# The system prompt will be used as a framework drive the LLM responses
system_prompt = """
You are a chatbot that answers the question in the <question> tags.
- Answer based only on provided context in <context> tags only if relevant.
- If unsure, say "I don't have enough information to answer."
- For unclear questions, ask for clarification.
- Always identify yourself as a chatbot, not the textbook.
- To questions about your purpose, say: "I'm a chatbot designed to answer questions about the provided textbook."
"""

# Setting up a prompt template
prompt = ChatPromptTemplate.from_messages([
  ("system", system_prompt),
  ("human", "<question>{input}</question>\n\n<context>{context}<context>"),
])
print("Generate system prompt and template.")

# 5. Testing the Chatbot

Purpose:
Render interactive widgets for user input, a submit button, and output display in the notebook

Input:
- `prompt_input`: User's query
- `submit_button`: Button to trigger response generation
- `output`: Area to display the response

Output:
Displays input field, submit button, and output area for chatbot interaction.

Processing:
- User enters a query and clicks the button
- The chatbot processes the query and displays the response in the output area

In [10]:
display(prompt_input, submit_button, output)

Text(value='', description='Prompt:', layout=Layout(width='500px'), placeholder='Enter your prompt here...')

Button(button_style='primary', description='Submit', style=ButtonStyle())

Output()

## Contributors

The Textbook Chatbot project was built by Team 3 for [CSE 6550: Software Engineering Concepts](https://catalog.csusb.edu/coursesaz/cse/) offered at CSUSB

[![GitHub](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team3) 
[![Wiki](https://img.shields.io/badge/Wiki-blue?style=flat&logo=wikipedia&logoColor=white)](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team3/wiki)