# Textbook Chatbot (Team 3)

This chatbot is as an educational tool that's built to answer questions related to the textbook, [Software Engineering Body of Knowledge (SWEBOK)](https://www.computer.org/education/bodies-of-knowledge/software-engineering). The chatbot was built by team 3 for [CSE 6550: Software Engineering Concepts](https://catalog.csusb.edu/coursesaz/cse/)

In this notebook, we will demonstrate how the chatbot uses retrieval augemented generation (RAG) to answer questions using the SWEBOK textbook as the primary data source.

[![GitHub](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team3) 
[![Wiki](https://img.shields.io/badge/Wiki-blue?style=flat&logo=wikipedia&logoColor=white)](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team3/wiki)

## Table of contents
1. [Setup](#1.-Setup)
    - 1.1. [Document loading](#1.1-Document-loading)
    - 1.2. [Embeddings](#1.2-Embeddings)
2. [LLM Setup](#2.-LLM-Setup)
    - 2.1. [Environment Variables](#2.1-Environment-Variables)
    - 2.2. [Mistral Loader](#2.2-Mistral-Loader)
3. [Inference](#3.-Inference)
    - 3.1. [Helpful Functions](#3.1-Helpful-Functions)
    - 3.2. [Prompt Engineering](#3.2-Prompt-Engineering)
    - 3.3. [User Input](#3.2-User-Input) 
5. [Contributors](#Contributors)

## 1. Setup

### Imports

- This code imports essential libraries for document retrieval, storage, and processing, enabling efficient querying and management of textual data. It uses FAISS and BM25 for high-performance document search, Hugging Face models for embedding text, and tools to split documents and load multiple PDFs from directories.

- Environment variables are managed via `dotenv`, making it simple to securely load configuration settings like API keys

In [None]:
import os
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from dotenv import load_dotenv
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFDirectoryLoader

### 1.1 Document loading
The primary data source used in this project is [Software Engineering Body of Knowledge (SWEBOK)](https://www.computer.org/education/bodies-of-knowledge/software-engineering).

We will begin by setting the `corpus_source` to point to the textbook and processing the textbook PDF

This code configures the system path, suppresses warnings, and sets "swebok" as the source for the textbook files. It defines paths for the SWEBOK document and FAISS indexes, then loads and processes the textbook PDFs using a backend function.

In [5]:
import os
import sys
sys.path.append(os.path.dirname(os.getcwd())) # Change current directory to root

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

corpus_source = "swebok" # Set corpus source

# Create a relative path for the textbook
document_path = os.path.abspath(os.path.join("../data", corpus_source))
persist_directory = os.path.join(document_path, "faiss_indexes")

# Process textbook PDF
from backend.document_loading import load_documents_from_directory
documents = load_documents_from_directory(document_path)

Loading documents from /app/data/swebok...


### 1.2 Embeddings
Now that we have retrieved the textbook, we need to create vector embeddings for it.

`Vector embeddings` are numerical vectors that capture semantic meaning of text. Each chunk of text from our textbook will be converted into a high-dimensional vector that represents its semantic meaning and context. These vectors enable efficient similarity searches and help maintain relationships between related concepts across the text.

We will use [FAISS](https://python.langchain.com/docs/integrations/vectorstores/faiss/) as our vector database and [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) as our embedding model.

In [6]:
# Download the embedding model from HuggingFace
from langchain_huggingface import HuggingFaceEmbeddings
EMBEDDING_MODEL_NAME = "Alibaba-NLP/gte-large-en-v1.5"
EMBEDDING_FUNCTION = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME, model_kwargs={'trust_remote_code': True})

This code uses the `load_or_create_faiss_vector_store` function to either create or load FAISS embeddings for the processed documents. It initializes faiss_store with these embeddings, storing them in the specified `persist_directory` for efficient text search and retrieval.

Creating/loading the embeddings (This will take a couple of minutes):

In [7]:
# Using pre-built load_or_create_faiss_vector_store function to create or load FAISS embeddings
from backend.document_loading import load_or_create_faiss_vector_store
faiss_store = load_or_create_faiss_vector_store(documents, persist_directory)

Loading existing FAISS vector store from /app/data/swebok/faiss_indexes/collection...



## 2. LLM Setup

### 2.1 Environment Variables

Now we have to setup environment variables that will contain our API keys.

- If you have already created a .env file and added the `MISTRAL_API_KEY` you do not have to do anything. 
- If not, then you can add your API key below. Get an API key [here](https://console.mistral.ai/api-keys/).

This code loads environment variables from a `.env` file with `override=True` to ensure any existing environment variables are overwritten. It checks if an API key is provided in the `api_key` variable; if not, it attempts to retrieve it from the environment variable `MISTRAL_API_KEY`. If neither is available, it raises an error. Finally, it confirms the environment setup with a success message.

In [8]:
from dotenv import load_dotenv
load_dotenv(override=True)

api_key = "" # add your Mistral API key here if needed
if api_key == "":
    api_key = os.getenv("MISTRAL_API_KEY")
elif not api_key:
	raise ValueError("MISTRA API KEY not found")
print("Environment variables succesfully setup")

Environment variables succesfully setup


### 2.2 Mistral Loader
We will be using [Mistral 7B](https://mistral.ai/news/announcing-mistral-7b/) as our primary large language model. This will combined with our retriever to create our RAG application.

Let's load the LLM using `langchain`

This code configures and loads the Mistral AI language model (`open-mistral-7b`) through the `ChatMistralAI` class. It defines a `load_llm` function that sets the model, API key, and other parameters like `temperature`, `max_tokens`, and `top_p` to control response diversity and length. After calling `load_llm` with the specified model name, it initializes `llm` with the configured model and confirms successful loading.

In [9]:
from langchain_mistralai import ChatMistralAI

# Load and configure the Mistral AI LLM.
model_name = "open-mistral-7b"
def load_llm(model_name):
	return ChatMistralAI(
		model=model_name, # Model name
		mistral_api_key=api_key, # Mistral API key
		temperature=0.2,
		max_tokens=256,
		top_p=0.4,
	)
    
llm = load_llm(model_name)
print("Succesfully loaded Mistral 7B")

Succesfully loaded Mistral 7B


## 3. Inference

### 3.1 Helpful Functions
Now we will define some helpful functions for RAG system

This code defines a `similarity_search` function that retrieves the top k most similar documents related to a given question from a FAISS vector store. It first fetches the k documents with similarity scores, then filters out any documents with scores above a specified `distance_threshold` (defaulting to 420.0). The function returns only the documents that meet the threshold for further processing or analysis

In [10]:
# Get top k most similar documents using FAISS vector store.
def similarity_search(question, vector_store, k, distance_threshold = 420.0):
	retrieved_docs = vector_store.similarity_search_with_score(question, k=k)
	filtered_docs = [doc for doc, score in retrieved_docs if score <= distance_threshold]
	return filtered_docs

This code defines a `chat_completion` function that uses a Retrieval-Augmented Generation (RAG) approach to answer user questions. It first retrieves the top 10 most relevant documents from the FAISS vector store using `similarity_search`. The documents are formatted into a single context string, which is added to the prompt for generating a response. The function streams the response from the language model `(llm)` in chunks, building and yielding each part of the answer as it is generated, along with the relevant document context for reference.

In [11]:
# Uses the RAG system to answer the user's questions
def chat_completion(question, prompt, llm):
    top_k = 10 # The maximum number of documents that similarity search will return
    
    relevant_docs = similarity_search(question, faiss_store, top_k) # Get relevant documents
    
    context = "\n\n".join([doc.page_content for doc in relevant_docs]) # Format retrived documents
    messages = prompt.format_messages(input=question, context=context) 
    
    # Stream response
    full_response = {"answer": "", "context": relevant_docs}
    for chunk in llm.stream(messages):
        full_response["answer"] += chunk.content
        yield (chunk.content)

Third, let's create widgets so the user can ask their question to the RAG system

This code sets up a simple interactive interface in Jupyter using `ipywidgets`. It includes a text box `(prompt_input)` for entering a prompt, a submit button `(submit_button)`, and an output area `(output)` for displaying responses.

- When the submit button is clicked, the `on_submit` function retrieves the user's prompt from `prompt_input`.
- If the input is empty, it defaults to asking "Who is Hironori Washizaki?"
- The prompt is then passed to the `chat_completion` function, which streams and displays the chatbot's response in real-time within the `output` area.

In [12]:
import ipywidgets as widgets
from IPython.display import display

# Prompt widget
prompt_input = widgets.Text(
    placeholder='Enter your prompt here...',
    description='Prompt:',
    layout=widgets.Layout(width='500px')
)
# Sumbit button
submit_button = widgets.Button(
    description='Submit',
    button_style='primary'
)
output = widgets.Output()
def on_submit(b):
    with output:
        output.clear_output()
        user_prompt = prompt_input.value
        if not user_prompt:
            user_prompt = "Who is Hironori Washizaki?"
        print(f"\nPrompt: {user_prompt}\n")
        # Stream the response
        for response_chunk in chat_completion(user_prompt, prompt, llm):
            print(response_chunk, end='', flush=True)

submit_button.on_click(on_submit)

### 3.2 Prompt Engineering
For the LLM to effectively answer our question we have to do some prompt engineering. This will make sure the model stays on track and answers questions with the textbook as the primary context

This code creates a system prompt template for guiding the chatbot's responses using `ChatPromptTemplate` from LangChain. The `system_prompt` sets clear guidelines: the chatbot should respond only based on relevant context provided, indicate when it lacks enough information, clarify ambiguous questions, and self-identify as a chatbot. Additionally, it explains its purpose if asked. The `prompt` template uses placeholders for the user question ({`input`}) and the retrieved context ({`context`}) to dynamically insert values when generating responses.

In [13]:
from langchain_core.prompts import ChatPromptTemplate

# The system prompt will be used as a framework drive the LLM responses
system_prompt = """
You are a chatbot that answers the question in the <question> tags.
- Answer based only on provided context in <context> tags only if relevant.
- If unsure, say "I don't have enough information to answer."
- For unclear questions, ask for clarification.
- Always identify yourself as a chatbot, not the textbook.
- To questions about your purpose, say: "I'm a chatbot designed to answer questions about the provided textbook."
"""

# Setting up a prompt template
prompt = ChatPromptTemplate.from_messages([
  ("system", system_prompt),
  ("human", "<question>{input}</question>\n\n<context>{context}<context>"),
])

### 3.2 User Input
We're done with creating our RAG system!

Let's test out the chatbot by asking it a question

In [15]:
display(prompt_input, submit_button, output)

Text(value='', description='Prompt:', layout=Layout(width='500px'), placeholder='Enter your prompt here...')

Button(button_style='primary', description='Submit', style=ButtonStyle())

Output(outputs=({'name': 'stdout', 'text': '\nPrompt: Who is Hironori Washizaki?\n\nHironori Washizaki is an i…

## Contributors

The Textbook Chatbot project was built by Team 3 for [CSE 6550: Software Engineering Concepts](https://catalog.csusb.edu/coursesaz/cse/) offered at CSUSB

[![GitHub](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team3) 
[![Wiki](https://img.shields.io/badge/Wiki-blue?style=flat&logo=wikipedia&logoColor=white)](https://github.com/DrAlzahraniProjects/csusb_fall2024_cse6550_team3/wiki)