# Textbook Chatbot 

The Textbook Chatbot project for CSE 6550 is designed to assist with queries related to the textbook."Software Engineering: A Practitioner's Approach." The chatbot serves as an educational tool, helping users by providing information, answering questions, and possibly retrieving content from the textbook.

## Tabel of contents 

- Setup
1. Imports
2. Environment variables
3. Document loading
4. LLM load function

- Inference
1. chat_completion function
2. Prompt template
3. User enter's a prompt
4. Display result

## Setup

### Imports

- This code imports essential libraries for document retrieval, storage, and processing, enabling efficient querying and management of textual data. It uses FAISS and BM25 for high-performance document search, Hugging Face models for embedding text, and tools to split documents and load multiple PDFs from directories.

- Environment variables are managed via `dotenv`, making it simple to securely load configuration settings like API keys

####  Libraries imported

- `os`: For interacting with the operating system and accessing environment variables

- `create_retrieval_chain` and `create_stuff_documents_chain` from `langchain.chains`: For building retrieval chains that enable querying and combining relevant documents.

- `load_dotenv` from `dotenv`: For loading environment variables from a `.env` file, especially useful for API keys and configuration settings

- `FAISS` from `langchain_community.vectorstores`: For creating a FAISS (Facebook AI Similarity Search) vector store, an efficient tool for handling large-scale similarity search

- `HuggingFaceEmbeddings` from `langchain_huggingface`: For creating text embeddings using Hugging Face models, which help in encoding textual data into vector form

- `BM25Retriever` from `langchain_community.retrievers`: For retrieving documents using BM25, a ranking function commonly used in search engines for information retrieval

- `EnsembleRetriever` from `langchain.retrievers`: For combining multiple retrieval methods to improve the accuracy of retrieved results

- `RecursiveCharacterTextSplitter` from `langchain_text_splitters`: For splitting large documents into smaller, more manageable chunks

- `PyPDFDirectoryLoader` from `langchain_community.document_loaders`: For loading and reading multiple PDF documents within a directory, converting them into text for analysis and retrieval

In [15]:
import os
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from dotenv import load_dotenv
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFDirectoryLoader

# Document loading 

- This code defines functions for loading and processing documents, specifically PDF files, into manageable chunks, and for creating or loading a FAISS vector store for similarity search. The `load_documents_from_directory function` utilizes a text splitter to break down documents, while `load_or_create_faiss_vector_store` handles the indexing of these chunks.

- `similarity_search` and `get_hybrid_retriever` functions facilitate efficient document retrieval using a combination of BM25 and vector-based methods, enhancing the accuracy and relevance of search results.

In [8]:
# document loading 
EMBEDDING_MODEL_NAME = "Alibaba-NLP/gte-large-en-v1.5"  # Embedding model (https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5)
model_kwargs = {'trust_remote_code': True}
EMBEDDING_FUNCTION = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME, model_kwargs=model_kwargs)

def load_documents_from_directory(
	document_path: str, 
	chunk_size: int = 2048, 
	chunk_overlap: int = 200
):
	"""
	Load PDF documents from a directory and split them into chunks.
	Args:
		document_path (str): Path to the directory containing PDF files.
		chunk_size (int): Size of each text chunk (default: 2048).
		chunk_overlap (int): Overlap between chunks (default: 200).
	Returns:
		List of document chunks.
	"""
	print(f"Loading documents from {document_path}...")
	# Load PDF documents from the specified directory
	documents = PyPDFDirectoryLoader(document_path).load_and_split()
	# Create a text splitter using tiktoken encoder
	text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
	# Split the documents into chunks
	return text_splitter.split_documents(documents)


def load_or_create_faiss_vector_store(
	documents,
	persist_directory,
	collection_name="collection"
):
	"""
	Load an existing FAISS vector store or create a new one if it doesn't exist.
	Args:
			documents: List of documents to be indexed.
			collection_name (str): Name of the collection.
			persist_directory (str): Directory to save/load the FAISS index.
	Returns:
			FAISS vector store object.
	"""
	index_path = os.path.join(persist_directory, f'{collection_name}')
	if os.path.exists(index_path):
		# Load existing FAISS index
		print(f"Loading existing FAISS vector store from {index_path}...\n")
		faiss_store = FAISS.load_local(
			index_path, 
			embeddings=EMBEDDING_FUNCTION, 
			allow_dangerous_deserialization=True
		)
	else:
		# Create new FAISS index
		print(f"Creating new FAISS vector store in {index_path}...\n")
		faiss_store = FAISS.from_documents(
			documents, 
			embedding=EMBEDDING_FUNCTION
		)
		faiss_store.save_local(index_path)
	return faiss_store

def similarity_search(
	question,
	vector_store,
	k,
	distance_threshold = 420.0
):
	"""
	Get top k most similar documents using FAISS vector store.
	Args:
		question: The user question
		vector_store: FAISS vector store
		k: Number of documents to return
		distance_threshold: Maximum distance score to include document
	Returns:
		list[Document]: Top k most similar documents
	"""
	retrieved_docs = vector_store.similarity_search_with_score(question, k=k)
	filtered_docs = [doc for doc, score in retrieved_docs if score <= distance_threshold]
	return filtered_docs

def get_hybrid_retriever(documents, vector_store, k):
	"""
	Create a hybrid retriever combining BM25 and vector search.
	Args:
		documents: List of documents for BM25 retriever.
		vector_store: FAISS vector store for vector retriever.
		k (int): Number of documents to retrieve.
	Returns:
		EnsembleRetriever object combining BM25 and vector search.
	"""
	# Create BM25 retriever
	bm25_retriever = BM25Retriever.from_documents(
		documents, 
		k = 0
	)
	# Create vector retriever
	vector_retriever = vector_store.as_retriever(
		search_type="similarity",
		search_kwargs={
			'k': k,
		}
	)
	# Combine retrievers with specified weights
	fusion_retriever = EnsembleRetriever(
		retrievers=[bm25_retriever, vector_retriever],
		weights=[0.2, 0.8]
	)
	return fusion_retriever

## Prompt template

- This code sets up a chatbot using LangChain that specifically answers questions about the SWEBOK textbook. The `system_prompt` defines guidelines for the chatbot, ensuring it identifies as a chatbot, restricts responses to provided context, avoids fabricating information, and maintains brevity.

- The `ChatPromptTemplate` is created from the system instructions and a human input format, allowing for structured interactions. A function retrieves a brief description of the chatbot's purpose, confirming its focus on the SWEBOK content

In [9]:
from langchain_core.prompts import ChatPromptTemplate 

# Prompts
system_prompt = """  # Defines a multi-line string containing system instructions for the chatbot.
You are a chatbot answering questions about "Software Engineering: A Practitioner's Approach" textbook.  # Specifies the chatbot's context and focus area.

1. Always identify yourself as a chatbot, not the textbook.  # Instructs the chatbot to clarify its identity.
2. Answer based only on provided context.  # Emphasizes using only the relevant context for responses.
3. If unsure, say "I don't have enough information to answer."  # Guides the chatbot on handling uncertainty in answers.
4. For unclear questions, ask for clarification.  # Encourages the chatbot to seek more information for ambiguous questions.
5. Keep responses under 256 tokens.  # Sets a limit on response length for conciseness.
6. Don't invent information.  # Instructs the chatbot to refrain from generating unsupported information.
7. Use context only if relevant.  # Advises the chatbot to incorporate context judiciously.
8. To questions about your purpose, say: "I'm a chatbot designed to answer questions about the 'Software Engineering: A Practitioner's Approach' textbook."  # Provides a standard response for inquiries about the chatbot's function.

Be accurate and concise. Answer only what's asked.  # Reinforces the importance of precision and relevance in responses.
"""

# Create the chat prompt template
prompt = ChatPromptTemplate.from_messages([  # Creates a chat prompt template from the defined messages.
    ("system", system_prompt),  # Sets the system prompt as the first message.
    ("human", "Question: {input}\n\nRelevant Context:\n{context}"),  # Defines the human user input format.
])

def get_chatbot_prompt_description():  # Defines a function that returns a description of the chatbot prompt.
    return "Chatbot prompt for answering textbook-related questions."  # Returns a brief description of the chatbot's purpose.

# Calls the function to get the prompt description.
output = get_chatbot_prompt_description()  
print(output)  # Prints the description of the chatbot prompt.

Chatbot prompt for answering textbook-related questions.


### Text Document Loader

- The provided code defines a function `load_documents_from_directory` that loads text documents from a specified directory. It iterates through all files in the given directory and checks if they have a `.txt` extension. For each valid text file, it uses the `TextLoader` from LangChain to load the document's content and appends it to a list.
  
-   Finally, the function returns a list of all loaded documents, allowing for easy access to the text data for further processing or analysis.

In [13]:
import os
from langchain.document_loaders import TextLoader

def load_documents_from_directory(directory_path):
    """
    Loads text documents from a specified directory.

    Args:
        directory_path (str): Path to the directory containing text documents.

    Returns:
        list: A list of loaded documents.
    """
    documents = []
    
    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        
        if os.path.isfile(file_path) and filename.endswith('.txt'):
            loader = TextLoader(file_path)
            document = loader.load()
            documents.extend(document)
    
    return documents

### Load Document Embeddings

- The `load_embeddings` function retrieves document embeddings from a directory specified by the `CORPUS_SOURCE` environment variable. It checks for the variable's existence, loads the documents, and verifies successful loading. 

- It then calls `load_or_create_faiss_vector_store` to manage the FAISS vector store and `get_hybrid_retriever` to create a combined retriever. Finally, it confirms the loading process and returns the retriever for use.

In [14]:
def load_embeddings():
    document_path = os.getenv("CORPUS_SOURCE")
    if not document_path:
        raise ValueError("CORPUS_SOURCE not found in environment variables.")
    
    persist_directory = os.path.join(document_path, "faiss_indexes")
    top_k = 15
    
    # Load documents
    documents = load_documents_from_directory(document_path)
    if not documents:
        raise ValueError("No documents loaded. Please check the document path.")

    # Assuming `load_or_create_faiss_vector_store` and `get_hybrid_retriever` are defined elsewhere
    faiss_store = load_or_create_faiss_vector_store(documents, persist_directory)
    retriever = get_hybrid_retriever(documents, faiss_store, top_k)
    
    print("Embeddings and retriever loaded.")
    return retriever

### Mistral API Loader

- The provided code defines a function `load_llm_api` that loads and configures the Mistral AI language model using an API key retrieved from environment variables. If the API key is not set, it raises an error. The function returns an instance of `ChatMistralAI`, configured with specified parameters like model name, temperature, maximum tokens, and top probability.
  
- The code then initializes the model with the name "open-mistral-7b." A function named `chat_completion_as_dict` is included to interact with the model, taking a question as input and returning the model's response in a structured dictionary format, including the complete answer and the model name.

- Retrieve an API key from an `.env` file

In [9]:
import os
from langchain_mistralai.chat_models import ChatMistralAI

def load_llm_api(model_name):
    """
    Load and configure the Mistral AI LLM.
    Returns:
        ChatMistralAI: Configured LLM instance.
    """
    api_key = os.getenv("MISTRAL_API_KEY")  # Retrieve API key from environment variable
    if not api_key:
        raise ValueError("API key is missing. Set the MISTRAL_API_KEY environment variable.")
    
    return ChatMistralAI(
        model=model_name,
        mistral_api_key=api_key,
        temperature=0.2,
        max_tokens=256,
        top_p=0.4,
    )

MODEL_NAME = "open-mistral-7b"
llm = load_llm_api(MODEL_NAME)

# Example function to test the LLM interaction
def chat_completion_as_dict(question):
    print(f"Running prompt: {question}")  # For debugging
    response = llm.invoke([HumanMessage(content=question)])
    return {
        "complete_answer": response.content,
        "model": MODEL_NAME
    }