<h1 align=center >Chatbot Project </h1> 

### Final Project - Large Language Model Chatbot

**Group 1** - Cassandra Phung, Kumar Kartikeyn Arora, Husandeep Kaur<br>
**Course** - COMP-3704 (254805) Neural Networds and Deep Learning<br>
**Date** - December 5, 2024
<br><br>

---
**Project Information**<br>
This chatbot was developed to operate entirely locally and offline, ensuring user privacy and data security. It enables users to load PDF files as input and interact with a language model (LLM) directly within this Jupyter Notebook.<br>

**Use Case**<br>
For this demonstration, the chatbot processes information from three of Red River College's Information Technology Programs. The data has been sourced directly from the official RRC website.<br>

**Other Potential Applications**<br>
This chatbot is particularly useful for scenarios involving sensitive data that users prefer not to upload to online platforms, as many companies use user-uploaded data to improve their generative AI.a

**Examples of sensitive files include:**
- Patient Information: Medical records or other health-related documents.
- Financial Information: Banking or investment details.
- Personal Information: Documents users wish to keep private.
- By keeping the process offline, this tool provides a secure alternative for handling confidential files.

In [1]:
import os
import requests
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import OllamaEmbeddings
from langchain_ollama.chat_models import ChatOllama
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.retrievers.multi_query import MultiQueryRetriever

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Setting environment variable for compatibility
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

**Instructions**<br>
Ensure ollama is running, as models need to communicate with the server
- Open command prompt
- Check if ollama is running: ollama --version
- If it cannot connect to a running Ollama instance: ollama start
- Open new command prompt
- ollama list
- Ensure you have both models: llama3.2 and nomic-embed-text:latest

---
### 1. Loading PDF
### LangChain document loader: UnstructuredPDFLoader
- Extracts text from PDF documents making it ready for further processing (like chunking & embedding)

In [3]:
# Define paths for the PDFs
# Loading multuple PDFs
# Each pdf containing information of an RRC Information Technology Program
pdf_files = [
    "Information_Security.pdf",
    "Data_Science_Machine_Learning.pdf",
    "Application_Development_Delivery.pdf"
]

In [4]:
# Map file names to user-friendly program names
def program_name_mapper(file_name):
    program_map = {
        "Information_Security": "Information Security Program",
        "Data_Science_Machine_Learning": "Data Science & Machine Learning Program",
        "Application_Development_Delivery": "Application Development & Delivery Program"
    }
    return program_map.get(file_name.split(".")[0], "Unknown Program")

In [5]:
# Load documents and add metadata
# Function to load and process PDFs

def process_pdfs(file_paths):
    document_collection = []
    
    for path in file_paths:
        try:
            loader = UnstructuredPDFLoader(file_path=path)
            documents = loader.load()
            print(f"Loaded PDF: {path}")

            # Add metadata to each document
            # Metadata identifying which program a document belongs to
            # Easier to filter results by program when retrieving or answering queries
            for doc in documents:
                
                # Create metadata dictionary for each document
                doc.metadata = {
                    "program": path.split(".")[0],
                    "program_name": program_name_mapper(path),
                    "source": "Program PDF"
                }
                document_collection.append(doc)
        except Exception as e:
            print(f"Error processing {path}: {e}")
    return document_collection

In [6]:
# Load the documents
documents = process_pdfs(pdf_files)

Loaded PDF: Information_Security.pdf
Loaded PDF: Data_Science_Machine_Learning.pdf
Loaded PDF: Application_Development_Delivery.pdf


In [7]:
# Preview the first 3 documents
for doc in documents[:3]:
    print(doc.metadata)

{'program': 'Information_Security', 'program_name': 'Information Security Program', 'source': 'Program PDF'}
{'program': 'Data_Science_Machine_Learning', 'program_name': 'Data Science & Machine Learning Program', 'source': 'Program PDF'}
{'program': 'Application_Development_Delivery', 'program_name': 'Application Development & Delivery Program', 'source': 'Program PDF'}


In [8]:
# Preview the text content of the first document (first 400 characters)
print(documents[0].page_content[:400])

FULL-TIME | WINNIPEG LOCATIONS

1: Information Security

Overview



Two-year advanced diploma

Fall entry date

Classes take place between 8 am and 6 pm

Classes are held on campus on Mondays, Tuesdays, and Wednesday mornings and online

on Wednesday afternoons, Thursdays, and Fridays

Exchange District Campus (formerly Princess Street Campus), Winnipeg

Co-op work experience



International app


---
**Using multiple PDFs instead of 1.**<br> 
Splitting PDFs independantly by program helps avoid confusion with the chatbot.

- Allows for clearer segmentation of the content
- Less risk of confusing similar content across programs
- Easier to query specific content without possible overlap (similar terms used across programs, fees, dates, courses, year, etc.)

---
### 2. Splitting and Chunking Text

**RecursiveCharacterTextSplitter**
- Splits long documents into smaller coherent pieces making sure that chunks don't break at awkward places (ex. in the middle of a sentence).
- Document requires text splitting because one full document may be too large or difficult to process for LLMs.
- chunk_size means each chunk will contain up to the declared number of characters.
- chunk_overlap means each chunk will overlap the previous chunk by the declared number of characters.
- This overlap helps ensure that important information, that may be split between chunks, remains in both chunks.
- Larger chunks give better Context Retention as larger chunks mean LLM gets more context.
- Improving retrieval effecnciency by reducing embedding computation which will have fewer but more meaningful chunks.

In [9]:
# Split the documents into manageable chunks

# Changing Chunk size to a higher number, Now more content in a single chunk is taken.
# Overlap 200, to avoid losing key details
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=200
)

doc_chunks = splitter.split_documents(documents)
print(f"Total document chunks created: {len(doc_chunks)}")

Total document chunks created: 44


#### NOTES
**Attempt 4**<br>
Reduce redundancy in the embeddings
- Ensure each chunk embedded has enough unique content, if the overlap is too big, the model can get confused since each chunk will be very similar. If the overlap is too small, there won't be enough context shared between chunks and the content may not make sense since it can stop at the middle of a sentence.
- Chunk size: larger chunks include more content, smaller chunks allow fore more precise embeddings

---
### 3. Embeddings and Storing in Vector Database

- Use Ollama from this Jupyter Notebook.
- Sends text to the locally running Ollama service for embedding, where embedding model: nomic-embed-text converts the chunks into vectors.
- Vector database (vector_db) has been created using Chroma and will store the embeddings.
- Allowing for an efficient search of relevant information from user queries.

In [11]:
# Create vector database
# Embed and store chunks into vector database where it will be later retrieved
vector_storage = Chroma.from_documents(
    documents=doc_chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text"),
    collection_name="program-advisor-db"
)

print("Vector database created successfully.")

Vector database created successfully.


---
### 4. Set up LLM and Retrieval

In [12]:
# Initialize the Llama 3.2.1b model
chat_model = ChatOllama(
    model="llama2:7b", 
    base_url="http://localhost:12345"
)

The port default for ollama on PC is 11434, if error encountered:change code to "http://localhost:11434"

**LLM Selection and Impact on Performance**
- Original model llama2:7b
- Alternative model llama3.2:1b
- The original model has 7 billion parameters while the new model has 1 billion parameters, allowing the model to output faster responses with the tradeoff of performance and accuracy
- Final chatbot LLM selection: llama2:7b since the responses were more contextually accurate

**QUERY_PROMPT template** is required and an established template used for setting up the LLM to generate multiple versions of the user's query.
- This part does not involve embedding the user query yet, it's just to generate alternative versions/phrasings of the queries.

In [13]:
template = PromptTemplate(
    input_variables=["question"],
    template="Question: {question}"
)

Instead of just using the original query as is, the **MultiQueryRetriever** generates multiple different versions of the query. All versions of the query are sent to the vector database to retrieve matches to any of the versions of the query.

By using MultiQueryRetriever:
- vector_storage.as_retriever() - retrieves from the vector database
- chat_model - large language model that is used to generate the different versions of the user queries
- prompt=template - the prompt template created used to create multiple versions of the user query

These versions of the user queries are not directly embedded yet, but after they are generated by the LLM, they will be embedded using the same embedding=OllamaEmbeddings(model="nomic-embed-text") model that was used earlier for the PDFs. This step happens automatically when using MultiQueryRetriever so there is no need to code for the user query embedding step.

In [15]:
# Create MultiQueryRetriever
retriever = MultiQueryRetriever.from_llm(
    vector_storage.as_retriever(), 
    chat_model,
    prompt=template
)

In [17]:
# Improved prompt template with placeholders for program
response_prompt = PromptTemplate(
    input_variables=["program_name", "context", "user_query"],
    template="""
    You are an academic advisor. Answer the user's query concisely based only on the provided context.

    Program: {program_name}
    Context: {context}
    Query: {user_query}

    Response:
    """
)

---
### 5. Processing user Queries & Prompts

In [19]:
def fetch_context_by_program(query, program_filter):
    # Prepare the inputs for the retriever
    prompt_inputs = {
        "program": program_filter,
        "query": query
    }

    # Generating relevant documents using MultiQueryRetriever
    results = retriever.get_relevant_documents(prompt_inputs)

    # Extracting and combine relevant context
    context = "\n".join([
        doc.page_content for doc in results if program_filter in doc.metadata.get("program", "")
    ])
    return context if context else "No specific information available."

In [25]:
from langchain.schema import SystemMessage, HumanMessage

def generate_response(query, program_filter):
    # Fetch context and program name
    context = fetch_context_by_program(query, program_filter)
    if not context.strip():
        return "No specific information is available in the provided documents for this query."

    program_name = program_name_mapper(program_filter)

    # Prepare structured messages
    messages = [
        SystemMessage(content="You are an academic advisor."),
        HumanMessage(content=f"Program: {program_name}\nQuery: {query}\nContext: {context}")
    ]

    try:
        # Generate response
        response = chat_model(messages, max_tokens=150)
        return response.content.strip()
    except Exception as e:
        return "An error occurred while generating the response. Please try again."


### 6. Chatbot interactive function

In [None]:
# Interactive chatbot loop for dynamic queries
def chatbot_interactive():
    print("Welcome to the Academic Advisor Chatbot!")
    print("Type 'exit' to quit the chatbot.")
    while True:
        try:
            user_query = input("Enter your question: ")
            if user_query.lower() in ["exit", "quit"]:
                print("Goodbye!")
                break
            program_filter = input("Specify the program (e.g., Data_Science_Machine_Learning): ").strip()
            if not program_filter:
                print("Please specify a valid program.")
                continue
            response = generate_response(user_query, program_filter)
            print(f"\nChatbot Response:\n{response}\n")
        except Exception as e:
            print(f"An error occurred: {e}")

# Run the chatbot
if __name__ == "__main__":
    chatbot_interactive()

Welcome to the Academic Advisor Chatbot!
Type 'exit' to quit the chatbot.


Enter your question:  tell me about the data science and machine learning program
Specify the program (e.g., Data_Science_Machine_Learning):  data science and machine learning



Chatbot Response:
Hello there! I'm happy to help you with any questions you may have about our data science and machine learning program. Unfortunately, I don't have access to specific information about the program as it is not provided in the context of our conversation. Could you please provide more details or context about the program you are interested in? This will help me better understand your question and provide a more accurate answer.



Enter your question:  give me an overview on data science and machine learning program
Specify the program (e.g., Data_Science_Machine_Learning):  data science and machine learning



Chatbot Response:
As an academic advisor, I must provide you with a comprehensive overview of the Data Science and Machine Learning (DSML) program, highlighting its key components, application areas, and potential career paths. However, since no specific information is available, I will outline a general framework for the program. Please note that the details may vary depending on the institution offering the program.

Data Science and Machine Learning Program:

1. Coursework: The DSML program typically includes courses in data mining, machine learning, statistical modeling, computer programming, and data visualization. Students will learn to collect, analyze, and interpret large datasets, as well as develop algorithms and models to solve complex problems.
2. Core Courses:
a. Data Mining and Machine Learning Fundamentals: Introduction to data mining and machine learning techniques, including supervised and unsupervised learning, deep learning, and neural networks.
b. Statistical Model

<h1 align=center >End </h1>
