
# 🎓 MSBA Program Chatbot with RAG Pipeline

This project builds a domain-specific AI chatbot using **Retrieval-Augmented Generation (RAG)** to answer questions about the Master of Science in Business Analytics (MSBA) program at UMass Lowell.

It was developed as a submission for the **Google GenAI Intensive Capstone Project**.


## 📦 Step 1: Install Required Libraries

We begin by installing necessary libraries:
- `PyMuPDF` for PDF text extraction
- `sentence-transformers` for text embeddings
- `faiss-cpu` for fast similarity search


In [None]:
!pip install -q PyMuPDF faiss-cpu sentence-transformers transformers peft accelerate bitsandbytes ipywidgets

## 📚 Step 2: Import Libraries

Next, we import standard libraries like Pandas and NumPy, NLP tools like NLTK, embedding tools from Sentence Transformers, and Hugging Face’s QA pipeline.


In [None]:
# Imports
import os
import fitz  # PyMuPDF
import faiss
import torch
import json
import numpy as np
import pandas as pd
import re
import nltk
import ipywidgets as widgets
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel

nltk.download('punkt')



import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load flan-t5-base for generation
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
model.eval()

# Load encoder
encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

## 🔍 Step 3: Chunk, Embed, and Index Program Data

We load two sources:
- MSBA Course Handbook (PDF)
- Scraped MSBA content from the university website

Then we:
1. Clean and chunk the content into readable units
2. Generate semantic embeddings using `all-mpnet-base-v2`
3. Store them in a FAISS vector index for fast nearest-neighbor search


In [None]:

def clean_text(text):
    if isinstance(text, str):
        text = re.sub(r'<[^>]*>', '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        text = re.sub(r'[^\w\s]', '', text)
        return text
    else:
        return ""

def chunk_text(text, max_chunk_length=500, overlap=50):
    if not text:
        return []
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""
    for sentence in sentences:
        if len(current_chunk) + len(sentence) + 1 <= max_chunk_length:
            current_chunk += sentence + " "
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = sentence + " "
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks


In [None]:

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

pdf_path = "/kaggle/input/msba-handbook-2024/MSBA Handbook 2024-2024.pdf"  
pdf_text = extract_text_from_pdf(pdf_path)
cleaned_pdf_text = clean_text(pdf_text)
pdf_chunks = chunk_text(cleaned_pdf_text)
pdf_df = pd.DataFrame({'text_chunk': pdf_chunks})
pdf_df['source'] = 'course_catalog'


In [None]:
#pdf_df = pd.DataFrame({'text_chunk': pdf_chunks})
# Load only the text from scraped data (ignore embeddings)
scraped_raw = pd.read_csv("/kaggle/input/scraped-data-embeddings/raw_chunks_with_embeddings_scraped_data.csv")
scraped_df = scraped_raw[['text_chunk']].copy()
scraped_df['text_chunk'] = scraped_df['text_chunk'].astype(str).fillna('')
scraped_df['source'] = 'scraped'

In [None]:
# Combine all text chunks for embedding
combined_df = pd.concat([scraped_df, pdf_df], ignore_index=True)

# Generate new embeddings using mpnet
encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
embeddings = encoder.encode(combined_df['text_chunk'].tolist(), show_progress_bar=True)

In [None]:
# Generate embeddings and FAISS index
embeddings = encoder.encode(combined_df['text_chunk'].tolist(), show_progress_bar=True)
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(np.array(embeddings).astype("float32"))
metadata_df = combined_df.copy()
metadata_df['embedding'] = embeddings.tolist()


## 🤖 Step 4: Retrieval and Context Builder

When a user asks a question, we:
1. Embed the query into a vector
2. Use FAISS to retrieve semantically similar chunks
3. Combine the top matches into a context window for the answer model


In [None]:
# Embedding & Search
def get_embedding(text):
    return encoder.encode([text])[0]





def search_faiss(query, top_k=5):
    query_vector = get_embedding(query).astype("float32")
    D, I = index.search(np.array([query_vector]), top_k)
    return I[0]




"""# cosine similarity HELPER function
from sklearn.metrics.pairwise import cosine_similarity

def rerank_by_cosine(query_embedding, chunk_embeddings, top_k=5):
    similarities = cosine_similarity([query_embedding], chunk_embeddings)[0]
    top_indices = similarities.argsort()[-top_k:][::-1]
    return top_indices

"""


In [None]:

# Truncated Context Retrieval

def get_context_from_indices(indices, max_chars=2000):
    texts = metadata_df.iloc[indices]['text_chunk']
    top_clean = []
    for t in texts:
        if 30 < len(t) < 555:  # keep short-medium useful chunks
            top_clean.append(t.strip())
        if len(" ".join(top_clean)) > max_chars:
            break
    return "\n".join(top_clean)[:max_chars]


"""
def get_context_from_indices(indices, max_chars=1500):
    texts = metadata_df.iloc[indices]['text_chunk']
    combined = "\n---\n".join(texts)
    return combined[:max_chars]
"""



"""
# Load fine-tuned LoRA adapter on top of flan-t5-base
base_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
model = PeftModel.from_pretrained(base_model, "./peft-msba-flant5")
model.eval()


"""


EXAMPLES = """\
Q: How many credits are required to complete the MSBA program?
A: The MSBA program requires 30 credits to complete.

Q: What are the core courses in the MSBA program?
A: The MSBA core courses include MIST.6030, MIST.6060, MIST.6150, POMS.6120, POMS.6220, POMS.6240, and either the Capstone (MIST.6490) or Internship (MIST.6890).

Q: Can I specialize in something within the MSBA?
A: Yes, the MSBA program offers six specialization tracks, including Accounting Analytics, Big Data Analytics, Finance Analytics, Healthcare Business Analytics, Managerial Decision Making, and Marketing Analytics.

Q: How long does it take to finish the MSBA program?
A: Students typically complete the MSBA program in 18 months to 3 years, depending on course load.

Q: Is there a capstone requirement in the MSBA program?
A: Yes, students must complete either a capstone project (MIST.6490) or an internship (MIST.6890) as part of the MSBA core.

Q: What is MIST.6890?
A: MIST.6890 is a 3-credit internship course where students apply analytics skills in a real-world job setting and complete a reflective paper.

Q: Are there prerequisites for the MSBA program?
A: Yes, students must complete introductory courses in Statistics and Management Information Systems, either before or during their first semester.

Q: Is the MSBA program flexible for working students?
A: Yes, the program is flexible and offers part-time study, semester-based entry, and both online and on-campus formats.

Q: Is the MSBA program STEM-designated?
A: Yes, the MSBA program is STEM-designated, making international students eligible for a 24-month OPT extension.

Q: What is the cost of online courses in the MSBA program?
A: Each 3-credit online course in the MSBA program costs $1,965, plus a $30 semester fee.
"""


## 🧠 Step 5: Natural Language Answer Generator

We use `flan-t5-base` to generate a full-sentence answer. The prompt includes:
- Few-shot examples from our training data (10 Q&A)
- Retrieved program context
- The user’s current question


In [None]:
from transformers import pipeline
generator = pipeline("text2text-generation", model="google/flan-t5-large")


def generate_response_local(query, context):
    prompt = (
        f"You are an MSBA program assistant. Based on the context and examples below, answer clearly in one or two full sentences.\n\n"
        "Here is some relevant program information:\n"
        f"{EXAMPLES}\n"
        f"{context}\n\n"
        "Q: How many credits are required to complete the MSBA program?\n"
        "A: The MSBA program requires 30 credits to complete.\n\n"
        f"Q: {query}\n"
        "A:"
    )
    output = generator(prompt, max_new_tokens=800, do_sample=True)
    answer = output[0]['generated_text'].strip()
    
    # Postprocess: Add fallback if answer is too short

    if len(answer.split()) < 4:
        answer += " (Sorry, the context might be too limited. Try rephrasing your question.)"
    
    return answer
"""
def generate_response_local(query, context):
    prompt = (
        f"You are an MSBA program assistant. Based on the context below, answer clearly in one or two full sentences.\n\n"
        f"{context}\n\n"
        f"Question: {query}\n"
        f"Answer:"
    )
    input_ids = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).input_ids

    with torch.no_grad():
        output_ids = model.generate(input_ids, max_new_tokens=256, do_sample=False)

    return tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
"""


## 💬 Step 6: Chat Interface (Interactive Bot)

We build a persistent chatbot UI using `ipywidgets`. 
This interface:
- Keeps the full chat history visible
- Lets the user ask follow-up questions
- Refreshes continuously without clearing previous messages


In [None]:
import ipywidgets as widgets
from IPython.display import display

# Text input
input_box = widgets.Text(
    value='',
    placeholder='Ask something about the MSBA program...',
    description='You:',
    layout=widgets.Layout(width='100%')
)

# Submit button
submit_button = widgets.Button(description="Ask Bot 🤖", button_style='primary')
output_box = widgets.Output()

# Chatbot logic
def on_submit(_):
    output_box.clear_output()
    with output_box:
        user_query = input_box.value.strip()
        print("You asked:", user_query)

        if not user_query:
            print("⚠️ Please enter a question.")
            return

        indices = search_faiss(user_query)
        context = get_context_from_indices(indices)
        answer = generate_response_local(user_query, context)

        print("Bot says:", answer)

# Hook up the button to the function
submit_button.on_click(on_submit)

# Display UI
display(input_box, submit_button, output_box)


In [None]:
import ipywidgets as widgets
from IPython.display import display

# Text input
input_box = widgets.Text(
    value='',
    placeholder='Ask something about the MSBA program...',
    description='You:',
    layout=widgets.Layout(width='100%')
)

# Submit button
submit_button = widgets.Button(description="Ask Bot 🤖", button_style='primary')
output_box = widgets.Output()

# Chatbot logic
def on_submit(_):
    output_box.clear_output()
    with output_box:
        user_query = input_box.value.strip()
        print("You asked:", user_query)

        if not user_query:
            print("⚠️ Please enter a question.")
            return

        indices = search_faiss(user_query)
        context = get_context_from_indices(indices)
        answer = generate_response_local(user_query, context)

        print("Bot says:", answer)

# Hook up the button to the function
submit_button.on_click(on_submit)

# Display UI
display(input_box, submit_button, output_box)


## ✅ Submission Summary

This notebook showcases a production-ready RAG chatbot using open-source tools only — no training required. The solution uses:
- Semantically indexed program data (from PDFs and scraped pages)
- Prompt-level few-shot learning (via handcrafted examples)
- A minimal and persistent notebook UI

This project was submitted as part of the **Google GenAI Intensive Capstone Program**.


In [None]:
metadata_df[['text_chunk']].to_csv("metadata.csv", index=False)
np.save("embeddings.npy", np.array(metadata_df['embedding'].tolist()))
