# AI-Enhanced Self-Service Portal for Student Affairs: A Proof of Concept

**Course:** CSCN 8010 – Foundations of Machine Learning  
**Project:** AI-Enhanced Self-Service Portal for Student Affairs  
**Team:** 
| Aiswarya Thekkuveettil Thazhath               | 8993970    |   
| Vishnu Sivaraj                                | 9025320    |         
| Rohit Iyer                                    | 8993045    |                            
| Cemil Caglar Yapici                           | 9081058    |
 

---


## 1. Introduction

Student Success Advisors (SSAs) at Conestoga College currently experience a high volume of repetitive, low-complexity inquiries from students.
Due to recent staffing reductions, SSAs face increasing workload pressure, leading to slower response times and inefficiencies.

This project presents a Proof of Concept (PoC) for an AI-enhanced self-service chatbot capable of:

Answering FAQs across multiple Student Affairs domains

Routing complex or serious queries to human advisors ("off-ramp")

Supporting multilingual responses

Reducing SSA workload by automating high-volume inquiries

This notebook details the full methodology, data pipeline, model development, evaluation, and final system design.

In [None]:
# 2. Environment Setup & Imports

import os
import sys
from pathlib import Path

import pandas as pd
import numpy as np

# Add project root to Python path so that `src.*` imports work
PROJECT_ROOT = Path.cwd()  # adjust if running notebook from a different directory
print("Project root:", PROJECT_ROOT)

src_path = PROJECT_ROOT / "src"
if src_path.exists():
    sys.path.append(str(PROJECT_ROOT))  # so `from src.x import y` works
else:
    print("⚠️ WARNING: src/ folder not found. Adjust PROJECT_ROOT as needed.")

# Core project modules
from src import retrieval_service
from src import intent_classifier


## 2. Problem Statement & Objectives

2.1 Problem Statement

Students frequently ask questions related to:

Fees and due dates

Orientation

Student rights and responsibilities

Registrar processes

Career services

Many of these questions are repetitive and answerable via existing web pages, but students often bypass the documentation and message SSAs directly.

2.2 Objectives

Build a centralized Knowledge Base (KB) from authenticated college sources.

Develop an information retrieval system that returns accurate answers from the KB.

Train a machine learning intent classifier to categorize queries.

Implement a crisis detection mechanism for high-risk messages (e.g., mental health concerns).

Build a Streamlit-based chatbot UI for live interaction.

Provide an academic evaluation and discuss limitations.

In [None]:
# 3. Raw Data Overview

data_dir = PROJECT_ROOT / "data"

raw_files = [
    "career_centre_faq.csv",
    "orientation_faq.csv",
    "student_fees_faq_winter_2024.csv",
    "student_rights_faq.csv",
    "winter_2024_registrar_faq.csv",
    "success_portal_resources.csv",
]

for fname in raw_files:
    fpath = data_dir / fname
    if fpath.exists():
        df = pd.read_csv(fpath)
        print(f"\n=== {fname} ===")
        print(df.head(3))
        print("Rows:", len(df))
    else:
        print(f"⚠️ {fname} not found in {data_dir}")


## 3. Data Sources

The following authenticated data sources were scraped:

Source	Description
Orientation FAQ	New student orientation questions
Student Rights FAQ	Policies, complaints, appeals
Student Fees FAQ (Winter 2024)	Tuition deadlines, refunds, OSAP
Registrar FAQ	Academic records, transcripts, scheduling
Career Centre FAQ	Employment and career services
Success Portal	Academic support and resources

Each dataset is stored as a CSV with fields:

question

answer

source_url

source_type (added during cleaning)

## 4. Data Collection (Web Scraping)

Python scripts using requests + BeautifulSoup collected structured Q&A data.

In [None]:
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
entries = soup.find_all("details")

for entry in entries:
    question = entry.find("summary").get_text(strip=True)
    answer = " ".join(p.get_text(strip=True) for p in entry.find_all("p"))


Ethical considerations

Only publicly available educational FAQ pages were scraped.

No authentication-protected data was accessed.

No personal identifiers were collected.

## 5. Data Cleaning & Knowledge Base Construction

All scraped CSVs were combined using cleanProcess.ipynb into a unified knowledge base:

5.1 Cleaning Steps

Remove duplicate questions

Normalize whitespace

Standardize formatting

Add source_type labels

Verify each row has a valid question & answer

5.2 Final Knowledge Base

student_affairs_knowledge_base.csv

question	answer	source_url	source_type

This knowledge base serves as the foundation for the retrieval model.

In [None]:
# 5. Load Unified Knowledge Base

kb_path = data_dir / "student_affairs_knowledge_base.csv"
kb = pd.read_csv(kb_path)

print("Knowledge Base shape:", kb.shape)
kb.head(5)


## 5. TF-IDF Retrieval Index

To support fast lookup, we built a **TF-IDF (Term Frequency–Inverse Document Frequency)** index over the questions in the knowledge base.

**Why TF-IDF?**

- Simple and efficient  
- Works well for FAQ-style queries  
- Provides interpretable similarity scores  
- Serves as a strong baseline for information retrieval  

We use:

- `TfidfVectorizer` from scikit-learn  
- Unigrams + bigrams (`ngram_range=(1, 2)`)  
- English stopword removal  
- Cosine similarity for matching


In [None]:
# 6. Building the TF-IDF Index (Offline Step)

from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import sparse
import joblib

def build_tfidf_index(kb_df, save_dir=PROJECT_ROOT / "models"):
    save_dir.mkdir(exist_ok=True)
    questions = kb_df["question"].fillna("").astype(str).tolist()

    vectorizer = TfidfVectorizer(
        ngram_range=(1, 2),
        stop_words="english",
        max_df=0.8,
        min_df=1,
    )
    X = vectorizer.fit_transform(questions)

    vec_path = save_dir / "tfidf_vectorizer_v2.pkl"
    mat_path = save_dir / "kb_tfidf_matrix_v2.npz"

    joblib.dump(vectorizer, vec_path)
    sparse.save_npz(mat_path, X)

    print("Saved vectorizer →", vec_path)
    print("Saved matrix     →", mat_path)
    print("Shape:", X.shape)

# Uncomment to rebuild if needed:
# build_tfidf_index(kb)


## 6. Semantic Embeddings (HuggingFace)

To capture deeper semantic similarity, we used a pre-trained sentence embedding model:

- **Model:** `sentence-transformers/all-MiniLM-L6-v2`

This model encodes a sentence into a dense vector.  
We can then compute cosine similarity between queries and FAQ questions, which helps when wording differs significantly.

We combined:

- TF-IDF similarity  
- Embedding similarity  

to improve matching robustness.


In [None]:
# 7. Sentence Embedding Example

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

embed_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def encode_texts(texts):
    return embed_model.encode(texts, convert_to_numpy=True, show_progress_bar=False)

# Embed a small sample
sample_questions = kb["question"].head(5).tolist()
embeddings = encode_texts(sample_questions)

print("Embeddings shape:", embeddings.shape)


## 7. Intent Classification Model

Not every user message should be treated as a simple FAQ lookup.  
We trained an intent classifier to distinguish between:

- `small_talk` – greetings or casual conversation  
- `student_affairs` – in-scope questions (fees, services, policies, etc.)  
- `out_of_scope` – unrelated topics (e.g., weather, math problems)  
- `serious_issue` – potential crisis, self-harm, or mental health risks  

### 7.1 Training Data

We used `training_data.csv`, which contains text examples and labels.  
The model was trained using a neural network (defined in `trainingmodel_intent.ipynb`) and exported as:

- `intent_classifier_best.pt` – PyTorch model weights  
- `intent_label_encoder.pkl` – maps integer indices back to label strings.


In [None]:
# 8. Testing the Intent Classifier

from src.intent_classifier import classify_intent

test_queries = [
    "hi",
    "how do I pay my fees?",
    "I want to book an appointment with a success advisor",
    "I feel like hurting myself",
    "what's the weather today?"
]

for q in test_queries:
    intent = classify_intent(q)
    print(f"Query: {q!r}")
    print(" → Intent:", intent.get("label"), "| Score:", intent.get("score"))
    print()


## 8. Retrieval & Answer Generation

The **retrieval_service** module implements the core logic for:

1. Loading the TF-IDF index and knowledge base  
2. Classifying intent  
3. Detecting crisis-related language (using emotion detection and keyword rules)  
4. Retrieving relevant FAQ entries  
5. Optionally calling an LLM for answer synthesis (OpenAI; disabled in our environment due to quota)

At inference, the pipeline is:

1. **Intent** ← `classify_intent(query)`  
2. If `serious_issue` → return crisis response  
3. Else if `small_talk` → return greeting  
4. Else if `out_of_scope` → return “out of scope”  
5. Else → run TF-IDF + embeddings → retrieve best FAQ → return answer


In [None]:
# 9. End-to-End Answer Pipeline

from src.retrieval_service import load_resources, answer_query

# Ensure resources are loaded (TF-IDF + KB)
load_resources()

demo_queries = [
    "How do I pay my fees?",
    "Where can I find information about student rights?",
    "How do I contact the career centre?",
    "I feel depressed and don't know what to do."
]

for q in demo_queries:
    print("="*80)
    print("User:", q)
    result = answer_query(q)
    print("Mode:", result.get("mode", "N/A"))
    print("Answer:\n", result.get("answer"))
    if "matched_question" in result:
        print("\nMatched KB question:", result["matched_question"])
    print()


## 9. User Interface (Streamlit Chatbot)

The final user-facing interface is implemented with **Streamlit** in `app.py`.

Key features:

- Chat-style interface with separate user and bot bubbles  
- Language selectors (English, French, Hindi) using `deep-translator`  
- A right-hand sidebar for announcements and student news  
- A “typing…” placeholder while responses are computed  

To run the app:

```bash
streamlit run app.py



---

### Architecture Diagram & Discussion

```markdown
## 10. System Architecture

The overall system architecture is:

1. **Data Layer**
   - Scraped FAQ CSVs
   - Unified KB: `student_affairs_knowledge_base.csv`

2. **Model Layer**
   - TF-IDF vectorizer + matrix
   - Sentence embedding model (HuggingFace)
   - Intent classifier (PyTorch)
   - Optional LLM (OpenAI) for synthesized answers

3. **Application Layer**
   - `retrieval_service.py`
   - `intent_classifier.py`
   - `llm_service.py`
   - `translation.py`

4. **Presentation Layer**
   - Streamlit app (`app.py`)

This layered design supports modularity and future extension (e.g., replacing the LLM, changing embeddings, or adding new FAQ sources).


## 11. Evaluation

### 11.1 Retrieval Quality

We manually evaluated system responses across multiple domains:

- Student fees
- Orientation
- Student rights
- Career centre
- Registrar

(Here you can add a table of examples with: query, expected answer, retrieved answer.)

### 11.2 Intent Classification

From the validation set of `training_data.csv`, we measured:

- Accuracy: *[fill in]*
- Precision / Recall / F1 for each class: *[fill in]*

### 11.3 Limitations

- The OpenAI API is limited by quota, so generative answers may fall back to defaults.
- Some complex questions still require human interpretation.
- The current system does not integrate with secure student records or personalized data.
- Multilingual support relies on machine translation and has not been fully evaluated for accuracy.
