### Import Libraries

In [2]:
import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

from sentence_transformers import SentenceTransformer
import pickle

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device


  from .autonotebook import tqdm as notebook_tqdm


device(type='cpu')

### Load SentenceTransformer

In this cell, we load a pre-trained SentenceTransformer model (all-mpnet-base-v2).
This model converts each text query (a sentence) into a 384-dimensional numerical vector called an embedding.

Why do we need embeddings?

Machine learning models cannot understand raw text.
So we convert text ‚Üí numbers.

Example:
"How do I pay my fees?" ‚Üí [0.12, -0.04, 0.88, ...] (384 numbers)

These vectors capture meaning, so similar sentences have similar embeddings.

In [3]:
# Load a sentence embedding model (384-dim)
# This is similar to what your prof described (embeddings with HF backbone)
embedder = SentenceTransformer("all-MiniLM-L6-v2")

def encode_texts(texts, batch_size=32):
    """
    Encode a list of texts to a 2D numpy array of embeddings.
    shape = (n_samples, 384)
    """
    embeddings = embedder.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=False
    )
    return embeddings


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


### Load your CSV and prepare data

In this step, we load our labeled intent dataset from a CSV file containing two columns: text and label.
We extract the text questions and their corresponding intent labels into Python lists so they can be used for training.
The text list will later be converted into numerical embeddings (vectors) using the SentenceTransformer model.
The label list will be encoded into numbers so the neural network can learn from them.
This step simply prepares our raw dataset so it can be processed in later training cells.



In [7]:
# Cell 3 ‚Äî Load CSV and prepare data

import pandas as pd

# Load your intent dataset
df = pd.read_csv('../data/raw/training_data.csv')   

# Extract the text and labels as lists
texts = df["text"].tolist()
labels = df["label"].tolist()

print("Total samples:", len(df))
df.head()


Total samples: 125


Unnamed: 0,text,label
0,hi,small_talk
1,hello,small_talk
2,hey there,small_talk
3,good morning,small_talk
4,good evening,small_talk


### Generate Text Embeddings

In this step, we convert all text queries into numerical embeddings using a SentenceTransformer model.
Embeddings are dense vector representations (384-dimensional) that capture the meaning of each sentence.
These vectors become the input features for our deep learning classifier.
Without embeddings, the neural network cannot understand text, so this step converts language into numbers.


In [8]:
# Cell 4 ‚Äî Convert texts into embeddings using SentenceTransformer

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a lightweight embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

def encode_texts(text_list):
    """
    Converts a list of sentences into numerical embeddings.
    Returns a NumPy array of shape (num_samples, 384)
    """
    embeddings = embedding_model.encode(text_list, show_progress_bar=True)
    return np.array(embeddings)

# Convert all text queries into embeddings
X = encode_texts(texts)

print("Embedding shape:", X.shape)   # Example output: (100, 384)


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 18.54it/s]

Embedding shape: (125, 384)





### Encode Labels

In this step, we convert the text labels (like ‚Äúfaq‚Äù, ‚Äúresource‚Äù, ‚Äúofframp‚Äù, ‚Äúchitchat‚Äù) into numbers because neural networks can only work with numeric data.
We use LabelEncoder to map each label to a unique integer (for example: faq ‚Üí 0, resource ‚Üí 1).
Then we convert these numeric labels into PyTorch tensors so they can be used during training.
This ensures that both the input embeddings and the labels are in a format the deep learning model can understand.

In [9]:
# Cell 5 ‚Äî Encode labels into integer IDs

from sklearn.preprocessing import LabelEncoder
import torch

# Initialize label encoder
le = LabelEncoder()

# Convert labels (strings) into numbers
y_encoded = le.fit_transform(labels)

# Convert to torch tensor
y = torch.tensor(y_encoded, dtype=torch.long)

# Show mapping
label_mapping = dict(zip(le.classes_, range(len(le.classes_))))
print("Label Mapping:", label_mapping)

print("Example encoded labels:", y[:10])


Label Mapping: {np.str_('out_of_scope'): 0, np.str_('serious_issue'): 1, np.str_('small_talk'): 2, np.str_('student_affairs'): 3}
Example encoded labels: tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])


### Train/Val/Test Split + PyTorch Datasets
In this step, we split our dataset into three parts: training, validation, and test sets.
The training set is used by the model to learn patterns, the validation set helps tune the model and avoid overfitting, and the test set is used at the end to evaluate real performance.
We first create a temporary split between training+validation and test, and then split the training part again into train and validation.
This ensures the model never sees the test data while training.

In [10]:
# Cell 6 ‚Äî Create Train/Val/Test split and PyTorch datasets

from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader

# ------------------------------
# 1. Train / Validation / Test Split
# ------------------------------

# First: split into 90% training+validation and 10% test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.10, random_state=42
)

# Second: split training data into 80% train and 20% validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.20, random_state=42
)

print("Train size:", len(X_train))
print("Validation size:", len(X_val))
print("Test size:", len(X_test))

# ------------------------------
# 2. PyTorch Dataset Wrapper
# ------------------------------

class QueryDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = y

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Create dataset objects
train_dataset = QueryDataset(X_train, y_train)
val_dataset = QueryDataset(X_val, y_val)
test_dataset = QueryDataset(X_test, y_test)

# ------------------------------
# 3. DataLoaders
# ------------------------------

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8)
test_loader = DataLoader(test_dataset, batch_size=8)

print("DataLoaders ready!")


Train size: 89
Validation size: 23
Test size: 13
DataLoaders ready!


### Build the Deep Learning Model (MLP Classifier)

In [11]:
# Cell 7 ‚Äî Define the Deep Learning Model (MLP Classifier)

import torch.nn as nn
import torch.nn.functional as F

# Number of features per embedding (for all-MiniLM-L6-v2 it's 384)
input_dim = X.shape[1]

# Number of output classes (intents)
num_classes = len(le.classes_)
print("Input dim:", input_dim)
print("Number of classes:", num_classes)
print("Classes:", le.classes_)


class IntentClassifier(nn.Module):
    def __init__(self, input_dim=384, hidden_dim=128, output_dim=4, dropout_p=0.3):
        super().__init__()
        # First fully connected layer
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        # Dropout to reduce overfitting
        self.dropout = nn.Dropout(dropout_p)
        # Output layer: one neuron per class
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        # x shape: (batch_size, input_dim)
        x = F.relu(self.fc1(x))   # non-linear activation
        x = self.dropout(x)       # apply dropout during training
        logits = self.fc2(x)      # raw scores for each class
        return logits


# Instantiate the model with the correct sizes
model = IntentClassifier(
    input_dim=input_dim,
    hidden_dim=128,
    output_dim=num_classes,
    dropout_p=0.3
)

# Move model to device (CPU or GPU)
model = model.to(device)

model


Input dim: 384
Number of classes: 4
Classes: ['out_of_scope' 'serious_issue' 'small_talk' 'student_affairs']


IntentClassifier(
  (fc1): Linear(in_features=384, out_features=128, bias=True)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc2): Linear(in_features=128, out_features=4, bias=True)
)

### Training the Deep Learning Model

In this step, we train the intent classifier using the training data and monitor its performance on the validation set.
For each epoch, the model updates its weights to minimize the cross-entropy loss between predicted and true labels.
We also calculate validation loss and accuracy to see if the model is improving or starting to overfit.
Early stopping is used: if the validation loss does not improve for a few epochs, training stops to prevent overfitting.

In [13]:
# Cell 8 ‚Äî Train the model with early stopping and save the best version

import os
import pickle
import torch.optim as optim

# Directory OUTSIDE the notebook
MODEL_DIR = "../models"   
os.makedirs(MODEL_DIR, exist_ok=True)

MODEL_PATH = os.path.join(MODEL_DIR, "intent_classifier_best.pt")
ENCODER_PATH = os.path.join(MODEL_DIR, "intent_label_encoder.pkl")

# Optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Training settings
num_epochs = 20
best_val_loss = float("inf")
patience = 3
patience_counter = 0

for epoch in range(num_epochs):

    # -----------------------
    # Training
    # -----------------------
    model.train()
    train_loss = 0.0

    for batch_X, batch_y in train_loader:
        batch_X = batch_X.to(device)
        batch_y = batch_y.to(device)

        optimizer.zero_grad()
        logits = model(batch_X)
        loss = criterion(logits, batch_y)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()

    avg_train_loss = train_loss / len(train_loader)

    # -----------------------
    # Validation
    # -----------------------
    model.eval()
    val_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for batch_X, batch_y in val_loader:
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)

            logits = model(batch_X)
            loss = criterion(logits, batch_y)
            val_loss += loss.item()

            preds = torch.argmax(logits, dim=1)
            correct += (preds == batch_y).sum().item()
            total += batch_y.size(0)

    avg_val_loss = val_loss / len(val_loader)
    val_accuracy = correct / total if total > 0 else 0.0

    print(
        f"Epoch {epoch+1}/{num_epochs} "
        f"- Train Loss: {avg_train_loss:.4f} "
        f"| Val Loss: {avg_val_loss:.4f} "
        f"| Val Acc: {val_accuracy:.2%}"
    )

    # -----------------------
    # Early stopping
    # -----------------------
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        patience_counter = 0

        # Save best model
        torch.save(model.state_dict(), MODEL_PATH)

        # Save encoder
        with open(ENCODER_PATH, "wb") as f:
            pickle.dump(le, f)

        print("  ‚úÖ New best model saved.")

    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"‚èπÔ∏è Early stopping at epoch {epoch+1}. Best Val Loss: {best_val_loss:.4f}")
            break

print("\nTraining finished.")
print("Best model saved at:", MODEL_PATH)
print("Label encoder saved at:", ENCODER_PATH)


Epoch 1/20 - Train Loss: 0.0429 | Val Loss: 0.1997 | Val Acc: 91.30%
  ‚úÖ New best model saved.
Epoch 2/20 - Train Loss: 0.0285 | Val Loss: 0.1814 | Val Acc: 91.30%
  ‚úÖ New best model saved.
Epoch 3/20 - Train Loss: 0.0182 | Val Loss: 0.1754 | Val Acc: 91.30%
  ‚úÖ New best model saved.
Epoch 4/20 - Train Loss: 0.0165 | Val Loss: 0.1843 | Val Acc: 91.30%
Epoch 5/20 - Train Loss: 0.0140 | Val Loss: 0.1830 | Val Acc: 91.30%
Epoch 6/20 - Train Loss: 0.0092 | Val Loss: 0.1760 | Val Acc: 91.30%
‚èπÔ∏è Early stopping at epoch 6. Best Val Loss: 0.1754

Training finished.
Best model saved at: ../models\intent_classifier_best.pt
Label encoder saved at: ../models\intent_label_encoder.pkl


### Test model

In [14]:
# Load best trained model
MODEL_PATH = "../models/intent_classifier_best.pt"
ENCODER_PATH = "../models/intent_label_encoder.pkl"

# Recreate model
num_classes = len(le.classes_)
model = IntentClassifier(
    input_dim=X_train.shape[1],
    hidden_dim=128,
    output_dim=num_classes
).to(device)

# Load best weights
model.load_state_dict(torch.load(MODEL_PATH, map_location=device))
model.eval()

# Evaluate on test set
test_loss = 0
correct = 0
total = 0

with torch.no_grad():
    for batch_X, batch_y in test_loader:
        batch_X = batch_X.to(device)
        batch_y = batch_y.to(device)

        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        test_loss += loss.item()

        preds = torch.argmax(outputs, dim=1)
        correct += (preds == batch_y).sum().item()
        total += batch_y.size(0)

test_accuracy = correct / total
print(f"\nüìä Test Accuracy: {test_accuracy:.2%} | Test Loss: {test_loss:.4f}")



üìä Test Accuracy: 100.00% | Test Loss: 0.0711


In [15]:
def classify_query(text):
    # 1. Convert text ‚Üí embedding
    embedding = encode_texts([text])  # shape (1, 384)
    tensor = torch.tensor(embedding, dtype=torch.float32).to(device)

    # 2. Forward pass
    model.eval()
    with torch.no_grad():
        logits = model(tensor)
        pred_idx = torch.argmax(logits, dim=1).item()

    # 3. Convert index ‚Üí label
    return le.inverse_transform([pred_idx])[0]


examples = [
    "I'm feeling anxious about my exams.",
    "Where do I submit my OSAP documents?",
    "Hey! how are you?",
    "I want to book an appointment with a student advisor.",
    "Nothing is working on the portal, I feel stressed.",
    "How do I register for courses?",
    "Can you guide me through career services?",
]

for text in examples:
    print(f"{text} ‚Üí {classify_query(text)}")


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 70.99it/s]


I'm feeling anxious about my exams. ‚Üí student_affairs


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 153.48it/s]


Where do I submit my OSAP documents? ‚Üí student_affairs


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 119.34it/s]


Hey! how are you? ‚Üí small_talk


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 148.86it/s]


I want to book an appointment with a student advisor. ‚Üí student_affairs


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 151.51it/s]


Nothing is working on the portal, I feel stressed. ‚Üí serious_issue


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 137.13it/s]


How do I register for courses? ‚Üí student_affairs


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 135.32it/s]

Can you guide me through career services? ‚Üí student_affairs



