## Model Inference

This notebook serves as an example of the pipeline predicting the category of *new unseen* transaction (`inference.csv`). This dataset is sourced from the original `bank_transaction.csv` dataset, i.e. 257 instances with missing class labels. These instances were dropped during preprocessing hence **not seen** during training.

These instances have no ground truth, let's see how the trained model is going to predict their categories :) 

Pipeline includes:
1. Data preprocessing (numerical + text)
2. Loading trained model weights
3. Model Inference

#### Load and Preprocess the New Inference Dataset

In [275]:

import pandas as pd
import numpy as np
import torch
import pickle
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

# Import the text cleaning function from the util folder
from util.text_cleaning import clean_normalize_text

# Load the raw training dataset
df = pd.read_csv("../dataset/preprocessed_bank_transaction.csv")

# Load the new dataset for inference
inference_df = pd.read_csv("../dataset/inference.csv")

# Load the FastText model
from gensim.models import FastText
fasttext_model = FastText.load("../models/fasttext/fasttext_model.bin")

# Load the trained scaler
SCALER_PATH = "../models/scaler/scaler.pkl"
with open(SCALER_PATH, "rb") as f:
    scaler = pickle.load(f)

#### Defining Data Preprocessing Function

In [276]:
def preprocess_new_data(df, fasttext_model, scaler, structured_features):
    """Preprocess the new dataset without requiring a category column."""

    # Drop unnecessary columns (if exist)
    df = df.drop(columns=['client_id', 'bank_id', 'account_id', 'txn_id'], errors='ignore')

    # Convert txn_date to datetime format (if applicable)
    if 'txn_date' in df.columns:
        df['txn_date'] = pd.to_datetime(df['txn_date'], errors='coerce')

        # Extract time-based features
        df['day_of_week'] = df['txn_date'].dt.dayofweek  # Monday=0, Sunday=6
        df['day_of_month'] = df['txn_date'].dt.day  # 1-31
        df['hour'] = df['txn_date'].dt.hour  # Extract hour from transaction time (0-23)
        df['is_weekend'] = df['day_of_week'].apply(lambda x: 1 if x >= 5 else 0)  # 1=Weekend, 0=Weekday

        # Drop original txn_date column
        df = df.drop(columns=['txn_date'], errors='ignore')

    # Ensure transaction descriptions are cleaned first
    if 'description' in df.columns:
        df['processed_description'] = df['description'].fillna('').apply(clean_normalize_text)
    else:
        raise KeyError("Column 'description' is missing in the dataset!")

    # Generate FastText embeddings from cleaned transaction descriptions
    def get_embedding(text):
        words = str(text).split()  # Convert text to words
        word_vectors = [fasttext_model.wv[word] for word in words if word in fasttext_model.wv]
        if len(word_vectors) == 0:
            return np.zeros(100, dtype=np.float32)  # Ensure float32 dtype
        return np.mean(word_vectors, axis=0).astype(np.float32)  # Ensure float32 dtype

    df['fasttext_embedding'] = df['processed_description'].apply(get_embedding)
    fasttext_embeddings = np.vstack(df['fasttext_embedding'].values).astype(np.float32)  # Ensure float32

    # Drop raw text columns
    df = df.drop(columns=['description', 'processed_description', 'fasttext_embedding'], errors='ignore')

    # Apply the SAME scaler to inference data
    numerical_features = ['amount', 'day_of_week', 'day_of_month', 'hour']
    df[numerical_features] = scaler.transform(df[numerical_features])

    # Ensure correct data type
    df[structured_features] = df[structured_features].astype(np.float32)

    # Combine structured features and FastText embeddings into final feature matrix
    X_new = np.hstack((df[structured_features].values, fasttext_embeddings)).astype(np.float32)

    return X_new

In [277]:
# Extract structured numerical features (same as training)
structured_features = ['amount', 'is_interested_investment', 'is_interested_build_credit', 'is_interested_increase_income', 'is_interested_pay_off_debt', 'is_interested_manage_spending', 'is_interested_grow_savings', 'day_of_week', 'day_of_month', 'hour', 'is_weekend']

# Extract category column names from training dataset (ensuring consistent one-hot encoding)
category_columns = [col for col in df.columns if col.startswith("category_")]

# Preprocess new dataset
X_new = preprocess_new_data(inference_df, fasttext_model, scaler, structured_features)

# Convert to PyTorch tensor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
X_new_tensor = torch.tensor(X_new, dtype=torch.float32).to(device)

# Print shape
print("New dataset shape after preprocessing:", X_new_tensor.shape)

New dataset shape after preprocessing: torch.Size([257, 111])


#### Initialise and Load the Trained Model

For this inference, I'll be using the model trained with:

- 50 epochs
- 0.001 learning rate
- BCELoss() as loss function
- 82% average accuracy

In [278]:
import torch
import torch.nn as nn
from model import TransactionClassifier

# Load the trained model 
model_path = "../models/ann/ANN_50e_1e-3lr_bce_classifier.pth" 

# Initialize the model
model = TransactionClassifier(X_new_tensor.shape[1], len(category_columns))

model.load_state_dict(torch.load(model_path))
model.to(device)
model.eval()  # Set model to evaluation mode

TransactionClassifier(
  (fc1): Linear(in_features=111, out_features=256, bias=True)
  (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu1): ReLU()
  (dropout1): Dropout(p=0.3, inplace=False)
  (fc2): Linear(in_features=256, out_features=128, bias=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu2): ReLU()
  (dropout2): Dropout(p=0.3, inplace=False)
  (fc3): Linear(in_features=128, out_features=33, bias=True)
  (sigmoid): Sigmoid()
)

#### Predicting Transaction Categories (Inference)

Since these instances have no ground truth, it's therefore not feasible to compare or obtain its accuracy. I'll just print out the transaction description and the model prediction to roughly get an idea of how the model is performing.

In [279]:
# Perform inference
with torch.no_grad():
    outputs = model(X_new_tensor)  # Get model predictions
    predicted_categories = (outputs > 0.5).float()  # Convert logits to binary (threshold = 0.5)

# Convert one-hot predictions back to category labels
predicted_labels = []

for pred in predicted_categories.cpu().numpy():
    predicted_index = pred.argmax()  # Get index of the highest probability category
    predicted_labels.append(category_columns[predicted_index])  # Map index to category column

# Print the results
print("\n===== Model Predictions =====")

for i, (original, predicted) in enumerate(zip(inference_df['description'], predicted_labels)):
    print(f"Transaction {i+1}:")
    print(f"  Description: {original}")
    print(f"  Predicted Category: {predicted}")
    print("-" * 50)


===== Model Predictions =====
Transaction 1:
  Description: Transfer from Chime Savings Account
  Predicted Category: category_Transfer Credit
--------------------------------------------------
Transaction 2:
  Description: Transfer from Chime Savings Account
  Predicted Category: category_Transfer Credit
--------------------------------------------------
Transaction 3:
  Description: Transfer from Chime Savings Account
  Predicted Category: category_Transfer Credit
--------------------------------------------------
Transaction 4:
  Description: Cash app*cash out      visa direct  caus
  Predicted Category: category_Third Party
--------------------------------------------------
Transaction 5:
  Description: Transfer from Chime Savings Account
  Predicted Category: category_Transfer Credit
--------------------------------------------------
Transaction 6:
  Description: Transfer from CB
  Predicted Category: category_Transfer Credit
--------------------------------------------------
Tra