# Phishing URL Neural Network Experiments

This notebook explores various neural network architectures using the Kaggle phishing URL dataset.

In increasing order of complexity, we will experiment with:

1. Simple Feedforward Neural Networks (MLP)
2. Convolutional Neural Networks (CNN)
3. Recurrent Neural Networks (RNN)
4. Hybrid Models

## Setup and Imports

In [None]:
use_drive = False

# uncomment if running on colab
from google.colab import drive
drive.mount('/content/drive')
use_drive = True
drive_root = '/content/drive/MyDrive/fraud-grp-proj/'

# check path exists
import os
print(os.path.exists(drive_root))

Mounted at /content/drive
True


In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, confusion_matrix,
                             classification_report, roc_curve)
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, TensorDataset
import torch.optim as optim

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

Using device: cuda


In [7]:
# Load train and test datasets
train_df = pd.read_csv('dataset/train.csv')
test_df = pd.read_csv('dataset/test.csv')

train_w_features_df = pd.read_csv('dataset/df_train_feature_engineered.csv')
test_w_features_df = pd.read_csv('dataset/df_test_feature_engineered.csv')

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")

print(f"Train with features shape: {train_w_features_df.shape}")
print(f"Test with features shape: {test_w_features_df.shape}")

Train shape: (9143, 2)
Test shape: (2286, 2)
Train with features shape: (9143, 78)
Test with features shape: (2286, 78)


In [8]:
train_w_features_df.columns

Index(['url', 'target', 'is_http', 'has_subdomain', 'has_tld', 'num_subdomain',
       'is_domain_ip', 'num_hyphens_domain', 'is_punycode', 'has_path',
       'path_depth', 'has_filename', 'has_file_extension', 'has_query',
       'length_url', 'length_hostname', 'length_tld', 'length_sld',
       'length_subdomains', 'length_path', 'length_query', 'num_dots',
       'num_hyphens', 'num_at', 'num_question_marks', 'num_and', 'num_equal',
       'num_percent', 'tld_in_path', 'tld_in_subdomain',
       'subdomain_longer_sld', 'ratio_digits_url', 'ratio_digits_hostname',
       'ratio_letter_url', 'ratio_path_url', 'ratio_hostname_url',
       'length_words_url', 'avg_word_hostname', 'avg_word_path',
       'num_unique_chars_hostname', 'has_shortened_hostname',
       'entropy_hostname', 'has_www_subdomain', 'has_com_tld',
       'is_http_and_many_subdomains', 'ip_and_short_tld',
       'http_and_missing_domain_info', 'subdomain_depth_x_http', 'ip_x_http',
       'domain_complexity_score',

Following the EDA, we use the same feature set as log regression since MLP models require normalized and scaled inputs.

In [9]:
# Drop original versions of log transformed features
train_w_features_df.drop(columns=['length_url', 'length_path',  'ratio_hostname_url', 'length_words_url', 'avg_word_hostname', 'num_unique_chars_hostname'], inplace=True)

# Drop original versions of squared transformed features
train_w_features_df.drop(columns=['ratio_letter_url', 'entropy_hostname'], inplace=True)

# Drop original versions of is_zero transformed features
train_w_features_df.drop(columns=['num_hyphens_domain', 'length_subdomains', 'num_hyphens',  'num_at', 'num_question_marks', 'num_and', 'num_equal', 'num_percent', 'ratio_digits_url', 'ratio_digits_hostname', 'avg_word_path', 'length_query'], inplace=True)

# Drop original versions of bucketed transformed features
train_w_features_df.drop(columns=['num_subdomain', 'length_tld', 'path_depth'], inplace=True)

# Check final columns
train_w_features_df.columns

Index(['url', 'target', 'is_http', 'has_subdomain', 'has_tld', 'is_domain_ip',
       'is_punycode', 'has_path', 'has_filename', 'has_file_extension',
       'has_query', 'length_hostname', 'length_sld', 'num_dots', 'tld_in_path',
       'tld_in_subdomain', 'subdomain_longer_sld', 'ratio_path_url',
       'has_shortened_hostname', 'has_www_subdomain', 'has_com_tld',
       'is_http_and_many_subdomains', 'ip_and_short_tld',
       'http_and_missing_domain_info', 'subdomain_depth_x_http', 'ip_x_http',
       'domain_complexity_score', 'suspicion_score', 'contains_brand_misspell',
       'is_homoglyph_attack', 'homoglyph_type', 'risk_score',
       'is_zero_num_hyphens_domain', 'is_zero_length_subdomains',
       'is_zero_num_hyphens', 'is_zero_num_at', 'is_zero_num_question_marks',
       'is_zero_num_and', 'is_zero_num_equal', 'is_zero_num_percent',
       'is_zero_ratio_digits_url', 'is_zero_ratio_digits_hostname',
       'is_zero_avg_word_path', 'is_zero_length_query',
       'num_sub

## Training Models

Now lets move on to training the models. We use the `ModelSaver` utility to help us standardize the storing of metrics and models for evaluation later on.

In [None]:
# Configuration
SAVE_MODELS = True  # Switch to turn on/off model saving
BATCH_SIZE = 64
EPOCHS = 20
LEARNING_RATE = 0.001
MAX_URL_LEN = 200  # Truncate/pad URLs to this length
EMBEDDING_DIM = 64
DROPOUT = 0.3

# Import ModelSaver
import sys
import os
sys.path.append(os.path.abspath('.'))
from save_model import ModelSaver

# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


In [12]:
# --- Data Preprocessing for Neural Networks ---

# 1. Character Tokenizer for URLs
class CharTokenizer:
    def __init__(self):
        self.char2idx = {}
        self.idx2char = {}
        self.vocab_size = 0

    def fit(self, texts):
        unique_chars = set()
        for text in texts:
            unique_chars.update(str(text))

        self.char2idx = {char: idx + 2 for idx, char in enumerate(sorted(unique_chars))}
        self.char2idx['<PAD>'] = 0
        self.char2idx['<UNK>'] = 1
        self.idx2char = {idx: char for char, idx in self.char2idx.items()}
        self.vocab_size = len(self.char2idx)

    def transform(self, texts, max_len):
        sequences = []
        for text in texts:
            text = str(text)
            seq = [self.char2idx.get(c, 1) for c in text]
            if len(seq) < max_len:
                seq = seq + [0] * (max_len - len(seq))
            else:
                seq = seq[:max_len]
            sequences.append(seq)
        return np.array(sequences)

tokenizer = CharTokenizer()
tokenizer.fit(train_df['url'])
print(f"Vocabulary size: {tokenizer.vocab_size}")

# 2. Prepare Numeric Features
numeric_cols = train_w_features_df.select_dtypes(include=[np.number]).columns.tolist()
if 'target' in numeric_cols:
    numeric_cols.remove('target')

X_numeric_train = train_w_features_df[numeric_cols].values
X_numeric_test = test_w_features_df[numeric_cols].values

scaler = StandardScaler()
X_numeric_train_scaled = scaler.fit_transform(X_numeric_train)
X_numeric_test_scaled = scaler.transform(X_numeric_test)

# 3. Prepare Text Features
X_text_train = tokenizer.transform(train_df['url'], MAX_URL_LEN)
X_text_test = tokenizer.transform(test_df['url'], MAX_URL_LEN)

# 4. Prepare Targets
y_train = train_df['target'].values
if 'target' in test_df.columns:
    y_test = test_df['target'].values
else:
    y_test = np.zeros(len(test_df))

# 5. Create PyTorch Datasets
class PhishingDataset(Dataset):
    def __init__(self, X_numeric, X_text, y, X_images=None):
        self.X_numeric = torch.FloatTensor(X_numeric)
        self.X_text = torch.LongTensor(X_text)
        self.y = torch.FloatTensor(y)
        self.X_images = torch.FloatTensor(X_images) if X_images is not None else None

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        if self.X_images is not None:
            return self.X_numeric[idx], self.X_text[idx], self.X_images[idx], self.y[idx]
        return self.X_numeric[idx], self.X_text[idx], self.y[idx]

full_train_dataset = PhishingDataset(X_numeric_train_scaled, X_text_train, y_train)
test_dataset = PhishingDataset(X_numeric_test_scaled, X_text_test, y_test)

test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
print("Data preparation complete.")


Vocabulary size: 101
Data preparation complete.


In [None]:
def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    running_loss = 0.0
    all_preds = []
    all_targets = []

    for batch in loader:
        x_img = None
        if len(batch) == 3:
            x_num, x_txt, y = batch
        else:
            x_num, x_txt, x_img, y = batch
            x_img = x_img.to(device)

        x_num, x_txt, y = x_num.to(device), x_txt.to(device), y.to(device)

        optimizer.zero_grad()
        outputs = model(x_num, x_txt, x_img).squeeze()
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * x_num.size(0)
        all_preds.extend(outputs.detach().cpu().numpy())
        all_targets.extend(y.detach().cpu().numpy())

    epoch_loss = running_loss / len(loader.dataset)
    return epoch_loss

def evaluate(model, loader, criterion, device):
    model.eval()
    running_loss = 0.0
    all_preds = []
    all_targets = []

    with torch.no_grad():
        for batch in loader:
            x_img = None
            if len(batch) == 3:
                x_num, x_txt, y = batch
            else:
                x_num, x_txt, x_img, y = batch
                x_img = x_img.to(device)

            x_num, x_txt, y = x_num.to(device), x_txt.to(device), y.to(device)

            outputs = model(x_num, x_txt, x_img).squeeze()
            loss = criterion(outputs, y)

            running_loss += loss.item() * x_num.size(0)
            all_preds.extend(outputs.cpu().numpy())
            all_targets.extend(y.cpu().numpy())

    epoch_loss = running_loss / len(loader.dataset)
    all_preds = np.array(all_preds)
    all_targets = np.array(all_targets)

    # Convert probabilities to binary predictions
    binary_preds = (all_preds > 0.5).astype(int)

    # Calculate confusion matrix components
    tn, fp, fn, tp = confusion_matrix(all_targets, binary_preds).ravel()

    metrics = {
        'loss': epoch_loss,
        'accuracy': accuracy_score(all_targets, binary_preds),
        'precision': precision_score(all_targets, binary_preds, zero_division=0),
        'recall': recall_score(all_targets, binary_preds, zero_division=0),
        'f1': f1_score(all_targets, binary_preds, zero_division=0),
        'roc_auc': roc_auc_score(all_targets, all_preds),
        'TP': tp,
        'FP': fp,
        'TN': tn,
        'FN': fn
    }

    return metrics, all_preds

def run_experiment(model_class, model_name, model_params, experiment_name, save_model=True):
    print(f"\n=== Running Experiment: {experiment_name} ({model_name}) ===")
    print(f"Saving Model: {save_model}")

    saver = None
    if save_model:
        if use_drive:
            base_path = drive_root + "experiments"
        else:
            base_path = "experiments"
        saver = ModelSaver(base_path=base_path)
        saver.start_experiment(
            experiment_name=experiment_name,
            model_type=model_name,
            vectorizer="CharTokenizer/Visualizer",
            vectorizer_params={'vocab_size': tokenizer.vocab_size, 'max_len': MAX_URL_LEN},
            model_params=model_params,
            n_folds=5,
            save_format="pickle"
        )

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    fold_test_preds = []

    for fold, (train_idx, val_idx) in enumerate(skf.split(X_numeric_train_scaled, y_train), start=1):
        print(f"\n--- Fold {fold}/5 ---")

        train_subsampler = torch.utils.data.SubsetRandomSampler(train_idx)
        val_subsampler = torch.utils.data.SubsetRandomSampler(val_idx)

        train_loader = DataLoader(full_train_dataset, batch_size=BATCH_SIZE, sampler=train_subsampler)
        val_loader = DataLoader(full_train_dataset, batch_size=BATCH_SIZE, sampler=val_subsampler)

        model = model_class(**model_params).to(device)
        criterion = nn.BCELoss()
        optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

        best_val_auc = 0.0
        best_model_state = None

        for epoch in range(EPOCHS):
            train_loss = train_epoch(model, train_loader, criterion, optimizer, device)
            val_metrics, _ = evaluate(model, val_loader, criterion, device)

            if (epoch + 1) % 5 == 0:
                print(f"Epoch {epoch+1}/{EPOCHS} - Train Loss: {train_loss:.4f} - Val AUC: {val_metrics['roc_auc']:.4f}")

            if val_metrics['roc_auc'] > best_val_auc:
                best_val_auc = val_metrics['roc_auc']
                best_model_state = model.state_dict()

        if best_model_state:
            model.load_state_dict(best_model_state)

        val_metrics, val_preds = evaluate(model, val_loader, criterion, device)
        print(f"Fold {fold} Best Val AUC: {val_metrics['roc_auc']:.4f}")

        test_metrics, test_preds = evaluate(model, test_loader, criterion, device)
        fold_test_preds.append(test_preds)

        if save_model and saver:
            saver.add_fold(
                fold_model=model,
                fold_metric={"fold": fold, **val_metrics, "train_size": len(train_idx), "val_size": len(val_idx)},
                test_predictions=test_preds,
                feature_names=["numeric_features"]
            )

    if save_model and saver:
        saver.finalize_experiment()
        print(f"Experiment saved to {saver._exp_dir}")

    return model, val_metrics, test_metrics

### 1. Baseline MLP (Numeric Features Only)

In [14]:
# Define Baseline MLP Architecture
class BaselineMLP(nn.Module):
    def __init__(self, input_dim, hidden_dims=[128, 64, 32], dropout=0.3):
        super(BaselineMLP, self).__init__()
        layers = []
        prev_dim = input_dim

        for dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, dim))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout))
            prev_dim = dim

        layers.append(nn.Linear(prev_dim, 1))
        self.network = nn.Sequential(*layers)

    def forward(self, x_numeric, x_text=None, x_img=None):
        # Ignores x_text, x_img
        return torch.sigmoid(self.network(x_numeric))

In [15]:
# Run Baseline MLP Experiment
SAVE_MLP = True # Set to False to skip saving

mlp_params = {
    'input_dim': X_numeric_train_scaled.shape[1],
    'hidden_dims': [128, 64, 32],
    'dropout': DROPOUT
}

run_experiment(BaselineMLP, "BaselineMLP", mlp_params, "exp_3_mlp_baseline", save_model=SAVE_MLP)


=== Running Experiment: exp_3_mlp_baseline (BaselineMLP) ===
Saving Model: True
Experiment 'exp_3_mlp_baseline' initialized at: /content/drive/MyDrive/fraud-grp-proj/experiments/exp_3_mlp_baseline
Mode: Incremental saving (5 folds)

--- Fold 1/5 ---
Epoch 5/20 - Train Loss: 0.2794 - Val AUC: 0.9429
Epoch 10/20 - Train Loss: 0.2553 - Val AUC: 0.9509
Epoch 15/20 - Train Loss: 0.2392 - Val AUC: 0.9541
Epoch 20/20 - Train Loss: 0.2319 - Val AUC: 0.9564
Fold 1 Best Val AUC: 0.9564
  Fold 1/5 saved | ROC AUC: 0.9564

--- Fold 2/5 ---
Epoch 5/20 - Train Loss: 0.2755 - Val AUC: 0.9318
Epoch 10/20 - Train Loss: 0.2555 - Val AUC: 0.9417
Epoch 15/20 - Train Loss: 0.2385 - Val AUC: 0.9454
Epoch 20/20 - Train Loss: 0.2302 - Val AUC: 0.9479
Fold 2 Best Val AUC: 0.9479
  Fold 2/5 saved | ROC AUC: 0.9479

--- Fold 3/5 ---
Epoch 5/20 - Train Loss: 0.2794 - Val AUC: 0.9403
Epoch 10/20 - Train Loss: 0.2527 - Val AUC: 0.9473
Epoch 15/20 - Train Loss: 0.2386 - Val AUC: 0.9516
Epoch 20/20 - Train Loss: 0.2

(BaselineMLP(
   (network): Sequential(
     (0): Linear(in_features=16, out_features=128, bias=True)
     (1): ReLU()
     (2): Dropout(p=0.3, inplace=False)
     (3): Linear(in_features=128, out_features=64, bias=True)
     (4): ReLU()
     (5): Dropout(p=0.3, inplace=False)
     (6): Linear(in_features=64, out_features=32, bias=True)
     (7): ReLU()
     (8): Dropout(p=0.3, inplace=False)
     (9): Linear(in_features=32, out_features=1, bias=True)
   )
 ),
 {'loss': 0.0554269311022057,
  'accuracy': 0.8862144420131292,
  'precision': 0.9002267573696145,
  'recall': 0.8687089715536105,
  'f1': 0.8841870824053452,
  'roc_auc': np.float64(0.9525003710814992),
  'TP': np.int64(794),
  'FP': np.int64(88),
  'TN': np.int64(826),
  'FN': np.int64(120)},
 {'loss': 0.2800612984810184,
  'accuracy': 0.8858267716535433,
  'precision': 0.8972972972972973,
  'recall': 0.8713910761154856,
  'f1': 0.8841544607190412,
  'roc_auc': np.float64(0.9518262863686222),
  'TP': np.int64(996),
  'FP': np.i

### 2. CharCNN (Text Features Only)

In [16]:
# Define CharCNN Architecture
class CharCNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim=64, num_filters=128, filter_sizes=[3, 4, 5], dropout=0.3, max_len=200):
        super(CharCNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

        self.convs = nn.ModuleList([
            nn.Conv1d(in_channels=embedding_dim,
                      out_channels=num_filters,
                      kernel_size=fs)
            for fs in filter_sizes
        ])

        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(len(filter_sizes) * num_filters, 1)

    def forward(self, x_numeric, x_text, x_img=None):
        # Ignores x_numeric, x_img
        # x_text shape: [batch_size, max_len]
        embedded = self.embedding(x_text) # [batch_size, max_len, emb_dim]

        # Permute for Conv1d: [batch_size, emb_dim, max_len]
        embedded = embedded.permute(0, 2, 1)

        # Apply Convs + ReLU + MaxPool
        conved = [F.relu(conv(embedded)) for conv in self.convs]
        pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]

        # Concatenate
        cat = torch.cat(pooled, dim=1)
        dropped = self.dropout(cat)

        return torch.sigmoid(self.fc(dropped))

In [17]:
# Run CharCNN Experiment
SAVE_CNN = True # Set to False to skip saving

cnn_params = {
    'vocab_size': tokenizer.vocab_size,
    'embedding_dim': EMBEDDING_DIM,
    'num_filters': 128,
    'filter_sizes': [3, 4, 5],
    'dropout': DROPOUT,
    'max_len': MAX_URL_LEN
}

run_experiment(CharCNN, "CharCNN", cnn_params, "exp_3_charcnn", save_model=SAVE_CNN)


=== Running Experiment: exp_3_charcnn (CharCNN) ===
Saving Model: True
Experiment 'exp_3_charcnn' initialized at: /content/drive/MyDrive/fraud-grp-proj/experiments/exp_3_charcnn
Mode: Incremental saving (5 folds)

--- Fold 1/5 ---
Epoch 5/20 - Train Loss: 0.1562 - Val AUC: 0.9780
Epoch 10/20 - Train Loss: 0.0796 - Val AUC: 0.9818
Epoch 15/20 - Train Loss: 0.0476 - Val AUC: 0.9830
Epoch 20/20 - Train Loss: 0.0313 - Val AUC: 0.9833
Fold 1 Best Val AUC: 0.9833
  Fold 1/5 saved | ROC AUC: 0.9833

--- Fold 2/5 ---
Epoch 5/20 - Train Loss: 0.1501 - Val AUC: 0.9695
Epoch 10/20 - Train Loss: 0.0843 - Val AUC: 0.9750
Epoch 15/20 - Train Loss: 0.0496 - Val AUC: 0.9770
Epoch 20/20 - Train Loss: 0.0323 - Val AUC: 0.9767
Fold 2 Best Val AUC: 0.9767
  Fold 2/5 saved | ROC AUC: 0.9767

--- Fold 3/5 ---
Epoch 5/20 - Train Loss: 0.1583 - Val AUC: 0.9685
Epoch 10/20 - Train Loss: 0.0874 - Val AUC: 0.9761
Epoch 15/20 - Train Loss: 0.0506 - Val AUC: 0.9767
Epoch 20/20 - Train Loss: 0.0288 - Val AUC: 0.97

(CharCNN(
   (embedding): Embedding(101, 64, padding_idx=0)
   (convs): ModuleList(
     (0): Conv1d(64, 128, kernel_size=(3,), stride=(1,))
     (1): Conv1d(64, 128, kernel_size=(4,), stride=(1,))
     (2): Conv1d(64, 128, kernel_size=(5,), stride=(1,))
   )
   (dropout): Dropout(p=0.3, inplace=False)
   (fc): Linear(in_features=384, out_features=1, bias=True)
 ),
 {'loss': 0.03929209689610012,
  'accuracy': 0.9321663019693655,
  'precision': 0.9312227074235808,
  'recall': 0.9332603938730853,
  'f1': 0.9322404371584699,
  'roc_auc': np.float64(0.9803171190668857),
  'TP': np.int64(853),
  'FP': np.int64(63),
  'TN': np.int64(851),
  'FN': np.int64(61)},
 {'loss': 0.19567685759416925,
  'accuracy': 0.9278215223097113,
  'precision': 0.9237435008665511,
  'recall': 0.9326334208223972,
  'f1': 0.9281671745755333,
  'roc_auc': np.float64(0.9811481351357765),
  'TP': np.int64(1066),
  'FP': np.int64(88),
  'TN': np.int64(1055),
  'FN': np.int64(77)})

### 3. BiLSTM (Text Features Only)

In [19]:
# Define BiLSTM Architecture
class BiLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim=64, hidden_dim=128, num_layers=2, dropout=0.3):
        super(BiLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

        self.lstm = nn.LSTM(embedding_dim,
                            hidden_dim,
                            num_layers=num_layers,
                            bidirectional=True,
                            batch_first=True,
                            dropout=dropout if num_layers > 1 else 0)

        self.dropout = nn.Dropout(dropout)
        # Bidirectional = 2 * hidden_dim
        self.fc = nn.Linear(hidden_dim * 2, 1)

    def forward(self, x_numeric, x_text, x_img=None):
        # Ignores x_numeric, x_img
        embedded = self.embedding(x_text)

        # LSTM output: output, (hidden, cell)
        # We use the final hidden state or max pooling.
        # Here we'll use the final hidden state of the last layer
        output, (hidden, cell) = self.lstm(embedded)

        # Concat the final forward and backward hidden states
        # hidden shape: [num_layers * num_directions, batch, hidden_dim]
        hidden_cat = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)

        dropped = self.dropout(hidden_cat)
        return torch.sigmoid(self.fc(dropped))

In [20]:
# Run BiLSTM Experiment
SAVE_LSTM = True # Set to False to skip saving

lstm_params = {
    'vocab_size': tokenizer.vocab_size,
    'embedding_dim': EMBEDDING_DIM,
    'hidden_dim': 128,
    'num_layers': 2,
    'dropout': DROPOUT
}

run_experiment(BiLSTM, "BiLSTM", lstm_params, "exp_3_bilstm", save_model=SAVE_LSTM)


=== Running Experiment: exp_3_bilstm (BiLSTM) ===
Saving Model: True
Experiment 'exp_3_bilstm' initialized at: /content/drive/MyDrive/fraud-grp-proj/experiments/exp_3_bilstm
Mode: Incremental saving (5 folds)

--- Fold 1/5 ---
Epoch 5/20 - Train Loss: 0.5033 - Val AUC: 0.7891
Epoch 10/20 - Train Loss: 0.2809 - Val AUC: 0.9397
Epoch 15/20 - Train Loss: 0.1689 - Val AUC: 0.9689
Epoch 20/20 - Train Loss: 0.0983 - Val AUC: 0.9758
Fold 1 Best Val AUC: 0.9758
  Fold 1/5 saved | ROC AUC: 0.9758

--- Fold 2/5 ---
Epoch 5/20 - Train Loss: 0.2925 - Val AUC: 0.9262
Epoch 10/20 - Train Loss: 0.2346 - Val AUC: 0.9487
Epoch 15/20 - Train Loss: 0.1639 - Val AUC: 0.9620
Epoch 20/20 - Train Loss: 0.1162 - Val AUC: 0.9679
Fold 2 Best Val AUC: 0.9679
  Fold 2/5 saved | ROC AUC: 0.9679

--- Fold 3/5 ---
Epoch 5/20 - Train Loss: 0.4129 - Val AUC: 0.8466
Epoch 10/20 - Train Loss: 0.2513 - Val AUC: 0.9404
Epoch 15/20 - Train Loss: 0.1703 - Val AUC: 0.9570
Epoch 20/20 - Train Loss: 0.0880 - Val AUC: 0.9672
F

(BiLSTM(
   (embedding): Embedding(101, 64, padding_idx=0)
   (lstm): LSTM(64, 128, num_layers=2, batch_first=True, dropout=0.3, bidirectional=True)
   (dropout): Dropout(p=0.3, inplace=False)
   (fc): Linear(in_features=256, out_features=1, bias=True)
 ),
 {'loss': 0.0766215846716647,
  'accuracy': 0.9048140043763676,
  'precision': 0.890295358649789,
  'recall': 0.9234135667396062,
  'f1': 0.9065520945220193,
  'roc_auc': np.float64(0.9579205550421596),
  'TP': np.int64(844),
  'FP': np.int64(104),
  'TN': np.int64(810),
  'FN': np.int64(70)},
 {'loss': 0.31688410463504174,
  'accuracy': 0.9151356080489939,
  'precision': 0.8990748528174937,
  'recall': 0.9352580927384077,
  'f1': 0.9168096054888508,
  'roc_auc': np.float64(0.968101701635502),
  'TP': np.int64(1069),
  'FP': np.int64(120),
  'TN': np.int64(1023),
  'FN': np.int64(74)})

### 4. Hybrid Model (Numeric + Text)

From the above results, we see that CharCNN outperforms BiLSTM, therefore, we will choose that for the hybrid model, along with a parallel MLP to capture the numeric features that we engineered.

In [21]:
# Define Hybrid Model Architecture
class HybridModel(nn.Module):
    def __init__(self, vocab_size, numeric_input_dim, embedding_dim=64, num_filters=128, filter_sizes=[3, 4, 5], dropout=0.3):
        super(HybridModel, self).__init__()

        # Text Branch (CNN)
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.convs = nn.ModuleList([
            nn.Conv1d(in_channels=embedding_dim,
                      out_channels=num_filters,
                      kernel_size=fs)
            for fs in filter_sizes
        ])
        self.text_out_dim = len(filter_sizes) * num_filters

        # Numeric Branch (MLP)
        self.numeric_fc = nn.Sequential(
            nn.Linear(numeric_input_dim, 128),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(128, 64),
            nn.ReLU()
        )
        self.numeric_out_dim = 64

        # Combined
        self.fc_final = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(self.text_out_dim + self.numeric_out_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )

    def forward(self, x_numeric, x_text, x_img=None):
        # Text Path
        embedded = self.embedding(x_text).permute(0, 2, 1)
        conved = [F.relu(conv(embedded)) for conv in self.convs]
        pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
        text_features = torch.cat(pooled, dim=1)

        # Numeric Path
        numeric_features = self.numeric_fc(x_numeric)

        # Combine
        combined = torch.cat((text_features, numeric_features), dim=1)
        return torch.sigmoid(self.fc_final(combined))

In [22]:
# Run Hybrid Model Experiment
SAVE_HYBRID = True # Set to False to skip saving

hybrid_params = {
    'vocab_size': tokenizer.vocab_size,
    'numeric_input_dim': X_numeric_train_scaled.shape[1],
    'embedding_dim': EMBEDDING_DIM,
    'num_filters': 128,
    'filter_sizes': [3, 4, 5],
    'dropout': DROPOUT
}

run_experiment(HybridModel, "HybridModel", hybrid_params, "exp_3_hybrid", save_model=SAVE_HYBRID)


=== Running Experiment: exp_3_hybrid (HybridModel) ===
Saving Model: True
Experiment 'exp_3_hybrid' initialized at: /content/drive/MyDrive/fraud-grp-proj/experiments/exp_3_hybrid
Mode: Incremental saving (5 folds)

--- Fold 1/5 ---
Epoch 5/20 - Train Loss: 0.1205 - Val AUC: 0.9872
Epoch 10/20 - Train Loss: 0.0648 - Val AUC: 0.9871
Epoch 15/20 - Train Loss: 0.0353 - Val AUC: 0.9883
Epoch 20/20 - Train Loss: 0.0351 - Val AUC: 0.9882
Fold 1 Best Val AUC: 0.9882
  Fold 1/5 saved | ROC AUC: 0.9882

--- Fold 2/5 ---
Epoch 5/20 - Train Loss: 0.1125 - Val AUC: 0.9811
Epoch 10/20 - Train Loss: 0.0678 - Val AUC: 0.9816
Epoch 15/20 - Train Loss: 0.0383 - Val AUC: 0.9829
Epoch 20/20 - Train Loss: 0.0248 - Val AUC: 0.9830
Fold 2 Best Val AUC: 0.9830
  Fold 2/5 saved | ROC AUC: 0.9830

--- Fold 3/5 ---
Epoch 5/20 - Train Loss: 0.1264 - Val AUC: 0.9819
Epoch 10/20 - Train Loss: 0.0778 - Val AUC: 0.9847
Epoch 15/20 - Train Loss: 0.0464 - Val AUC: 0.9851
Epoch 20/20 - Train Loss: 0.0311 - Val AUC: 0.9

(HybridModel(
   (embedding): Embedding(101, 64, padding_idx=0)
   (convs): ModuleList(
     (0): Conv1d(64, 128, kernel_size=(3,), stride=(1,))
     (1): Conv1d(64, 128, kernel_size=(4,), stride=(1,))
     (2): Conv1d(64, 128, kernel_size=(5,), stride=(1,))
   )
   (numeric_fc): Sequential(
     (0): Linear(in_features=16, out_features=128, bias=True)
     (1): ReLU()
     (2): Dropout(p=0.3, inplace=False)
     (3): Linear(in_features=128, out_features=64, bias=True)
     (4): ReLU()
   )
   (fc_final): Sequential(
     (0): Dropout(p=0.3, inplace=False)
     (1): Linear(in_features=448, out_features=64, bias=True)
     (2): ReLU()
     (3): Linear(in_features=64, out_features=1, bias=True)
   )
 ),
 {'loss': 0.045974995705826624,
  'accuracy': 0.9442013129102844,
  'precision': 0.953125,
  'recall': 0.9343544857768052,
  'f1': 0.943646408839779,
  'roc_auc': np.float64(0.9830924495688272),
  'TP': np.int64(854),
  'FP': np.int64(42),
  'TN': np.int64(872),
  'FN': np.int64(60)},
 {'

The hybrid model has the best fold performance so far, which tells us that adding in the numeric features does help improve performance.

Although the hybrid model is the best performing neural network so far, we are unlikely able to Optuna tune it due to the complexity of the model and training time required. For the neural network models, only MLP would be feasible for Optuna tuning, but since it performed worst, we will not pursue that further.