# Tabular Classification with PyTorch & Optuna

This notebook demonstrates advanced tabular classification using PyTorch, Optuna for hyperparameter tuning, and feature engineering.

**Data Split Strategy:**
- 80% of training data for model training and validation (cross-validation)
- 20% of training data held out for final evaluation metrics
- Final predictions made on test.csv

## 1. Import Libraries and Set Seed
Import all required libraries and set the random seed for reproducibility. Print the device being used (CPU/GPU).

In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import random
import os
import optuna
from optuna.samplers import TPESampler
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import f1_score, classification_report, confusion_matrix

# Set seed for reproducibility
def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

SEED = 42
seed_everything(SEED)

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"‚úÖ Processing on: {DEVICE}")

BATCH_SIZE = 64
N_TRIALS = 500  # How many Optuna trials to run

‚úÖ Processing on: cpu


## 2. Load Data and Feature Engineering
Load train and test data, and apply advanced feature engineering.

In [2]:
def create_advanced_features(df):
    df = df.copy()
    activity_cols = ['hobby_engagement_level', 'physical_activity_index', 
                     'creative_expression_index', 'altruism_score']
    df['total_activity'] = df[activity_cols].sum(axis=1)
    df['support_guidance_combo'] = df['support_environment_score'] * (df['external_guidance_usage'] + 1)
    df['focus_efficiency'] = df['focus_intensity'] / (df['consistency_score'] + 1)
    df['consistency_gap'] = 30 - df['consistency_score']
    df['focus_sq'] = df['focus_intensity'] ** 2
    df['focus_X_consistency'] = df['focus_intensity'] * df['consistency_score']
    df['low_focus_high_consist'] = ((df['focus_intensity'] < 5) & (df['consistency_score'] > 24)).astype(int)
    return df

try:
    train_df = pd.read_csv('../dataset/train_20percent.csv')
    test_df = pd.read_csv('../dataset/test.csv')
except FileNotFoundError:
    raise FileNotFoundError("‚ùå Upload train_20percent.csv and test.csv!")

print(f"Total Training Data: {len(train_df)} samples")
print(f"Test Data: {len(test_df)} samples")

train_df = create_advanced_features(train_df)
test_df = create_advanced_features(test_df)

# Split training data into 80% train and 20% evaluation
X_full = train_df.drop(['participant_id', 'personality_cluster'], axis=1)
y_full = train_df['personality_cluster']

X_train, X_eval, y_train, y_eval = train_test_split(
    X_full, y_full, test_size=0.2, random_state=SEED, stratify=y_full
)

print(f"\nüìä Data Split:")
print(f"  Training Set (for model training & CV): {len(X_train)} samples (80%)")
print(f"  Evaluation Set (held out): {len(X_eval)} samples (20%)")
print(f"  Test Set (final predictions): {len(test_df)} samples")

test_ids = test_df['participant_id']
X_test = test_df.drop(['participant_id'], axis=1)

Total Training Data: 383 samples
Test Data: 479 samples

üìä Data Split:
  Training Set (for model training & CV): 306 samples (80%)
  Evaluation Set (held out): 77 samples (20%)
  Test Set (final predictions): 479 samples


## 3. Preprocess Data (Encoding & Scaling)
Encode categorical features and scale numerical features for train, eval, and test sets.

In [3]:
cat_cols = [
    'identity_code', 'cultural_background', 'age_group', 
    'upbringing_influence', 'support_environment_score', 
    'hobby_engagement_level', 'physical_activity_index',
    'creative_expression_index', 'altruism_score',
    'low_focus_high_consist'
]

cat_dims = []
for col in cat_cols:
    le = LabelEncoder()
    # Fit on all data to ensure consistency
    full_data = pd.concat([X_train[col], X_eval[col], X_test[col]], axis=0).astype(str)
    le.fit(full_data)
    X_train[col] = le.transform(X_train[col].astype(str))
    X_eval[col] = le.transform(X_eval[col].astype(str))
    X_test[col] = le.transform(X_test[col].astype(str))
    num_classes = len(le.classes_)
    emb_dim = min(50, (num_classes + 1) // 2)
    cat_dims.append((num_classes, emb_dim))

num_cols = [c for c in X_train.columns if c not in cat_cols]
scaler = StandardScaler()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_eval[num_cols] = scaler.transform(X_eval[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

# Convert to numpy arrays
X_train_cat = X_train[cat_cols].values.astype(np.int64)
X_train_num = X_train[num_cols].values.astype(np.float32)
X_eval_cat = X_eval[cat_cols].values.astype(np.int64)
X_eval_num = X_eval[num_cols].values.astype(np.float32)
X_test_cat = X_test[cat_cols].values.astype(np.int64)
X_test_num = X_test[num_cols].values.astype(np.float32)

## 4. Prepare Datasets and Class Weights
Encode target labels, compute class weights, and prepare PyTorch datasets for training and validation.

In [4]:
target_le = LabelEncoder()
y_train_encoded = target_le.fit_transform(y_train)
y_eval_encoded = target_le.transform(y_eval)
num_classes_target = len(target_le.classes_)

print(f"Number of classes: {num_classes_target}")
print(f"Classes: {target_le.classes_}")

class_weights = compute_class_weight('balanced', classes=np.unique(y_train_encoded), y=y_train_encoded)
class_weights_tensor = torch.tensor(class_weights, dtype=torch.float).to(DEVICE)
print(f"Class weights: {class_weights}")

Number of classes: 5
Classes: ['Cluster_A' 'Cluster_B' 'Cluster_C' 'Cluster_D' 'Cluster_E']
Class weights: [4.70769231 1.74857143 1.24897959 1.15471698 0.39230769]


## 5. Define Dynamic Model Architecture
Implement dataset and model classes for flexible tabular neural network modeling.

In [5]:
class AdvancedTabularDataset(Dataset):
    def __init__(self, x_cat, x_num, y=None):
        self.x_cat = torch.tensor(x_cat, dtype=torch.long)
        self.x_num = torch.tensor(x_num, dtype=torch.float)
        self.y = torch.tensor(y, dtype=torch.long) if y is not None else None
    def __len__(self): return len(self.x_cat)
    def __getitem__(self, idx):
        if self.y is not None: return self.x_cat[idx], self.x_num[idx], self.y[idx]
        return self.x_cat[idx], self.x_num[idx]

class DynamicTabularModel(nn.Module):
    def __init__(self, cat_dims, num_dim, output_dim, 
                 l1_size, l2_size, l3_size, 
                 emb_dropout, hidden_dropout):
        super(DynamicTabularModel, self).__init__()
        self.embeddings = nn.ModuleList([
            nn.Embedding(n, d) for n, d in cat_dims
        ])
        self.emb_dropout = nn.Dropout(emb_dropout)
        total_emb_dim = sum([d for _, d in cat_dims])
        input_dim = total_emb_dim + num_dim
        self.bn0 = nn.BatchNorm1d(num_dim)
        self.fc1 = nn.Linear(input_dim, l1_size)
        self.bn1 = nn.BatchNorm1d(l1_size)
        self.act1 = nn.GELU()
        self.drop1 = nn.Dropout(hidden_dropout)
        self.fc2 = nn.Linear(l1_size, l2_size)
        self.bn2 = nn.BatchNorm1d(l2_size)
        self.act2 = nn.GELU()
        self.drop2 = nn.Dropout(hidden_dropout)
        self.fc3 = nn.Linear(l2_size, l3_size)
        self.bn3 = nn.BatchNorm1d(l3_size)
        self.act3 = nn.GELU()
        self.drop3 = nn.Dropout(hidden_dropout / 2)
        self.output = nn.Linear(l3_size, output_dim)
    def forward(self, x_cat, x_num):
        emb_list = [emb(x_cat[:, i]) for i, emb in enumerate(self.embeddings)]
        x_emb = self.emb_dropout(torch.cat(emb_list, dim=1))
        x_num = self.bn0(x_num)
        x = torch.cat([x_emb, x_num], dim=1)
        x = self.drop1(self.act1(self.bn1(self.fc1(x))))
        x = self.drop2(self.act2(self.bn2(self.fc2(x))))
        x = self.drop3(self.act3(self.bn3(self.fc3(x))))
        return self.output(x)

## 6. Optuna Hyperparameter Tuning
Define the Optuna objective function and run the study to find the best hyperparameters using cross-validation on the 80% training set.

In [6]:
print(f"--- Starting Optuna Tuning ({N_TRIALS} Trials) ---")

def objective(trial):
    l1_size = trial.suggest_int('l1_size', 128, 512)
    l2_size = trial.suggest_int('l2_size', 64, 256)
    l3_size = trial.suggest_int('l3_size', 32, 128)
    emb_dropout = trial.suggest_float('emb_dropout', 0.1, 0.5)
    hidden_dropout = trial.suggest_float('hidden_dropout', 0.1, 0.5)
    lr = trial.suggest_float('lr', 1e-4, 1e-2, log=True)
    weight_decay = trial.suggest_float('weight_decay', 1e-6, 1e-3, log=True)
    
    # Use 3-fold CV on the 80% training set
    skf_tune = StratifiedKFold(n_splits=3, shuffle=True, random_state=SEED)
    scores = []
    
    for train_idx, val_idx in skf_tune.split(X_train_num, y_train_encoded):
        train_ds = AdvancedTabularDataset(X_train_cat[train_idx], X_train_num[train_idx], y_train_encoded[train_idx])
        val_ds = AdvancedTabularDataset(X_train_cat[val_idx], X_train_num[val_idx], y_train_encoded[val_idx])
        train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)
        val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE*2, shuffle=False, num_workers=0)
        
        model = DynamicTabularModel(cat_dims, X_train_num.shape[1], num_classes_target,
                                    l1_size, l2_size, l3_size, emb_dropout, hidden_dropout).to(DEVICE)
        criterion = nn.CrossEntropyLoss(weight=class_weights_tensor)
        optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
        
        for epoch in range(25):
            model.train()
            for c, n, y_batch in train_loader:
                c, n, y_batch = c.to(DEVICE), n.to(DEVICE), y_batch.to(DEVICE)
                optimizer.zero_grad()
                loss = criterion(model(c, n), y_batch)
                loss.backward()
                optimizer.step()
        
        model.eval()
        preds, labels = [], []
        with torch.no_grad():
            for c, n, y_batch in val_loader:
                c, n = c.to(DEVICE), n.to(DEVICE)
                p = torch.argmax(model(c, n), dim=1)
                preds.extend(p.cpu().numpy())
                labels.extend(y_batch.numpy())
        scores.append(f1_score(labels, preds, average='macro'))
        
        trial.report(scores[-1], 0)
        if trial.should_prune():
            raise optuna.exceptions.TrialPruned()
    
    return np.mean(scores)

sampler = TPESampler(seed=SEED)
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=N_TRIALS)

print("\n‚úÖ BEST PARAMS FOUND:")
print(study.best_params)
best_p = study.best_params

[I 2025-12-02 22:19:33,987] A new study created in memory with name: no-name-2e88a9ce-1c25-48a0-a619-b78e85197a56


--- Starting Optuna Tuning (500 Trials) ---


[I 2025-12-02 22:19:36,542] Trial 0 finished with value: 0.4223883740325754 and parameters: {'l1_size': 272, 'l2_size': 247, 'l3_size': 103, 'emb_dropout': 0.3394633936788146, 'hidden_dropout': 0.1624074561769746, 'lr': 0.00020511104188433984, 'weight_decay': 1.493656855461763e-06}. Best is trial 0 with value: 0.4223883740325754.
[I 2025-12-02 22:19:37,822] Trial 1 finished with value: 0.4791580703858734 and parameters: {'l1_size': 461, 'l2_size': 180, 'l3_size': 100, 'emb_dropout': 0.10823379771832098, 'hidden_dropout': 0.4879639408647978, 'lr': 0.004622589001020831, 'weight_decay': 4.335281794951567e-06}. Best is trial 1 with value: 0.4791580703858734.
[I 2025-12-02 22:19:38,846] Trial 2 finished with value: 0.41215589025954374 and parameters: {'l1_size': 198, 'l2_size': 99, 'l3_size': 61, 'emb_dropout': 0.3099025726528951, 'hidden_dropout': 0.2727780074568463, 'lr': 0.0003823475224675188, 'weight_decay': 6.847920095574779e-05}. Best is trial 1 with value: 0.4791580703858734.
[I 2025


‚úÖ BEST PARAMS FOUND:
{'l1_size': 151, 'l2_size': 204, 'l3_size': 93, 'emb_dropout': 0.34364460300893857, 'hidden_dropout': 0.14242178538631708, 'lr': 0.004705820212650299, 'weight_decay': 8.53319582166021e-05}


## 7. Final Model Training with Best Hyperparameters
Retrain the model using the best hyperparameters found by Optuna on full 5-fold cross-validation using the 80% training set. Aggregate test predictions across folds.

In [7]:
print("\n--- Retraining Final Model with Best Params (5 Folds on 80% Training Data) ---")

skf_final = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
test_probs_sum = np.zeros((len(X_test), num_classes_target))
final_f1_scores = []
FINAL_EPOCHS = 50

for fold, (train_idx, val_idx) in enumerate(skf_final.split(X_train_num, y_train_encoded)):
    train_ds = AdvancedTabularDataset(X_train_cat[train_idx], X_train_num[train_idx], y_train_encoded[train_idx])
    val_ds = AdvancedTabularDataset(X_train_cat[val_idx], X_train_num[val_idx], y_train_encoded[val_idx])
    train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)
    val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE*2, shuffle=False, num_workers=0)
    
    model = DynamicTabularModel(
        cat_dims, X_train_num.shape[1], num_classes_target,
        best_p['l1_size'], best_p['l2_size'], best_p['l3_size'],
        best_p['emb_dropout'], best_p['hidden_dropout']
    ).to(DEVICE)
    
    criterion = nn.CrossEntropyLoss(weight=class_weights_tensor, label_smoothing=0.1)
    optimizer = optim.AdamW(model.parameters(), lr=best_p['lr'], weight_decay=best_p['weight_decay'])
    scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=best_p['lr']*10, 
                                              steps_per_epoch=len(train_loader), epochs=FINAL_EPOCHS)
    
    best_val_f1 = 0
    best_state = None
    
    for epoch in range(FINAL_EPOCHS):
        model.train()
        for c, n, y_batch in train_loader:
            c, n, y_batch = c.to(DEVICE), n.to(DEVICE), y_batch.to(DEVICE)
            optimizer.zero_grad()
            loss = criterion(model(c, n), y_batch)
            loss.backward()
            optimizer.step()
            scheduler.step()
        
        model.eval()
        preds, labels = [], []
        with torch.no_grad():
            for c, n, y_batch in val_loader:
                c, n = c.to(DEVICE), n.to(DEVICE)
                p = torch.argmax(model(c, n), dim=1)
                preds.extend(p.cpu().numpy())
                labels.extend(y_batch.numpy())
        
        f1 = f1_score(labels, preds, average='macro')
        if f1 > best_val_f1:
            best_val_f1 = f1
            best_state = model.state_dict()
    
    print(f"Fold {fold+1} | Best Validation F1: {best_val_f1:.4f}")
    final_f1_scores.append(best_val_f1)
    
    # Load best model state
    model.load_state_dict(best_state)
    model.eval()
    
    # Generate predictions on test set
    test_ds = AdvancedTabularDataset(X_test_cat, X_test_num)
    test_loader = DataLoader(test_ds, batch_size=BATCH_SIZE*2, shuffle=False, num_workers=0)
    fold_probs = []
    with torch.no_grad():
        for c, n in test_loader:
            c, n = c.to(DEVICE), n.to(DEVICE)
            probs = torch.softmax(model(c, n), dim=1)
            fold_probs.append(probs.cpu().numpy())
    test_probs_sum += np.concatenate(fold_probs)

print(f"\nüèÜ Average Cross-Validation F1 (on 80% training): {np.mean(final_f1_scores):.4f} ¬± {np.std(final_f1_scores):.4f}")


--- Retraining Final Model with Best Params (5 Folds on 80% Training Data) ---
Fold 1 | Best Validation F1: 0.5911
Fold 2 | Best Validation F1: 0.7183
Fold 3 | Best Validation F1: 0.5231
Fold 4 | Best Validation F1: 0.5539
Fold 5 | Best Validation F1: 0.7447

üèÜ Average Cross-Validation F1 (on 80% training): 0.6262 ¬± 0.0890


## 8. Evaluate on Held-Out 20% Evaluation Set
Evaluate the final model performance on the 20% held-out evaluation set to get unbiased metrics.

In [8]:
print("\n--- Evaluating on Held-Out 20% Evaluation Set ---")

# Train final model on entire 80% training set for evaluation
full_train_ds = AdvancedTabularDataset(X_train_cat, X_train_num, y_train_encoded)
full_train_loader = DataLoader(full_train_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)

eval_model = DynamicTabularModel(
    cat_dims, X_train_num.shape[1], num_classes_target,
    best_p['l1_size'], best_p['l2_size'], best_p['l3_size'],
    best_p['emb_dropout'], best_p['hidden_dropout']
).to(DEVICE)

criterion = nn.CrossEntropyLoss(weight=class_weights_tensor, label_smoothing=0.1)
optimizer = optim.AdamW(eval_model.parameters(), lr=best_p['lr'], weight_decay=best_p['weight_decay'])
scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=best_p['lr']*10, 
                                          steps_per_epoch=len(full_train_loader), epochs=FINAL_EPOCHS)

print(f"Training on full 80% training set for {FINAL_EPOCHS} epochs...")
for epoch in range(FINAL_EPOCHS):
    eval_model.train()
    for c, n, y_batch in full_train_loader:
        c, n, y_batch = c.to(DEVICE), n.to(DEVICE), y_batch.to(DEVICE)
        optimizer.zero_grad()
        loss = criterion(eval_model(c, n), y_batch)
        loss.backward()
        optimizer.step()
        scheduler.step()
    
    if (epoch + 1) % 10 == 0:
        print(f"  Epoch {epoch+1}/{FINAL_EPOCHS} completed")

# Evaluate on held-out 20%
eval_ds = AdvancedTabularDataset(X_eval_cat, X_eval_num, y_eval_encoded)
eval_loader = DataLoader(eval_ds, batch_size=BATCH_SIZE*2, shuffle=False, num_workers=0)

eval_model.eval()
eval_preds, eval_labels = [], []
with torch.no_grad():
    for c, n, y_batch in eval_loader:
        c, n = c.to(DEVICE), n.to(DEVICE)
        p = torch.argmax(eval_model(c, n), dim=1)
        eval_preds.extend(p.cpu().numpy())
        eval_labels.extend(y_batch.numpy())

eval_f1 = f1_score(eval_labels, eval_preds, average='macro')
print(f"\nüìä EVALUATION METRICS (20% Held-Out Set):")
print(f"Macro F1 Score: {eval_f1:.4f}")
print("\nClassification Report:")
print(classification_report(eval_labels, eval_preds, target_names=target_le.classes_))
print("\nConfusion Matrix:")
print(confusion_matrix(eval_labels, eval_preds))


--- Evaluating on Held-Out 20% Evaluation Set ---
Training on full 80% training set for 50 epochs...
  Epoch 10/50 completed
  Epoch 20/50 completed
  Epoch 30/50 completed
  Epoch 40/50 completed
  Epoch 50/50 completed

üìä EVALUATION METRICS (20% Held-Out Set):
Macro F1 Score: 0.4542

Classification Report:
              precision    recall  f1-score   support

   Cluster_A       0.00      0.00      0.00         4
   Cluster_B       0.38      0.33      0.35         9
   Cluster_C       0.35      0.58      0.44        12
   Cluster_D       0.75      0.46      0.57        13
   Cluster_E       0.92      0.90      0.91        39

    accuracy                           0.66        77
   macro avg       0.48      0.46      0.45        77
weighted avg       0.69      0.66      0.67        77


Confusion Matrix:
[[ 0  2  1  0  1]
 [ 3  3  3  0  0]
 [ 0  3  7  2  0]
 [ 0  0  5  6  2]
 [ 0  0  4  0 35]]


## 9. Save Predictions and Submission Files
Save the test set prediction probabilities and final submission file as CSVs.

In [10]:
# Average test predictions from 5-fold CV
avg_test_probs = test_probs_sum / 5

# Save Probs
prob_df = pd.DataFrame(avg_test_probs, columns=[f'prob_{i}' for i in range(num_classes_target)])
prob_df['participant_id'] = test_ids.values
prob_df.to_csv('nn_optuna_probs.csv', index=False)
print("‚úÖ Saved 'nn_optuna_probs.csv'")

# Save Submission
final_indices = np.argmax(avg_test_probs, axis=1)
final_labels = target_le.inverse_transform(final_indices)
submission_df = pd.DataFrame({
    'participant_id': test_ids,
    'personality_cluster': final_labels
})
submission_df.to_csv('submission_nn_optuna_20pca.csv', index=False)
print("‚úÖ Saved 'submission_nn_optuna_20pca.csv'")

print("\n" + "="*60)
print("SUMMARY:")
print(f"  Training Set CV F1: {np.mean(final_f1_scores):.4f} ¬± {np.std(final_f1_scores):.4f}")
print(f"  Evaluation Set F1 (20% held-out): {eval_f1:.4f}")
print(f"  Test Predictions: {len(submission_df)} samples")
print("="*60)

‚úÖ Saved 'nn_optuna_probs.csv'
‚úÖ Saved 'submission_nn_optuna_20pca.csv'

SUMMARY:
  Training Set CV F1: 0.6262 ¬± 0.0890
  Evaluation Set F1 (20% held-out): 0.4542
  Test Predictions: 479 samples
