# Neural Networks with PyTorch
Note: This notebook will be used for nn as an educational extension and will compare its performance. For this dataset, since it is small and tabular, a tree-based model may be more appropriate.  


### Model Selection
Since our dataset is small, from research we have determined that a smal feed-forward MLP would be a good candidate.  

A shallow MLP gives a flexible nonlinear decision boundary (via ReLU) while keeping cariance controlled through small width, dropout/L2, and early stopping.  

Moreover, MLPs learn smooth functions and with standardized tabular inputs, this may be beneficial.  

A compact MLP will introduce enough nonlinearity to model interactiosn in the 12 clinical features but keep capacity, variance, and training instability in check.

In [7]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

## Data Preparation and Loading 

In [8]:
# Load data - splits were created in previous notebook 
X_train = pd.read_csv("../data/X_train.csv")
y_train = pd.read_csv("../data/y_train.csv").values.ravel()  # Convert to 1D array

X_val = pd.read_csv("../data/X_val.csv")
y_val = pd.read_csv("../data/y_val.csv").values.ravel()  # Convert to 1D array

print(f"Training set: {X_train.shape[0]} samples")
print(f"Validation set: {X_val.shape[0]} samples")
print(f"Class distribution in training: {np.bincount(y_train)}")
print(f"Class distribution in validation: {np.bincount(y_val)}")

Training set: 179 samples
Validation set: 45 samples
Class distribution in training: [130  49]
Class distribution in validation: [33 12]


In [9]:
continuous_features = ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']
categorical_features = ['anaemia', 'diabetes', 'high_blood_pressure', 'sex', 'smoking']

In [10]:
# Column Transformer + Pipeline to standardize data 
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), continuous_features),
        ('cat', 'passthrough', categorical_features)
    ]
)

In [11]:
X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)


## PyTorch Tensors

In [12]:
# Convert to PyTorch tensors
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Train Tensors
X_train_tensor = torch.tensor(X_train_processed, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)

# Validation Tensors
X_val_tensor = torch.tensor(X_val_processed, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32).unsqueeze(1)

# Create DataLoader
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) # shuffle=True prevents model from seeign data in the same order every time,
                                                                      # reducing correlation between successive batches and improving stochastic optimization.
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
        

## Create Torch model 