### Exercise - NN for Regression

This dataset encompasses details about various workers and their corresponding employment levels, featuring a diverse set of attributes ranging from categorical to continuous. Initialize the data loading process using the suitable Pandas function, and meticulously inspect for any instances of null or duplicated data. Specifically focusing on the **“salary in usd”** feature, identify and eliminate outliers while devising a strategy to address any missing values.

Then, use any method you like to encode the categorical features, namely **“work year”, “experience level”, “employment type”, “job title”, “employee residence”, “remote ratio”, “company location”, and “company size”**. You may consider to employ the sklearn LabelEncoder class<sup>1</sup>.

Following the preprocessing steps, normalize the dataset utilizing the z-score technique to ensure consistent scaling across features. Subsequently, construct a neural network using **PyTorch**, incorporating **2 hidden layers with 5 and 3 neurons**, respectively. Carefully select an appropriate learning rate and normalization value for optimal model training.

Furthermore, assess the model’s performance using a relevant evaluation metric, ensuring a comprehensive understanding of its effectiveness in handling the given employment dataset. Finally, find the best hyperparameter combination (namely **lr** and **weight decay**) using both the **Grid Search** and the **k-fold cross validation** methods.

<sup>1</sup>https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
import itertools

from sklearn.metrics import mean_absolute_error

import torch
torch.__version__

'2.5.1+cu118'

In [2]:
from torch import nn
from torch import optim

device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [3]:
# Load the dataset
data_url = "./datasets/ds_salaries.csv"
df = pd.read_csv(data_url)

# Inspect the dataset
display(df)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M
...,...,...,...,...,...,...,...,...,...,...,...
3750,2020,SE,FT,Data Scientist,412000,USD,412000,US,100,US,L
3751,2021,MI,FT,Principal Data Scientist,151000,USD,151000,US,100,US,L
3752,2020,EN,FT,Data Scientist,105000,USD,105000,US,100,US,S
3753,2020,EN,CT,Business Data Analyst,100000,USD,100000,US,100,US,L


In [4]:
# Drop not useful features
df = df.drop(columns=['salary', 'salary_currency'])
print(df.columns)

Index(['work_year', 'experience_level', 'employment_type', 'job_title',
       'salary_in_usd', 'employee_residence', 'remote_ratio',
       'company_location', 'company_size'],
      dtype='object')


In [5]:
# Check for missing values
print('Missing values:')
print(df.isnull().sum())

# Check for duplicates
print('Duplicates: ', df.duplicated().sum())


Missing values:
work_year             0
experience_level      0
employment_type       0
job_title             0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64
Duplicates:  1171


In [6]:
# So we found that we do not have missing values, but we have 1171 duplicates. Let's drop them.
df = df.drop_duplicates()
print(df.size)

23256


In [7]:
# Handle missing values in "salary in usd" (even if in this case we don't have any)
# For the sake of simplicity, let's fill missing values with the median salary.
df['salary_in_usd'] = df['salary_in_usd'].fillna(df['salary_in_usd'].median())

# Identify and remove outliers in "salary in usd" using IQR
alpha = 1.5
Q1 = df['salary_in_usd'].quantile(0.25)
Q3 = df['salary_in_usd'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - alpha * IQR
upper_bound = Q3 + alpha * IQR

# we take just entries within the range [lower_bound, upper_bound]
df = df[(df['salary_in_usd'] >= lower_bound) & (df['salary_in_usd'] <= upper_bound)]

print(df.size)

In [9]:
# Define the categorical features
categorical_features = ['work_year', 'experience_level', 'employment_type', 
                        'job_title', 'employee_residence', 'remote_ratio', 
                        'company_location', 'company_size']

# Initialize LabelEncoder
le = LabelEncoder()

# Apply LabelEncoder to each categorical column
for col in categorical_features:
    df[col] = le.fit_transform(df[col])

In [10]:
display(df)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,3,3,2,83,85847,26,2,25,0
1,3,2,0,65,30000,74,2,70,2
2,3,2,0,65,25500,74,2,70,2
3,3,3,2,46,175000,11,2,12,1
4,3,3,2,46,120000,11,2,12,1
...,...,...,...,...,...,...,...,...,...
3749,1,3,2,48,165000,74,2,70,0
3751,1,2,2,83,151000,74,2,70,0
3752,0,0,2,46,105000,74,2,70,2
3753,0,0,0,17,100000,74,2,70,0


In [11]:
# Split the dataset into features and target variable
X = df.drop('salary_in_usd', axis=1).values
y = df['salary_in_usd'].values

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Features normalization
# Normalize the features using Z-score (so, excluding target variable 'salary_in_usd')
features_scaler = StandardScaler() # or RobustScaler()
X_train_std = features_scaler.fit_transform(X_train)
X_test_std = features_scaler.transform(X_test)

In [12]:
display(X_train_std)
display(y_train)

array([[-3.08053138, -2.43217262,  0.03952792, ...,  1.0219463 ,
         0.53774805,  2.41820964],
       [-1.74054284, -2.43217262,  0.03952792, ..., -0.01676988,
        -2.16777496,  0.21748813],
       [-0.4005543 ,  0.67169471,  0.03952792, ...,  1.0219463 ,
         0.53774805,  0.21748813],
       ...,
       [-1.74054284,  0.67169471,  0.03952792, ...,  1.0219463 ,
        -1.09577528, -1.98323338],
       [-0.4005543 ,  0.67169471,  0.03952792, ..., -1.05548606,
         0.53774805,  0.21748813],
       [ 0.93943424,  0.67169471,  0.03952792, ...,  1.0219463 ,
         0.53774805,  0.21748813]])

array([105000,  21844, 186000, ...,  54094, 249500, 213580])

In [13]:
# Let's transform y_train and y_test to be column vectors (so they will match output layer of our NN)
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)

In [14]:
display(y_train)

array([[105000],
       [ 21844],
       [186000],
       ...,
       [ 54094],
       [249500],
       [213580]])

In [15]:
# Convert data to torch tensors
X_train_std = torch.from_numpy(X_train_std).float().to(device)
X_test_std = torch.from_numpy(X_test_std).float().to(device)
y_train = torch.from_numpy(y_train).float().to(device)
y_test= torch.from_numpy(y_test).float().to(device)

In [16]:
print(X_train_std)

tensor([[-3.0805, -2.4322,  0.0395,  ...,  1.0219,  0.5377,  2.4182],
        [-1.7405, -2.4322,  0.0395,  ..., -0.0168, -2.1678,  0.2175],
        [-0.4006,  0.6717,  0.0395,  ...,  1.0219,  0.5377,  0.2175],
        ...,
        [-1.7405,  0.6717,  0.0395,  ...,  1.0219, -1.0958, -1.9832],
        [-0.4006,  0.6717,  0.0395,  ..., -1.0555,  0.5377,  0.2175],
        [ 0.9394,  0.6717,  0.0395,  ...,  1.0219,  0.5377,  0.2175]],
       device='cuda:0')


In [17]:
# Let's define our model
class NN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(NN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size[0]) # Input -> First Hidden Layer
        self.fc2 = nn.Linear(hidden_size[0], hidden_size[1]) # First Hidden Layer layer -> Second Hidden Layer
        self.fc3 = nn.Linear(hidden_size[1], output_size)  # Second Hidden Layer -> Output Layer
        self.sigmoid = nn.Sigmoid() # in regression task the sigmoid is not applied at the output layer

    def forward(self, x):
        x = self.fc1(x)
        x = self.sigmoid(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        x = self.fc3(x)
        return x

In [18]:
torch.manual_seed(42)

# Initialize the model
input_size = X_train.shape[1]
hidden_size = [5, 3]
output_size = 1  # Regression task for predicting salary_in_usd

model = NN(input_size, hidden_size, output_size).to(device)

# Loss function and optimizer
criterion = nn.MSELoss()  # Mean Squared Error for regression
optimizer = optim.SGD(model.parameters(), lr=0.001)  # SGD optimizer

In [19]:
# Training loop
num_epochs = 1000
for epoch in range(num_epochs):
    model.train()
    
    # Forward pass
    preds_train = model(X_train_std)
    loss = criterion(preds_train, y_train)
    
    # Backward pass and optimization
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    
    if (epoch+1) % 100 == 0:
        # Evaluate the model on the test data
        model.eval()
        with torch.no_grad():
            # Compute MSE and MAE for test data (just for logging)
            preds_test = model(X_test_std)
            test_mse = criterion(preds_test, y_test)
            test_mae = mean_absolute_error(preds_test.cpu(), y_test.cpu())

        print(f"Epoch: {epoch+1} | Training loss (MSE): {loss.item():.5f} | Test MSE: {test_mse.item():.5f}, Test MAE: {test_mae:.5f}")


Epoch: 100 | Training loss (MSE): 10139148288.00000 | Test MSE: 10078524416.00000, Test MAE: 82511.83594
Epoch: 200 | Training loss (MSE): 5151995392.00000 | Test MSE: 5215097856.00000, Test MAE: 57558.57422
Epoch: 300 | Training loss (MSE): 4151564544.00000 | Test MSE: 4250427648.00000, Test MAE: 52156.82031
Epoch: 400 | Training loss (MSE): 3950876928.00000 | Test MSE: 4061814528.00000, Test MAE: 51232.86328
Epoch: 500 | Training loss (MSE): 3910618368.00000 | Test MSE: 4026173184.00000, Test MAE: 51065.75781
Epoch: 600 | Training loss (MSE): 3902542336.00000 | Test MSE: 4020006656.00000, Test MAE: 51039.69922
Epoch: 700 | Training loss (MSE): 3900922368.00000 | Test MSE: 4019209728.00000, Test MAE: 51046.93359
Epoch: 800 | Training loss (MSE): 3900597248.00000 | Test MSE: 4019247104.00000, Test MAE: 51061.28906
Epoch: 900 | Training loss (MSE): 3900531968.00000 | Test MSE: 4019343104.00000, Test MAE: 51067.71484
Epoch: 1000 | Training loss (MSE): 3900518656.00000 | Test MSE: 4019401

Since we want to perform **hyperparameter tuning** using **Grid Search** with **k-fold cross validation** it's efficient to encapsulate the entire process in a single method. This approach allows us to easily repeat the training and evaluation process as needed. Let's create a method to handle this.

In [20]:
def train_and_evaluate(model, criterion, optimizer, X_train_fold, y_train_fold, X_val_fold, y_val_fold, fold, num_epochs=1000):
    """
    Train and evaluate a PyTorch model.

    Parameters:
        model (torch.nn.Module): The neural network model.
        criterion (torch.nn.Module): The loss function (e.g., nn.MSELoss).
        optimizer (torch.optim.Optimizer): Optimizer for training (e.g., Adam, SGD).
        X_train (torch.Tensor): Training features.
        y_train (torch.Tensor): Training target values.
        X_test (torch.Tensor): Test features.
        y_test (torch.Tensor): Test target values.
        num_epochs (int): Number of training epochs.

    Returns:
        float: Final test loss (MSE).
    """

    for epoch in range(num_epochs):   
        model.train()
        # Forward pass
        preds_train = model(X_train_fold)
        loss = criterion(preds_train, y_train_fold)
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    
        if (epoch+1) % num_epochs == 0: # we print just the last step results
            # Evaluate the model on the test data
            model.eval()
            with torch.no_grad():
                # Compute MSE and MAE for validation data (just for logging)
                preds_val = model(X_val_fold)      
                val_mse = criterion(preds_val, y_val_fold)
                val_mae = mean_absolute_error(preds_val.cpu(), y_val_fold.cpu())
            print(f"Fold: {fold} | Val loss (MSE): {val_mse.item():.5f}, Val MAE: {val_mae:.5f}")

    return val_mse.item()

Okay, now we are ready to actually perform hyperparameter tuning ...

In [21]:
# Let's convert back into numpy arrays (so we can perform the k fold splitting)
X_train_std = X_train_std.cpu().numpy()
X_test_std = X_test_std.cpu().numpy()
y_train = y_train.cpu().numpy()
y_test = y_test.cpu().numpy()

display(X_train_std)
display(y_train)

array([[-3.0805314 , -2.4321725 ,  0.03952792, ...,  1.0219463 ,
         0.53774804,  2.4182096 ],
       [-1.7405429 , -2.4321725 ,  0.03952792, ..., -0.01676988,
        -2.167775  ,  0.21748814],
       [-0.4005543 ,  0.6716947 ,  0.03952792, ...,  1.0219463 ,
         0.53774804,  0.21748814],
       ...,
       [-1.7405429 ,  0.6716947 ,  0.03952792, ...,  1.0219463 ,
        -1.0957752 , -1.9832333 ],
       [-0.4005543 ,  0.6716947 ,  0.03952792, ..., -1.0554861 ,
         0.53774804,  0.21748814],
       [ 0.93943423,  0.6716947 ,  0.03952792, ...,  1.0219463 ,
         0.53774804,  0.21748814]], dtype=float32)

array([[105000.],
       [ 21844.],
       [186000.],
       ...,
       [ 54094.],
       [249500.],
       [213580.]], dtype=float32)

In [22]:
# Define 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

best_params = None
best_loss = float('inf')

# Define hyperparameter grid
param_grid = {
    'lr': [0.001, 0.005, 0.01],
    'weight_decay': [0.0, 0.01, 0.05]
}

# Convert param_grid to all combinations of hyperparameters
param_combinations = list(itertools.product(*param_grid.values()))

# Iterate over all parameter combinations
for params in param_combinations:
    lr, weight_decay = params

    print(f"Testing parameters: lr={lr}, weight_decay={weight_decay}")

    fold_losses = []
    fold=0

    # Iteratively consider all fold configurations
    for train_index, val_index in kf.split(X_train_std):
        fold=fold +1
        # Split data into train and validation sets for the current fold
        X_train_fold = X_train_std[train_index]
        y_train_fold = y_train[train_index]
        X_val_fold = X_train_std[val_index]
        y_val_fold = y_train[val_index]

        # Convert data to torch tensors
        X_train_fold = torch.from_numpy(X_train_fold).float().to(device)
        y_train_fold = torch.from_numpy(y_train_fold).float().to(device)
        X_val_fold = torch.from_numpy(X_val_fold).float().to(device)
        y_val_fold= torch.from_numpy(y_val_fold).float().to(device)

        # Initialize model, optimizer (with the lr and weight decay values to test) and criterion
        model = NN(input_size=X_train_fold.shape[1], hidden_size=[5, 3], output_size=1).to(device)
        optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=weight_decay)
        criterion = nn.MSELoss()

        # Train and evaluate the model on the current fold
        fold_loss = train_and_evaluate(
            model=model,
            criterion=criterion,
            optimizer=optimizer,
            X_train_fold=X_train_fold,
            y_train_fold=y_train_fold,
            X_val_fold=X_val_fold,
            y_val_fold=y_val_fold,
            fold=fold
        )
    fold_losses.append(fold_loss)

    # Calculate average loss across all folds
    avg_loss = sum(fold_losses) / len(fold_losses)
    print(f"Avg Loss for params {params}: {avg_loss:.5f}")

    # Update the best parameters if the current configuration is better
    if avg_loss < best_loss:
        best_loss = avg_loss
        best_params = params

print('\n')
print(f"Best Params: lr={best_params[0]}, weight_decay={best_params[1]}")
print(f"Best Loss: {best_loss:.5f}")

Testing parameters: lr=0.001, weight_decay=0.0
Fold: 1 | Val loss (MSE): 3728678656.00000, Val MAE: 49153.28906
Fold: 2 | Val loss (MSE): 3990404864.00000, Val MAE: 50996.93750
Fold: 3 | Val loss (MSE): 3900164608.00000, Val MAE: 51443.54688
Fold: 4 | Val loss (MSE): 4009527808.00000, Val MAE: 50209.69141
Fold: 5 | Val loss (MSE): 4353251840.00000, Val MAE: 53986.99609
Avg Loss for params (0.001, 0.0): 4353251840.00000
Testing parameters: lr=0.001, weight_decay=0.01
Fold: 1 | Val loss (MSE): 4033252352.00000, Val MAE: 50070.53125
Fold: 2 | Val loss (MSE): 3989474816.00000, Val MAE: 50982.12109
Fold: 3 | Val loss (MSE): 3899711488.00000, Val MAE: 51443.12891
Fold: 4 | Val loss (MSE): 3545121536.00000, Val MAE: 48051.44922
Fold: 5 | Val loss (MSE): 4347741696.00000, Val MAE: 53976.51953
Avg Loss for params (0.001, 0.01): 4347741696.00000
Testing parameters: lr=0.001, weight_decay=0.05
Fold: 1 | Val loss (MSE): 3729170944.00000, Val MAE: 49100.88672
Fold: 2 | Val loss (MSE): 3984356608.00

In [23]:
# Once found the best params we can re-train our model with them and test it on the test set
torch.manual_seed(42)
best_model = NN(input_size, hidden_size, output_size).to(device)
best_optimizer = optim.SGD(best_model.parameters(), lr=best_params[0], weight_decay=best_params[1])  # SGD optimizer

In [24]:
# Convert (again) data to torch tensors
X_train_std = torch.from_numpy(X_train_std).float().to(device)
X_test_std = torch.from_numpy(X_test_std).float().to(device)
y_train = torch.from_numpy(y_train).float().to(device)
y_test= torch.from_numpy(y_test).float().to(device)

In [25]:
# Training loop
num_epochs = 1000
for epoch in range(num_epochs):
    best_model.train()
    
    # Forward pass
    preds_train = best_model(X_train_std)
    loss = criterion(preds_train, y_train)
    
    # Backward pass and optimization
    loss.backward()
    best_optimizer.step()
    best_optimizer.zero_grad()
    
    if (epoch+1) % 100 == 0:
        # Evaluate the model on the test data
        best_model.eval()
        with torch.no_grad():
            # Compute MSE and MAE for test data (just for logging)
            preds_test = best_model(X_test_std)      
            test_mse = criterion(preds_test, y_test)
            test_mae = mean_absolute_error(preds_test.cpu(), y_test.cpu())

        print(f"Epoch: {epoch+1} | Training loss (MSE): {loss.item():.5f} | Test MSE: {test_mse.item():.5f}, Test MAE: {test_mae:.5f}")

Epoch: 100 | Training loss (MSE): 4218264320.00000 | Test MSE: 4306754048.00000, Test MAE: 52463.19531
Epoch: 200 | Training loss (MSE): 3906104064.00000 | Test MSE: 4022454016.00000, Test MAE: 51047.90625
Epoch: 300 | Training loss (MSE): 3900613888.00000 | Test MSE: 4019236864.00000, Test MAE: 51060.30469
Epoch: 400 | Training loss (MSE): 3900517120.00000 | Test MSE: 4019416832.00000, Test MAE: 51071.25781
Epoch: 500 | Training loss (MSE): 3900515584.00000 | Test MSE: 4019451136.00000, Test MAE: 51072.70312
Epoch: 600 | Training loss (MSE): 3900515328.00000 | Test MSE: 4019456000.00000, Test MAE: 51072.89844
Epoch: 700 | Training loss (MSE): 3900515328.00000 | Test MSE: 4019456512.00000, Test MAE: 51072.91797
Epoch: 800 | Training loss (MSE): 3900515328.00000 | Test MSE: 4019456512.00000, Test MAE: 51072.91797
Epoch: 900 | Training loss (MSE): 3900515328.00000 | Test MSE: 4019456512.00000, Test MAE: 51072.91797
Epoch: 1000 | Training loss (MSE): 3900515328.00000 | Test MSE: 401945651

---

### Extra
Using SGD as optimizer we got nice result ...

... however we could obtain better results by normalizing also the target variable.

This becuase as we saw in Neural Network for Regression exercise (the one from scratch) when target variable has high magnitude values can have a more difficult training (e.g getting stuck in a local minima).

So, now let's try to repeat the same we did above but with normalized target variable ... we will see this will lead to better result (as consequence of a more efficient and stable training)

In [26]:
X_train_std = X_train_std.cpu().numpy()
X_test_std = X_test_std.cpu().numpy()
y_train = y_train.cpu().numpy()
y_test = y_test.cpu().numpy()

In [27]:
# Z-normalize also the target variable
target_scaler = StandardScaler()
y_train_std = target_scaler.fit_transform(y_train)
# we leave y-test as it is

In [28]:
display(y_train_std)

array([[-0.4144498],
       [-1.7459235],
       [ 0.8825025],
       ...,
       [-1.2295443],
       [ 1.8992491],
       [ 1.3241068]], dtype=float32)

In [29]:
# Convert data to torch tensors
X_train_std = torch.from_numpy(X_train_std).float().to(device)
X_test_std = torch.from_numpy(X_test_std).float().to(device)
y_train_std = torch.from_numpy(y_train_std).float().to(device)
y_test= torch.from_numpy(y_test).float().to(device)

In [30]:
print(X_train_std)

tensor([[-3.0805, -2.4322,  0.0395,  ...,  1.0219,  0.5377,  2.4182],
        [-1.7405, -2.4322,  0.0395,  ..., -0.0168, -2.1678,  0.2175],
        [-0.4006,  0.6717,  0.0395,  ...,  1.0219,  0.5377,  0.2175],
        ...,
        [-1.7405,  0.6717,  0.0395,  ...,  1.0219, -1.0958, -1.9832],
        [-0.4006,  0.6717,  0.0395,  ..., -1.0555,  0.5377,  0.2175],
        [ 0.9394,  0.6717,  0.0395,  ...,  1.0219,  0.5377,  0.2175]],
       device='cuda:0')


In [31]:
torch.manual_seed(42)

# Initialize the model
model = NN(input_size, hidden_size, output_size).to(device)

# Loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001) # !!! Note in this case we use Adam

In [32]:
# Training loop
num_epochs = 5000 # we also increased the number of epochs
for epoch in range(num_epochs):
    model.train()
    
    # Forward pass
    preds_train = model(X_train_std)
    loss = criterion(preds_train, y_train_std)
    
    # Backward pass and optimization
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if (epoch+1) % 100 == 0:
        # De-normalized training loss for logging
        # To de-normalize the data we need to detach the data and then transform them back to numpy array. So then as consequence to compute the criterion loss we need to
        # reconvert them in tensors
        preds_train_denorm = target_scaler.inverse_transform(preds_train.detach().cpu().numpy()) # De-normalized predictions (detach and transform them in numpy arrays)
        y_train_denorm = target_scaler.inverse_transform(y_train_std.cpu().numpy())  # De-normalized y_train
        preds_train_denorm_tensor = torch.tensor(preds_train_denorm, dtype=torch.float32, device='cpu') # Convert back to tensors from NumPy arrays
        y_train_denorm_tensor = torch.tensor(y_train_denorm, dtype=torch.float32, device='cpu')
        train_loss = criterion(preds_train_denorm_tensor, y_train_denorm_tensor)
        
        # Evaluate the model on the test data
        model.eval()
        with torch.no_grad():
            # Compute MSE and MAE for test data (just for logging)
            preds_test_norm = model(X_test_std)  # Normalized predictions
            preds_test_denorm = target_scaler.inverse_transform(preds_test_norm.cpu().numpy())  # De-normalize predictions
            preds_test_denorm_tensor = torch.tensor(preds_test_denorm, dtype=torch.float32, device='cpu')
            
            y_test_tensor = y_test.clone().detach().cpu().float()  # Detach y_test and move to CPU
            
            test_loss = criterion(preds_test_denorm_tensor, y_test_tensor)
            test_mae = mean_absolute_error(preds_test_denorm_tensor, y_test_tensor)

        
        print(f"Epoch: {epoch+1} | Training loss (MSE): {train_loss.item():.5f} | Test loss (MSE): {test_loss.item():.5f}, Test MAE: {test_mae:.5f}")


Epoch: 100 | Training loss (MSE): 3922163456.00000 | Test loss (MSE): 4041557760.00000, Test MAE: 51478.83203
Epoch: 200 | Training loss (MSE): 3791259392.00000 | Test loss (MSE): 3897159168.00000, Test MAE: 50284.60938
Epoch: 300 | Training loss (MSE): 3673942016.00000 | Test loss (MSE): 3764498432.00000, Test MAE: 49373.90625
Epoch: 400 | Training loss (MSE): 3494677760.00000 | Test loss (MSE): 3558182144.00000, Test MAE: 47951.54688
Epoch: 500 | Training loss (MSE): 3253070848.00000 | Test loss (MSE): 3270491392.00000, Test MAE: 45890.89453
Epoch: 600 | Training loss (MSE): 3020109056.00000 | Test loss (MSE): 2981674752.00000, Test MAE: 43787.42188
Epoch: 700 | Training loss (MSE): 2872146432.00000 | Test loss (MSE): 2786582272.00000, Test MAE: 42279.50000
Epoch: 800 | Training loss (MSE): 2803412480.00000 | Test loss (MSE): 2690657792.00000, Test MAE: 41456.58594
Epoch: 900 | Training loss (MSE): 2771198720.00000 | Test loss (MSE): 2653285888.00000, Test MAE: 41058.54688
Epoch: 100

We can notice that using Adam with the normalized target variable leads to even better results.

Now let's encapsulate this entire process in a single method in similar way as we did above, so we can simply call it when we will perform Grid Search with K-fold Cross-validation.

In [37]:
def train_and_evaluate(model, criterion, optimizer, target_scaler, X_train_fold, y_train_fold, X_val_fold, y_val_fold, fold, num_epochs=1000):
    """
    Train and evaluate a PyTorch model.

    Parameters:
        model (torch.nn.Module): The neural network model.
        criterion (torch.nn.Module): The loss function (e.g., nn.MSELoss).
        optimizer (torch.optim.Optimizer): Optimizer for training (e.g., Adam, SGD).
        X_train (torch.Tensor): Training features.
        y_train (torch.Tensor): Training target values.
        X_test (torch.Tensor): Test features.
        y_test (torch.Tensor): Test target values.
        num_epochs (int): Number of training epochs.

    Returns:
        float: Final test loss (MSE).
    """

    X_train_fold, y_train_fold = X_train_fold.to(device), y_train_fold.to(device)
    X_val_fold, y_val_fold = X_val_fold.to(device), y_val_fold.to(device)

    for epoch in range(num_epochs):   
        model.train()
        # Forward pass
        preds_train = model(X_train_fold)
        loss = criterion(preds_train, y_train_fold)

        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    
        if (epoch+1) % num_epochs == 0:
            # Evaluate the model on the test data
            model.eval()
            with torch.no_grad():
                preds_val_norm = model(X_val_fold)
                preds_val_denorm = target_scaler.inverse_transform(preds_val_norm.cpu().numpy())
                preds_val_denorm_tensor = torch.tensor(preds_val_denorm, dtype=torch.float32, device='cpu')

                y_val_denorm = target_scaler.inverse_transform(y_val_fold.cpu().numpy())
                y_val_tensor = torch.tensor(y_val_denorm, dtype=torch.float32, device='cpu')
                
                test_loss = criterion(preds_val_denorm_tensor, y_val_tensor)
                test_mae = mean_absolute_error(preds_val_denorm_tensor, y_val_tensor)

            print(f"Fold: {fold} | Test loss (MSE): {test_loss.item():.5f}, Test MAE: {test_mae:.5f}")

    return test_loss.item()

In [38]:
# Define 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

best_params = None
best_loss = float('inf')

# Define hyperparameter grid
param_grid = {
    'lr': [0.001, 0.005, 0.01],
    'weight_decay': [0.0, 0.01, 0.05]
}

# Convert param_grid to all combinations of hyperparameters
param_combinations = list(itertools.product(*param_grid.values()))

# Iterate over all parameter combinations
for params in param_combinations:
    lr, weight_decay = params

    print(f"Testing parameters: lr={lr}, weight_decay={weight_decay}")

    fold_losses = []
    fold=0

    # Iteratively consider all fold configurations
    for train_index, val_index in kf.split(X_train_std):
        fold=fold +1
        # Split data into train and validation sets for the current fold
        X_train_fold = X_train_std[train_index]
        y_train_fold = y_train_std[train_index]
        X_val_fold = X_train_std[val_index]
        y_val_fold = y_train_std[val_index]

        # Initialize model, optimizer (with the lr and weight decay to test) and criterion
        model = NN(input_size=X_train_fold.shape[1], hidden_size=[5, 3], output_size=1).to(device)
        optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
        criterion = nn.MSELoss()

        # Train and evaluate the model on the current fold
        fold_loss = train_and_evaluate(
            model=model,
            criterion=criterion,
            optimizer=optimizer,
            target_scaler=target_scaler,
            X_train_fold=X_train_fold,
            y_train_fold=y_train_fold,
            X_val_fold=X_val_fold,
            y_val_fold=y_val_fold,
            fold=fold,
            num_epochs=5000
        )
    fold_losses.append(fold_loss)

    # Calculate average loss across all folds
    avg_loss = sum(fold_losses) / len(fold_losses)
    print(f"Avg Loss for params {params}: {avg_loss:.5f}")

    # Update the best parameters if the current configuration is better
    if avg_loss < best_loss:
        best_loss = avg_loss
        best_params = params

print('\n')
print(f"Best Params: lr={best_params[0]}, weight_decay={best_params[1]}")
print(f"Best Loss: {best_loss:.5f}")

Testing parameters: lr=0.001, weight_decay=0.0
Fold: 1 | Test loss (MSE): 2751747072.00000, Test MAE: 41649.92578
Fold: 2 | Test loss (MSE): 2535083008.00000, Test MAE: 39307.72266
Fold: 3 | Test loss (MSE): 2341580544.00000, Test MAE: 37815.57031
Fold: 4 | Test loss (MSE): 2436936448.00000, Test MAE: 39123.32812
Fold: 5 | Test loss (MSE): 2949472000.00000, Test MAE: 41890.58203
Avg Loss for params (0.001, 0.0): 2949472000.00000
Testing parameters: lr=0.001, weight_decay=0.01
Fold: 1 | Test loss (MSE): 2632020736.00000, Test MAE: 40982.26953
Fold: 2 | Test loss (MSE): 2669598976.00000, Test MAE: 41286.21094
Fold: 3 | Test loss (MSE): 2444236032.00000, Test MAE: 38751.43750
Fold: 4 | Test loss (MSE): 2642719232.00000, Test MAE: 40531.98828
Fold: 5 | Test loss (MSE): 3183097600.00000, Test MAE: 43934.01562
Avg Loss for params (0.001, 0.01): 3183097600.00000
Testing parameters: lr=0.001, weight_decay=0.05
Fold: 1 | Test loss (MSE): 3728975616.00000, Test MAE: 49172.72266
Fold: 2 | Test lo

In [39]:
# Once found the best params we can re-train our model with them and test it on the test set
torch.manual_seed(42)
best_model = NN(input_size, hidden_size, output_size).to(device)
best_optimizer = optim.Adam(best_model.parameters(), lr=best_params[0], weight_decay=best_params[1])

In [41]:
# Training loop
num_epochs = 5000
for epoch in range(num_epochs):
    best_model.train()
    
    # Forward pass
    preds_train = best_model(X_train_std)
    loss = criterion(preds_train, y_train_std)
    
    # Backward pass and optimization
    loss.backward()
    best_optimizer.step()
    best_optimizer.zero_grad()

    if (epoch+1) % 100 == 0:
        # De-normalized training loss for logging
        # To de-normalize the data we need to detach the data and then transform them back to numpy array. So then as consequence to compute the criterion loss we need to
        # reconvert them in tensors
        preds_train_denorm = target_scaler.inverse_transform(preds_train.detach().cpu().numpy()) # De-normalized predictions (detach and transform them in numpy arrays)
        y_train_denorm = target_scaler.inverse_transform(y_train_std.cpu().numpy())  # De-normalized y_train
        preds_train_denorm_tensor = torch.tensor(preds_train_denorm, dtype=torch.float32, device='cpu') # Convert back to tensors from NumPy arrays
        y_train_denorm_tensor = torch.tensor(y_train_denorm, dtype=torch.float32, device='cpu')
        train_loss = criterion(preds_train_denorm_tensor, y_train_denorm_tensor)
        
        # Evaluate the model on the test data
        best_model.eval()
        with torch.no_grad():
            # Compute MSE and MAE for test data (just for logging)
            preds_test_norm = best_model(X_test_std)  # Normalized predictions
            preds_test_denorm = target_scaler.inverse_transform(preds_test_norm.cpu().numpy())  # De-normalize predictions
            preds_test_denorm_tensor = torch.tensor(preds_test_denorm, dtype=torch.float32, device='cpu')
            
            y_test_tensor = y_test.clone().detach().cpu().float()  # Detach y_test and move to CPU
            
            test_loss = criterion(preds_test_denorm_tensor, y_test_tensor)
            test_mae = mean_absolute_error(preds_test_denorm_tensor, y_test_tensor)

        
        print(f"Epoch: {epoch+1} | Training loss (MSE): {train_loss.item():.5f} | Test loss (MSE): {test_loss.item():.5f}, Test MAE: {test_mae:.5f}")


Epoch: 100 | Training loss (MSE): 2785271040.00000 | Test loss (MSE): 2639118848.00000, Test MAE: 40896.19531
Epoch: 200 | Training loss (MSE): 2654635520.00000 | Test loss (MSE): 2600004096.00000, Test MAE: 40363.85547
Epoch: 300 | Training loss (MSE): 2512472576.00000 | Test loss (MSE): 2543557632.00000, Test MAE: 39801.81641
Epoch: 400 | Training loss (MSE): 2473465856.00000 | Test loss (MSE): 2512959744.00000, Test MAE: 39462.26953
Epoch: 500 | Training loss (MSE): 2456041728.00000 | Test loss (MSE): 2496257280.00000, Test MAE: 39236.79297
Epoch: 600 | Training loss (MSE): 2441771008.00000 | Test loss (MSE): 2481133824.00000, Test MAE: 39072.59375
Epoch: 700 | Training loss (MSE): 2427616256.00000 | Test loss (MSE): 2467489024.00000, Test MAE: 38920.14062
Epoch: 800 | Training loss (MSE): 2413174784.00000 | Test loss (MSE): 2455692544.00000, Test MAE: 38775.10547
Epoch: 900 | Training loss (MSE): 2356956672.00000 | Test loss (MSE): 2384444928.00000, Test MAE: 38339.56641
Epoch: 100

Without normalizing target variable (and using SGD) we got:
- Best Params: lr=0.005, weight_decay=0.0
- MAE on test set = 51072.91797

While normalizing target variable (and using Adam) we got:
- Best Params: lr=0.01, weight_decay=0.0
- MAE on test set = 38124.28125

So, in second case we got much better!