### Neural Network for Regression (Exercise)

This dataset encompasses details about various workers and their corresponding employment levels, featuring a diverse set of attributes ranging from categorical to continuous. Initialize the data loading process using the suitable Pandas function, and meticulously inspect for any instances of null or duplicated data. Specifically focusing on the **“salary in usd”** feature, identify and eliminate outliers while devising a strategy to address any missing values.

Then, use any method you like to encode the categorical features, namely **“work year”, “experience level”, “employment type”, “job title”, “employee residence”, “remote ratio”, “company location”, and “company size”**. You may consider to employ the sklearn LabelEncoder class<sup>1</sup>.

Following the preprocessing steps, normalize the dataset utilizing the z-score technique to ensure consistent scaling across features. Subsequently, construct a neural network using **PyTorch**, incorporating **2 hidden layers with 5 and 3 neurons**, respectively. Carefully select an appropriate learning rate and normalization value for optimal model training.

Furthermore, assess the model’s performance using a relevant evaluation metric, ensuring a comprehensive understanding of its effectiveness in handling the given employment dataset. Finally, find the best hyperparameter combination (namely **lr** and **weight decay**) using both the **Grid Search** and the **k-fold cross validation** methods.

<sup>1</sup>https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.LabelEncoder.html

---

##### **Encoding of Categorical Features**
To apply **standardization**, all features must be numeric. While continuous features are already numerical, categorical features must first be **transformed into numerical values**. This transformation process is called **encoding**.

We could use Label Encoding for all categorical variables as the track recoomend, but I can tell you that it's not a good move. LabelEncoder is design to encode target labels with value between 0 and n_classes-1, therefore it make no sense to use it for features encoding. Moreover, when encoding features is important to understand the typology of feature considered, since different typologies lead to [different encodings](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features).

Specifically, categorical features fall into two main categories:
1. **Ordinal Features**: Have a meaningful order (e.g., education level: *High School < Bachelor's < Master's < PhD*).
2. **Nominal Features**: Have no intrinsic order (e.g.company location).

**1️⃣ Ordinal Features → Ordinal Encoding**

Assigns numerical values based on category order. For example, education levels may be assigned values such as:
- High School → 0
- Bachelor's → 1
- Master's → 2
- PhD → 3

Since ordinal variables have a meaningful ranking, converting them into ordered numbers preserves that relationship.

**2️⃣ Nominal Features → One-Hot Encoding (OHE)**

Creates binary columns for each category. For instance, if a feature represents company size with values **S, M, L**, one-hot encoding transforms it into three separate columns:
- `S` → [1, 0, 0]
- `M` → [0, 1, 0]
- `L` → [0, 0, 1]

One-hot encoding ensures that categories are represented equally without implying any order.

⚠️ **Issue: High Cardinality in Nominal Features**
For categorical features with many unique values (e.g., thousands of job titles), one-hot encoding creates too many columns, which could lead to the curse of dimensionality.
A better alternative in such a case is **Target Encoding**.

**3️⃣ Alternative for High-Cardinality Nominal Features → Target Encoding**

Replaces each category with the **mean target value** (e.g., average salary for each job title).
- Example: If the average salary for “Software Engineer” is 80,000 USD and for “Data Scientist” is 90,000 USD, we replace:
Software Engineer → 80,000
Data Scientist → 90,000

It solve the one-hot encoding **dimensionality explosion** problem (i.e. when dealing with features that have a large number of unique values) while keeping useful statistical information about the relationship between a category and the target variable.

##### **When to Apply Encoding?**
Just like standardization, encoding should be computed only on the training set:
- Fit and transform on the training data (fit_transform).
- Transform only on the test set (transform).

⚠️ In general, always rember that the test set should remain **untouched**, mimicking real-world data the model has never seen before.

Once all features are numeric (including encoded categorical ones), we can apply Z-score standardization to all features (both the transformed categorical ones and the original continuous numerical features).

This ensures that all features are on the same scale, having a mean of 0 and a standard deviation of 1, helping models like neural networks learn efficiently.

&nbsp;

**Note**: I modified a little the dataset: I deleted some redundant features such as salary and salary_currency since thay are to similar to our target variable and moreover since most are categorical features I added three new continuous features company_revenue_million, avg_working_hours_weekly, industry_demand_score.

In [1]:
import numpy as np
import pandas as pd
import itertools

from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from category_encoders import TargetEncoder
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_absolute_error

import torch
torch.__version__

'2.5.1+cu118'

In [2]:
from torch import nn
from torch import optim

device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [3]:
# Load your dataset
df = pd.read_csv("./datasets/ds_salaries.csv")

# Drop redundant salary columns
df = df.drop(columns=["salary", "salary_currency"], errors="ignore")

# Generate `company_revenue_million` based on `company_size`
company_size_to_revenue = {"S": 50, "M": 200, "L": 1000}  # Example revenue in million USD
df["company_revenue_million"] = df["company_size"].map(company_size_to_revenue)

# Generate `avg_working_hours_weekly` based on `employment_type`
employment_to_hours = {"FT": 40, "PT": 20, "CT": 30, "FL": 45}  # Example weekly hours
df["avg_working_hours_weekly"] = df["employment_type"].map(employment_to_hours)

# Generate `industry_demand_score` (synthetic) based on job title
np.random.seed(42)  # For reproducibility
job_titles = df["job_title"].unique()
job_demand_scores = {job: np.random.randint(50, 100) for job in job_titles}  # Random demand score between 50-100
df["industry_demand_score"] = df["job_title"].map(job_demand_scores)

display(df)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,company_revenue_million,avg_working_hours_weekly,industry_demand_score
0,2023,SE,FT,Principal Data Scientist,85847,ES,100,ES,L,1000,40,88
1,2023,MI,CT,ML Engineer,30000,US,100,US,S,50,30,78
2,2023,MI,CT,ML Engineer,25500,US,100,US,S,50,30,78
3,2023,SE,FT,Data Scientist,175000,CA,100,CA,M,200,40,64
4,2023,SE,FT,Data Scientist,120000,CA,100,CA,M,200,40,64
...,...,...,...,...,...,...,...,...,...,...,...,...
3750,2020,SE,FT,Data Scientist,412000,US,100,US,L,1000,40,64
3751,2021,MI,FT,Principal Data Scientist,151000,US,100,US,L,1000,40,88
3752,2020,EN,FT,Data Scientist,105000,US,100,US,S,50,40,64
3753,2020,EN,CT,Business Data Analyst,100000,US,100,US,L,1000,30,79


In [4]:
# Check for missing values
print('Missing values:')
print(df.isnull().sum())

# Check for duplicates
print('Duplicates: ', df.duplicated().sum())

Missing values:
work_year                   0
experience_level            0
employment_type             0
job_title                   0
salary_in_usd               0
employee_residence          0
remote_ratio                0
company_location            0
company_size                0
company_revenue_million     0
avg_working_hours_weekly    0
industry_demand_score       0
dtype: int64
Duplicates:  1171


In [6]:
# So we found that we do not have missing values, but we have 1171 duplicates. Let's drop them.
df = df.drop_duplicates()
print(df.size)

31008


In [7]:
# Handle missing values in "salary in usd" (even if in this case we don't have any)
# For the sake of simplicity, let's fill missing values with the median salary.
df['salary_in_usd'] = df['salary_in_usd'].fillna(df['salary_in_usd'].median())

# Identify and remove outliers in "salary in usd" using IQR
alpha = 1.5
Q1 = df['salary_in_usd'].quantile(0.25)
Q3 = df['salary_in_usd'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - alpha * IQR
upper_bound = Q3 + alpha * IQR

# we take just entries within the range [lower_bound, upper_bound]
df = df[(df['salary_in_usd'] >= lower_bound) & (df['salary_in_usd'] <= upper_bound)]

print(df.size)

30660


In [8]:
X = df.drop('salary_in_usd', axis=1)
y = df['salary_in_usd']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Handling Categorical Features
To handle categorical features we will proceed as follows: we first distinguish between ordinal and nominal features. 

For the **ordinal features** we use a Ordinal Encoder, for the second ones, instead for the **nominal features** if the number of unique values is fewer than 10 we apply **One-Hot Encoding**, otherwise we apply **Target Encoding** (to reduce dimensionality explosion).

In [9]:
ordinal_features = ['work_year', 'experience_level', 'remote_ratio', 'company_size']
nominal_features = ['employment_type', 'job_title', 'employee_residence', 'company_location']

one_hot_features = []  # To be filled dynamically
target_features = []   # To be filled dynamically 

# For ordinal features, you can simply use OrdinalEncoder
ordinal_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
X_train[ordinal_features] = ordinal_encoder.fit_transform(X_train[ordinal_features])
X_test[ordinal_features] = ordinal_encoder.transform(X_test[ordinal_features])


threshold = 10 # max unique values
# Let's handle nominal features features
for col in nominal_features:
    if X_train[col].nunique() <= threshold:
        one_hot_features.append(col)  # If number of unique values is below threshold, use one-hot encoding
    else:
        target_features.append(col)  # Otherwise, use target encoding
# Note: Basically here we will one_hot_features and target_features based on the number of unique values

print("One-hot encoding features: ", one_hot_features)
print("Target encoding features: ", target_features)

# Apply One-Hot Encoding for the `one_hot_features`
onehot_encoder = OneHotEncoder(sparse_output=False, dtype=float, handle_unknown='ignore')
X_train_onehot = onehot_encoder.fit_transform(X_train[one_hot_features])
X_test_onehot = onehot_encoder.transform(X_test[one_hot_features])
X_train_onehot_df = pd.DataFrame(X_train_onehot, 
                                 columns=onehot_encoder.get_feature_names_out(one_hot_features), 
                                 index=X_train.index)
X_test_onehot_df = pd.DataFrame(X_test_onehot, 
                                columns=onehot_encoder.get_feature_names_out(one_hot_features), 
                                index=X_test.index)
X_train = pd.concat([X_train, X_train_onehot_df], axis=1).drop(columns=one_hot_features)
X_test = pd.concat([X_test, X_test_onehot_df], axis=1).drop(columns=one_hot_features)

# Apply Target Encoding for the `target_features`
target_encoder = TargetEncoder()
X_train[target_features] = target_encoder.fit_transform(X_train[target_features], y_train)
X_test[target_features] = target_encoder.transform(X_test[target_features])

One-hot encoding features:  ['employment_type']
Target encoding features:  ['job_title', 'employee_residence', 'company_location']


In [9]:
display(X_train)

Unnamed: 0,work_year,experience_level,job_title,employee_residence,remote_ratio,company_location,company_size,company_revenue_million,avg_working_hours_weekly,industry_demand_score,employment_type_CT,employment_type_FL,employment_type_FT,employment_type_PT
3752,0.0,0.0,133210.021277,150683.102547,2.0,149179.641026,2.0,50,40,64,0.0,0.0,1.0,0.0
3656,1.0,0.0,142463.895493,122733.034718,1.0,122733.034718,1.0,200,40,60,0.0,0.0,1.0,0.0
2964,2.0,3.0,129539.722002,150683.102547,2.0,149179.641026,1.0,200,40,77,0.0,0.0,1.0,0.0
2951,2.0,3.0,137034.845214,150683.102547,2.0,149179.641026,1.0,200,40,65,0.0,0.0,1.0,0.0
3532,1.0,2.0,138457.543524,150683.102547,2.0,149179.641026,0.0,1000,40,73,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2513,2.0,2.0,158168.601781,150683.102547,2.0,149179.641026,1.0,200,40,78,0.0,0.0,1.0,0.0
1676,3.0,3.0,174758.058342,150683.102547,2.0,149179.641026,1.0,200,40,71,0.0,0.0,1.0,0.0
1738,1.0,3.0,133210.021277,40859.436374,2.0,39812.130853,0.0,1000,40,64,0.0,0.0,1.0,0.0
1960,2.0,3.0,154249.369672,150683.102547,0.0,149179.641026,1.0,200,40,88,0.0,0.0,1.0,0.0


In [10]:
# transform in numpy array
X_train = X_train.values
y_train = y_train.values
X_test = X_test.values
y_test = y_test.values

In [11]:
# Okay, now all features present numeric values, so as next step we can apply standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [12]:
display(X_train)

array([[-3.08053138, -2.43217262,  0.04602072, ..., -0.06650266,
         0.11996691, -0.07684733],
       [-1.74054284, -2.43217262,  0.60287022, ..., -0.06650266,
         0.11996691, -0.07684733],
       [-0.4005543 ,  0.67169471, -0.1748386 , ..., -0.06650266,
         0.11996691, -0.07684733],
       ...,
       [-1.74054284,  0.67169471,  0.04602072, ..., -0.06650266,
         0.11996691, -0.07684733],
       [-0.4005543 ,  0.67169471,  1.31205809, ..., -0.06650266,
         0.11996691, -0.07684733],
       [ 0.93943424,  0.67169471,  0.36178902, ..., -0.06650266,
         0.11996691, -0.07684733]])

In [13]:
print(y_train.shape)

(2044,)


In [14]:
# Let's transform y_train and y_test to be column vectors (so they will match output layer of our NN)
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)

print(y_train.shape)

(2044, 1)


In [15]:
display(y_train)

array([[105000],
       [ 21844],
       [186000],
       ...,
       [ 54094],
       [249500],
       [213580]], dtype=int64)

In [16]:
# Convert data to torch tensors
X_train_tensor = torch.from_numpy(X_train).float().to(device)
X_test_tensor = torch.from_numpy(X_test).float().to(device)
y_train_tensor = torch.from_numpy(y_train).float().to(device)
y_test_tensor= torch.from_numpy(y_test).float().to(device)

In [17]:
print(X_train_tensor)

tensor([[-3.0805, -2.4322,  0.0460,  ..., -0.0665,  0.1200, -0.0768],
        [-1.7405, -2.4322,  0.6029,  ..., -0.0665,  0.1200, -0.0768],
        [-0.4006,  0.6717, -0.1748,  ..., -0.0665,  0.1200, -0.0768],
        ...,
        [-1.7405,  0.6717,  0.0460,  ..., -0.0665,  0.1200, -0.0768],
        [-0.4006,  0.6717,  1.3121,  ..., -0.0665,  0.1200, -0.0768],
        [ 0.9394,  0.6717,  0.3618,  ..., -0.0665,  0.1200, -0.0768]],
       device='cuda:0')


In [18]:
# Let's define our model
class NN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(NN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size[0]) # Input -> First Hidden Layer
        self.fc2 = nn.Linear(hidden_size[0], hidden_size[1]) # First Hidden Layer layer -> Second Hidden Layer
        self.fc3 = nn.Linear(hidden_size[1], output_size)  # Second Hidden Layer -> Output Layer
        self.sigmoid = nn.Sigmoid() # In regression task the sigmoid is not applied at the output layer

    def forward(self, x):
        x = self.fc1(x)
        x = self.sigmoid(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        x = self.fc3(x)
        return x

In [19]:
torch.manual_seed(42)

# Initialize the model
input_size = X_train.shape[1]
hidden_size = [5, 3]
output_size = 1
model = NN(input_size, hidden_size, output_size).to(device)

# Loss function and optimizer
criterion = nn.MSELoss()  # Mean Squared Error for regression
optimizer = optim.SGD(model.parameters(), lr=0.001)  # SGD optimizer


# Training loop
num_epochs = 1000
for epoch in range(num_epochs):
    model.train()
    
    # Forward pass
    preds_train_tensor = model(X_train_tensor)
    loss = criterion(preds_train_tensor, y_train_tensor)
    
    # Backward pass and optimization
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    
    if (epoch+1) % 100 == 0:
        # Evaluate the model on the test data
        model.eval()
        with torch.no_grad():
            # Compute MSE and MAE for test data (just for logging)
            preds_test_tensor = model(X_test_tensor)
            test_mse = criterion(preds_test_tensor, y_test_tensor)
            test_mae = mean_absolute_error(preds_test_tensor.cpu(), y_test_tensor.cpu()) # here I switch to CPU just because MAE of sklearn is not able to work on GPU

        print(f"Epoch: {epoch+1} | Training loss (MSE): {loss.item():.2f} | Test MSE: {test_mse.item():.2f}, Test MAE: {test_mae:.2f}")


Epoch: 100 | Training loss (MSE): 11662346240.00 | Test MSE: 11629561856.00, Test MAE: 90166.82
Epoch: 200 | Training loss (MSE): 7382539264.00 | Test MSE: 7413563392.00, Test MAE: 69121.98
Epoch: 300 | Training loss (MSE): 5462580736.00 | Test MSE: 5528785408.00, Test MAE: 59309.98
Epoch: 400 | Training loss (MSE): 4601272320.00 | Test MSE: 4687649792.00, Test MAE: 54604.25
Epoch: 500 | Training loss (MSE): 4214881024.00 | Test MSE: 4313249280.00, Test MAE: 52499.35
Epoch: 600 | Training loss (MSE): 4041543168.00 | Test MSE: 4147260160.00, Test MAE: 51652.74
Epoch: 700 | Training loss (MSE): 3963781888.00 | Test MSE: 4074114816.00, Test MAE: 51289.85
Epoch: 800 | Training loss (MSE): 3928897024.00 | Test MSE: 4042183936.00, Test MAE: 51140.65
Epoch: 900 | Training loss (MSE): 3913248256.00 | Test MSE: 4028452352.00, Test MAE: 51077.52
Epoch: 1000 | Training loss (MSE): 3906227712.00 | Test MSE: 4022688512.00, Test MAE: 51048.78


Since we want to perform **hyperparameter tuning** using **Grid Search** with **k-fold cross validation** it's efficient to encapsulate the entire process in a single method. This approach allows us to easily repeat the training and evaluation process as needed. Let's create a method to handle this.

In [20]:
def train_and_evaluate(model, criterion, optimizer, X_train_fold_tensor, y_train_fold_tensor, X_val_fold_tensor, y_val_fold_tensor, fold, num_epochs=1000):
    """
    Train and evaluate a PyTorch model.

    Parameters:
        model (torch.nn.Module): The neural network model.
        criterion (torch.nn.Module): The loss function (e.g., nn.MSELoss).
        optimizer (torch.optim.Optimizer): Optimizer for training (e.g., Adam, SGD).
        X_train_fold_tensor (torch.Tensor): Training features.
        y_train_fold_tensor (torch.Tensor): Training target values.
        X_val_fold_tensor (torch.Tensor): Test features.
        y_val_fold_tensor (torch.Tensor): Test target values.
        fold: Fold number considered (used just for logging)
        num_epochs (int): Number of training epochs.

    Returns:
        float: Final val loss (MSE).
    """

    for epoch in range(num_epochs):   
        model.train()
        # Forward pass
        preds_train = model(X_train_fold_tensor)
        loss = criterion(preds_train, y_train_fold_tensor)
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    
        if (epoch+1) % num_epochs == 0: # we print just the last step results
            # Evaluate the model on the test data
            model.eval()
            with torch.no_grad():
                # Compute MSE and MAE for validation data (just for logging)
                preds_val = model(X_val_fold_tensor)      
                val_mse = criterion(preds_val, y_val_fold_tensor)
                val_mae = mean_absolute_error(preds_val.cpu(), y_val_fold_tensor.cpu())
            print(f"  Fold: {fold} | Val loss (MSE): {val_mse.item():.2f}, Val MAE: {val_mae:.2f}")

    return val_mse.item()

Okay, now we are ready to actually perform hyperparameter tuning ...

In [21]:
torch.manual_seed(42)

cv_folds = 5 # Number of cross-validation folds

# Define 5-fold cross-validation
kf = KFold(n_splits=cv_folds, shuffle=True, random_state=42)

best_params = None
best_loss = float('inf')
best_model_state = None

# Define hyperparameter grid
param_grid = {
    'lr': [0.001, 0.005, 0.01],
    'weight_decay': [0.0, 0.01, 0.05]
}

# Convert param_grid to all combinations of hyperparameters
param_combinations = list(itertools.product(*param_grid.values()))
num_candidates = len(param_combinations)  # Total hyperparameter combinations
total_fits = num_candidates * cv_folds  # Total model fits

print(f"Fitting {cv_folds} folds for each of {num_candidates} candidates, totalling {total_fits} fits.\n")
# Iterate over all parameter combinations
for idx, params in enumerate(param_combinations):
    lr, weight_decay = params
    print(f"\n[{idx+1}/{num_candidates}] Testing Params: lr={lr}, weight_decay={weight_decay}")

    fold_losses = []
    fold=0

    # Iteratively consider all fold configurations
    for train_index, val_index in kf.split(X_train):
        fold=fold +1
        # Split data into train and validation sets for the current fold
        X_train_fold = X_train[train_index]
        y_train_fold = y_train[train_index]
        X_val_fold = X_train[val_index]
        y_val_fold = y_train[val_index]

        # Convert data to torch tensors
        X_train_fold_tensor = torch.from_numpy(X_train_fold).float().to(device)
        y_train_fold_tensor = torch.from_numpy(y_train_fold).float().to(device)
        X_val_fold_tensor = torch.from_numpy(X_val_fold).float().to(device)
        y_val_fold_tensor = torch.from_numpy(y_val_fold).float().to(device)

        # Initialize model, optimizer (with the lr and weight decay values to test) and criterion
        model = NN(input_size=X_train_fold_tensor.shape[1], hidden_size=[5, 3], output_size=1).to(device)
        optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=weight_decay)
        criterion = nn.MSELoss()

        # Train and evaluate the model on the current fold
        fold_loss = train_and_evaluate(
            model=model,
            criterion=criterion,
            optimizer=optimizer,
            X_train_fold_tensor=X_train_fold_tensor,
            y_train_fold_tensor=y_train_fold_tensor,
            X_val_fold_tensor=X_val_fold_tensor,
            y_val_fold_tensor=y_val_fold_tensor,
            fold=fold
        )
        fold_losses.append(fold_loss)

    # Calculate average loss across all folds
    avg_loss = sum(fold_losses) / len(fold_losses)
    print(f"Avg Loss for params {params}: {avg_loss:.2f}")

    # Update the best parameters if the current configuration is better
    if avg_loss < best_loss:
        best_loss = avg_loss
        best_params = params
        best_model_state = model.state_dict() 

print('\n')
print(f"Best Params: lr={best_params[0]}, weight_decay={best_params[1]}")
print(f"Best Loss: {best_loss:.2f}")

Fitting 5 folds for each of 9 candidates, totalling 45 fits.


[1/9] Testing Params: lr=0.001, weight_decay=0.0
  Fold: 1 | Val loss (MSE): 3731662080.00, Val MAE: 49076.13
  Fold: 2 | Val loss (MSE): 3990384384.00, Val MAE: 50996.62
  Fold: 3 | Val loss (MSE): 3898186496.00, Val MAE: 51442.36
  Fold: 4 | Val loss (MSE): 3545713664.00, Val MAE: 48045.62
  Fold: 5 | Val loss (MSE): 4353252864.00, Val MAE: 53987.00
Avg Loss for params (0.001, 0.0): 3903839897.60

[2/9] Testing Params: lr=0.001, weight_decay=0.01
  Fold: 1 | Val loss (MSE): 3732792576.00, Val MAE: 49069.60
  Fold: 2 | Val loss (MSE): 3987496192.00, Val MAE: 50947.20
  Fold: 3 | Val loss (MSE): 3912232704.00, Val MAE: 51492.78
  Fold: 4 | Val loss (MSE): 3546896640.00, Val MAE: 48038.67
  Fold: 5 | Val loss (MSE): 4348322816.00, Val MAE: 53961.72
Avg Loss for params (0.001, 0.01): 3905548185.60

[3/9] Testing Params: lr=0.001, weight_decay=0.05
  Fold: 1 | Val loss (MSE): 3729155328.00, Val MAE: 49101.16
  Fold: 2 | Val lo

In [22]:
print(best_model_state)

OrderedDict({'fc1.weight': tensor([[ 3.2630,  4.2425,  3.2691,  5.9602, -1.1488,  5.6662,  1.0755, -2.5103,
          1.5990,  0.4973, -0.9156, -1.2397,  1.8618, -1.3080],
        [-0.0234,  0.2987,  0.3932,  0.5120, -0.2359,  0.2355, -0.1422,  0.0236,
         -0.1347, -0.2210, -0.2303,  0.1349,  0.2842,  0.0750],
        [-1.3700, -2.3170, -2.0855, -3.0629,  0.7789, -2.7167, -0.3200,  0.8138,
         -0.7747, -0.4725,  0.4107,  0.5042, -1.0545,  0.5950],
        [-0.9803, -1.9238, -1.3971, -2.3644,  0.6210, -2.2792,  0.0145,  0.7754,
         -0.4749,  0.0255,  0.4254,  0.4524, -0.8065,  0.5522],
        [-0.9045, -1.4662, -0.9566, -1.5383,  0.3917, -1.5630, -0.2190,  0.4672,
         -0.4319, -0.2208,  0.2516,  0.4395, -0.3424,  0.3803]],
       device='cuda:0'), 'fc1.bias': tensor([ 20.6913,   1.0102, -11.2419,  -9.1654,  -6.3112], device='cuda:0'), 'fc2.weight': tensor([[ 16.9218,  15.2643,  16.8741,  17.4984,  16.0235],
        [153.8076, 138.1495, 154.0072, 158.1416, 144.5004],

In [23]:
best_model = NN(input_size=X_test.shape[1], hidden_size=[5, 3], output_size=1).to(device)
best_model.load_state_dict(best_model_state)
best_model.eval()  # Set to evaluation mode

# Convert test data to tensor
X_test_tensor = torch.tensor(X_test, dtype=torch.float32, device=device)

# Make predictions
with torch.no_grad():
    y_pred_tensor = best_model(X_test_tensor)
    test_mse = criterion(y_pred_tensor, y_test_tensor)
    test_mae = mean_absolute_error(y_pred_tensor.cpu(), y_test_tensor.cpu())
print(f"Test loss (MSE): {test_mse.item():.2f}, Test MAE: {test_mae:.2f}")    

Test loss (MSE): 4019319808.00, Test MAE: 51066.43


In [30]:
print("Predictions on test set:", y_pred_tensor[:15])

Predictions on test set: tensor([[130725.9453],
        [130725.9453],
        [130725.9453],
        [130725.9453],
        [130725.9453],
        [130725.9453],
        [130725.9453],
        [130725.9453],
        [130725.9453],
        [130725.9453],
        [130725.9453],
        [130725.9453],
        [130725.9453],
        [130725.9453],
        [130725.9453]], device='cuda:0')


---

Looking our prediction results seems that we got stuck in a local minima ...

... as we already saw in scratch implementation of Neural Network for Regression when target variable has high magnitude values can have a more difficult training resulting in this type of problem

So, now let's try to repeat the same we did above but with normalized target variable ...

In [25]:
# Z-normalize also the target variable
target_scaler = StandardScaler()
y_train_std = target_scaler.fit_transform(y_train)
y_train_std = torch.from_numpy(y_train_std).float().to(device) # convert it into tensor
# we leave y-test as it is

In [26]:
display(y_train_std)

tensor([[-0.4144],
        [-1.7459],
        [ 0.8825],
        ...,
        [-1.2295],
        [ 1.8992],
        [ 1.3241]], device='cuda:0')

In [27]:
torch.manual_seed(42)

# Initialize the model
model = NN(input_size, hidden_size, output_size).to(device)

# Loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.005) # !!! Note in this case we use Adam

# Training loop
num_epochs = 1000
for epoch in range(num_epochs):
    model.train()
    
    # Forward pass
    preds_train_std = model(X_train_tensor)
    loss = criterion(preds_train_std, y_train_std)
    
    # Backward pass and optimization
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# Prediction 
model.eval()
with torch.no_grad():
    preds_test_std = model(X_test_tensor)
    # De-normalized predictions (to do it we need first to detach and transform them in numpy arrays)
    preds_test_numpy = target_scaler.inverse_transform(preds_test_std.detach().cpu().numpy())
    preds_test = torch.from_numpy(preds_test_numpy).float().to(device) # convert back to tensor
    test_mse = criterion(preds_test, y_test_tensor)
    test_mae = mean_absolute_error(preds_test.cpu(), y_test_tensor.cpu())
print(f"Test loss (MSE): {test_mse.item():.2f}, Test MAE: {test_mae:.2f}")

Test loss (MSE): 2191775232.00, Test MAE: 37095.02


In [28]:
print(preds_test[:10])

tensor([[ 57849.0000],
        [173994.0781],
        [154895.3594],
        [159599.9688],
        [165134.2031],
        [143111.3125],
        [ 69813.7266],
        [ 96433.1094],
        [146585.8750],
        [159540.8438]], device='cuda:0')


By applying the target normalization approach, we can now see that the model is demonstrating significant improvements in  **prediction diversity**. Moreover, we have an imporovement also in performance metrics:

- Without normalizing target variable we got:
    - Test loss (MSE): 4019319808.00, Test MAE: 51066.43

- While normalizing target variable we got:
    - Test loss (MSE): 2191775232.00, Test MAE: 37095.02