In this notebook, we will use perform hyper parameter tuning of a neural network with grid search.

In deep learning, hyperparameters are the parameters that are set before the learning process begins and are not updated during training. They differ from model parameters, which are learned from the data during the training process. Hyperparameters play a critical role in determining the performance and efficiency of the model.

Learning Rate:

Definition: The step size used by the optimization algorithm (like gradient descent) to update the model parameters.
Impact: A small learning rate can make the training process very slow and might get stuck in a local minimum, while a large learning rate can cause the model to converge quickly but possibly to a suboptimal solution or even diverge.


Batch Size:

Definition: The number of training examples used in one forward/backward pass.
Impact: A smaller batch size leads to noisier updates but can generalize better, while a larger batch size makes training more stable but may require more memory and could potentially overfit.


Number of Epochs:

Definition: One epoch is when the entire dataset has been passed forward and backward through the neural network once.
Impact: More epochs can improve the model’s performance up to a point, but too many epochs can lead to overfitting.


Optimization Algorithm:

Examples: Stochastic Gradient Descent (SGD), Adam, RMSprop, etc.
Impact: The choice of optimizer can significantly affect how quickly and how well the model converges.

Dropout Rate:

Definition: The fraction of neurons to drop during training to prevent overfitting.
Impact: Helps in regularization by randomly setting a fraction of neurons to zero during each forward pass, which forces the model to generalize better.

Weight Initialization:

Examples: Xavier Initialization, He Initialization.
Impact: Proper weight initialization can help avoid issues like vanishing or exploding gradients, which can affect how quickly and effectively the model trains.

Activation Functions:

Examples: ReLU, Sigmoid, Tanh, Leaky ReLU.
Impact: The choice of activation function affects the non-linearity introduced to the model and thus influences the learning ability.

In [None]:
Regularization Parameter:

Examples: L2 regularization (weight decay), L1 regularization.
Impact: Regularization helps prevent overfitting by adding a penalty to the loss function for large weights.

Model Architecture:

Examples: Number of layers, number of units/neurons in each layer, types of layers (Convolutional, Fully Connected, Recurrent, etc.).
Impact: The complexity and depth of the model determine its capacity to learn patterns from the data.

Momentum:

Definition: A hyperparameter that helps accelerate the gradient descent algorithm by adding a fraction of the previous update to the current update.
Impact: Helps in speeding up convergence and helps escape local minima.

Learning Rate Scheduler:

Definition: A strategy for adjusting the learning rate during training, such as reducing it when the model stops improving.
Impact: Helps in fine-tuning the learning process, often leading to better performance.


Gradient Clipping:

Definition: A technique to cap the gradients during backpropagation to prevent the problem of exploding gradients.
Impact: Ensures stable training by avoiding excessively large updates

https://skorch.readthedocs.io/en/stable/user/installation.html

In [1]:
!pip install skorch

Collecting skorch
  Downloading skorch-1.0.0-py3-none-any.whl.metadata (11 kB)
Collecting tabulate>=0.7.7 (from skorch)
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Downloading skorch-1.0.0-py3-none-any.whl (239 kB)
Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Installing collected packages: tabulate, skorch
Successfully installed skorch-1.0.0 tabulate-0.9.0


In [2]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from skorch import NeuralNetRegressor

In [3]:
# Load the California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

In [4]:
X.shape, y.shape

((20640, 8), (20640,))

In [5]:
print(housing.feature_names)

['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']


In [6]:
# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [7]:
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
# Convert data to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.FloatTensor(y_train)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.FloatTensor(y_test)

In [9]:
class RegressionModel(nn.Module):
    def __init__(self):
        super(RegressionModel, self).__init__()
        self.fc1 = nn.Linear(X.shape[1], 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

In [10]:
# create model with skorch
model = NeuralNetRegressor(
    RegressionModel,
    criterion=nn.MSELoss,
    optimizer=optim.Adam,
    verbose=False
)

In [11]:
model.set_params(train_split=False, verbose=0)
param_grid = {
    'max_epochs': [1, 2],
    'lr': [0.01, 0.1], #[0.001, 0.005, 0.01, 0.05, 0.1],
    'batch_size': [8, 32] #[8, 32, 64, 128, 256]
}

grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid.fit(X_train_tensor, y_train_tensor)

  return F.mse_loss(input, target, reduction=self.reduction)


In [12]:
print(grid.best_score_, grid.best_params_)

0.1275056799252828 {'batch_size': 8, 'lr': 0.01, 'max_epochs': 2}


In [13]:
grid.cv_results_

{'mean_fit_time': array([16.35880502, 22.47258409, 16.10154239, 22.95530629,  8.62492951,
         7.74335686,  3.00848564,  3.76468706]),
 'std_fit_time': array([0.26988472, 0.17064134, 0.17015183, 0.07099262, 0.04493201,
        2.80012232, 0.05756284, 0.15345353]),
 'mean_score_time': array([1.3281099 , 0.64212076, 1.35126233, 0.62544163, 0.730383  ,
        0.70889624, 0.6539944 , 0.3216393 ]),
 'std_score_time': array([0.03549379, 0.01184367, 0.0543674 , 0.0071634 , 0.02604853,
        0.04548442, 0.06308667, 0.00837135]),
 'param_batch_size': masked_array(data=[8, 8, 8, 8, 32, 32, 32, 32],
              mask=[False, False, False, False, False, False, False, False],
        fill_value=999999),
 'param_lr': masked_array(data=[0.01, 0.01, 0.1, 0.1, 0.01, 0.01, 0.1, 0.1],
              mask=[False, False, False, False, False, False, False, False],
        fill_value=1e+20),
 'param_max_epochs': masked_array(data=[1, 2, 1, 2, 1, 2, 1, 2],
              mask=[False, False, False, False

## Verdict

We can change the `param_grid` to include other hyper parameters like activation functions, dropout rate and more.

## Resources

1. Scorch documentation: https://skorch.readthedocs.io/en/stable/user/quickstart.html#grid-search
2. Sklearn GridSearchCV documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
3. Link to housing dataset documentation: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html
