### Deep Learning with PyTorch

#### Tensors

Tensors are similar to NumPy arrays but have unique features. Convert a Python list into a Torch tensor with `torch.tensor`.

In [65]:
import torch

temperatures = [[10, 20], [30, 40], [50, 60]]

temperature_tensor = torch.tensor(temperatures)
print(temperature_tensor)

tensor([[10, 20],
        [30, 40],
        [50, 60]])


Tensors have a `shape` and `dtype`, and can be added elementwise.

In [66]:
addend_tensor = torch.tensor([[10, 20], [10, 20], [10, 20]])
sums_tensor = temperature_tensor + addend_tensor
print("Sums:", sums_tensor)
print("Sums shape:", sums_tensor.shape)
print("Sums shape:", sums_tensor.dtype)

Sums: tensor([[20, 40],
        [40, 60],
        [60, 80]])
Sums shape: torch.Size([3, 2])
Sums shape: torch.int64


#### Linear layer
`Linear` from `torch.nn` takes a tensor as input and outputs a tensor whose sizes correspond to `in_features` and `out_features`. The weights and biases involved in the calculation between are initialized randomly.

In [67]:
import torch.nn as nn

input_tensor = torch.tensor([0.1, -0.1, 0.8])

linear_layer = nn.Linear(
    in_features=3,
    out_features=2
)
output = linear_layer(input_tensor)
print(output)
print(linear_layer.weight)
print(linear_layer.bias)

tensor([ 0.4454, -0.5690], grad_fn=<ViewBackward0>)
Parameter containing:
tensor([[-0.3867, -0.0119, -0.0820],
        [ 0.3435, -0.5435, -0.3046]], requires_grad=True)
Parameter containing:
tensor([ 0.5484, -0.4140], requires_grad=True)


### Sequential layer
`Sequential` from `torch.nn` can stack layers such as `Linear` to pass data through the layers in sequence. Layers bookended by the input and output are called hidden layers.

A neuron in a linear layer has $n+1$ parameters, with $n$ counting the weight for each input from the previous layer and $1$ accounting the neuron's bias.

More hidden layers = more parameters = higher model capacity.

In [68]:
sequential_model = nn.Sequential(
    nn.Linear(3, 2),
    nn.Linear(2, 8),
    nn.Linear(8, 3)
)

Acquire the model's parameters using `parameters()`, which outputs a container of tensors containing each layer's weights and each layer's biases.

`numel()` outputs the number of elements in a tensor.

In [69]:
count = 0
for parameter in sequential_model.parameters():
    print(parameter)
    count += parameter.numel()
print(count)

Parameter containing:
tensor([[ 0.0043, -0.3726, -0.2042],
        [ 0.1532,  0.0616, -0.4301]], requires_grad=True)
Parameter containing:
tensor([-0.2986, -0.3719], requires_grad=True)
Parameter containing:
tensor([[ 0.0132,  0.3602],
        [ 0.1083,  0.2950],
        [-0.3910,  0.2208],
        [-0.5053,  0.1646],
        [ 0.5056,  0.3109],
        [ 0.4946, -0.4006],
        [ 0.0100, -0.0050],
        [ 0.6591, -0.0147]], requires_grad=True)
Parameter containing:
tensor([-0.2311, -0.0944,  0.3947, -0.0409,  0.6380,  0.4827,  0.2862,  0.4078],
       requires_grad=True)
Parameter containing:
tensor([[ 0.3282,  0.1979, -0.1349,  0.1230, -0.0341,  0.2138, -0.1414,  0.0091],
        [-0.3316, -0.1466,  0.0412,  0.3316,  0.2403, -0.0737, -0.2166,  0.2909],
        [-0.1337, -0.0096,  0.2199,  0.2130,  0.0700, -0.1754,  0.1948, -0.2480]],
       requires_grad=True)
Parameter containing:
tensor([0.1423, 0.0299, 0.2928], requires_grad=True)
59


#### Sigmoid function
Type of function that takes a real-valued input (specifically a float) and outputs a single value between 0 and 1. Used for binary classification, and can be placed as the final activation of a network of linear layers after which a forward pass determines classification-or-not by a threshold (for instance 0.5).

Equivalent to traditional logistic regression (in that the output is a probability for the category of interest).

In [70]:
input = torch.tensor([10.0, 12.0, 13.0])
sigmoid_model = nn.Sequential(
    nn.Linear(3, 2),
    nn.Linear(2, 1),
    nn.Sigmoid()
)
sigmoid_model(input)

tensor([0.8874], grad_fn=<SigmoidBackward0>)

#### Softmax function
Type of function that takes a one-dimensional input (specifically of floats) and outputs a one-dimensional distribution of probabilities that sum to 1. Used for multi-class classification, and can be placed as the final activation of a network of linear layers after which a forward pass produces the classification to be chosen from the highest per-class probability.

In [71]:
input = torch.tensor([10.0, 12.0, 13.0])
softmax_model = nn.Sequential(
    nn.Linear(3, 2),
    nn.Linear(2, 5),
    nn.Softmax()
)
softmax_model(input)

tensor([3.0023e-01, 1.6290e-02, 4.6548e-05, 8.8521e-04, 6.8255e-01],
       grad_fn=<SoftmaxBackward0>)

#### Loss function

The function that quantifies how far a machine learning model's predictions are from the actual target values, be it during training or in practice. The loss function takes a model prediction $\hat{y}$ (may be a singular regressive / sigmoid output, or a softmax tensor of probabilities) and ground truth $y$ (the actual value or class itself) as inputs and outputs a single float, the loss.

The goal of training is to minimize the loss, which should be low or zero for an accurate prediction and high for an incorrect one.

For cross-entropy loss, the ground truth value may be the class itself (a number), so to convert it into a tensor functional against the model prediction (a softmax probability distribution), use `nn.functional.one_hot()` which takes a tensor of indices to make one-hots for and a num_elements and to output a tensor of containing one-hot(s).

In [72]:
import torch.nn.functional as F

print(F.one_hot(torch.tensor(0), 3))
print(F.one_hot(torch.tensor(1), 3))
print(F.one_hot(torch.tensor([0,2]), 3))

tensor([1, 0, 0])
tensor([0, 1, 0])
tensor([[1, 0, 0],
        [0, 0, 1]])


#### Cross-entropy loss

Cross-entropy loss is a common loss function for classification. With a scores tensor (model predictions before the final softmax function) and a one-hot encoded ground truth label as input (both must be converted to floats), the cross-entropy loss function applies an internal softmax to the scores (producing a probability distribution of the same size), then outputs the negative natural log of the ground truth's corresponding probability.

Cross-entropy loss function is supplied in PyTorch by instantiating `nn.CrossEntropyLoss()`.

In [73]:
scores = torch.tensor([-5.2, 4.6, 0.8])
one_hot_target = F.one_hot(torch.tensor(0), 3)

softmax = nn.Softmax()
print(softmax(scores))
criterion = nn.CrossEntropyLoss()
print(criterion(scores.double(), one_hot_target.double()))

tensor([5.4235e-05, 9.7807e-01, 2.1880e-02])
tensor(9.8222, dtype=torch.float64)


#### Loss gradients and Backpropagation

Loss outputted by a forward pass may vary by the values used for the model's parameters (weights and biases); this rate of change is the loss gradient. With the output (i.e. `loss = criterion(prediction, target)`) of an instantiated loss function (such as `criterion = CrossEntropyLoss()`), we can calculate the gradients of this loss using `loss.backward()`.

Backpropagation is the process by which we aim to locate the global minimum of the loss function and tune the model's parameters to better fit. with each forward pass, we incrementally update the model's weights and biases inversely to the loss gradient such that they follow the gradient downstream (gradient descent).

The existing gradients of layer `i` can be accessed using `model[i].weight.grad` and `model[i].bias.grad`. To update model parameters manually, access each layer gradient and multiply it by the learning rate, then subtract the result from the weight or bias.

In [74]:
model = nn.Sequential(
    nn.Linear(5, 8),
    nn.Linear(8, 2)
)

# Need to compute loss and backpropagate in order to define gradients

x = torch.randn(5) # dummy input
target = torch.randn(2) # dummy target
output = model(x) # dummy prediction
criterion = nn.CrossEntropyLoss()
loss = criterion(output, target)
loss.backward()

weight0 = model[0].weight
weight1 = model[1].weight
bias1 = model[1].bias
print("Weight of the first layer:", weight0)
print("Bias of the second layer:", bias1)

grads0 = weight0.grad
grads1 = weight1.grad
print("Gradient of the first layer:", grads0)
print("Gradient of the second layer:", grads1)

lr = 0.001
weight0 = weight0 - lr * grads0 # Note that -= does not work, as grad requires being used in an in-place operation
weight1 = weight1 - lr * grads1

Weight of the first layer: Parameter containing:
tensor([[-0.3165,  0.1104, -0.1071,  0.4049, -0.3143],
        [-0.0234, -0.3527, -0.0361,  0.0084, -0.4175],
        [-0.1854,  0.3420, -0.3655,  0.2279, -0.1146],
        [ 0.1683,  0.1802,  0.1531,  0.2364,  0.0681],
        [-0.3879, -0.1752, -0.2330,  0.2276, -0.3617],
        [-0.2367, -0.4007, -0.3831,  0.2490,  0.2312],
        [-0.1730, -0.0632, -0.3578,  0.3751,  0.2336],
        [-0.4034, -0.0928,  0.2454, -0.3865,  0.3348]], requires_grad=True)
Bias of the second layer: Parameter containing:
tensor([-0.3145, -0.1655], requires_grad=True)
Gradient of the first layer: tensor([[-0.0101, -0.0120,  0.0985, -0.0688,  0.0205],
        [ 0.0016,  0.0020, -0.0161,  0.0112, -0.0033],
        [ 0.0040,  0.0048, -0.0396,  0.0277, -0.0083],
        [-0.0045, -0.0054,  0.0443, -0.0310,  0.0092],
        [ 0.0115,  0.0137, -0.1123,  0.0785, -0.0234],
        [-0.0082, -0.0098,  0.0804, -0.0562,  0.0167],
        [-0.0119, -0.0143,  0.1167, 

#### Optimized gradient descent

PyTorch has an optimized gradient descent function, `torch.optim.SGD()`, that uses Stochastic Gradient Descent (which calculates gradients from one or a small subset of training examples, rather than the computationally expensive entire dataset). Takes a model's parameters and a learning rate as input when instantiating; use `optimizer.step()` to perform an update on all of the model's parameters.

In [75]:
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.001)

x = torch.randn(5) # dummy input
target = torch.randn(2) # dummy target
output = model(x) # dummy prediction
criterion = nn.CrossEntropyLoss()
loss = criterion(output, target)
loss.backward()

optimizer.step()

#### DataFrames

`pd.read_csv()` from `pandas` reads into a `pandas.DataFrame` from a .csv file.

The `DataFrame.iloc` property is used to select slices of the DataFrame through integer-location based indexing. `iloc[:,1:-1]` for instance selects all but the first and last columns (as columns are dimension 1 of the DataFrame).

`to_numpy()` of a DataFrame outputs an equivalent numpy array for easier handling with PyTorch.

In [76]:
import numpy as np
import pandas as pd

# dataset = pd.read_csv("dataset.csv")
# target = dataset.iloc[:, -1]
# y = target.to_numpy()
# print(y)

#### Data loading and batching

`TensorDataset` and `DataLoader` from `torch.utils.data` help manage data loading and batching during training. 

Instantiate a `DataLoader` with the following parameters:
- the `TensorDataset` itself
- `batch_size` determines the number of samples included in each iteration
- `shuffle` is whether the data order at each epoch (a full pass through the training data) is randomized, which helps improve model generalization (level of performance on unseen data).

The `DataLoader` behaves like a PEZ dispenser; each element in it is a tuple unpacked as batch_inputs (features) and batch_labels (target), representing a row from the dataset. Iterating through the `DataLoader` dispenses feature and label tensors that are `batch_size` in depth (aka `batch_size` number of samples, or fewer if there are no more left), corresponding to `batch_size` rows in the `TensorDataset`. Resets every time it is used as an iterator.

In real-world deep learning, batch sizes are often 32 or greater to accommodate larger datasets with better computational efficiency.

In [77]:
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader

X = np.array([[0.0, 1.0],
              [1.0, 0.0],
              [-1.0, 0.0],
              [-1.0, 1.0],
              [-1.0, -1.0]], dtype=np.float32)
y = np.array([[2.0],
              [3.0],
              [4.0],
              [0.0],
              [1.0]], dtype=np.float32)
dataset = TensorDataset(torch.tensor(X), torch.tensor(y))

input_sample, label_sample = dataset[0]
print("Input sample:", input_sample)
print("Label sample:", label_sample)

dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
for batch_inputs, batch_labels in dataloader:
    print("Batch inputs:", batch_inputs)
    print("Batch labels:", batch_labels)

Input sample: tensor([0., 1.])
Label sample: tensor([2.])
Batch inputs: tensor([[-1.,  0.],
        [ 0.,  1.]])
Batch labels: tensor([[4.],
        [2.]])
Batch inputs: tensor([[-1., -1.],
        [-1.,  1.]])
Batch labels: tensor([[1.],
        [0.]])
Batch inputs: tensor([[1., 0.]])
Batch labels: tensor([[3.]])


#### Mean Squared Error loss

Mean Squared Error (MSE) loss is a common loss function for regression. With a predictions tensor and ground truth tensor as input (both must be converted to floats), MSE loss is the average of the squared difference between each prediction and its corresponding ground truth.

The MSE loss function is implemented in NumPy as the function below, and is supplied in PyTorch by instantiating `nn.MSELoss()`.

In [78]:
def mean_squared_loss(prediction, target):
    return np.mean((prediction - target) ** 2)

criterion = nn.MSELoss()
# loss = criterion(torch.tensor(prediction), torch.tensor(target))

#### Training loop

Training a neural network involves:
- creating a model
- choosing a loss function
- defining a dataset
- setting an optimizer (optimized gradient descent function)
- and running the training loop (calculating loss via a forward pass, computing gradients via backpropagation, and updating model parameters).

In [79]:
model = nn.Sequential(
    nn.Linear(2, 8),
    nn.Linear(8, 1)
)

num_epochs = 5
# dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
optimizer = optim.SGD(model.parameters(), lr=0.001)

# Loop over the number of epochs and then the dataloader
for i in range(num_epochs):
  for data in dataloader: # data is a tuple with features and target
    # Set the gradients to zero
    optimizer.zero_grad()

    # Get feature and target from the dataloader
    feature, target = data

    # Run a forward pass
    prediction = model(feature)

    # Compute loss and gradients
    loss = criterion(prediction, target)
    loss.backward() # Update gradients

    # Descend gradients / update model parameters
    optimizer.step()

#### ReLU activation functions

Sometimes activation functions can shrink gradients too much, reducing training efficiency:
- Sigmoid and softmax functions bound their outputs between 0 and 1, so have very small gradients for large and small values of $x$ (namely, as the sigmoid output approaches either 0 or 1).
- This phenomenon is called saturation, which introduces the vanishing gradients problem (each gradient depends on the previous one during backpropagation, so extremely small gradients can fail to update the weights effectively). This is also what renders sigmoid and softmax suboptimal for hidden layers (use them in the last layer only).

Common activation functions designed for use in hidden layers include:
- The Rectified Linear Unit (ReLU), defined by the function $f(x) = max(x, 0)$, converts all negative inputs to 0. Its gradients do not approach 0 for large values of $x$, circumventing the vanishing gradients problem. 
    - ReLU is supplied in PyTorch by instantiating `nn.ReLU()`.
- Leaky ReLU is a variation that scales negative inputs by a very small coefficient (default 0.01 in PyTorch), preventing neurons from ceasing learning with zero-value gradients which can occur with standard ReLU.
    - Leaky ReLU is supplied in PyTorch by instantiating `nn.LeakyReLU()` with a `negative_slope` argument.

In [80]:
x_pos = torch.tensor(2.0)
x_neg = torch.tensor(-3.0)

relu_torch = nn.ReLU()
print("ReLU applied to positive x:", relu_torch(x_pos))
print("ReLU applied to negative x:", relu_torch(x_neg))

leakyrelu_torch = nn.LeakyReLU(0.05)
print("ReLU applied to positive x:", leakyrelu_torch(x_pos))
print("ReLU applied to negative x:", leakyrelu_torch(x_neg))

ReLU applied to positive x: tensor(2.)
ReLU applied to negative x: tensor(0.)
ReLU applied to positive x: tensor(2.)
ReLU applied to negative x: tensor(-0.1500)


#### Learning rate and momentum

Recall that training a neural network is optimizing its parameters to achieve minimum loss, and the Stochastic Gradient Descent (SGD) optimizer is what descends the model's gradients to this end.

`optim.SGD()` actually takes two key arguments after the model's parameters:
- `lr` (the learning rate) controls the step size; larger increases efficiency, smaller increases precision and avoids getting stuck going back-and-forth about minima.
    - Typical range: `0.01` to `0.0001`. Rule of thumb: start with `0.001`.
- `momentum` adds inertia to help the optimizer avoid getting stuck in local minima; with larger momentum, steep drops in loss push the gradient descent further.
    - Typical range: `0.85` to `0.99`. Rule of thumb: start with `0.95`.

Momentum is crucial when the global minimum to be found is of a non-convex (i.e. local minima-riddled) function.

In [81]:
sgd = optim.SGD(model.parameters(), lr=0.001, momentum=0.95)

#### Layer initialization

The weights of a layer are often initialized to small values; PyTorch's `nn.Linear` for instance initializes its weights between roughly `-0.125` and `0.125`. Keeping both the input data and layer weights small ensures stable training and prevents extreme values that could slow training.

`nn.init.uniform_()` takes a tensor (like the `weight` property of a layer such as `nn.Linear`) as input and initializes it with a uniform distribution between 0 and 1.

#### Transfer learning

Transfer learning is the reuse of a model trained on a first task for a second similar task. For instance, a model trained on US data scientist salaries can be repurposed as a model to train on European salaries, leveraging the latent information within the original model's weights as a starting point.

Save and load weights to and from local files using `torch.save(layer, "layer.pth")` and `torch.load("layer.pth")`, which work on any type of PyTorch objects.

In [82]:
layer = nn.Linear(64, 128)

# torch.save(layer, "layer.pth")
# new_layer = torch.load("layer.pth")
nn.init.uniform_(layer.weight)
print(layer.weight.min(), layer.weight.max())

tensor(2.6822e-06, grad_fn=<MinBackward1>) tensor(0.9999, grad_fn=<MaxBackward1>)


#### Fine-tuning

A type of transfer learning where we load weights from a previously trained model and fine-tune them with a small learning rate and/or with freeze certain parts of the network (often the early ones, whereas the layers closer to the output get fine-tuned).

Fine-tuning process:
- Find a model trained on a similar task
- Load pre-trained weights
- Optionally freeze some of the layers in the model
- Train with a smaller learning rate
- Inspect loss for if learning rate needs to be adjusted

Fine-tuning existing models is useful enough that deep learning engineers oftentimes rarely train their model from scratch.

In [83]:
model = nn.Sequential(
    nn.Linear(64, 128),
    nn.Linear(128, 64)
)

# named_parameters returns the name and parameter itself
for name, param in model.named_parameters():
    if name == "0_weight":
        param.requires_grad = False
        # setting this to False ceases gradient tracking,
        # which essentially marks these as ignored
        # by the optmizer thereby preventing further
        # updates to the layer's weights

#### Evaluating model performance

A dataset is typically split into three subsets:
- training (80-90%) to adjust model parameters
- validation (10-20%) to tune hyperparameters such as k in k-nearest, learning rate, and momentum
- test (5-10%) to evaluate final model performance on unseen data

Key metrics to track include loss and accuracy during training and validation. For each epoch:
- Training/validation loss is the sum of losses across all batches in the dataloader; can compute the mean training/validation loss by dividing total loss by the number of batches

When a model overfits, training loss keeps decreasing but validation loss starts to rise, which reflects a loss of generality in model performance.

In [84]:
model = nn.Sequential( # dummy model
    nn.Linear(2, 8),
    nn.Linear(8, 1)
)
X = np.array([[0.0, 1.0],
              [1.0, 0.0],
              [-1.0, 0.0],
              [-1.0, 1.0],
              [-1.0, -1.0]], dtype=np.float32)
y = np.array([[2.0],
              [3.0],
              [4.0],
              [0.0],
              [1.0]], dtype=np.float32)
dataset = TensorDataset(torch.tensor(X), torch.tensor(y))
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# @@@ Calculating training loss @@@

training_loss = 0
for inputs, labels in dataloader: # containing training data
    # Run the forward pass
    outputs = model(inputs)
    # Compute loss from prediction and ground truth
    loss = criterion(outputs, labels)
    # Backpropagation:
    loss.backward # Computer gradients
    optimizer.step() # Update weights
    optimizer.zero_grad() # Reset gradients

    # Calc and sum the loss
    training_loss += loss.item()
epoch_loss = training_loss / len(dataloader) # mean loss

# @@@ Calculating validation loss @@@

validation_loss = 0
model.eval() # put model in evaluation mode,
# as some layers behave differently during training and validation

with torch.no_grad(): # disable gradients for efficiency
    for inputs, labels in dataloader: # containing validation data
        # Run the forward pass
        outputs = model(inputs)
        # Compute loss from prediction and ground truth
        loss = criterion(outputs, labels)

        # Calc and sum the loss
        validation_loss += loss.item() # .item() turns the (1,) tensor into a numerical value
epoch_loss = validation_loss / len(dataloader) # mean loss
model.train() # switch back to training mode

Sequential(
  (0): Linear(in_features=2, out_features=8, bias=True)
  (1): Linear(in_features=8, out_features=1, bias=True)
)

#### Accuracy with torchmetrics

Loss describes how well a model is learning, but not necessarily how accurately it makes predictions.

For multi-class classification tasks, create an accuracy metric with `torchmetrics.Accuracy`. As the model processes each batch, the metric updates by taking predictions and ground truth labels as input (similarly to a loss function).

The model outputs a probability distribution over multiple classes, so use `argmax(dim=-1)` to select the class with the highest probability.

In [85]:
import torchmetrics

# Create accuracy metric
metric = torchmetrics.Accuracy(task="multiclass", num_classes=2)

for features, labels in dataloader:
    outputs = model(features) # Do a forward pass
    # Compute batch accuracy (selecting argmax for one-hot labels)
    # metric.update(outputs, labels.argmax(dim=-1))

# Compute accuracy over entire epoch
accuracy = metric.compute()

# Reset metric for next epoch
metric.reset()

#### Fighting overfitting 

Overfitting is when the model falters in generalizing to unseen data (performs well on training data but poorly on validation data). Possible causes:
- If the dataset is not large enough, solutions are to get more data or synthesize more through data augmentation
- If the model has too much capacity, reduce the model size or add a "dropout" layer
- If the weights are too large, use weight decay to force parameters to remain small

#### Dropout layer

"Regularization" through a dropout layer is a technique that randomly zeroes out elements of the input tensor during training, preventing the model from becoming too dependent on specific features from the training dataset. Dropout layers are typically added after activation functions. Dropouts are added after the activation function and behaves differently during `model.train()` vs. `model.eval()` (specifically, its zeroing is intended to be disabled during validation).

This layer is supplied in PyTorch by `nn.Dropout()`, where the `p` argument is the probability that a neuron (element in the input tensor) is zeroed.

#### Weight decay

Weight decay is another form of regularization, supplied in PyTorch as an optimizer parameter (`weight_decay`) typically set to a small value like `0.0001`. This parameter adds a penalty to the loss function, encouraging smaller weights and improving generalization. During backpropagation this penalty is subtracted from the gradient to prevent excessive weight growth, and higher weight decay entails stronger regularization against overfitting.

In [86]:
model = nn.Sequential( # dummy model
    nn.Linear(8, 4),
    nn.ReLU(),
    nn.Dropout(p=0.5)
)
features = torch.randn((1, 8))
print(model(features))

optimizer = optim.SGD(model.parameters(), lr=0.001, weight_decay=0.0001)

tensor([[0.8662, 0.0000, 0.0000, 0.7566]], grad_fn=<MulBackward0>)


#### Maximizing model performance

Steps to maximize performance:

#### 1. Create a model that overfits the training set (be it a single data point or a small subset)
- This ensures that the problem is solvable and catch potential roadblocks early on.
- Rig the training loop to repeatedly train on a single example rather than the entire dataloader; a proper model and an applicable problem should quickly descend to near-zero loss and 100% accuracy on that data point.
- Then, scale up to the entire training set; at this stage we use an existing model architecture large enough to overfit while keeping hyperparameters (e.g. learning rate, momentum) at their defaults.

In [87]:
model = nn.Sequential(
    nn.Linear(2, 8),
    nn.Linear(8, 1)
)

features, labels = next(iter(dataloader))
for i in range(1000):
    outputs = model(features)
    loss = criterion(outputs, labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

#### 2. Maximize the validation accuracy
- Set a performance baseline to aim for with the validation set, then use strategies to reduce overfitting and help the model generalize well to unseen data: dropout, data augmentation, weight decay, or downsizing model capacity.
- Regularization comes with its costs: too little can fail to tackle overfitting and helping the model generalize to unseen data, while too much can reduce training / validation accuracy and limit the model's ability to learn effectively.

#### 3. Fine-tune hyperparameters
- Often done on optimizer settings such as learning rate and momentum.
- Methods of testing different hyperparameter values:
    - Grid search tests parameters at fixed intervals
    - Random search randomly selects values within a given range, and is often more efficient as it avoids unnecessary tests and increases the chance of finding optimal settings.

In [88]:
# Grid search from 10^-2 to 10^-6
for factor in range(2,6):
    lr = 10 ** -factor

# Random search from 10^-2 to 10^-6
factor = np.random.uniform(2, 6) # a single number
lr = 10 ** -factor