<a href="https://www.kaggle.com/code/aisuko/training-models-with-hyperparameters?scriptVersionId=164331017" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Training a model is an iterative process; in each iteration (called an *epoch*). The model makes a guess about the output, calculates the error in its guess(*loss*), collects the derivatives of the error with respect to its parameters (as we saw in the previous module), and **optimizes** these parameters using gradient descent.

In [1]:
import os
import torch
import warnings

if torch.cuda.is_available():
    torch_device = 'cuda'
else:
    torch_device = 'cpu'

warnings.filterwarnings('ignore')

print(torch_device)

cuda


# Loading Dataset

In [2]:
%matplotlib inline
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor, Lambda

training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor(),
)

train_dataloader = DataLoader(training_data, batch_size=64)
test_dataloader = DataLoader(test_data, batch_size=64)

print(train_dataloader)
print(test_dataloader)

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to data/FashionMNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 26421880/26421880 [00:15<00:00, 1680565.00it/s]


Extracting data/FashionMNIST/raw/train-images-idx3-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 29515/29515 [00:00<00:00, 111689.73it/s]


Extracting data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 4422102/4422102 [00:02<00:00, 2091611.39it/s]


Extracting data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 5148/5148 [00:00<00:00, 11240123.37it/s]

Extracting data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw

<torch.utils.data.dataloader.DataLoader object at 0x787f169653f0>
<torch.utils.data.dataloader.DataLoader object at 0x787f16965840>





# Define the Model

In [3]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
            nn.ReLU()
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(torch_device)
print(model)

NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
    (5): ReLU()
  )
)


# Setting Hyperparameters

Hyperparameters are adjustable parameters that let you control the model optimization peocess. Different hyperparameter values can impact model training and the level of accuracy.

We define the following hyperparameters for training:
* **Number of Epochs** - the number times the entire training dataset is pass through the network.
* **Batch Size** - the number of data samples seen by the model in each epoch. Iterates are the number of batches needs to complete an epoch.
* **Learning Rate** - the size of steps the model match as it searchs for best weights that will produce a higher model accuracy. Smaller values mean the model will take a longer time to find the best weights, while larger values may result in the model step over and misses the best weights which yields unpredictable behavior during training.


In [4]:
learning_rate = 1e-3
batch_size = 64
epochs = 5

# Adding an Optimization Loop

Once we set our hyperparameters, we can then train and optimize our model with an optimization loop. Each iteration of the optimization loop is called an **epoch**.

Each epoch consists of two main parts:
* **The Train Loop** - iterate over the training dataset and try to converge to optimal parameters.
* **The Validation/Test Loop** - iterate over the best dataset to check if model performance is improving.

Let's briefly familiarize ourselves with some of the concepts used in the training loop.


# Adding a Loss Function

When presented with some training data, our untrained network is likely not to give the correct answer. **Loss function** measures the degree of dissimilarity of obtained result to the target value, and it is the loss function that we want to minimize during training. To calculate the loss we make a prediction using the inputs of our given data sample and compare it against the true data label value.

Common loss functions include:
* `nn.MSELoss` (Mean Square Error) used for regression tasks
* `nn.NLLLoss` (Negative Log Likelihood) used for classification
* `nn.CrossEntropyLoss` combines `nn.LogSoftmax` and `nn.NLLLoss`

We pass our model's output logits to `nn.CrossEntropyLoss`, which will normalize the logits and compute the prediction error.

In [5]:
loss_fn = nn.CrossEntropyLoss()
print(loss_fn)

CrossEntropyLoss()


# Optimization Pass

Optimization is the process of adjusting model parameters to reduce model error in each training step. **Optimization algorithms** define how this process is performed (in this example we use Stochastic Gradient Descent). All optimization logic is encapsulated in the `optimizer` object. Here, we use the SGD optimizer; additionally, there are many different optimizers available in PyTorch such as `ADAM and RMSProp`, that work better for different kinds of models and data.

We initialize the optimizer by registering the model's parameters that need to be trained, and passing in the learning rate hyperparameter.

In [6]:
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
print(optimizer)

SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    lr: 0.001
    maximize: False
    momentum: 0
    nesterov: False
    weight_decay: 0
)


Inside the training loop, optimization happens in three steps:
* Call `optimizer.zero_grad()` to reset the gradients of model parameters. Gradients by default add up; **to prevent double-counting**, we explicitly zero them at each iteration.
* Back-propagate the prediction loss with a call to `loss.backwards()`. PyTorch deposits the gradients of the loss w.r.t each parameter.
* Once we have our gradients, we call `optimizer.step()` to adjust the parameters by the gradients collected in the backward pass.


# Full Implementation

We define `train_loop` that loops over our optimization code, and `test_loop` that evaluates the model's performance against out test data.

In [7]:
def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        # X input tensor
        # y is a tensor of target labels
        X, y = X.to(torch_device), y.to(torch_device)
        pred = model (X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad() # prevent double counting
        loss.backward() # back-propagate the prediction loss of (W,R,T)
        optimizer.step() # adjust the parameters by gradient collected in the backword pass
        
        if batch % 100 ==0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")


def test_loop(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    test_loss, correct = 0,0

    with torch.no_grad():
        for X,y in dataloader:
            X, y = X.to(torch_device), y.to(torch_device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1)==y).type(torch.float).sum().item()
    test_loss /= size
    correct /= size
    
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

We initialize the loss function and optimizer, and pass it to `train_loop` and `test_loop`. Feel free to increase the number of epochs to track the model's improving performance.

In [8]:
epochs = 3

for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    test_loop(test_dataloader, model, loss_fn)

print("Done!")

Epoch 1
-------------------------------
loss: 2.298036 [    0/60000]
loss: 2.296766 [ 6400/60000]
loss: 2.291211 [12800/60000]
loss: 2.283522 [19200/60000]
loss: 2.274467 [25600/60000]
loss: 2.259772 [32000/60000]
loss: 2.255943 [38400/60000]
loss: 2.247060 [44800/60000]
loss: 2.236033 [51200/60000]
loss: 2.204096 [57600/60000]
Test Error: 
 Accuracy: 37.9%, Avg loss: 0.035007 

Epoch 2
-------------------------------
loss: 2.243741 [    0/60000]
loss: 2.261688 [ 6400/60000]
loss: 2.237645 [12800/60000]
loss: 2.218422 [19200/60000]
loss: 2.211165 [25600/60000]
loss: 2.163101 [32000/60000]
loss: 2.167909 [38400/60000]
loss: 2.142121 [44800/60000]
loss: 2.113598 [51200/60000]
loss: 2.078067 [57600/60000]
Test Error: 
 Accuracy: 40.3%, Avg loss: 0.033314 

Epoch 3
-------------------------------
loss: 2.134739 [    0/60000]
loss: 2.177348 [ 6400/60000]
loss: 2.125244 [12800/60000]
loss: 2.107274 [19200/60000]
loss: 2.100496 [25600/60000]
loss: 1.995999 [32000/60000]
loss: 2.009327 [38400/

Try running the loop for more `epochs` or adjusting the `learning_rate` to a bigger number. It might also be the case that the model configuration we choise might not be the optimal one for this kind of problem (it isn't).


# Saving models

When you are satisfied with the model's performance, you can use `torch.save` to save it. PyTorch models store the learned parameters in an internal state dictionary, called `state_dict`. These can be persisted with the `torch.save` method:

In [9]:
torch.save(model.state_dict(), "model.pth")

print("Saved PyTorch Model State to model.pth")

Saved PyTorch Model State to model.pth
