# Optimizing Model Parameters
- Now model is ready, time to train, validate and test it by optimizing its parameters on our data.

## Prerequisite Code
- We load the code from the previous sections on Datasets & DataLoaders and Build Model.

In [8]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

train_dataloader = DataLoader(training_data, batch_size=64)
test_dataloader = DataLoader(test_data, batch_size=64)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork()

## Hyperparameters
- Hyperparameters are adjustable parameters that let you control the model optimization process.
- Different hyperparameter values can impact model training and convergence rates.

We define the following hyperparameters for training:
 - **Number of Epochs** - the number times to iterate over the dataset
 - **Batch Size** - the number of data samples propagated through the network before the parameters are updated
 - **Learning Rate** - how much to update models parameters at each batch/epoch. Smaller values yield slow learning speed, while large values may result in unpredictable behavior during training.

In [9]:
learning_rate = 1e-3
batch_size = 64
epochs = 5

## Optimization Loop

- Each iteration of the optimization loop is called an **epoch**.

Each epoch consists of two main parts:
 - **The Train Loop** - iterate over the training dataset and try to converge to optimal parameters.
 - **The Validation/Test Loop** - iterate over the test dataset to check if model performance is improving.

### Loss Function

- **Loss function** measures the degree of dissimilarity of obtained result to the target value,
and it is the loss function that we want to minimize during training. 
- To calculate the loss we make a
prediction using the inputs of our given data sample and compare it against the true data label value.

Common loss functions include [nn.MSELoss](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss) (Mean Square Error) for regression tasks, and
[nn.NLLLoss](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html#torch.nn.NLLLoss) (Negative Log Likelihood) for classification.
[nn.CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) combines ``nn.LogSoftmax`` and ``nn.NLLLoss``.

We pass our model's output logits to ``nn.CrossEntropyLoss``, which will normalize the logits and compute the prediction error.


In [10]:
# Initializeing the loss function
loss_fn = nn.CrossEntropyLoss()

## Optimizer
- Optimization is the process of adjusting model parameters to reduce model error in each training step. **Optimization algorithms** define how this process is performed (in this example we use Stochastic Gradient Descent).
- All optimization logic is encapsulated in  the ``optimizer`` object. Here, we use the SGD optimizer.
- We initialize the optimizer by registering the model's parameters that need to be trained, and passing in the learning rate hyperparameter.

In [11]:
optimizer = torch.optim.SGD(model.parameters(), lr = learning_rate)

Inside the training loop, optimization happens in three steps:
 * Call ``optimizer.zero_grad()`` to reset the gradients of model parameters. Gradients by default add up; to prevent double-counting, we explicitly zero them at each iteration.
 * Backpropagate the prediction loss with a call to ``loss.backward()``. PyTorch deposits the gradients of the loss w.r.t. each parameter.
 * Once we have our gradients, we call ``optimizer.step()`` to adjust the parameters by the gradients collected in the backward pass.

# Full Implementation
We define `train_loop` that loops over our optimization code, and `test_loop` that evaluates the model's performance against our test data.

In [12]:
def train_loop(dataloader, model, loss_fun, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X,y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")

- `size = len(dataloader.dataset):` This line computes the total number of training samples in the dataset by taking the length of the dataset associated with the given dataloader.
- `model.train():` This line sets the model in training mode. Some layers, like dropout and batch normalization, behave differently during training and evaluation, so it's crucial to set the model to training mode before training.
- `for batch, (X, y) in enumerate(dataloader):` This line initiates a loop over the batches in the dataloader. In each iteration, it unpacks a batch into X (input features) and y (ground truth labels).
- `pred = model(X):` This line passes the input features X through the model to obtain the model's predictions pred.

- `loss = loss_fn(pred, y):` This line calculates the loss between the model's predictions pred and the ground truth labels y using the specified loss function loss_fn.

- `loss.backward():` This line performs backpropagation to compute the gradients of the loss with respect to the model's parameters.

- `optimizer.step():` This line updates the model's parameters using the computed gradients. The optimizer takes a step in the direction that reduces the loss.

- `optimizer.zero_grad():` This line zeros out the gradients of the model's parameters after the optimization step. It is essential to clear the gradients before computing new 

In [15]:
def test_loop(dataloader, model, loss_fn):
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

- `model.eval():` This line sets the model in evaluation mode. Some layers, like dropout and batch normalization, behave differently during training and evaluation. Setting the model in evaluation mode ensures that such layers are not applied during evaluation.
- `with torch.no_grad():` This line creates a context manager using torch.no_grad(), which disables gradient computation for the following operations. This is crucial for evaluation to save computation and memory.
- `test_loss += loss_fn(pred, y).item():` This line calculates the loss between the model's predictions pred and the ground truth labels y using the specified loss function loss_fn. The loss is added to the test_loss variable to compute the cumulative loss over all batches.
- `correct += (pred.argmax(1) == y).type(torch.float).sum().item():` This line calculates the number of correct predictions by comparing the model's predicted class index (the index of the highest value in pred) with the ground truth labels y. It then converts the resulting Boolean tensor to float and sums the correct predictions over the batch.

In [16]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

epochs = 10
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    test_loop(test_dataloader, model, loss_fn)
print("Done!")

Epoch 1
-------------------------------
loss: 2.167865 [   64/60000]
loss: 2.164467 [ 6464/60000]
loss: 2.096689 [12864/60000]
loss: 2.119823 [19264/60000]
loss: 2.071922 [25664/60000]
loss: 2.011602 [32064/60000]
loss: 2.037684 [38464/60000]
loss: 1.959946 [44864/60000]
loss: 1.982416 [51264/60000]
loss: 1.894180 [57664/60000]
Test Error: 
 Accuracy: 53.0%, Avg loss: 1.895716 

Epoch 2
-------------------------------
loss: 1.927295 [   64/60000]
loss: 1.907401 [ 6464/60000]
loss: 1.776551 [12864/60000]
loss: 1.825603 [19264/60000]
loss: 1.722878 [25664/60000]
loss: 1.666149 [32064/60000]
loss: 1.689700 [38464/60000]
loss: 1.585172 [44864/60000]
loss: 1.624613 [51264/60000]
loss: 1.507373 [57664/60000]
Test Error: 
 Accuracy: 58.2%, Avg loss: 1.529580 

Epoch 3
-------------------------------
loss: 1.590795 [   64/60000]
loss: 1.567051 [ 6464/60000]
loss: 1.400884 [12864/60000]
loss: 1.484929 [19264/60000]
loss: 1.372178 [25664/60000]
loss: 1.358258 [32064/60000]
loss: 1.373359 [38464/

# Save and Load the Model

In [18]:
import torch
import torchvision.models as models

## Saving and Loading Model Weights
PyTorch models store the learned parameters in an internal
state dictionary, called ``state_dict``. These can be persisted via the ``torch.save``
method:



In [19]:
model = models.vgg16(weights='IMAGENET1K_V1')
torch.save(model.state_dict(), 'model_weights.pth')

Downloading: "https://download.pytorch.org/models/vgg16-397923af.pth" to C:\Users\Msc 2/.cache\torch\hub\checkpoints\vgg16-397923af.pth
100%|██████████| 528M/528M [01:10<00:00, 7.82MB/s] 


To load model weights, you need to create an instance of the same model first, and then load the parameters using ``load_state_dict()`` method.

In [20]:
model = models.vgg16() # we do not specify ``weights``, i.e. create untrained model
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1

<h4>Note</h4><p>be sure to call ``model.eval()`` method before inferencing to set the dropout and batch normalization layers to evaluation mode. Failing to do this will yield inconsistent inference results.</p></div>

## Saving and Loading Models with Shapes
When loading model weights, we needed to instantiate the model class first, because the class
defines the structure of a network. We might want to save the structure of this class together with
the model, in which case we can pass ``model`` (and not ``model.state_dict()``) to the saving function:



In [21]:
torch.save(model, 'model.pth')

We can then load the model like this:



In [22]:
model = torch.load('model.pth')

<h4>Note</h4><p>This approach uses Python [pickle](https://docs.python.org/3/library/pickle.html) module when serializing the model, thus it relies on the actual class definition to be available when loading the model.</p></div>