## Optimizing Parameters of Model
- Model training goes through iterative process of the following.
> 1. Estimate the Output
> 2. Calculate loss between estimation and real answer
> 3. Collect all devariative of the error for the parameters
> 4. Optimize all parameters using *Gradient Descent*

In [1]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

In [2]:
# Before start training, check the device to train
device = ("cuda" if torch.cuda.is_available()
                else "mps" if torch.backends.mps.is_available()
                else "CPU")

print(f"Current Device: {device}")

Current Device: cuda


### Pre-requisite Code

In [3]:
train_data = datasets.FashionMNIST(
    root='data',
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root='data',
    train=False,
    download=True,
    transform=ToTensor()
)

train_dataloader = DataLoader(train_data, batch_size=64)
test_dataloader = DataLoader(test_data, batch_size=64)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()

        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 18),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)

        return logits
    
model = NeuralNetwork().to(device)

### Hyperparameter
- `Hyperparameters` are the parameters that can control the model optimization process.
- Different hyperparameter values can affect model training and convergence rate.

<br>

- When learning, the following hyperparameters must be defined.
> - `epochs`: The number of times the Dataset is repeated.
> - `batch_size`: The number of data samples spread through Neural Networks before the parameters are updated.
> - `learning_rate`: The ratio to control the parameters of the model at each batch/epochs.  
> Smaller values can lead to slower learning, and larger values can lead to unpredicted behavior during learning.

In [4]:
learning_rate = 1e-3
batch_size = 64
epochs = 5

### Optimization Loop
- After setting the hyperparameters, you can train and optimize your model with *Optimization Loop*.
- Each *iteration* in the Optimization Loop is called an `Epoch`.

<br>

- An Epoch consists of the following parts.
> - `Train Loop`: Iterates the training datasets and converges to the optimal parameters.
> - `Test/Validation Loop`: Iterates the test datasets to check if the model performance is improving. 

### Loss Function
- When providing the training data, there is high probability that the untrained Neural Network will not provice the correct answer.
- The *Loss Function* measures degree of dissimilarity between obtained result and actual value, and attempts to minimize it during training.
- Calculates the *loss* by comparing prediction (calculated as input from given data samples) and real answer (*label*).

<br>

- General Loss Functions include...
> - `nn.MSELoss`: MSE(Mean Square Error) for *Regression* task.
> - `nn/NLLLoss`: NLL(Negative Log Likelihood) for for *Classification* task.
> - `nn.CrossEntropyLoss`: The combination of `nn.LogSoftmax` and `nn.NLLLoss`.

<br>

- Regularizes the *logits* by transferring output logit of model to `nn.CrossEntorpyLoss` and calculates predicted error.

In [5]:
# Initialize the Loss Function
loss_fn = nn.CrossEntropyLoss()

### Optimizer
- *Optimization* is the process of controlling parameters of model at each step of training to reduce the model's error.
- *Optimization Algorithm* dictates how to execute this process. IN this example, the algorithm is `SGD(Stochastic Gradient Descent)`.
- All of logics for optimization are encapsulated in `optimizer` object.
- In PyTorch, there are many types of optimizers such as `ADAM` and `RMSProp` that work better with certain types of models or data.

<br>

- Register the parameters and learning rate of the model you want to train to initialize the optimizer.

In [6]:
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

- In the training loop, optimization consists of 3 steps.
> 1. Resets the gradient of model's parameters by calling `optimizer.zero_grad()`.  
Since the gradient is added up, it has to be set to 0 to prevent duplicate calculations.
> 2. Backpropagates the the *prediction loss* by calling `loss.backwards()`.  
PyTorch saves the gradient of loss for each parameter.
> 3. After calculating the gradient, control the parameters by using the gradient collected in the Backpropagation step.

### Full Imprementation
- Declared `train_loop`(executes the optimization codes repeatedly) and `test_loop`(evaluates the model's performance with test data).

In [8]:
def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)

    # Sets the model as Train mode. - It's important for the Batch Normalization and Dropout layers.
    # It is not necessary in this example, but I have added it for best practice.
    model.train()

    for batch, (X, y) in enumerate(dataloader):
        # Calculates with GPU
        X, y = X.to(device), y.to(device)

        # Calculation of prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * batch_size + len(X)
            print(f"loss: {loss:>7f}  [{current:>5d} / {size:>5d}]")


def test_loop(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Sets the model as Evaluation mode. - It's important for the Batch Normalization and Dropout layers.
    # It is not necessary in this example, but I have added it for best practice.
    model.eval()

    with torch.no_grad():
        for X, y in dataloader:
            # Calculates with GPU
            X, y = X.to(device), y.to(device)

            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100 * correct):>0.1f}%, Average Loss: {test_loss:>8f} \n")

- Initialize the Loss Function and Optimizer and send them to `train_loop` and `test_loop`.
- You can freely try increasing the number of `epochs` to observe the improvement of model's performance.

In [9]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

epochs = 50

for t in range(epochs):
    print(f"Epoch {t + 1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    test_loop(test_dataloader, model, loss_fn)

print("Done!")

Epoch 1
-------------------------------
loss: 2.895954  [   64 / 60000]
loss: 2.859394  [ 6464 / 60000]
loss: 2.811296  [12864 / 60000]
loss: 2.779952  [19264 / 60000]
loss: 2.726380  [25664 / 60000]
loss: 2.648319  [32064 / 60000]
loss: 2.634180  [38464 / 60000]
loss: 2.533622  [44864 / 60000]
loss: 2.502630  [51264 / 60000]
loss: 2.420824  [57664 / 60000]
Test Error: 
 Accuracy: 27.1%, Average Loss: 2.390184 

Epoch 2
-------------------------------
loss: 2.406029  [   64 / 60000]
loss: 2.383222  [ 6464 / 60000]
loss: 2.261412  [12864 / 60000]
loss: 2.278776  [19264 / 60000]
loss: 2.210935  [25664 / 60000]
loss: 2.082571  [32064 / 60000]
loss: 2.157885  [38464 / 60000]
loss: 2.019848  [44864 / 60000]
loss: 2.041665  [51264 / 60000]
loss: 1.944015  [57664 / 60000]
Test Error: 
 Accuracy: 54.1%, Average Loss: 1.931624 

Epoch 3
-------------------------------
loss: 1.973522  [   64 / 60000]
loss: 1.938063  [ 6464 / 60000]
loss: 1.785159  [12864 / 60000]
loss: 1.829759  [19264 / 60000]
