## TLDR:
- Hyperparameters affect the convergence of the training:
  - Learning rate is the amount the optimizer affects the parameters at each optimization step
  - Batch size is the number of samples processed before updating parameters
  - Epochs is the number of times the training loops over the entire dataset
- The training loop calculates for each batch the loss and optimizes the model:
  - A loss function defines how the loss is determined
  - The optimizer provides logic for how the model is optimized as a result
- The testing loop evaluates the model after each epoch

## Preparing for Training

In [42]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

# Refer to "2. Datasets & DataLoaders"
training_data = datasets.FashionMNIST(
  root="Fashion-MNIST",
  train=True,
  download=True,
  transform=ToTensor()
)

test_data = datasets.FashionMNIST(
  root="Fashion-MNIST",
  train=False,
  download=True,
  transform=ToTensor()
)

train_dataloader = DataLoader(training_data, batch_size=64)
test_dataloader = DataLoader(test_data, batch_size=64)

# Refer to "4. Build the Neural Network"
class MyNeuralNetwork(nn.Module):
  def __init__(self):
    super().__init__()
    self.flatten = nn.Flatten()
    self.linear_relu_stack = nn.Sequential(
      nn.Linear(28 * 28, 512),
      nn.ReLU(),
      nn.Linear(512, 512),
      nn.ReLU(),
      nn.Linear(512, 10),
    )

  def forward(self, x):
    x = self.flatten(x)
    logits = self.linear_relu_stack(x)
    return logits

model = MyNeuralNetwork()

Combining the steps in the previous notebooks, we can initialize the model by:
- Getting the datasets
- Creating DataLoaders from the datasets
- Defining the structure of the neural network

# Training Hyperparameters

In [43]:
# Hyperparameters, as shown below, impact the model
# training and convergent rates. Note that there *are*
# optimal hyperparameters, but when prototyping, they
# don't matter so much
learning_rate = 1e-3
batch_size = 64
epochs = 10

The hyperparameters affect the convergence of the model:
- Learning rate corresponds to slower learning speed and more unpredictable training behavior at lower and higher values respectively
- Batch size is the number of data samples propogated through the network before the parameters are updated
- Number of epochs is the number of times the training loops over the entire dataset

# The Training and Testing Loops

In [44]:
# To train the model, we must define a training loop.
# Training is done in a few steps:
def train_loop(dataloader, model, loss_fn, optimizer):
  batch_count = len(dataloader)
  for batch, (x, y) in enumerate(dataloader):
    # Calculate predictions
    y_pred = model(x)

    # Calculate loss and gradients with respect to
    # the parameters:
    # - Loss functions measure how far off the
    #   predictions are from the actual data
    optimizer.zero_grad()
    loss = loss_fn(y_pred, y)
    loss.backward()

    # Optimize the parameters using the gradients using
    # an optimizer:
    # - Optimizers are optimization algorithms
    #   containing logic for how to optimize parameters
    optimizer.step()

    # Log the progress of the training
    if batch % 100 == 0:
      print(f"Loss: {loss:>0.8f}, [{batch:>3d}/{batch_count:>3d}]")

# After training, the model must be evaluated. For this,
# we can define a testing loop, which uses a different
# dataloader than the training loop to test the model.
def test_loop(dataloader, model, loss_fn):
  batch_count = len(dataloader)
  pred_count = len(dataloader.dataset)
  total_loss, total_correct = 0, 0

  for (x, y) in dataloader:
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    correct = y_pred.argmax(1) == y
    
    total_loss += loss
    total_correct += correct.sum()

  # What metric is used when testing the model is up
  # to the creator, but usually it's average accuracy
  # and/or loss
  avg_loss = total_loss / batch_count
  avg_correct = total_correct / pred_count

  print(f"Accuracy: {avg_correct:>.2f}%, average loss: {avg_loss:>0.8f}")

# `CrossEntropyLoss()` calculates the loss when the
# model is used to classify between more than two classes
loss_fn = nn.CrossEntropyLoss()

# `SGD()` is short for "Stochastic Gradient Descent"
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# The training loop and testing loop are repeatedly
# executed for the total number of epochs
for t in range(epochs):
  print(f"*** Epoch #{t + 1} ***")
  print("Training...")
  train_loop(train_dataloader, model, loss_fn, optimizer)
  print("Testing...")
  test_loop(train_dataloader, model, loss_fn)
  print()

*** Epoch #1 ***
Training...
Loss: 2.29855251, [  0/938]
Loss: 2.29045606, [100/938]
Loss: 2.27337646, [200/938]
Loss: 2.26828003, [300/938]
Loss: 2.24884081, [400/938]
Loss: 2.21866322, [500/938]
Loss: 2.21745205, [600/938]
Loss: 2.18242002, [700/938]
Loss: 2.17854881, [800/938]
Loss: 2.14892769, [900/938]
Testing...
Accuracy: 0.51%, average loss: 2.14140081

*** Epoch #2 ***
Training...
Loss: 2.14595199, [  0/938]
Loss: 2.14124417, [100/938]
Loss: 2.07956338, [200/938]
Loss: 2.10073686, [300/938]
Loss: 2.04285312, [400/938]
Loss: 1.98154294, [500/938]
Loss: 2.00525117, [600/938]
Loss: 1.92100573, [700/938]
Loss: 1.93031168, [800/938]
Loss: 1.85450912, [900/938]
Testing...
Accuracy: 0.61%, average loss: 1.84844100

*** Epoch #3 ***
Training...
Loss: 1.88395309, [  0/938]
Loss: 1.85617280, [100/938]
Loss: 1.73068535, [200/938]
Loss: 1.77646601, [300/938]
Loss: 1.66236997, [400/938]
Loss: 1.62110662, [500/938]
Loss: 1.64127755, [600/938]
Loss: 1.54362488, [700/938]
Loss: 1.57724297, [80

The training and testing loops define how the model is trained and evaluated, respectively. Typically, the following are required:
- A DataLoader, containing all the data in batches
- A loss function, to determine how far off the model's predictions are
- An optimizer, to guide the model towards better predictions
- The hyperparameters, which affect convergence of the model