### Deep Learning with PyTorch 2

#### Object-Oriented Programming
OOP-defined objects in PyTorch include `Dataset`s and Models (`nn.Module`, PyTorch's base class for neural networks).

For a `Dataset`, common methods of note include:
- `def __init__(self, ...)` which is called when the object is created.
    - Note that for a subclass of PyTorch's `Dataset`, `super().__init__()` is needed to call the constructor of `Dataset` to ensure the subclass retains functionality despite overriding the superclass constructor.
- `def __len__(self)` which returns the size of the `Dataset` (number of entries).
- `def __getitem__(self, idx):` which extracts features and label for a single sample at index `idx`.

In [1]:
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import pandas as pd

class MyDataset(Dataset):
    def __init__(self, csv_path):
        super().__init__()
        df = pd.read_csv(csv_path)
        self.data = df.to_numpy()
    
    def __len__(self):
        return self.data.shape[0]
    
    def __getitem__(self, idx):
        features = self.data[idx, :-1] # gets the ith row's features (all columns except the last)
        label = self.data[idx, -1] # gets the ith row's features (the last column)
        return features, label

# dataset_train = MyDataset("dataset.csv")
# dataloader_train = DataLoader(dataset_train, batch_size=2, shuffle=True)
# features, labels = next(iter(dataloader_train))

For a Model, common methods of note include:
- `def __init__(self, ...)` which is called when the object is created.
    - Note that for a subclass of PyTorch's `nn.Module`, `super().__init__()` is needed to call the constructor of `nn.Module` to ensure the subclass retains functionality despite overriding the superclass constructor.
- `def forward(self, x)` which runs a forward pass from input to output. In particular, each layer's output is wrapped in the activation function that succeeds it.

Note that `nn.functional.relu()` and `nn.functional.sigmoid()` are inline activation functions that behave identically to `nn.ReLU` and `nn.Sigmoid()` respectively; use the former in `forward()` for inline, stateless activations.

In [2]:
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(9, 16),
    nn.ReLU(),
    nn.Linear(16, 8),
    nn.ReLU(),
    nn.Linear(8, 1),
    nn.Sigmoid(),
)

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(9, 16) # fc = shorthand for fully-connected layer
        self.fc2 = nn.Linear(16, 8)
        self.fc3 = nn.Linear(8, 1)

    def forward(self, x):
        x = nn.functional.relu(self.fc1(x))
        x = nn.functional.relu(self.fc2(x))
        x = nn.functional.sigmoid(self.fc3(x))
        return x

mymodel = MyModel()

#### Optimizers, training, and evaluation

Recall the PyTorch training loop:
- creating a model
- choosing a loss function
- defining a dataset
- setting an optimizer (optimized gradient descent function)
- and running the training loop (calculating loss via a forward pass, computing gradients via backpropagation, and updating model parameters).

In [3]:
import torch.optim as optim
from torch.utils.data import TensorDataset

# binary cross entropy error; CSE specifically for bianry classification
criterion = nn.BCELoss()
optimizer = optim.SGD(mymodel.parameters(), lr=0.01)

X = torch.randn((10, 9)) # dummy input
y = nn.init.uniform_(torch.randn((10, 1))) # dummy target
dataset = TensorDataset(X, y)
dataloader_train = DataLoader(dataset, batch_size=2, shuffle=True)
dataloader_test = DataLoader(dataset, batch_size=2, shuffle=True)

num_epochs = 1000

# Loop over the number of epochs and then the dataloader
for epoch in range(num_epochs):
    for features, labels in dataloader_train:
        optimizer.zero_grad() # Set the gradients to zero
        outputs = mymodel(features) # Run a forward pass
        loss = criterion(outputs, labels.view(-1,1)) # Compute loss
        loss.backward() # Update gradients
        optimizer.step() # Update parameters by descending gradients

#### Other optimizers

Recall that the Stochastic Gradient Descent (SGD) optimizer bases the magnitude of its parameter updates on learning rate.
- This is simple and computationally efficient for basic models but rarely used in practice in lieu of more sophisticated methods.

Using the same learning rate for each parameter cannot be optimal, so the Adagrad optimizer adapts learning rate for each parameter by scaling it proportionally to the frequency at which the parameter is updated (frequent = greater learning rate; infrequent = smaller learning rate).
- However, Adagrad tends to decrease the learning rate too steeply.

Root Mean Square Propagation (RMSprop) addresses Adagrad's steep learning rate decay by adapting the learning rate per parameter based on the size of its previous gradients.

The most versatile and widely used optimizer is the default go-to Adaptive Moment Estimation (Adam), which combines RMSprop with the concept of momentum (the weighted average of past gradients weighed toward the most recent ones). Basing the update on both gradient size and momentum accelerates training.

In [4]:
optimizer_sgd = optim.SGD(mymodel.parameters(), lr=0.01)
optimizer_adagrad = optim.Adagrad(mymodel.parameters(), lr=0.01)
optimizer_rmsprop = optim.RMSprop(mymodel.parameters(), lr=0.01)
optimizer_adam = optim.Adam(mymodel.parameters(), lr=0.01)

#### Accuracy with torchmetrics

In [5]:
from torchmetrics import Accuracy

# acc = Accuracy(task="binary")

mymodel.eval()
with torch.no_grad():
    for features, labels in dataloader_test:
        outputs = mymodel(features)
        predictions = (outputs >= 0.5).float() # for binary prediction (0, 1)
        # acc(predictions, labels.view(-1, 1))

# accuracy = acc.compute()
# print("Accuracy:", accuracy)

#### Vanishing and exploding gradients

Neural networks may suffer from gradient instability (change in gradient updates from output to input layer) during training:
- Vanishing gradients are when gradients shrink through backpropagation, which fails to update earlier layers' parameters efficiently
- Exploding gradients are when gradients inflate toward huge parameter updates and divergent training

A three-step solution to addressing gradient instability: proper weights initialization, good activations, and batch normalization.

#### Weights initialization
- Per the latest research, good weights initialization should ensure that the variance of the layer's inputs is similar to its outputs and the variance of its gradients is similar before and after passing through the layer. Achieving this is different for each activation function:
    - For Rectified Linear Unit (ReLU) and similar activations, use He/Kaiming initialization. Call `kaiming_uniform_()` from `torch.nn.init` and pass the layer's `weight` attribute to ensure the desired variance properties.

In [6]:
import torch.nn.init as init

layer = nn.Linear(8, 2)

init.kaiming_uniform_(layer.weight)
print(layer.weight)

# @@@ Implementation of He/Kaiming initialization @@@
class KaimingModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(9, 16) # fc = shorthand for fully-connected layer
        self.fc2 = nn.Linear(16, 8)
        self.fc3 = nn.Linear(8, 1)

        init.kaiming_uniform_(self.fc1.weight)
        init.kaiming_uniform_(self.fc2.weight)
        init.kaiming_uniform_(
            self.fc3.weight,
            nonlinearity="sigmoid"
        )
    
    def forward(self, x):
        x = nn.functional.relu(self.fc1(x))
        x = nn.functional.relu(self.fc2(x))
        x = nn.functional.sigmoid(self.fc3(x))
        return x

Parameter containing:
tensor([[-0.0943, -0.7111, -0.2765, -0.3434,  0.3502,  0.5823, -0.7252,  0.6005],
        [ 0.7573,  0.5972,  0.1416, -0.2971, -0.2139,  0.1173, -0.5487, -0.7062]],
       requires_grad=True)


#### Activation functions

ReLU is available as the default `nn.functional.relu()`, but its zeroing of negative inputs suffers from the dying neuron problem.

An improvement is the Exponential Linear Unit (ELU) function (supplied in PyTorch by `nn.functional.elu()), which merely approaches 0 with negative inputs; because of its nonzero gradients it avoids the dying neurons problem, and its average output being near zero makes it less prone to vanishing gradients.

#### Batch normalization

Batch normalization is an operation applied after a layer that first normalizes the layer's outputs (subtracting the mean, then dividing by the standard deviation to ensure a roughly normal output distribution), then scales and shifts the normalized outputs using learned model parameters. This enables the model to learn the optimal distribution to each layer before it is applied, speeding up the loss decrease and increasing resilience against unstable gradient issues.

In [7]:
# @@@ Implementation of batch normalization@@@
class KaimingModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(9, 16) # fc = shorthand for fully-connected layer
        self.bn1 = nn.BatchNorm1d(16) # bn = batch normalization
        # ...
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.bn1(x)
        x = nn.functional.elu(x)
        return x
    
    # Whether to place the activation function after or before batch normalization may depend on the model and dataset.
    # Originally it was prescribed to apply BN before, but as of recent research
    # applying after can normalize the activation toward better statistics for the next layer.