### Deep Learning with PyTorch

#### Tensors

Tensors are similar to NumPy arrays but have unique features. Convert a Python list into a Torch tensor with `torch.tensor`.

In [45]:
import torch

temperatures = [[10, 20], [30, 40], [50, 60]]

temperature_tensor = torch.tensor(temperatures)
print(temperature_tensor)

tensor([[10, 20],
        [30, 40],
        [50, 60]])


Tensors have a `shape` and `dtype`, and can be added elementwise.

In [46]:
addend_tensor = torch.tensor([[10, 20], [10, 20], [10, 20]])
sums_tensor = temperature_tensor + addend_tensor
print("Sums:", sums_tensor)
print("Sums shape:", sums_tensor.shape)
print("Sums shape:", sums_tensor.dtype)

Sums: tensor([[20, 40],
        [40, 60],
        [60, 80]])
Sums shape: torch.Size([3, 2])
Sums shape: torch.int64


#### Linear layer
`Linear` from `torch.nn` takes a tensor as input and outputs a tensor whose sizes correspond to `in_features` and `out_features`. The weights and biases involved in the calculation between are initialized randomly.

In [47]:
import torch.nn as nn

input_tensor = torch.tensor([0.1, -0.1, 0.8])

linear_layer = nn.Linear(
    in_features=3,
    out_features=2
)
output = linear_layer(input_tensor)
print(output)
print(linear_layer.weight)
print(linear_layer.bias)

tensor([ 0.7825, -0.9039], grad_fn=<ViewBackward0>)
Parameter containing:
tensor([[-0.1909, -0.4258,  0.5062],
        [-0.5135, -0.2065, -0.4223]], requires_grad=True)
Parameter containing:
tensor([ 0.3540, -0.5354], requires_grad=True)


### Sequential layer
`Sequential` from `torch.nn` can stack layers such as `Linear` to pass data through the layers in sequence. Layers bookended by the input and output are called hidden layers.

A neuron in a linear layer has $n+1$ parameters, with $n$ counting the weight for each input from the previous layer and $1$ accounting the neuron's bias.

More hidden layers = more parameters = higher model capacity.

In [48]:
sequential_model = nn.Sequential(
    nn.Linear(3, 2),
    nn.Linear(2, 8),
    nn.Linear(8, 3)
)

Acquire the model's parameters using `parameters()`, which outputs a container of tensors containing each layer's weights and each layer's biases.

`numel()` outputs the number of elements in a tensor.

In [49]:
count = 0
for parameter in sequential_model.parameters():
    print(parameter)
    count += parameter.numel()
print(count)

Parameter containing:
tensor([[ 0.4372, -0.3712, -0.4777],
        [-0.4745, -0.3590,  0.3441]], requires_grad=True)
Parameter containing:
tensor([0.5434, 0.4214], requires_grad=True)
Parameter containing:
tensor([[-0.6535, -0.6952],
        [ 0.3978, -0.6286],
        [ 0.0502,  0.2019],
        [ 0.2138,  0.1033],
        [ 0.2212,  0.0905],
        [ 0.4450, -0.1702],
        [-0.3050,  0.5096],
        [ 0.1094,  0.5195]], requires_grad=True)
Parameter containing:
tensor([-0.2077,  0.0894, -0.6579,  0.4189,  0.4407,  0.0286, -0.2390,  0.1748],
       requires_grad=True)
Parameter containing:
tensor([[-0.0654,  0.0603,  0.1492,  0.0654,  0.0913, -0.1673,  0.2176,  0.1341],
        [ 0.2890, -0.1768, -0.0975,  0.0136,  0.2267, -0.1219,  0.0688, -0.2123],
        [ 0.2221, -0.2042, -0.2308,  0.1172,  0.0818, -0.2777,  0.2426,  0.1920]],
       requires_grad=True)
Parameter containing:
tensor([ 0.1055,  0.3396, -0.1689], requires_grad=True)
59


#### Sigmoid function
Type of function that takes a real-valued input (specifically a float) and outputs a single value between 0 and 1. Used for binary classification, and can be placed as the final activation of a network of linear layers after which a forward pass determines classification-or-not by a threshold (for instance 0.5).

Equivalent to traditional logistic regression (in that the output is a probability for the category of interest).

In [50]:
input = torch.tensor([10.0, 12.0, 13.0])
sigmoid_model = nn.Sequential(
    nn.Linear(3, 2),
    nn.Linear(2, 1),
    nn.Sigmoid()
)
sigmoid_model(input)

tensor([0.5200], grad_fn=<SigmoidBackward0>)

#### Softmax function
Type of function that takes a one-dimensional input (specifically of floats) and outputs a one-dimensional distribution of probabilities that sum to 1. Used for multi-class classification, and can be placed as the final activation of a network of linear layers after which a forward pass produces the classification to be chosen from the highest per-class probability.

In [51]:
input = torch.tensor([10.0, 12.0, 13.0])
softmax_model = nn.Sequential(
    nn.Linear(3, 2),
    nn.Linear(2, 5),
    nn.Softmax()
)
softmax_model(input)

  return self._call_impl(*args, **kwargs)


tensor([0.8374, 0.0024, 0.0745, 0.0015, 0.0842], grad_fn=<SoftmaxBackward0>)

#### Loss function

The function that quantifies how far a machine learning model's predictions are from the actual target values, be it during training or in practice. The loss function takes a model prediction $\hat{y}$ (may be a singular regressive / sigmoid output, or a softmax tensor of probabilities) and ground truth $y$ (the actual value or class itself) as inputs and outputs a single float, the loss.

The goal of training is to minimize the loss, which should be low or zero for an accurate prediction and high for an incorrect one.

For cross-entropy loss, the ground truth value may be the class itself (a number), so to convert it into a tensor functional against the model prediction (a softmax probability distribution), use `nn.functional.one_hot()` which takes a tensor of indices to make one-hots for and a num_elements and to output a tensor of containing one-hot(s).

In [52]:
import torch.nn.functional as F

print(F.one_hot(torch.tensor(0), 3))
print(F.one_hot(torch.tensor(1), 3))
print(F.one_hot(torch.tensor([0,2]), 3))

tensor([1, 0, 0])
tensor([0, 1, 0])
tensor([[1, 0, 0],
        [0, 0, 1]])


#### Cross-entropy loss

Cross-entropy loss is a common loss function for classification. With a scores tensor (model predictions before the final softmax function) and a one-hot encoded ground truth label as input (both must be converted to floats), the cross-entropy loss function applies an internal softmax to the scores (producing a probability distribution of the same size), then outputs the negative natural log of the ground truth's corresponding probability.

Cross-entropy loss function is supplied in PyTorch by instantiating `nn.CrossEntropyLoss()`.

In [53]:
scores = torch.tensor([-5.2, 4.6, 0.8])
one_hot_target = F.one_hot(torch.tensor(0), 3)

softmax = nn.Softmax()
print(softmax(scores))
criterion = nn.CrossEntropyLoss()
print(criterion(scores.double(), one_hot_target.double()))

tensor([5.4235e-05, 9.7807e-01, 2.1880e-02])
tensor(9.8222, dtype=torch.float64)


#### Loss gradients and Backpropagation

Loss outputted by a forward pass may vary by the values used for the model's parameters (weights and biases); this rate of change is the loss gradient. With the output (i.e. `loss = criterion(prediction, target)`) of an instantiated loss function (such as `criterion = CrossEntropyLoss()`), we can calculate the gradients of this loss using `loss.backward()`.

Backpropagation is the process by which we aim to locate the global minimum of the loss function and tune the model's parameters to better fit. with each forward pass, we incrementally update the model's weights and biases inversely to the loss gradient such that they follow the gradient downstream (gradient descent).

The existing gradients of layer `i` can be accessed using `model[i].weight.grad` and `model[i].bias.grad`. To update model parameters manually, access each layer gradient and multiply it by the learning rate, then subtract the result from the weight or bias.

In [54]:
model = nn.Sequential(
    nn.Linear(5, 8),
    nn.Linear(8, 2)
)

# Need to compute loss and backpropagate in order to define gradients

x = torch.randn(5) # dummy input
target = torch.randn(2) # dummy target
output = model(x) # dummy prediction
criterion = nn.CrossEntropyLoss()
loss = criterion(output, target)
loss.backward()

weight0 = model[0].weight
weight1 = model[1].weight
bias1 = model[1].bias
print("Weight of the first layer:", weight0)
print("Bias of the second layer:", bias1)

grads0 = weight0.grad
grads1 = weight1.grad
print("Gradient of the first layer:", grads0)
print("Gradient of the second layer:", grads1)

lr = 0.001
weight0 = weight0 - lr * grads0 # Note that -= does not work, as grad requires being used in an in-place operation
weight1 = weight1 - lr * grads1

Weight of the first layer: Parameter containing:
tensor([[ 0.3206, -0.3639,  0.3761, -0.4230, -0.1317],
        [ 0.3998, -0.2565,  0.4259,  0.0580, -0.2682],
        [ 0.1348,  0.2798,  0.2490,  0.1747, -0.3247],
        [-0.2314, -0.3041,  0.4000,  0.3989, -0.1950],
        [-0.2652,  0.2406,  0.3434, -0.1151,  0.1719],
        [-0.0882,  0.0454, -0.3510,  0.0415, -0.0767],
        [ 0.2033,  0.3179, -0.2007,  0.0304, -0.1262],
        [-0.2840, -0.2974, -0.2339,  0.1518,  0.0749]], requires_grad=True)
Bias of the second layer: Parameter containing:
tensor([ 0.1527, -0.3415], requires_grad=True)
Gradient of the first layer: tensor([[ 0.0045,  0.0348, -0.0262,  0.0470,  0.0377],
        [ 0.0942,  0.7219, -0.5435,  0.9740,  0.7812],
        [ 0.0820,  0.6286, -0.4732,  0.8481,  0.6802],
        [ 0.0729,  0.5587, -0.4206,  0.7538,  0.6045],
        [-0.0388, -0.2971,  0.2237, -0.4008, -0.3215],
        [ 0.0782,  0.5995, -0.4513,  0.8088,  0.6487],
        [-0.0241, -0.1848,  0.1391, 

#### Optimized gradient descent

PyTorch has an optimized gradient descent function, `torch.optim.SGD()`, that uses Stochastic Gradient Descent (which calculates gradients from one or a small subset of training examples, rather than the computationally expensive entire dataset). Takes a model's parameters and a learning rate as input when instantiating; use `optimizer.step()` to perform an update on all of the model's parameters.

In [55]:
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.001)

x = torch.randn(5) # dummy input
target = torch.randn(2) # dummy target
output = model(x) # dummy prediction
criterion = nn.CrossEntropyLoss()
loss = criterion(output, target)
loss.backward()

optimizer.step()

#### DataFrames

`pd.read_csv()` from `pandas` reads into a `pandas.DataFrame` from a .csv file.

The `DataFrame.iloc` property is used to select slices of the DataFrame through integer-location based indexing. `iloc[:,1:-1]` for instance selects all but the first and last columns (as columns are dimension 1 of the DataFrame).

`to_numpy()` of a DataFrame outputs an equivalent numpy array for easier handling with PyTorch.

In [None]:
import numpy as np
import pandas as pd

# dataset = pd.read_csv("dataset.csv")
# target = dataset.iloc[:, -1]
# y = target.to_numpy()
# print(y)

Input sample:  tensor([0., 1.], dtype=torch.float64)
Label sample:  tensor([2.], dtype=torch.float64)


#### Data loading and batching

`TensorDataset` and `DataLoader` from `torch.utils.data` help manage data loading and batching during training. 

Instantiate a `DataLoader` with the following parameters:
- the `TensorDataset` itself
- `batch_size` determines the number of samples included in each iteration
- `shuffle` is whether the data order at each epoch (a full pass through the training data) is randomized, which helps improve model generalization (level of performance on unseen data).

The `DataLoader` behaves like a PEZ dispenser; each element in it is a tuple unpacked as batch_inputs (features) and batch_labels (target), representing a row from the dataset. Iterating through the `DataLoader` dispenses feature and label tensors that are `batch_size` in depth (aka `batch_size` number of samples, or fewer if there are no more left), corresponding to `batch_size` rows in the `TensorDataset`. 

In real-world deep learning, batch sizes are often 32 or greater to accommodate larger datasets with better computational efficiency.

In [None]:
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader

X = np.array([[0.0, 1.0],
              [1.0, 0.0],
              [-1.0, 0.0],
              [-1.0, 1.0],
              [-1.0, -1.0]])
y = np.array([[2.0],
              [3.0],
              [4.0],
              [0.0],
              [1.0]])
dataset = TensorDataset(torch.tensor(X), torch.tensor(y))

input_sample, label_sample = dataset[0]
print("Input sample:", input_sample)
print("Label sample:", label_sample)

dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
for batch_inputs, batch_labels in dataloader:
    print("Batch inputs:", batch_inputs)
    print("Batch labels:", batch_labels)

Input sample: tensor([0., 1.], dtype=torch.float64)
Label sample: tensor([2.], dtype=torch.float64)
Batch inputs: tensor([[ 0.,  1.],
        [-1.,  0.]], dtype=torch.float64)
Batch labels: tensor([[2.],
        [4.]], dtype=torch.float64)
Batch inputs: tensor([[-1., -1.],
        [-1.,  1.]], dtype=torch.float64)
Batch labels: tensor([[1.],
        [0.]], dtype=torch.float64)
Batch inputs: tensor([[1., 0.]], dtype=torch.float64)
Batch labels: tensor([[3.]], dtype=torch.float64)


#### Mean Squared Error loss

Mean Squared Error (MSE) loss is a common loss function for regression. With a predictions tensor and ground truth tensor as input (both must be converted to floats), MSE loss is the average of the squared difference between each prediction and its corresponding ground truth.

The MSE loss function is implemented in NumPy as the function below, and is supplied in PyTorch by instantiating `nn.MSELoss()`.

In [64]:
def mean_squared_loss(prediction, target):
    return np.mean((prediction - target) ** 2)

criterion = nn.MSELoss()
# loss = criterion(torch.tensor(prediction), torch.tensor(target))

#### Training loop

Training a neural network involves creating a model, choosing a loss function, defining a dataset, setting an optimizer, and running the training loop (calculating loss via a forward pass, computing gradients via backpropagation, and updating model parameters).

In [None]:
num_epochs = 5

# Loop over the number of epochs and then the dataloader
for i in num_epochs:
  for data in ____:
    # Set the gradients to zero
    ____