# Chapter 10. Building Neural Networks with PyTorch

PyTorch is an open-source deep learning library for working with machine learning models. It's goal was to provide a pythonic interface for researchers, originally developed by Meta and now under the governance of the PyTorch foundation.

## Fundamentals

The fundamental core data structure in PyTorch is the *tensor* which is a multidimensional array with a shape and a data type not so unlike a NumPy array. It will become the input and output of our neural networks just like NumPy arrays were in Sciki-Learn models.

### PyTorch Tensors

Import the library and create a 2 x 3 array:

In [1]:
import torch

In [2]:
X = torch.tensor([[1.0, 4.0, 7.0], [2.0, 3.0, 6.0]])
X

tensor([[1., 4., 7.],
        [2., 3., 6.]])

Getting shape and data type is similar to NumPy

In [3]:
X.shape

torch.Size([2, 3])

In [4]:
X.dtype

torch.float32

So is indexing 

In [5]:
X[0, 1]

tensor(4.)

In [6]:
X[:, 1]

tensor([4., 3.])

In [7]:
10 * (X + 1.0)

tensor([[20., 50., 80.],
        [30., 40., 70.]])

In [8]:
X.exp()

tensor([[   2.7183,   54.5982, 1096.6332],
        [   7.3891,   20.0855,  403.4288]])

In [9]:
X.mean()

tensor(3.8333)

In [10]:
X.max(dim=0)

torch.return_types.max(
values=tensor([2., 4., 7.]),
indices=tensor([1, 0, 0]))

In [11]:
X.max(dim=1)

torch.return_types.max(
values=tensor([7., 6.]),
indices=tensor([2, 2]))

In [12]:
X @ X.T

tensor([[66., 56.],
        [56., 49.]])

The `numpy()` method enables creating a tensor from a NumPy array and vice-versa

In [13]:
import numpy as np

X.numpy()

torch.tensor(np.array([[1., 4., 7.], [2. ,3., 6.]]))

tensor([[1., 4., 7.],
        [2., 3., 6.]], dtype=torch.float64)

In [14]:
X[:, 1] = -99
X

tensor([[  1., -99.,   7.],
        [  2., -99.,   6.]])

QUESTION: what does it mean that these api methods with the underscore apply operations in-place? Does that mean without storing a copy?

In [15]:
X.relu_()
X

tensor([[1., 0., 7.],
        [2., 0., 6.]])

### Hardware Acceleration

One benefit of PyTorch is the hardware acceleration that greatly speeds up computations. Unlike SciKit Learn, we can choose from using NVIDIA GPUs with CUDA, Apple's *Metal Perfomance Shaders* (MPS), AMD ROCm, Intel's oneAPI, etc.

In [16]:
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

In [17]:
device

'mps'

Creating a Tensor on the CPU then copying it to the accelerator `device` with the `to()` method.
The tensor's `device` attribute will show the device it livs on.

In [18]:
M = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
M = M.to(device)

In [19]:
M.device

device(type='mps', index=0)

The tensor can also be created directly on the GPU using the `device` argument:

In [20]:
M = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device=device)

QUESTION: why isn't this the default? what's the scenario for either use case? Maybe if we want to do some easy manipulations and data verification on the tensor in CPU to keep the GPU freed up for the more intense training computations?

In [21]:
R = M @ M.T
R

tensor([[14., 32.],
        [32., 77.]], device='mps:0')

Crucially the resulting Tensor `R` also lives on the accelerator device so we are saved the bottleneck of data transfer between devices.

In [22]:
M = torch.rand((1000, 1000))
%timeit M @ M.T

2.34 ms ± 74.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [23]:
M = torch.rand((1000, 1000), device=device)
%timeit M @ M.T

195 μs ± 3.25 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


Interestinglly the `mpu` accelerated tensor operation was significnatly faster like an order of magnitude faster. Honestly the cpu operation was not slouch probably due to the Apple M4's ARM architecture and unified memory. 

It's curious to me that the actual python execution of the accelerated block took a lot longer. Maybe this was just a temporary Jupyter kernel quirk or the increased overhead of communicating the process back to the python kernel running in the cpu caused that?

### Autograd

The reverse-mode auto-differentiation implementation in PyTorch is *autograd* automatic gradients.

In [24]:
x = torch.tensor(5.0, requires_grad=True)
f = x ** 2
f

tensor(25., grad_fn=<PowBackward0>)

In [25]:
f.backward()
x.grad

tensor(10.)

A breakdown of this code is as follows. 

First a tensor is created, `x` of value `5.0`. In order for it to not be treated as a constant value we provide `reuiqres_grad=True` so that PyTorch internally will track operations performed on it (necessary for our autograd).

The definition of our f(x) is fairly straightforward we are performing an exponential operaion on the tensor x. This gives us a new tensor of the resulting value `25.` and the `grad_fn=<PowBackward0>` which is the operation we just performed, creating the relationship between this tensor and the one it is modified from, `x`. This `grad_fn` attribute keeps track of the *computation graph*.

Given this, when we call `f.backward()` pytorch is traversing this graph backwards starting at `f` calculating the gradients all the way back to the leaf nodes which in this case is just `x`.

This then allows us to read the tensor `x`'s gradient with `x.grad`. If we would have run this at the beginning it's gradient would be empty. This gradient was computed during the backprop and gives us the derivative of `f` with regard to `x`. 

---


For gradient descent, the reduction operation subtracting a fraction of the gradients from the model variables should not be tracked and in fact raises an exception in PyTorch. To exclude these from the gradient descent steps from the computation graph, we can place each step insdie a `torch.no_grad()` context.

In [26]:
learning_rate = 0.1
print(f"x: {x}, x.grad {x.grad}")
with torch.no_grad():
    x -= learning_rate * x.grad
print(f"x: {x}, x.grad {x.grad}")


x: 5.0, x.grad 10.0
x: 4.0, x.grad 10.0


You can also avoid gradient computation by using the `detach()` method creating a new tensor detached from the computation graph, with `requires_grad=False` but pointing to the same data in memory.

In [27]:
x_detached = x.detach()
x_detached -= learning_rate * x.grad
print(f"x: {x}, x.grad {x.grad}")
print(f"x_detached: {x_detached}, x_detached.grad {x_detached.grad}")

x: 3.0, x.grad 10.0
x_detached: 3.0, x_detached.grad None


So `detach()` is handy for performing operations that you don't want affecting the gradient, for logging or other things, and since `x_detached` and `x` share the same memory, modifying `x_detached` also modifies `x`. In general `no_grad()` is preferred when performing inference or doing a gradient descent step. 

Before repeating the process the gradients of every model parameter need to be zeroed out. The gradient tensor has `required_grad=False` so a `no_grad()` is not necessary. (Yes I guess the grad_fn is also a tensor?)

In [28]:
x.grad.zero_()

tensor(0.)

In [29]:
learning_rate = 0.1
x = torch.tensor(5.0, requires_grad=True)

for i in range(100):
    f = x ** 2  # forward pass
    f.backward()    # backward pass
    
    with torch.no_grad():
        x -= learning_rate * x.grad

    x.grad.zero_()  # zero out the gradients at the end of each iteration

> in place operations save space by reducing intermediate copies but this doesn't work well with autograd where they are needed to perform the backward pass. Instead of `z += 1` instead use `z = z + 1`

## Implementing Linear Regression

We'll utilize the same california housing dataset to train an NLP on linear regression.

In [30]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

housing = fetch_california_housing()

After downloading the dataset we split into a training and test set. Then we split the training set into a training and validation set.

In [31]:
x, x_test, y, y_test = train_test_split(
    housing.data, housing.target, random_state=42)

x_train, x_valid, y_train, y_valid = train_test_split(x, y, random_state=42)

With our data downloaded and split into training and testing sets, we can normalize it. But this time, with Tensors!

In [32]:
x_train = torch.FloatTensor(x_train)
x_valid = torch.FloatTensor(x_valid)
x_test = torch.FloatTensor(x_test)
means = x_train.mean(dim=0, keepdims=True)
stds = x_train.std(dim=0, keepdims=True)
x_train = (x_train - means) / stds
x_valid = (x_valid - means) / stds
x_test = (x_test - means) / stds

The targets should also be converted to tensors and need to be reshaped by adding a second dimension of size 1:

In [33]:
y_train = torch.FloatTensor(y_train).reshape(-1, 1)
y_valid = torch.FloatTensor(y_valid).reshape(-1, 1)
y_test = torch.FloatTensor(y_test).reshape(-1, 1)

QUESTION: exactly how did we know to do that should probably get some better intuition of that I'm assuming this reshape is doing somekind of multiplication / dot product operation?

In [34]:
torch.manual_seed(42)
n_features = x_train.shape[1]
w = torch.randn((n_features, 1), requires_grad=True)
b = torch.tensor(0., requires_grad=True)

This created a weights `w` and bias `b` parameter with randomly initialized weights and the bias is initially zero. Notice we used `requires_grade=True` for our backprop to work. We get the number of features from the `x_train.shape[1]` so that the weights is a column vector with one weight per input dimension.

With all the groundwork laid we can begin training our model! Using Batch Gradient Descent (BGD) with autodiff to compute the gradients we'll use the full training set at each training step.

We'll calculate log using the squared mean.

#### Linear Regression Using PyTorch's Low-Level API

In [35]:
learning_rate = 0.4
n_epochs = 20

for epoch in range(n_epochs):
    y_pred = x_train @ w + b
    loss = ((y_pred - y_train) ** 2).mean()
    loss.backward()
    with torch.no_grad():
        b -= learning_rate * b.grad
        w -= learning_rate * w.grad
        b.grad.zero_()
        w.grad.zero_()
    print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {loss.item()}")

Epoch 1/20, Loss: 16.158458709716797
Epoch 2/20, Loss: 4.879374027252197
Epoch 3/20, Loss: 2.255225896835327
Epoch 4/20, Loss: 1.3307636976242065
Epoch 5/20, Loss: 0.9680691957473755
Epoch 6/20, Loss: 0.8142675757408142
Epoch 7/20, Loss: 0.741704523563385
Epoch 8/20, Loss: 0.7020700573921204
Epoch 9/20, Loss: 0.6765917539596558
Epoch 10/20, Loss: 0.6577964425086975
Epoch 11/20, Loss: 0.6426150798797607
Epoch 12/20, Loss: 0.6297222971916199
Epoch 13/20, Loss: 0.6184941530227661
Epoch 14/20, Loss: 0.6085968017578125
Epoch 15/20, Loss: 0.5998216271400452
Epoch 16/20, Loss: 0.5920186638832092
Epoch 17/20, Loss: 0.5850691199302673
Epoch 18/20, Loss: 0.578873336315155
Epoch 19/20, Loss: 0.5733453631401062
Epoch 20/20, Loss: 0.5684100389480591


The forward pass is the `y_pred` predictions calculation and the mean squared error `loss`.

The `loss.backward()` method performs the backpropagation against the loss function (is loss a tensor here?) **this is autograd**

Gradient descent is then performed against the bias and weights as described above in a method where the gradients are not affected by the descent step operations. Gradients are then zeroed out in-place with `.zero_()` before the next epoch.

Making predictions with our trained model:

In [36]:
x_new = x_test[:3]
with torch.no_grad():
    y_pred = x_new @ w + b

y_pred

tensor([[0.8916],
        [1.6480],
        [2.6577]])

> Best practice is using `torch.no_grad()` context during inference. there is not need to keep track of the graph and it's computationally cheaper in cycles and memory.

#### Linear Regression Using PyTorch's High-Level API

The above process is intensive in it's verbosity, we have a lot of control but must also do a lot of work. We can utilize the higher level abstractions available in PyTorch `toch.nn.Linear` class to achieve our goals.

In [37]:
import torch.nn as nn

torch.manual_seed(42)
model = nn.Linear(in_features=n_features, out_features=1)

The `nn.Linear` subclass is a module from `nn.Module` class, modules can be used as building blocks for more complex operations. 

The `nn.Linear` module contains a `bias` vector with one bias term per neuron, and a `weight` matrix with one row per neuron and one column per input dimension, which is the tranpose of the weight matrix presented in computing outputs of a fully connected layer equation (rows and cols flipped).

In this example the model has a single neuron `out_features=1` the `bias` vector therefore contains a single term and the `weight` matrix is a single row. 

These parameters are accessible as attributes of the `nn.Linear` module `model` we created.

In [38]:
model.bias

Parameter containing:
tensor([0.3117], requires_grad=True)

In [39]:
model.weight

Parameter containing:
tensor([[ 0.2703,  0.2935, -0.0828,  0.3248, -0.0775,  0.0713, -0.1721,  0.2076]],
       requires_grad=True)

So this seems to have done the `w = torch.tensor(...)` work we did earlier by hand as it were. The weights seem pretty random but the bias is not zero which we had used before. I wonder if that can be adjusted.

In [40]:
for param in model.parameters():
    print(param)

Parameter containing:
tensor([[ 0.2703,  0.2935, -0.0828,  0.3248, -0.0775,  0.0713, -0.1721,  0.2076]],
       requires_grad=True)
Parameter containing:
tensor([0.3117], requires_grad=True)


In [41]:
for param in model.named_parameters():
    print(param)

('weight', Parameter containing:
tensor([[ 0.2703,  0.2935, -0.0828,  0.3248, -0.0775,  0.0713, -0.1721,  0.2076]],
       requires_grad=True))
('bias', Parameter containing:
tensor([0.3117], requires_grad=True))


In [42]:
model(x_train[:2])

tensor([[-0.4718],
        [ 0.1131]], grad_fn=<AddmmBackward0>)

The above function call computes the internal `forward()` method of the model. In this case it's `x @ self.weight.T + self.bias` for linear regression.

With the model in hand we have a way to perform inference i.e. making predictions by calling the `forward()` method by using the model as a function: `model(x_train[:2])`.

What's left now is a way to iteratively adjust the parameters with respect to the loss function in order to minimize the loss over time. Optimization. 

PyTorch includes a Stochastic Gradient Descent (SGD) optimizer that we can use:

In [43]:
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

Great, so the optimizer has been initialized with the model's parameters (weights and bias) and the learning rate.

For the loss we can use mean squared error again but instead of calculating ourselves we can use a another `nn` built-in method.

In PyTorch the loss function is usually referred to as the *criterion* to distinguish the function from its output `loss`

In [44]:
mse = nn.MSELoss()

With everything in place, our prediction, optimizer, and loss function we can build our training loop.

In [45]:
def train_bgd(model, optimizer, criterion, x_train, y_train, n_epochs):
    for epoch in range(n_epochs):
        y_pred = model(x_train)
        loss = criterion(y_pred, y_train)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        print(f"Epoch {epoch + 1}/{n_epochs}, Loss {loss.item()}")

- The `optimizer.step()` function is equivalent to our previous example manipulating `b` and `w` inside of the `no_grad()` 
- Same with the `zero_grad` zeroing out the gradients for each parameter we initialized the optimizer with.

In [46]:
train_bgd(model, optimizer, mse, x_train, y_train, n_epochs)

Epoch 1/20, Loss 4.337850093841553
Epoch 2/20, Loss 0.7802939414978027
Epoch 3/20, Loss 0.6253842115402222
Epoch 4/20, Loss 0.6060435175895691
Epoch 5/20, Loss 0.5956299304962158
Epoch 6/20, Loss 0.587356686592102
Epoch 7/20, Loss 0.5802990198135376
Epoch 8/20, Loss 0.5741382837295532
Epoch 9/20, Loss 0.5687101483345032
Epoch 10/20, Loss 0.5639079213142395
Epoch 11/20, Loss 0.5596511363983154
Epoch 12/20, Loss 0.5558737516403198
Epoch 13/20, Loss 0.5525194406509399
Epoch 14/20, Loss 0.5495392084121704
Epoch 15/20, Loss 0.5468899607658386
Epoch 16/20, Loss 0.5445339679718018
Epoch 17/20, Loss 0.5424376726150513
Epoch 18/20, Loss 0.5405715703964233
Epoch 19/20, Loss 0.5389097332954407
Epoch 20/20, Loss 0.5374288558959961


And the model is trained! Making predictions can be done by calling the function as we saw earlier, but inside of a `torch.no_grad()` context.

In [47]:
x_new = x_test[:3]
with torch.no_grad():
    y_pred = model(x_new)

In [48]:
y_pred

tensor([[0.8061],
        [1.7116],
        [2.6973]])

I got my above question answered about initializing the model. The `nn.Linear` module uses a uniform random distribution from - sqrt(2) / 4 to + sqrt(2) / 4 for both the weights and bias term.

## Implementing a Regression MLP

Lots of neural networks are just stacks of modules, one very useful big stack of modules is the `nn.Sequential` module which we can use to make our MLP with two hidden layers and one output layer.

In [49]:
torch.manual_seed(42)
model = nn.Sequential(
    nn.Linear(n_features, 50),
    nn.ReLU(),
    nn.Linear(50, 40),
    nn.ReLU(),
    nn.Linear(40, 1)
)

- ReLU is the activation function we've seen before and it acts item-wise and requires no parameters it will input and output the same dimensions
- The hidden layers are `nn.Linear` instances and the first layer input must match the number of features for our data. The output is a tunabel hyperparameter, in this case 50.
- The second hidden layer shows how the input must match the output of the previous layer. It's common to use the same output for each hidden layer.
- The final output layer must match the output of the previous to its input *and* match its output size to the dimension of our target. Since our target is 1-Dimensional (a housing price prediction) we have a single output neuron.

In [50]:
learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
mse = nn.MSELoss()
train_bgd(model, optimizer, mse, x_train, y_train, n_epochs)

Epoch 1/20, Loss 5.045480251312256
Epoch 2/20, Loss 2.0523128509521484
Epoch 3/20, Loss 1.0039883852005005
Epoch 4/20, Loss 0.8570139408111572
Epoch 5/20, Loss 0.7740675210952759
Epoch 6/20, Loss 0.7225848436355591
Epoch 7/20, Loss 0.6893726587295532
Epoch 8/20, Loss 0.6669033169746399
Epoch 9/20, Loss 0.6507739424705505
Epoch 10/20, Loss 0.6383934020996094
Epoch 11/20, Loss 0.6281994581222534
Epoch 12/20, Loss 0.6193398833274841
Epoch 13/20, Loss 0.6113173365592957
Epoch 14/20, Loss 0.6038705110549927
Epoch 15/20, Loss 0.5968307852745056
Epoch 16/20, Loss 0.5901119112968445
Epoch 17/20, Loss 0.5836467742919922
Epoch 18/20, Loss 0.5774063467979431
Epoch 19/20, Loss 0.5713555216789246
Epoch 20/20, Loss 0.565444827079773


Woohoo! My first neural network trained with PyTorch! I am going to tell all of my friends.

## Implementing Mini-Batch Gradient Descent Using DataLoaders

Thus far we've been training against the full training set `x_train` and `y_train` without breaking it up into more managable bathces. This changes now, using the `DataLoader` class from `torch.utils.data` we will be able to break the training set up and even shuffle it with `shuffle=True`.

Because our data `x_train` and `y_train` are tensors we'll use the `TensorDataset` class to wrap them in the approrpiate API.

In [51]:
from torch.utils.data import TensorDataset, DataLoader

train_dataset = TensorDataset(x_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=21, shuffle=True)

Moving a model to the GPU loads all of its parameters to the GPU/Accelerator RAM. At the start of each iteration during training we copy each batch to the device. We use the `to()` method as we did with tensors:

In [57]:
torch.manual_seed(42)
model = nn.Sequential(
    nn.Linear(8, 50),
    nn.ReLU(),
    nn.Linear(50, 40),
    nn.ReLU(),
    nn.Linear(40, 1)
)
model = model.to(device)

In [58]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.2)

> Best practice to initialized optimizers **after** you have moved the model to the GPU.

In [60]:
def train(model, optimizer, criterion, train_loader, n_epochs):
    model.train()
    for epoch in range(n_epochs):
        total_loss = 0
        for x_batch, y_batch in train_loader:
            x_batch, y_batch = x_batch.to(device), y_batch.to(device)
            y_pred = model(x_batch)
            loss = criterion(y_pred, y_batch)
            total_loss += loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        
        mean_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {mean_loss:.4f}")

Very similar to our previous with some extra logic for the batches. Notice the `model.train()` at the start this sets the model into **training mode** which is distinct from an evaluation mode with `model.eval()` so calling these functions essentially just flips a boolean within the models.

In [61]:
train(model, optimizer, mse, train_loader, n_epochs)

Epoch 1/20, Loss: 0.7774
Epoch 2/20, Loss: 0.4419
Epoch 3/20, Loss: 0.4120
Epoch 4/20, Loss: 0.3883
Epoch 5/20, Loss: 0.3787
Epoch 6/20, Loss: 0.3810
Epoch 7/20, Loss: 0.3799
Epoch 8/20, Loss: 0.3641
Epoch 9/20, Loss: 0.3537
Epoch 10/20, Loss: 0.3539
Epoch 11/20, Loss: 0.3512
Epoch 12/20, Loss: 0.3586
Epoch 13/20, Loss: 0.3448
Epoch 14/20, Loss: 0.3337
Epoch 15/20, Loss: 0.3544
Epoch 16/20, Loss: 0.3500
Epoch 17/20, Loss: 0.3612
Epoch 18/20, Loss: 0.3519
Epoch 19/20, Loss: 0.3432
Epoch 20/20, Loss: 0.3921


##### Some minor perfomance tweaks to the DataLoader:
- use `pin_memory=True` when creating the data loader, allocates the data in page-locked memory
- allow the CPU to prefetch batches my adjusting `num_workers` `prefetch_factor` and `persistent_workers=True`