# Chapter 10. Building Neural Networks with PyTorch

PyTorch is an open-source deep learning library for working with machine learning models. It's goal was to provide a pythonic interface for researchers, originally developed by Meta and now under the governance of the PyTorch foundation.

## Fundamentals

The fundamental core data structure in PyTorch is the *tensor* which is a multidimensional array with a shape and a data type not so unlike a NumPy array. It will become the input and output of our neural networks just like NumPy arrays were in Sciki-Learn models.

### PyTorch Tensors

Import the library and create a 2 x 3 array:

In [1]:
import torch

In [2]:
X = torch.tensor([[1.0, 4.0, 7.0], [2.0, 3.0, 6.0]])
X

tensor([[1., 4., 7.],
        [2., 3., 6.]])

Getting shape and data type is similar to NumPy

In [3]:
X.shape

torch.Size([2, 3])

In [4]:
X.dtype

torch.float32

So is indexing 

In [5]:
X[0, 1]

tensor(4.)

In [7]:
X[:, 1]

tensor([4., 3.])

In [8]:
10 * (X + 1.0)

tensor([[20., 50., 80.],
        [30., 40., 70.]])

In [9]:
X.exp()

tensor([[   2.7183,   54.5982, 1096.6332],
        [   7.3891,   20.0855,  403.4288]])

In [10]:
X.mean()

tensor(3.8333)

In [11]:
X.max(dim=0)

torch.return_types.max(
values=tensor([2., 4., 7.]),
indices=tensor([1, 0, 0]))

In [12]:
X.max(dim=1)

torch.return_types.max(
values=tensor([7., 6.]),
indices=tensor([2, 2]))

In [13]:
X @ X.T

tensor([[66., 56.],
        [56., 49.]])

The `numpy()` method enables creating a tensor from a NumPy array and vice-versa

In [15]:
import numpy as np

X.numpy()

torch.tensor(np.array([[1., 4., 7.], [2. ,3., 6.]]))

tensor([[1., 4., 7.],
        [2., 3., 6.]], dtype=torch.float64)

In [16]:
X[:, 1] = -99
X

tensor([[  1., -99.,   7.],
        [  2., -99.,   6.]])

QUESTION: what does it mean that these api methods with the underscore apply operations in-place? Does that mean without storing a copy?

In [17]:
X.relu_()
X

tensor([[1., 0., 7.],
        [2., 0., 6.]])

### Hardware Acceleration

One benefit of PyTorch is the hardware acceleration that greatly speeds up computations. Unlike SciKit Learn, we can choose from using NVIDIA GPUs with CUDA, Apple's *Metal Perfomance Shaders* (MPS), AMD ROCm, Intel's oneAPI, etc.

In [19]:
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

In [20]:
device

'mps'

Creating a Tensor on the CPU then copying it to the accelerator `device` with the `to()` method.
The tensor's `device` attribute will show the device it livs on.

In [22]:
M = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
M = M.to(device)

In [23]:
M.device

device(type='mps', index=0)

The tensor can also be created directly on the GPU using the `device` argument:

In [24]:
M = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device=device)

QUESTION: why isn't this the default? what's the scenario for either use case? Maybe if we want to do some easy manipulations and data verification on the tensor in CPU to keep the GPU freed up for the more intense training computations?

In [25]:
R = M @ M.T
R

tensor([[14., 32.],
        [32., 77.]], device='mps:0')

Crucially the resulting Tensor `R` also lives on the accelerator device so we are saved the bottleneck of data transfer between devices.

In [26]:
M = torch.rand((1000, 1000))
%timeit M @ M.T

2.46 ms ± 160 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [28]:
M = torch.rand((1000, 1000), device=device)
%timeit M @ M.T

195 μs ± 2.68 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


Interestinglly the `mpu` accelerated tensor operation was significnatly faster like an order of magnitude faster. Honestly the cpu operation was not slouch probably due to the Apple M4's ARM architecture and unified memory. 

It's curious to me that the actual python execution of the accelerated block took a lot longer. Maybe this was just a temporary Jupyter kernel quirk or the increased overhead of communicating the process back to the python kernel running in the cpu caused that?

### Autograd

The reverse-mode auto-differentiation implementation in PyTorch is *autograd* automatic gradients.

In [42]:
x = torch.tensor(5.0, requires_grad=True)
f = x ** 2
f

tensor(25., grad_fn=<PowBackward0>)

In [43]:
f.backward()
x.grad

tensor(10.)

A breakdown of this code is as follows. 

First a tensor is created, `x` of value `5.0`. In order for it to not be treated as a constant value we provide `reuiqres_grad=True` so that PyTorch internally will track operations performed on it (necessary for our autograd).

The definition of our f(x) is fairly straightforward we are performing an exponential operaion on the tensor x. This gives us a new tensor of the resulting value `25.` and the `grad_fn=<PowBackward0>` which is the operation we just performed, creating the relationship between this tensor and the one it is modified from, `x`. This `grad_fn` attribute keeps track of the *computation graph*.

Given this, when we call `f.backward()` pytorch is traversing this graph backwards starting at `f` calculating the gradients all the way back to the leaf nodes which in this case is just `x`.

This then allows us to read the tensor `x`'s gradient with `x.grad`. If we would have run this at the beginning it's gradient would be empty. This gradient was computed during the backprop and gives us the derivative of `f` with regard to `x`. 

---


For gradient descent, the reduction operation subtracting a fraction of the gradients from the model variables should not be tracked and in fact raises an exception in PyTorch. To exclude these from the gradient descent steps from the computation graph, we can place each step insdie a `torch.no_grad()` context.

In [38]:
learning_rate = 0.1
print(f"x: {x}, x.grad {x.grad}")
with torch.no_grad():
    x -= learning_rate * x.grad
print(f"x: {x}, x.grad {x.grad}")


x: 3.0, x.grad 10.0
x: 2.0, x.grad 10.0


You can also avoid gradient computation by using the `detach()` method creating a new tensor detached from the computation graph, with `requires_grad=False` but pointing to the same data in memory.

In [45]:
x_detached = x.detach()
x_detached -= learning_rate * x.grad
print(f"x: {x}, x.grad {x.grad}")
print(f"x_detached: {x_detached}, x_detached.grad {x_detached.grad}")

x: 3.0, x.grad 10.0
x_detached: 3.0, x_detached.grad None


So `detach()` is handy for performing operations that you don't want affecting the gradient, for logging or other things, and since `x_detached` and `x` share the same memory, modifying `x_detached` also modifies `x`. In general `no_grad()` is preferred when performing inference or doing a gradient descent step. 

Before repeating the process the gradients of every model parameter need to be zeroed out. The gradient tensor has `required_grad=False` so a `no_grad()` is not necessary. (Yes I guess the grad_fn is also a tensor?)

In [46]:
x.grad.zero_()

tensor(0.)

In [49]:
learning_rate = 0.1
x = torch.tensor(5.0, requires_grad=True)

for i in range(100):
    f = x ** 2  # forward pass
    f.backward()    # backward pass
    
    with torch.no_grad():
        x -= learning_rate * x.grad

    x.grad.zero_()  # zero out the gradients at the end of each iteration

> in place operations save space by reducing intermediate copies but this doesn't work well with autograd where they are needed to perform the backward pass. Instead of `z += 1` instead use `z = z + 1`