# Introduction to PyTorch

## The three core components of PyTorch

Firstly, PyTorch is a tensor library that extends the concept of array-oriented programming
library NumPy with the additional feature of accelerated computation on GPUs, thus
providing a seamless switch between CPUs and GPUs.

Secondly, PyTorch is an automatic differentiation engine, also known as autograd, which
enables the automatic computation of gradients for tensor operations, simplifying
backpropagation and model optimization.

Finally, PyTorch is a deep learning library, meaning that it offers modular, flexible, and
efficient building blocks (including pre-trained models, loss functions, and optimizers) for
designing and training a wide range of deep learning models, catering to both researchers
and developers.

AI is fundamentally about creating computer systems capable of performing tasks that
usually require human intelligence. These tasks include understanding natural language,
recognizing patterns, and making decisions. (Despite significant progress, AI is still far from
achieving this level of general intelligence.)

Machine learning represents a subfield of AI that focuses on developing and improving learning algorithms. The key idea behind machine learning is to enable computers to learn from data and make predictions or decisions without being explicitly programmed to perform the task. This involves developing algorithms that can identify patterns and learn from historical data and improve their performance over time with more data and feedback.

Machine learning is also behind technologies like recommendation systems used by online retailers and streaming services, email spam filtering, voice recognition in virtual assistants, and even self-driving cars. The introduction and advancement of machine learning have significantly enhanced AI's capabilities, enabling it to move beyond strict rule-based systems and adapt to new inputs or changing environments.

Deep learning is a subcategory of machine learning that focuses on the training and
application of deep neural networks. These deep neural networks were originally inspired by
how the human brain works, particularly the interconnection between many neurons. The
"deep" in deep learning refers to the multiple hidden layers of artificial neurons or nodes
that allow them to model complex, nonlinear relationships in the data.

Unlike traditional machine learning techniques that excel at simple pattern recognition,
deep learning is particularly good at handling unstructured data like images, audio, or text,
so deep learning is particularly well suited for LLMs.

Using a learning algorithm, a model is trained on a training dataset consisting of examples
and corresponding labels. In the case of an email spam classifier, for example, the training
dataset consists of emails and their spam and not-spam labels that a human identified.
Then, the trained model can be used on new observations (new emails) to predict their
unknown label (spam or not spam)

Of course, we also want to add a model evaluation between the training and inference
stages to ensure that the model satisfies our performance criteria before using it in a real-
world application.


In [2]:
import torch
import numpy as np

In [3]:
torch.__version__

'2.0.1'

In [4]:
print(torch.backends.mps.is_available())

True


## Understanding tensors

Tensors represent a mathematical concept that generalizes vectors and matrices to potentially higher dimensions. In other words, tensors are mathematical objects that can be characterized by their order (or rank), which provides the number of dimensions. For example, a scalar (just a number) is a tensor of rank 0, a vector is a tensor of rank 1, and a matrix is a tensor of rank 2.

From a computational perspective, tensors serve as data containers. For instance, they hold multi-dimensional data, where each dimension represents a different feature. Tensor libraries, such as PyTorch, can create, manipulate, and compute with these multi-dimensional arrays efficiently. In this context, a tensor library functions as an array library.


PyTorch tensors are similar to NumPy arrays but have several additional features important for deep learning. For example, PyTorch adds an automatic differentiation engine, simplifying computing gradients. PyTorch tensors also support GPU computations to speed up deep neural network training.

## Scalars, vectors, matrices, and tensors

As mentioned earlier, PyTorch tensors are data containers for array-like structures. A scalar
is a `0-dimensional` tensor (for instance, just a number), a vector is a `1-dimensional` tensor,
and a matrix is a `2-dimensional` tensor. There is no specific term for higher-dimensional
tensors, so we typically refer to a `3-dimensional` tensor as just a 3D tensor, and so forth.

We can create objects of PyTorch's Tensor class using the torch.tensor function as
follows:

In [5]:
np.__version__

'1.25.0'

In [6]:
import torch

tensor0d = torch.tensor(1)
print(tensor0d, tensor0d.dim())

tensor(1) 0


In [7]:
tensor1d = torch.tensor([1, 2, 3])
print(tensor1d, tensor1d.dim())

tensor([1, 2, 3]) 1


In [8]:
tensor2d = torch.tensor([[3, 4], [6, 7]])
print(tensor2d, tensor2d.dim())

tensor([[3, 4],
        [6, 7]]) 2


In [9]:
tensor3d = torch.tensor([[[2, 6], [3, 7], [4, 8]]])
print(tensor3d, tensor3d.dim())

tensor([[[2, 6],
         [3, 7],
         [4, 8]]]) 3


## Tensor data types

PyTorch adopts the default 64-bit integer data type from Python. We can access the data type of a
tensor via the `.dtype` attribute of a tensor:

In [10]:
print(tensor1d.dtype)

torch.int64


If we create tensors from Python floats, PyTorch creates tensors with a 32-bit precision by
default, as we can see below:

In [11]:
floatvec = torch.tensor([1.0, 2.3, 4.6])
print(floatvec.dtype)

torch.float32


This choice is primarily due to the balance between precision and computational efficiency.
A 32-bit floating point number offers sufficient precision for most deep learning tasks, while
consuming less memory and computational resources than a 64-bit floating point number.
Moreover, GPU architectures are optimized for 32-bit computations, and using this data
type can significantly speed up model training and inference.

Moreover, it is possible to readily change the precision using a tensor's `.to` method. The following code demonstrates this by changing a `64-bit` integer tensor into a `32-bit` float tensor:

In [12]:
vectofloat = tensor1d.to(torch.float32)
print(vectofloat, vectofloat.dtype)

tensor([1., 2., 3.]) torch.float32


## Common PyTorch tensor operations

Some of the most essential PyTorch tensor operations.

In [13]:
# We already introduced the torch.tensor() function to create new tensors.

new_tensor = torch.tensor([[3, 6], [4, 8], [5, 10]])
print(new_tensor, new_tensor.dtype)

tensor([[ 3,  6],
        [ 4,  8],
        [ 5, 10]]) torch.int64


In [14]:
# In addition, the .shape attribute allows us to access the shape of a tensor:

new_tensor.shape

torch.Size([3, 2])

In [15]:
# As you can see above, .shape returns [3, 2], which means that the tensor has 3 rows and 2 columns. 
# To reshape the tensor into a 2 by 3 tensor, we can use the .reshape method:

print(new_tensor.reshape(2, 3))

tensor([[ 3,  6,  4],
        [ 8,  5, 10]])


In [16]:
# However, note that the more common command for reshaping tensors in PyTorch is .view():

print(new_tensor.view(2, 3))

tensor([[ 3,  6,  4],
        [ 8,  5, 10]])


Similar to `.reshape` and `.view`, there are several cases where PyTorch offers multiple syntax options for executing the same computation.
This is because PyTorch initially followed the original `Lua Torch` syntax convention but then also added syntax to make it more similar to NumPy upon popular request.

Next, we can use `.T` to transpose a tensor, which means flipping it across its diagonal. Note that this is similar from reshaping a tensor as you can see based on the result below:

In [17]:
print(new_tensor.T)

tensor([[ 3,  4,  5],
        [ 6,  8, 10]])


Lastly, the common way to multiply two matrices in PyTorch is the `.matmul` method:

In [18]:
print(tensor2d.matmul(tensor2d))

tensor([[33, 40],
        [60, 73]])


However, we can also adopt the `@` operator, which accomplishes the same thing more compactly:

In [19]:
print(tensor2d@tensor2d)

tensor([[33, 40],
        [60, 73]])


## Seeing models as computation graphs

PyTorch's autograd system provides functions to compute gradients in dynamic computational graphs automatically. 

A computational graph (or computation graph in short) is a directed graph that allows us to express and visualize mathematical expressions. In the context of deep learning, a computation graph lays out the sequence of calculations needed to compute the output of a neural network

Let's look at a concrete example to illustrate the concept of a computation graph. The following code implements the forward pass (prediction step) of a simple logistic regression classifier, which can be seen as a single-layer neural network, returning a score between `0` and `1` that is compared to the true class label `(0 or 1)` when computing the loss:

In [20]:
import torch.nn.functional as F

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2])
b = torch.Tensor([0.0])
z = x1 * w1 + b

y_pred = torch.sigmoid(z)
print(y_pred)

tensor([0.9183])


In [21]:
# Get the loss
loss = F.binary_cross_entropy(y_pred, y)

In [22]:
loss

tensor(0.0852)

![Alt text](../assests/computational-graph.png)

The image above illustrates A logistic regression forward pass as a computation graph. The input feature `x1` is multiplied by a model weight `w1` and passed through an activation function `σ` after adding the bias. The loss is computed by comparing the model output `a` with a given label `y`.

In fact, PyTorch builds such a computation graph in the background, and we can use this to
calculate gradients of a loss function with respect to the model parameters (here w1 and b)
to train the model.

## Automatic differentiation made easy

If we carry out computations in PyTorch, it will build such a graph internally by default if one of its terminal nodes has the `requires_grad` attribute set to True. This is useful if we want to compute `gradients`. Gradients are required when training neural networks via the popular backpropagation algorithm, which can be thought of as an implementation of the `chain rule` from calculus for neural networks.

![Alt text](../assests/auto-diff.png)

# PARTIAL DERIVATIVES AND GRADIENTS

`Figure A.8` shows partial derivatives, which measure the rate at which a function changes with respect to one of its variables. A gradient is a vector containing all of the partial derivatives of a multivariate function, a function with more than one variable as input.

If you are not familiar or don't remember the `partial derivatives`, `gradients`, or the `chain rule` from calculus, don't worry. On a high level, all you need to know for this book is that the chain rule is a way to compute gradients of a loss function with respect to the model's parameters in a computation graph. This provides the information needed to update each parameter in a way that minimizes the loss function, which serves as a proxy for measuring the model's performance, using a method such as gradient descent.

the automatic differentiation `(autograd)` engine? By tracking every operation performed on tensors, PyTorch's `autograd` engine constructs a computational graph in the background. Then, calling the `grad` function, we can compute the gradient of the loss with respect to model parameter `w1` as follows:

In [23]:
import torch.nn.functional as F
from torch.autograd import grad

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)
z = x1 * w1 + b

y_pred = torch.sigmoid(z)
loss = F.binary_cross_entropy(y_pred, y)

grad_l_w1 = grad(loss, w1, retain_graph=True)
grad_l_b = grad(loss, b, retain_graph=True)

By default, PyTorch destroys the computation graph after calculating the gradients to free memory. However, since we are going to reuse this computation graph shortly, we set retain_graph=True so that it stays in memory.

In [24]:
print(grad_l_b, grad_l_w1)

(tensor([-0.0817]),) (tensor([-0.0898]),)


Above, we have been using the grad function `"manually,"` which can be useful for experimentation, debugging, and demonstrating concepts. But in practice, PyTorch provides even more high-level tools to automate this process. For instance, we can call `.backward` on the loss, and PyTorch will compute the gradients of all the leaf nodes in the graph, which will be stored via the tensors' `.grad` attributes:

In [25]:
loss.backward()
print(w1.grad)
print(b.grad)

tensor([-0.0898])
tensor([-0.0817])


## Implementing multilayer neural networks

In the previous sections, we covered PyTorch's tensor and autograd components. This section focuses on PyTorch as a library for implementing `deep neural networks`. To provide a concrete example, we focus on a `multilayer perceptron`, which is a fully
connected neural network.

![Alt text](../assests/mlp.png)

When implementing a neural network in PyTorch, we typically subclass the `torch.nn.Module` class to define our own custom network architecture. This Module base class provides a lot of functionality, making it easier to build and train models. For instance, it allows us to encapsulate layers and operations and keep track of the model's parameters.

Within this subclass, we define the network layers in the `__init__` constructor and specify how they interact in the `forward method`. The forward method describes how the input data passes through the network and comes together as a computation graph.

In contrast, the `backward` method, which we typically do not need to implement ourselves, is used during training to compute `gradients` of the `loss function` with respect to the `model parameters`. 

The following code implements a classic `multilayer perceptron` with two hidden layers to illustrate a typical usage of the Module class:

In [26]:
class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()
        
        self.layers = torch.nn.Sequential(
            
            # 1st hidden layer
            torch.nn.Linear(num_inputs, 30),
            torch.nn.ReLU(),
            
            # 2nd hidden layer
            torch.nn.Linear(30, 20),
            torch.nn.ReLU(),
            
            # Output layer
            torch.nn.Linear(20, num_outputs),
            
        )
        
    def forward(self, x):
        logits = self.layers(x)
        return logits

- It's useful to code the number of inputs and outputs as variables to reuse the same code for datasets with different numbers of features and classes.
  
- The Linear layer takes the number of input and output nodes as arguments.
  
- Nonlinear activation functions are placed between the hidden layers.
  
- The number of output nodes of one hidden layer has to match the number of inputs of the next layer.
  
- The outputs of the last layer are called logits.

In [27]:
# instantiate a new neural network object as follows:
model = NeuralNetwork(50, 3)

But before using this new model object, it is often useful to call print on the model to see a summary of its structure:

In [28]:
print(model)

NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)


Note that we used the `Sequential` class when we implemented the `NeuralNetwork` class. Using Sequential is not required, but it can make our life easier if we have a series of layers that we want to execute in a specific order, as is the case here. This way, after instantiating `self.layers = Sequential(...)` in the `__init__` constructor, we just have to call the `self.layers` instead of calling each layer individually in the `NeuralNetwork's` forward method.

In [29]:
# Next, let's check the total number of trainable parameters of this model:

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total number of trainable model parameters:", num_params)

Total number of trainable model parameters: 2213


Note that each parameter for which `requires_grad=True` counts as a trainable parameter and will be updated during training.

In the case of our `neural network model` with the two hidden layers above, these trainable parameters are contained in the `torch.nn.Linear` layers. A linear layer multiplies the `inputs with a weight matrix and adds a bias vector`. This is sometimes also referred to
as a `feedforward` or `fully connected layer`.

Based on the `print(model)` call we executed above, we can see that the first Linear layer is at index position `0` in the layers attribute. We can access the corresponding weight parameter matrix as follows:

In [30]:
print(model.layers[0].weight)

Parameter containing:
tensor([[ 0.0242, -0.1279, -0.0271,  ..., -0.0631, -0.0629, -0.1057],
        [-0.0779, -0.1378,  0.0358,  ..., -0.0503,  0.1083,  0.0015],
        [ 0.1010, -0.0752, -0.1025,  ...,  0.0584,  0.1044,  0.1369],
        ...,
        [-0.1133,  0.0278,  0.1364,  ..., -0.0332, -0.0087, -0.0196],
        [ 0.1011,  0.0802, -0.0845,  ...,  0.0500,  0.0372, -0.1414],
        [-0.1276, -0.1105, -0.1404,  ...,  0.0672,  0.1302, -0.0142]],
       requires_grad=True)


In [31]:
print(model.layers[2].weight)

Parameter containing:
tensor([[-8.2704e-02,  1.6676e-01,  2.3790e-02, -1.6425e-01, -9.1062e-03,
         -7.9301e-02,  1.2550e-01,  9.2919e-02, -1.6414e-01, -1.1609e-01,
         -1.2934e-01,  1.0597e-01,  1.0918e-01,  1.1679e-01, -1.1571e-01,
          1.6734e-01, -1.4785e-01, -7.9849e-02, -2.1936e-03, -9.1362e-02,
          7.5577e-03,  9.7152e-02,  5.4985e-02,  1.6783e-01, -8.5155e-02,
          1.1405e-01, -1.0813e-01,  4.1274e-02, -8.3538e-02,  6.1219e-02],
        [ 1.2612e-01,  1.3987e-01,  9.1484e-02, -8.8441e-02, -3.9006e-02,
         -1.2066e-01, -1.0001e-01,  1.3610e-01, -2.5158e-02,  1.1552e-01,
          1.5893e-01,  1.4023e-02, -1.4964e-01, -3.4823e-02, -1.4583e-01,
         -1.3668e-01,  7.7482e-02, -6.3777e-02,  7.9075e-02,  1.5279e-01,
         -1.5895e-02,  1.1040e-01,  6.4484e-02, -1.3324e-02, -1.1895e-01,
         -1.5732e-01, -8.4179e-02,  1.6404e-01, -3.5424e-02, -8.3850e-02],
        [ 1.2384e-01,  1.6902e-01,  3.6369e-02,  8.4238e-02,  1.2355e-01,
         -1.65

Since this is a large matrix that is not shown in its entirety, let's use the .shape attribute to show its dimensions:

In [32]:
first_layer = model.layers[0].weight.shape
second_layer = model.layers[2].weight.shape

print(f"Shape of the weight of the first layer:", first_layer)
print(f"Shape of the weight of the second layer:", second_layer)

Shape of the weight of the first layer: torch.Size([30, 50])
Shape of the weight of the second layer: torch.Size([20, 30])


(Similarly, you could access the bias vector via `model.layers[0].bias`.)

In [33]:
first_layer_bias = model.layers[0].bias.shape
second_layer_bias = model.layers[2].bias.shape

print(f"Shape of the bias of the first layer:", first_layer_bias)
print(f"Shape of the bias of the second layer:", second_layer_bias)

Shape of the bias of the first layer: torch.Size([30])
Shape of the bias of the second layer: torch.Size([20])


The weight matrix of the first layer above is a `30x50` matrix, and we can see that the `requires_grad` is set to True, which means its entries are trainable -- this is the default setting for `weights` and `biases` in `torch.nn.Linear`.

Note that if you execute the code above on your computer, the numbers in the weight matrix will likely differ from those shown above. This is because the model weights are initialized with small random numbers, which are different each time we instantiate the network. In deep learning, initializing model weights with small random numbers is desired to break symmetry during training -- otherwise, the nodes would be just performing the same operations and updates during backpropagation.

However, while we want to keep using small random numbers as initial values for our layer weights, we can make the random number initialization reproducible by `seeding` PyTorch's random number generator via `manual_seed`:

In [34]:
torch.manual_seed(123)
model = NeuralNetwork(50, 3)
print(model.layers[0].weight)

Parameter containing:
tensor([[-0.0577,  0.0047, -0.0702,  ...,  0.0222,  0.1260,  0.0865],
        [ 0.0502,  0.0307,  0.0333,  ...,  0.0951,  0.1134, -0.0297],
        [ 0.1077, -0.1108,  0.0122,  ...,  0.0108, -0.1049, -0.1063],
        ...,
        [-0.0787,  0.1259,  0.0803,  ...,  0.1218,  0.1303, -0.1351],
        [ 0.1359,  0.0175, -0.0673,  ...,  0.0674,  0.0676,  0.1058],
        [ 0.0790,  0.1343, -0.0293,  ...,  0.0344, -0.0971, -0.0509]],
       requires_grad=True)


Now, after we spent some time inspecting the `NeuraNetwork` instance, let's briefly see how it's used via the forward pass:

In [35]:
torch.manual_seed(123)
X = torch.rand((1, 50))

In [36]:
X

tensor([[0.2961, 0.5166, 0.2517, 0.6886, 0.0740, 0.8665, 0.1366, 0.1025, 0.1841,
         0.7264, 0.3153, 0.6871, 0.0756, 0.1966, 0.3164, 0.4017, 0.1186, 0.8274,
         0.3821, 0.6605, 0.8536, 0.5932, 0.6367, 0.9826, 0.2745, 0.6584, 0.2775,
         0.8573, 0.8993, 0.0390, 0.9268, 0.7388, 0.7179, 0.7058, 0.9156, 0.4340,
         0.0772, 0.3565, 0.1479, 0.5331, 0.4066, 0.2318, 0.4545, 0.9737, 0.4606,
         0.5159, 0.4220, 0.5786, 0.9455, 0.8057]])

In [37]:
X.shape, X.dim()

(torch.Size([1, 50]), 2)

In [38]:
out = model(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]], grad_fn=<AddmmBackward0>)


In the code above, we generated a single random training example `X` as a toy input (note that our network expects `50-dimensional` feature vectors) and fed it to the model, returning three scores. When we call `model(x)`, it will automatically execute the forward pass of the
model.

The forward pass refers to calculating output tensors from input tensors. This involves passing the input data through all the neural network layers, starting from the input layer, through hidden layers, and finally to the output layer.

These three numbers returned above correspond to a score assigned to each of the three output nodes. Notice that the output tensor also includes a `grad_fn` value.

Here, `grad_fn=<AddmmBackward0>` represents the last-used function to compute a variable in the computational graph. In particular, `grad_fn=<AddmmBackward0>` means that the tensor we are inspecting was created via a matrix multiplication and addition operation.
PyTorch will use this information when it computes gradients during backpropagation. The `<AddmmBackward0>` part of `grad_fn=<AddmmBackward0>` specifies the operation that was performed. In this case, it is an `Addmm` operation. `Addmm` stands for `matrix multiplication (mm)` followed by an `addition (Add)`.

If we just want to use a network without training or backpropagation, for example, if we use it for prediction after training, constructing this computational graph for backpropagation can be wasteful as it performs unnecessary computations and consumes additional memory. So, when we use a model for inference (for instance, making predictions) rather than training, it is a best practice to use the `torch.no_grad()` context manager, as shown below. This tells PyTorch that it doesn't need to keep track of the gradients, which can result in significant savings in memory and computation.

In [39]:
with torch.no_grad():
    out = model(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]])


In PyTorch, it's common practice to code models such that they return the outputs of the `last layer (logits)` without passing them to a `nonlinear activation` function. That's because `PyTorch's` commonly used loss functions combine the `softmax (or sigmoid for binary
classification)` operation with the `negative log-likelihood loss` in a single class. The reason for this is numerical efficiency and stability. So, if we want to compute `class-membership probabilities` for our predictions, we have to call the `softmax` function explicitly:

In [40]:
with torch.no_grad():
    out = torch.softmax(model(X), dim=1)
print(out)
out.sum()

tensor([[0.3113, 0.3934, 0.2952]])


tensor(1.)

The values can now be interpreted as `class-membership probabilities` that sum up to `1`. The values are roughly equal for this random input, which is expected for a randomly initialized model without training.

## Setting up efficient data loaders

In PyTorch, a `DataLoader` is a `utility class` within the `torch.utils.data` module that facilitates efficient data loading for training and evaluating machine learning models. It acts as an interface between your dataset and the model, abstracting away complexities like batching, shuffling, and parallel data loading.


**How it works:**

**Wraps a Dataset:** The `DataLoader` takes a `Dataset` object as input. This `Dataset` object can be a built-in PyTorch dataset `(e.g., MNIST, CIFAR-10)` or a custom dataset you define by inheriting from `torch.utils.data.Dataset`.


**Batching:** It groups individual data samples from the dataset into mini-batches, which are then fed to the model for training or inference. This is crucial for efficient training, especially with large datasets, as it allows for `parallel processing` and better utilization of hardware resources.


**Shuffling:** It can shuffle the data samples at the beginning of each epoch (or before training) to ensure that the model does not learn patterns based on the order of the data.


**Parallel Loading (Optional):** It can use multiple worker processes (`num_workers` parameter) to load data in parallel, which can significantly speed up the data loading process, preventing the GPU from waiting for data.


**Iteration:** The `DataLoader` makes the dataset iterable, allowing you to easily loop through the data in batches during your training or evaluation loop.

![Alt text](../assests/dataloader.png)

Following the `illustration` in `figure A.10`, in this section, we will implement a custom Dataset
class that we will use to create a training and a test dataset that we'll then use to create
the data loaders.

Let's start by creating a simple toy dataset of five training examples with two features each. Accompanying the training examples, we also create a tensor containing the corresponding class labels: three examples belong to `class 0`, and two examples belong to `class 1`. In addition, we also make a `test set` consisting of two entries. 

The code to create this dataset is shown below.

In [41]:
X_train = torch.tensor([
    [-1.2, 3.1],
    [-0.9, 2.9],
    [-0.5, 2.6],
    [2.3, -1.1],
    [2.7, -1.5]
])

y_train = torch.tensor([0, 0, 0, 1, 1])

X_test = torch.tensor([
    [-0.8, 2.8],
    [2.6, -1.6],
])
y_test = torch.tensor([0, 1])

In [42]:
X_train, X_test

(tensor([[-1.2000,  3.1000],
         [-0.9000,  2.9000],
         [-0.5000,  2.6000],
         [ 2.3000, -1.1000],
         [ 2.7000, -1.5000]]),
 tensor([[-0.8000,  2.8000],
         [ 2.6000, -1.6000]]))

In [43]:
X_train.shape, X_test.shape

(torch.Size([5, 2]), torch.Size([2, 2]))

In [44]:
len(X_train), X_train.shape[0]

(5, 5)

**CLASS LABEL NUMBERING** PyTorch requires that class labels start with `label 0`, and the largest class label value should not exceed the number of output nodes minus 1 (since Python index counting
starts at 0. So, if we have class labels `0, 1, 2, 3, and 4`, the neural network output layer should consist of 5 nodes.

Next, we create a custom dataset class, `ToyDataset`, by subclassing from PyTorch's Dataset parent class, as shown below.

In [45]:
from torch.utils.data import Dataset

class ToyDataset(Dataset):
    def __init__(self, X, y):
        self.features = X
        self.labels = y
        
    def __getitem__(self, index):
        one_x = self.features[index]
        one_y = self.labels[index]
        return one_x, one_y
    
    def __len__(self):
        return self.labels.shape[0]
    

train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(X_test, y_test)

This custom `ToyDataset` class's purpose is to use it to instantiate a PyTorch `DataLoader`.
But before we get to this step, let's briefly go over the general structure of the `ToyDataset`
code.

In PyTorch, the three main components of a custom Dataset class are the `__init__`
constructor, the `__getitem__` method, and the `__len__` method, as shown in code listing above.

In the `__init__` method, we set up attributes that we can access later in the `__getitem__` and `__len__` methods. This could be file paths, file objects, database connectors, and so on. Since we created a tensor dataset that sits in memory, we are simply assigning `X` and `y` to these attributes, which are placeholders for our tensor objects.

In the `__getitem__` method, we define instructions for returning exactly one item from the dataset via an index. This means the features and the class label corresponding to a single training example or test instance.

Finally, the `__len__` method constrains instructions for retrieving the length of the dataset. Here, we use the `.shape` attribute of a tensor to return the number of rows in the feature array. In the case of the training dataset, we have five rows, which we can double-check as follows:

In [46]:
print(len(train_ds)), print(len(test_ds))

5
2


(None, None)

Now that we defined a PyTorch Dataset class we can use for our toy dataset, we can use PyTorch's `DataLoader` class to sample from it, as shown in the code listing below:

In [47]:
from torch.utils.data import DataLoader

torch.manual_seed(235)

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0
)

test_loader = DataLoader(
    dataset=test_ds,
    batch_size=2,
    shuffle=False,
    num_workers=0
)

- **train_loader**: The `ToyDataset` instance created earlier serves as input to the data loader
- **shuffle**: Whether or not to shuffle the data
- **batch_size**: The number of background processes
- **shuffle=False**: It is not necessary to shuffle the test data

After instantiating the training data loader, we can iterate over it as shown below. (The
iteration over the `test_loader` works similarly but is omitted for brevity.)

In [48]:
for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx+1}:", x, y)

Batch 1: tensor([[-0.9000,  2.9000],
        [ 2.3000, -1.1000]]) tensor([0, 1])
Batch 2: tensor([[-0.5000,  2.6000],
        [ 2.7000, -1.5000]]) tensor([0, 1])
Batch 3: tensor([[-1.2000,  3.1000]]) tensor([0])


In [49]:
for idx, enu in enumerate(train_loader):
    print(idx, enu)

0 [tensor([[-1.2000,  3.1000],
        [ 2.7000, -1.5000]]), tensor([0, 1])]
1 [tensor([[-0.5000,  2.6000],
        [-0.9000,  2.9000]]), tensor([0, 0])]
2 [tensor([[ 2.3000, -1.1000]]), tensor([1])]


As we can see based on the output above, the `train_loader` iterates over the training dataset visiting each training example exactly once. This is known as a `training epoch`. Since we seeded the random number generator using `torch.manual_seed(123)` above, you should get the exact same shuffling order of training examples as shown above. However if you iterate over the dataset a second time, you will see that the shuffling order will change. This is desired to prevent deep neural networks getting caught in repetitive update cycles during training.


Note that we specified a batch size of `2` above, but the `3rd batch` only contains a single example. That's because we have five training examples, which is not evenly divisible by 2. In practice, having a substantially smaller batch as the last batch in a training epoch can disturb the convergence during training. To prevent this, it's recommended to set `drop_last=True`, which will drop the last batch in each epoch, as shown below:

In [50]:
train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0,
    drop_last=True,
)

Now, iterating over the training loader, we can see that the last batch is omitted:

In [51]:
for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx+1}:", x, y)

Batch 1: tensor([[-0.5000,  2.6000],
        [-1.2000,  3.1000]]) tensor([0, 0])
Batch 2: tensor([[ 2.3000, -1.1000],
        [ 2.7000, -1.5000]]) tensor([1, 1])


Lastly, let's discuss the setting `num_workers=0` in the `DataLoader`. This parameter in PyTorch's `DataLoader` function is crucial for `parallelizing` data loading and preprocessing. When `num_workers` is set to `0`, the data loading will be done in the main process and not in separate worker processes. This might seem unproblematic, but it can lead to significant slowdowns during model training when we train larger networks on a GPU. This is because instead of focusing solely on the processing of the deep learning model, the CPU must also take time to load and preprocess the data. As a result, the GPU can sit idle while waiting for the CPU to finish these tasks. In contrast, when `num_workers` is set to a number greater than zero, multiple worker processes are launched to load data in parallel, freeing the main process to focus on training your model and better utilizing your system's resources, which is illustrated in figure below:

![Alt text](../assests/num_workers.png)

However, if we are working with very small datasets, setting `num_workers` to `1` or larger may not be necessary since the total training time takes only fractions of a second anyway. On the contrary, if you are working with tiny datasets or interactive environments such as Jupyter notebooks, increasing `num_workers` may not provide any noticeable speedup. They might, in fact, lead to some issues. One potential issue is the overhead of spinning up multiple worker processes, which could take longer than the actual data loading when your dataset is small.


Furthermore, for Jupyter notebooks, setting `num_workers` to greater than `0` can sometimes lead to issues related to the sharing of resources between different processes, resulting in errors or notebook crashes. Therefore, it's essential to understand the `trade-off` and make a calculated decision on setting the `num_workers` parameter. When used correctly, it can be a beneficial tool but should be adapted to your specific dataset size and computational environment for optimal results.

In real-world experience, setting `num_workers=4` usually leads to optimal performance on many
real-world datasets, but optimal settings depend on your hardware and the code used for 
loading a training example defined in the Dataset class.

## A typical training loop

Let's now combine all the requirements we have discussed so far and train a neural network on the `toydataset`

The training code is below:

In [52]:
import torch.nn.functional as F

learning_rate = 0.5
torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

num_epochs = 3

for epoch in range(num_epochs):
    model.train()
    for batch_index, (features, labels) in enumerate(train_loader):
        logits = model(features)
        loss = F.cross_entropy(logits, labels)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # LOGGING
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
              f" | Batch {batch_index:03d}/{len(train_loader):03d}"
              f" | Train Loss: {loss:.2f}"         
        )

    model.eval()

Epoch: 001/003 | Batch 000/002 | Train Loss: 0.75
Epoch: 001/003 | Batch 001/002 | Train Loss: 0.65
Epoch: 002/003 | Batch 000/002 | Train Loss: 0.44
Epoch: 002/003 | Batch 001/002 | Train Loss: 0.13
Epoch: 003/003 | Batch 000/002 | Train Loss: 0.03
Epoch: 003/003 | Batch 001/002 | Train Loss: 0.00


- The dataset from the previous section has `2 features` and `2 classes`
- We led the optimizer needs to know which parameters to optimize
- Set the gradients from the previous round to zero to prevent unintended gradient accumulation
- Compute the gradients to the loss with respect to the model parameters
- The optimizer uses the gradients to update the model parameters

As we can see, the loss reaches zero after `3 epochs`, a sign that the model converged on the training set. However, before we evaluate the model's predictions, let's go over some of the details of the preceding code listing.

First, note that we initialized a model with `two inputs` and `two outputs`. That's because the `toy dataset` from the previous section has `two input` features and `two class` labels to predict. We used a `stochastic gradient descent (SGD)` optimizer with a `learning rate (lr)` of `0.5`. The learning rate is a `hyperparameter`, meaning it's a tunable setting that we have to experiment with based on observing the loss. Ideally, we want to choose a learning rate such that the loss converges after a certain number of epochs -- the number of epochs is another hyperparameter to choose.

In practice, we often use a `third` dataset, a so-called `validation dataset`, to find the optimal
`hyperparameter` settings. A `validation dataset` is similar to a `test set`. However, while we only want to use a test set precisely once to avoid biasing the evaluation, we usually use the `validation set` multiple times to tweak the model settings.

We also introduced new settings called `model.train()` and `model.eval()`. As these names imply, these settings are used to put the model into a training and an evaluation mode. This is necessary for components that behave differently during training and inference, such as `dropout` or `batch normalization` layers. Since we don't have `dropout` or other components in our `NeuralNetwork` class that are affected by these settings, using `model.train()` and `model.eval()` is redundant in our code above. However, it's best practice to include them anyway to avoid unexpected behaviors when we change the `model architecture` or reuse the code to train a different model.


As discussed earlier, we pass the `logits` directly into the `cross_entropy` loss function, which will apply the `softmax function` internally for efficiency and numerical stability reasons. Then, calling `loss.backward()` will calculate the gradients in the computation graph that PyTorch constructed in the background. The `optimizer.step()` method will use the gradients to update the model parameters to minimize the loss. In the case of the `SGD` optimizer, this means multiplying the gradients with the learning rate and adding the scaled negative gradient to the parameters.

![Alt text](../assests/softmax.png)

**PREVENTING UNDESIRED GRADIENT ACCUMULATION** It is important to include an `optimizer.zero_grad()` call in each update round to reset the gradients to zero. Otherwise, the gradients will accumulate, which may be undesired.

In [53]:
# After we trained the model, we can use it to make predictions, as shown below:

model.eval()
with torch.no_grad():
    outputs = model(X_train)
print(outputs)

tensor([[ 2.8569, -4.1618],
        [ 2.5382, -3.7548],
        [ 2.0944, -3.1820],
        [-1.4814,  1.4816],
        [-1.7176,  1.7342]])


To obtain the class membership probabilities, we can then use PyTorch's softmax function, we follows:

In [54]:
torch.set_printoptions(sci_mode=False)
probas = torch.softmax(outputs, dim=1)
print(probas)

tensor([[    0.9991,     0.0009],
        [    0.9982,     0.0018],
        [    0.9949,     0.0051],
        [    0.0491,     0.9509],
        [    0.0307,     0.9693]])


Let's consider the first row in the code output above. Here, the first value (column) means that the training example has a `99.91%` probability of belonging to `class 0` and a `0.09%` probability of belonging to `class 1`. (The `set_printoptions` call is used here to make the outputs more legible.)


We can convert these values into class labels predictions using PyTorch's `argmax` function, which returns the `index position` of the highest value in each row if we set `dim=1` (setting `dim=0` would return the highest value in each column, instead):

In [55]:
predictions = torch.argmax(probas, dim=1)
print(predictions)

tensor([0, 0, 0, 1, 1])


In [56]:
# preds = torch.argmax(probas, dim=0)
# print(preds)

Note that it is unnecessary to compute softmax probabilities to obtain the class labels. We could also apply the `argmax` function to the logits (outputs) directly:

In [57]:
predicitons = torch.argmax(outputs, dim=1)
print(predictions)

tensor([0, 0, 0, 1, 1])


Above, we computed the predicted labels for the training dataset. Since the training dataset is relatively small, we could compare it to the true training labels by eye and see that the model is 100% correct. We can double-check this using the == comparison operator:

In [58]:
predictions == y_train

tensor([True, True, True, True, True])

Using `torch.sum`, we can count the number of correct prediction as follows:

In [59]:
torch.sum(predictions == y_train)

tensor(5)

Since the dataset consists of 5 training examples, we have 5 out of 5 predictions that are correct, which equals `5/5 × 100% = 100%` prediction accuracy.

However, to generalize the computation of the prediction accuracy, let's implement a `compute_accuracy` function as shown in the following code listing.

In [60]:
def compute_accuracy(model, dataloader):
    model = model.eval()
    correct = 0.0
    total_examples = 0
    
    for idx, (features, labels) in enumerate(dataloader):
        with torch.no_grad():
            logits = model(features)
        
        predicitons = torch.argmax(logits, dim=1)
        compare = labels == predicitons
        correct += torch.sum(compare)
        total_examples += len(compare)
        
    return (correct / total_examples).item()

- **compare**: This returns a tensor of True/False values depending on whether the labels match
- **correct**: The sum operations counts the number of True values
- **return value**: This is the fraction of correct prediction, a value between 0 and 1. And `.item()` returns the value of the tensor as a Python float.

Note that the following code listing iterates over a data loader to compute the number and fraction of the correct predictions. This is because when we work with large datasets, we typically can only call the model on a small part of the dataset due to memory limitations. The `compute_accuracy` function above is a general method that scales to datasets of arbitrary size since, in each iteration, the dataset chunk that the model receives is the same size as the batch size seen during training.

Notice that the internals of the `compute_accuracy` function are similar to what we used before when we converted the logits to the class labels.

We can then apply the function to the training as follows:

In [61]:
print(compute_accuracy(model, train_loader))

1.0


Similarly, we can apply the function to the test set as follows:

In [62]:
print(compute_accuracy(model, test_loader))

1.0


In this section, we learned how we can train a neural network using PyTorch. Next, let's see
how we can save and restore models after training.

## Saving and loading models

In the previous section, we successfully trained a model. Let's now see how we can save a `trained model` to reuse it later.

Here's the recommended way how we can `save` and `load` models in PyTorch:

In [63]:
torch.save(model.state_dict(), "model.pth")

The model's `state_dict` is a Python dictionary object that maps each layer in the model to its trainable parameters `(weights and biases)`. Note that `"model.pth"` is an arbitrary filename for the model file saved to disk. We can give it any name and file ending we like; however, `.pth` and `.pt` are the most common conventions.

Once we saved the model, we can restore it from disk as follows:

In [64]:
model = NeuralNetwork(2, 2)
model.load_state_dict(torch.load("model.pth"))

<All keys matched successfully>

The `torch.load("model.pth")` function reads the file `"model.pth"` and reconstructs the Python dictionary object containing the model's parameters while `model.load_state_dict()` applies these parameters to the model, effectively restoring its learned state from when we saved it.

Note that the line `model = NeuralNetwork(2, 2)` above is not strictly necessary if you execute this code in the same session where you saved a model. However, I included it here to illustrate that we need an instance of the model in memory to apply the saved parameters. Here, the `NeuralNetwork(2, 2)` architecture needs to match the original saved model exactly.

Now, we are well equipped to use PyTorch to implement large language models.

## Optimizing training performance with GPUs

In this section, we'll see how we can utilize `GPUs`, which will accelerate deep neural network training compared to regular `CPUs`. First, we will introduce the main concepts behind GPU computing in PyTorch. Then, we will train a model on a single `GPU`. Finally, we'll then look at distributed training using multiple `GPUs`.


### PyTorch computations on GPU devices

Before we make the modifications, it's crucial to understand the main concept behind
GPU computations within PyTorch. First, we need to introduce the notion of devices. In
PyTorch, a device is where computations occur, and data resides. The CPU and the GPU are
examples of devices. A PyTorch tensor resides in a device, and its operations are executed
on the same device.

Assuming that you installed a GPU-compatible version of PyTorch, you can double-check that our
runtime indeed supports GPU computing via the following code:
```python
print(torch.cuda.is_available())
```

**Since I am on Apple Mac with an Apple Silicon chip, instead of Nvidia GPU, I'll be using MPS (Metal Performance Shaders).**

In [65]:
# check for MPS (Metal Performance Shaders) availability on macOS 
print(torch.backends.mps.is_available())

True


In [66]:
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(device)

mps


Now, suppose we have two tensors that we can add as follows -- this computation will be
carried out on the CPU by default:

In [None]:
tensor_1 = torch.tensor([1., 2., 3.])
tensor_2 = torch.tensor([4., 5., 6.])
print(tensor_1 + tensor_2)

We can now use the `.to()` method to transfer these tensors onto a MPS and perform the
addition there:

In [None]:
tensor_1 = tensor_1.to("mps")
tensor_2 = tensor_2.to("mps")
print(tensor_1 + tensor_2)

Notice that the resulting tensor now includes the device information, `device='mps:0'`, which means that the tensors reside on the first MPS. 

If your machine hosts multiple `GPUs`, you have the option to specify which GPU you'd like to transfer the tensors to. You can do this by indicating the device ID in the transfer command. For instance, you can use `.to("cuda:0")`, `.to("cuda:1")`, and so on.

However, it is important to note that all tensors must be on the same device. Otherwise, the computation will fail.

### Single-GPU training

Now that we are familiar with transferring tensors to the GPU, we can modify our `training loop`. A typical training loop, to run on a GPU. This requires only changing three lines of code, as shown in code below:

In [None]:
import torch

torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2)

device = torch.device("mps")
model = model.to(device)

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
num_epochs = 3

for epoch in range(num_epochs):
    model.train()
    for batch_index, (features, labels) in enumerate(train_loader):
        features, labels = features.to(device), labels.to(device)
        logits = model(features)
        loss = F.cross_entropy(logits, labels)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        
        # LOGGING
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
            f" | Batch {batch_index:03d}/{len(train_loader):03d}"
            f" | Train Loss: {loss:.2f}"         
        )
    
    model.eval()