# Introduction to PyTorch

## The three core components of PyTorch

Firstly, PyTorch is a tensor library that extends the concept of array-oriented programming
library NumPy with the additional feature of accelerated computation on GPUs, thus
providing a seamless switch between CPUs and GPUs.

Secondly, PyTorch is an automatic differentiation engine, also known as autograd, which
enables the automatic computation of gradients for tensor operations, simplifying
backpropagation and model optimization.

Finally, PyTorch is a deep learning library, meaning that it offers modular, flexible, and
efficient building blocks (including pre-trained models, loss functions, and optimizers) for
designing and training a wide range of deep learning models, catering to both researchers
and developers.

AI is fundamentally about creating computer systems capable of performing tasks that
usually require human intelligence. These tasks include understanding natural language,
recognizing patterns, and making decisions. (Despite significant progress, AI is still far from
achieving this level of general intelligence.)

Machine learning represents a subfield of AI that focuses on developing and improving learning algorithms. The key idea behind machine learning is to enable computers to learn from data and make predictions or decisions without being explicitly programmed to perform the task. This involves developing algorithms that can identify patterns and learn from historical data and improve their performance over time with more data and feedback.

Machine learning is also behind technologies like recommendation systems used by online retailers and streaming services, email spam filtering, voice recognition in virtual assistants, and even self-driving cars. The introduction and advancement of machine learning have significantly enhanced AI's capabilities, enabling it to move beyond strict rule-based systems and adapt to new inputs or changing environments.

Deep learning is a subcategory of machine learning that focuses on the training and
application of deep neural networks. These deep neural networks were originally inspired by
how the human brain works, particularly the interconnection between many neurons. The
"deep" in deep learning refers to the multiple hidden layers of artificial neurons or nodes
that allow them to model complex, nonlinear relationships in the data.

Unlike traditional machine learning techniques that excel at simple pattern recognition,
deep learning is particularly good at handling unstructured data like images, audio, or text,
so deep learning is particularly well suited for LLMs.

Using a learning algorithm, a model is trained on a training dataset consisting of examples
and corresponding labels. In the case of an email spam classifier, for example, the training
dataset consists of emails and their spam and not-spam labels that a human identified.
Then, the trained model can be used on new observations (new emails) to predict their
unknown label (spam or not spam)

Of course, we also want to add a model evaluation between the training and inference
stages to ensure that the model satisfies our performance criteria before using it in a real-
world application.


In [8]:
import torch
import numpy as np

In [None]:
torch.__version__

False

In [6]:
print(torch.backends.mps.is_available())

True


## Understanding tensors

Tensors represent a mathematical concept that generalizes vectors and matrices to potentially higher dimensions. In other words, tensors are mathematical objects that can be characterized by their order (or rank), which provides the number of dimensions. For example, a scalar (just a number) is a tensor of rank 0, a vector is a tensor of rank 1, and a matrix is a tensor of rank 2.

From a computational perspective, tensors serve as data containers. For instance, they hold multi-dimensional data, where each dimension represents a different feature. Tensor libraries, such as PyTorch, can create, manipulate, and compute with these multi-dimensional arrays efficiently. In this context, a tensor library functions as an array library.


PyTorch tensors are similar to NumPy arrays but have several additional features important for deep learning. For example, PyTorch adds an automatic differentiation engine, simplifying computing gradients. PyTorch tensors also support GPU computations to speed up deep neural network training.

## Scalars, vectors, matrices, and tensors

As mentioned earlier, PyTorch tensors are data containers for array-like structures. A scalar
is a `0-dimensional` tensor (for instance, just a number), a vector is a `1-dimensional` tensor,
and a matrix is a `2-dimensional` tensor. There is no specific term for higher-dimensional
tensors, so we typically refer to a `3-dimensional` tensor as just a 3D tensor, and so forth.

We can create objects of PyTorch's Tensor class using the torch.tensor function as
follows:

In [9]:
np.__version__

'2.3.2'

In [14]:
import torch

tensor0d = torch.tensor(1)
print(tensor0d, tensor0d.dim())

tensor(1) 0


In [13]:
tensor1d = torch.tensor([1, 2, 3])
print(tensor1d, tensor1d.dim())

tensor([1, 2, 3]) 1


In [16]:
tensor2d = torch.tensor([[3, 4], [6, 7]])
print(tensor2d, tensor2d.dim())

tensor([[3, 4],
        [6, 7]]) 2


In [17]:
tensor3d = torch.tensor([[[2, 6], [3, 7], [4, 8]]])
print(tensor3d, tensor3d.dim())

tensor([[[2, 6],
         [3, 7],
         [4, 8]]]) 3


## Tensor data types

PyTorch adopts the default 64-bit integer data type from Python. We can access the data type of a
tensor via the `.dtype` attribute of a tensor:

In [18]:
print(tensor1d.dtype)

torch.int64


If we create tensors from Python floats, PyTorch creates tensors with a 32-bit precision by
default, as we can see below:

In [19]:
floatvec = torch.tensor([1.0, 2.3, 4.6])
print(floatvec.dtype)

torch.float32


This choice is primarily due to the balance between precision and computational efficiency.
A 32-bit floating point number offers sufficient precision for most deep learning tasks, while
consuming less memory and computational resources than a 64-bit floating point number.
Moreover, GPU architectures are optimized for 32-bit computations, and using this data
type can significantly speed up model training and inference.

Moreover, it is possible to readily change the precision using a tensor's `.to` method. The following code demonstrates this by changing a `64-bit` integer tensor into a `32-bit` float tensor:

In [21]:
vectofloat = tensor1d.to(torch.float32)
print(vectofloat, vectofloat.dtype)

tensor([1., 2., 3.]) torch.float32


## Common PyTorch tensor operations

Some of the most essential PyTorch tensor operations.

In [23]:
# We already introduced the torch.tensor() function to create new tensors.

new_tensor = torch.tensor([[3, 6], [4, 8], [5, 10]])
print(new_tensor, new_tensor.dtype)

tensor([[ 3,  6],
        [ 4,  8],
        [ 5, 10]]) torch.int64


In [24]:
# In addition, the .shape attribute allows us to access the shape of a tensor:

new_tensor.shape

torch.Size([3, 2])

In [25]:
# As you can see above, .shape returns [3, 2], which means that the tensor has 3 rows and 2 columns. 
# To reshape the tensor into a 2 by 3 tensor, we can use the .reshape method:

print(new_tensor.reshape(2, 3))

tensor([[ 3,  6,  4],
        [ 8,  5, 10]])


In [26]:
# However, note that the more common command for reshaping tensors in PyTorch is .view():

print(new_tensor.view(2, 3))

tensor([[ 3,  6,  4],
        [ 8,  5, 10]])


Similar to `.reshape` and `.view`, there are several cases where PyTorch offers multiple syntax options for executing the same computation.
This is because PyTorch initially followed the original `Lua Torch` syntax convention but then also added syntax to make it more similar to NumPy upon popular request.

Next, we can use `.T` to transpose a tensor, which means flipping it across its diagonal. Note that this is similar from reshaping a tensor as you can see based on the result below:

In [27]:
print(new_tensor.T)

tensor([[ 3,  4,  5],
        [ 6,  8, 10]])


Lastly, the common way to multiply two matrices in PyTorch is the `.matmul` method:

In [28]:
print(tensor2d.matmul(tensor2d))

tensor([[33, 40],
        [60, 73]])


However, we can also adopt the `@` operator, which accomplishes the same thing more compactly:

In [29]:
print(tensor2d@tensor2d)

tensor([[33, 40],
        [60, 73]])


## Seeing models as computation graphs

PyTorch's autograd system provides functions to compute gradients in dynamic computational graphs automatically. 

A computational graph (or computation graph in short) is a directed graph that allows us to express and visualize mathematical expressions. In the context of deep learning, a computation graph lays out the sequence of calculations needed to compute the output of a neural network

Let's look at a concrete example to illustrate the concept of a computation graph. The following code implements the forward pass (prediction step) of a simple logistic regression classifier, which can be seen as a single-layer neural network, returning a score between `0` and `1` that is compared to the true class label `(0 or 1)` when computing the loss:

In [35]:
import torch.nn.functional as F

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2])
b = torch.Tensor([0.0])
z = x1 * w1 + b

y_pred = torch.sigmoid(z)
print(y_pred)

tensor([0.9183])


In [36]:
# Get the loss
loss = F.binary_cross_entropy(y_pred, y)

In [37]:
loss

tensor(0.0852)

![Alt text](../assests/computational-graph.png)

The image above illustrates A logistic regression forward pass as a computation graph. The input feature `x1` is multiplied by a model weight `w1` and passed through an activation function `σ` after adding the bias. The loss is computed by comparing the model output `a` with a given label `y`.

In fact, PyTorch builds such a computation graph in the background, and we can use this to
calculate gradients of a loss function with respect to the model parameters (here w1 and b)
to train the model.

## Automatic differentiation made easy

If we carry out computations in PyTorch, it will build such a graph internally by default if one of its terminal nodes has the `requires_grad` attribute set to True. This is useful if we want to compute `gradients`. Gradients are required when training neural networks via the popular backpropagation algorithm, which can be thought of as an implementation of the `chain rule` from calculus for neural networks.

![Alt text](../assests/auto-diff.png)

# PARTIAL DERIVATIVES AND GRADIENTS

`Figure A.8` shows partial derivatives, which measure the rate at which a function changes with respect to one of its variables. A gradient is a vector containing all of the partial derivatives of a multivariate function, a function with more than one variable as input.

If you are not familiar or don't remember the `partial derivatives`, `gradients`, or the `chain rule` from calculus, don't worry. On a high level, all you need to know for this book is that the chain rule is a way to compute gradients of a loss function with respect to the model's parameters in a computation graph. This provides the information needed to update each parameter in a way that minimizes the loss function, which serves as a proxy for measuring the model's performance, using a method such as gradient descent.

the automatic differentiation `(autograd)` engine? By tracking every operation performed on tensors, PyTorch's `autograd` engine constructs a computational graph in the background. Then, calling the `grad` function, we can compute the gradient of the loss with respect to model parameter `w1` as follows:

In [38]:
import torch.nn.functional as F
from torch.autograd import grad

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)
z = x1 * w1 + b

y_pred = torch.sigmoid(z)
loss = F.binary_cross_entropy(y_pred, y)

grad_l_w1 = grad(loss, w1, retain_graph=True)
grad_l_b = grad(loss, b, retain_graph=True)

By default, PyTorch destroys the computation graph after calculating the gradients to free memory. However, since we are going to reuse this computation graph shortly, we set retain_graph=True so that it stays in memory.

In [41]:
print(grad_l_b, grad_l_w1)

(tensor([-0.0817]),) (tensor([-0.0898]),)


Above, we have been using the grad function `"manually,"` which can be useful for experimentation, debugging, and demonstrating concepts. But in practice, PyTorch provides even more high-level tools to automate this process. For instance, we can call `.backward` on the loss, and PyTorch will compute the gradients of all the leaf nodes in the graph, which will be stored via the tensors' `.grad` attributes:

In [42]:
loss.backward()
print(w1.grad)
print(b.grad)

tensor([-0.0898])
tensor([-0.0817])


## Implementing multilayer neural networks

In the previous sections, we covered PyTorch's tensor and autograd components. This section focuses on PyTorch as a library for implementing `deep neural networks`. To provide a concrete example, we focus on a `multilayer perceptron`, which is a fully
connected neural network.

![Alt text](../assests/mlp.png)

When implementing a neural network in PyTorch, we typically subclass the `torch.nn.Module` class to define our own custom network architecture. This Module base class provides a lot of functionality, making it easier to build and train models. For instance, it allows us to encapsulate layers and operations and keep track of the model's parameters.

Within this subclass, we define the network layers in the `__init__` constructor and specify how they interact in the `forward method`. The forward method describes how the input data passes through the network and comes together as a computation graph.

In contrast, the `backward` method, which we typically do not need to implement ourselves, is used during training to compute `gradients` of the `loss function` with respect to the `model parameters`. 

The following code implements a classic `multilayer perceptron` with two hidden layers to illustrate a typical usage of the Module class:

In [43]:
class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()
        
        self.layers = torch.nn.Sequential(
            
            # 1st hidden layer
            torch.nn.Linear(num_inputs, 30),
            torch.nn.ReLU(),
            
            # 2nd hidden layer
            torch.nn.Linear(30, 20),
            torch.nn.ReLU(),
            
            # Output layer
            torch.nn.Linear(20, num_outputs),
            
        )
        
    def forward(self, x):
        logits = self.layers(x)
        return logits

- It's useful to code the number of inputs and outputs as variables to reuse the same code for datasets with different numbers of features and classes.
  
- The Linear layer takes the number of input and output nodes as arguments.
  
- Nonlinear activation functions are placed between the hidden layers.
  
- The number of output nodes of one hidden layer has to match the number of inputs of the next layer.
  
- The outputs of the last layer are called logits.

In [44]:
# instantiate a new neural network object as follows:
model = NeuralNetwork(50, 3)

But before using this new model object, it is often useful to call print on the model to see a summary of its structure:

In [45]:
print(model)

NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)


Note that we used the `Sequential` class when we implemented the `NeuralNetwork` class. Using Sequential is not required, but it can make our life easier if we have a series of layers that we want to execute in a specific order, as is the case here. This way, after instantiating `self.layers = Sequential(...)` in the `__init__` constructor, we just have to call the `self.layers` instead of calling each layer individually in the `NeuralNetwork's` forward method.

In [47]:
# Next, let's check the total number of trainable parameters of this model:

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total number of trainable model parameters:", num_params)

Total number of trainable model parameters: 2213


Note that each parameter for which `requires_grad=True` counts as a trainable parameter and will be updated during training.

In the case of our `neural network model` with the two hidden layers above, these trainable parameters are contained in the `torch.nn.Linear` layers. A linear layer multiplies the `inputs with a weight matrix and adds a bias vector`. This is sometimes also referred to
as a `feedforward` or `fully connected layer`.

Based on the `print(model)` call we executed above, we can see that the first Linear layer is at index position `0` in the layers attribute. We can access the corresponding weight parameter matrix as follows:

In [54]:
print(model.layers[0].weight)

Parameter containing:
tensor([[ 0.1002,  0.0516,  0.0463,  ..., -0.1280, -0.1195,  0.0278],
        [ 0.1319,  0.0549,  0.1327,  ...,  0.0467, -0.0404,  0.1108],
        [-0.0768,  0.0059,  0.0921,  ..., -0.0913, -0.1009,  0.0777],
        ...,
        [-0.1242,  0.0902,  0.0313,  ...,  0.0348, -0.0284,  0.0580],
        [-0.0845,  0.1084, -0.0515,  ...,  0.0223,  0.0502,  0.0350],
        [ 0.0059, -0.0862, -0.0601,  ...,  0.0730, -0.1253,  0.0554]],
       requires_grad=True)


In [56]:
print(model.layers[2].weight)

Parameter containing:
tensor([[-0.0706, -0.1613, -0.0382, -0.0338,  0.0408,  0.0283, -0.1800,  0.0613,
          0.0337, -0.1748,  0.0415, -0.1285,  0.0826,  0.1203,  0.1686, -0.0867,
         -0.1456, -0.1658, -0.0388,  0.1806, -0.1766,  0.1511, -0.0600,  0.1367,
         -0.0169,  0.1419,  0.1603,  0.0325,  0.1771,  0.1390],
        [-0.0345, -0.0200, -0.1431,  0.1416, -0.1668, -0.1217,  0.0388,  0.0628,
         -0.1759,  0.0356, -0.1225, -0.1818,  0.0592,  0.0677,  0.0149,  0.0779,
          0.0678, -0.1519, -0.0082, -0.1400,  0.1643,  0.0040,  0.0731, -0.0616,
         -0.0788,  0.1778, -0.0299, -0.0703, -0.1403,  0.0580],
        [-0.1791, -0.0134, -0.1733,  0.1272, -0.0072,  0.1291, -0.0544, -0.0058,
          0.1186,  0.1017, -0.1676,  0.1193,  0.0153,  0.0970,  0.0720,  0.1543,
         -0.0394,  0.0698, -0.0334, -0.0784, -0.0734,  0.0118,  0.0591, -0.0523,
          0.0371,  0.0506,  0.1024,  0.1455, -0.0425, -0.0215],
        [ 0.1729,  0.0291, -0.0773,  0.0555, -0.0129,  0.

Since this is a large matrix that is not shown in its entirety, let's use the .shape attribute to show its dimensions:

In [61]:
first_layer = model.layers[0].weight.shape
second_layer = model.layers[2].weight.shape

print(f"Shape of the weight of the first layer:", first_layer)
print(f"Shape of the weight of the second layer:", second_layer)

Shape of the weight of the first layer: torch.Size([30, 50])
Shape of the weight of the second layer: torch.Size([20, 30])


(Similarly, you could access the bias vector via `model.layers[0].bias`.)

In [63]:
first_layer_bias = model.layers[0].bias.shape
second_layer_bias = model.layers[2].bias.shape

print(f"Shape of the bias of the first layer:", first_layer_bias)
print(f"Shape of the bias of the second layer:", second_layer_bias)

Shape of the bias of the first layer: torch.Size([30])
Shape of the bias of the second layer: torch.Size([20])


The weight matrix of the first layer above is a `30x50` matrix, and we can see that the `requires_grad` is set to True, which means its entries are trainable -- this is the default setting for `weights` and `biases` in `torch.nn.Linear`.

Note that if you execute the code above on your computer, the numbers in the weight matrix will likely differ from those shown above. This is because the model weights are initialized with small random numbers, which are different each time we instantiate the network. In deep learning, initializing model weights with small random numbers is desired to break symmetry during training -- otherwise, the nodes would be just performing the same operations and updates during backpropagation.

However, while we want to keep using small random numbers as initial values for our layer weights, we can make the random number initialization reproducible by `seeding` PyTorch's random number generator via `manual_seed`:

In [66]:
torch.manual_seed(123)
model = NeuralNetwork(50, 3)
print(model.layers[0].weight)

Parameter containing:
tensor([[-0.0577,  0.0047, -0.0702,  ...,  0.0222,  0.1260,  0.0865],
        [ 0.0502,  0.0307,  0.0333,  ...,  0.0951,  0.1134, -0.0297],
        [ 0.1077, -0.1108,  0.0122,  ...,  0.0108, -0.1049, -0.1063],
        ...,
        [-0.0787,  0.1259,  0.0803,  ...,  0.1218,  0.1303, -0.1351],
        [ 0.1359,  0.0175, -0.0673,  ...,  0.0674,  0.0676,  0.1058],
        [ 0.0790,  0.1343, -0.0293,  ...,  0.0344, -0.0971, -0.0509]],
       requires_grad=True)


Now, after we spent some time inspecting the `NeuraNetwork` instance, let's briefly see how it's used via the forward pass:

In [69]:
torch.manual_seed(123)
X = torch.rand((1, 50))

In [70]:
X

tensor([[0.2961, 0.5166, 0.2517, 0.6886, 0.0740, 0.8665, 0.1366, 0.1025, 0.1841,
         0.7264, 0.3153, 0.6871, 0.0756, 0.1966, 0.3164, 0.4017, 0.1186, 0.8274,
         0.3821, 0.6605, 0.8536, 0.5932, 0.6367, 0.9826, 0.2745, 0.6584, 0.2775,
         0.8573, 0.8993, 0.0390, 0.9268, 0.7388, 0.7179, 0.7058, 0.9156, 0.4340,
         0.0772, 0.3565, 0.1479, 0.5331, 0.4066, 0.2318, 0.4545, 0.9737, 0.4606,
         0.5159, 0.4220, 0.5786, 0.9455, 0.8057]])

In [72]:
X.shape, X.dim()

(torch.Size([1, 50]), 2)

In [73]:
out = model(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]], grad_fn=<AddmmBackward0>)


In the code above, we generated a single random training example `X` as a toy input (note that our network expects `50-dimensional` feature vectors) and fed it to the model, returning three scores. When we call `model(x)`, it will automatically execute the forward pass of the
model.

The forward pass refers to calculating output tensors from input tensors. This involves passing the input data through all the neural network layers, starting from the input layer, through hidden layers, and finally to the output layer.

These three numbers returned above correspond to a score assigned to each of the three output nodes. Notice that the output tensor also includes a `grad_fn` value.

Here, `grad_fn=<AddmmBackward0>` represents the last-used function to compute a variable in the computational graph. In particular, `grad_fn=<AddmmBackward0>` means that the tensor we are inspecting was created via a matrix multiplication and addition operation.
PyTorch will use this information when it computes gradients during backpropagation. The `<AddmmBackward0>` part of `grad_fn=<AddmmBackward0>` specifies the operation that was performed. In this case, it is an `Addmm` operation. `Addmm` stands for `matrix multiplication (mm)` followed by an `addition (Add)`.

If we just want to use a network without training or backpropagation, for example, if we use it for prediction after training, constructing this computational graph for backpropagation can be wasteful as it performs unnecessary computations and consumes additional memory. So, when we use a model for inference (for instance, making predictions) rather than training, it is a best practice to use the `torch.no_grad()` context manager, as shown below. This tells PyTorch that it doesn't need to keep track of the gradients, which can result in significant savings in memory and computation.

In [74]:
with torch.no_grad():
    out = model(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]])


In PyTorch, it's common practice to code models such that they return the outputs of the `last layer (logits)` without passing them to a `nonlinear activation` function. That's because `PyTorch's` commonly used loss functions combine the `softmax (or sigmoid for binary
classification)` operation with the `negative log-likelihood loss` in a single class. The reason for this is numerical efficiency and stability. So, if we want to compute `class-membership probabilities` for our predictions, we have to call the `softmax` function explicitly:

In [79]:
with torch.no_grad():
    out = torch.softmax(model(X), dim=1)
print(out)
out.sum()

tensor([[0.3113, 0.3934, 0.2952]])


tensor(1.)

The values can now be interpreted as `class-membership probabilities` that sum up to `1`. The values are roughly equal for this random input, which is expected for a randomly initialized model without training.

## Setting up efficient data loaders

