# [Understanding PyTorch with an example: a step-by-step tutorial](https://towardsdatascience.com/understanding-pytorch-with-an-example-a-step-by-step-tutorial-81fc5f8c4e8e)

## A Simple Regression Problem

![](https://miro.medium.com/max/188/1*a7_GUQQT5BjvAhh3qq0JwA.png)

In [1]:
import numpy as np

### Data Generation

In [2]:
# Data Generation
np.random.seed(42)

x = np.random.rand(100, 1)
y = 1 + 2 * x + .1 * np.random.randn(100, 1)

# Shuffle the indices
idx  = np.arange(100)
np.random.shuffle(idx)

train_idx = idx[:80]
val_idx = idx[80:]

# Generate training and validation sets
x_train, y_train = x[train_idx], y[train_idx]
x_val, y_val = x[val_idx], y[val_idx]

In [3]:
(x_train.shape, y_train.shape), (x_val.shape, y_val.shape)

(((80, 1), (80, 1)), ((20, 1), (20, 1)))

Gradient Descent
===
**Four basic steps** you’d need to go through to compute it :
Step 1: Compute the Loss
---
For a regression problem, the loss is given by the **Mean Square Error (MSE)**, that is, the average of all squared differences between labels (y) and predictions (a + bx).

It is worth mentioning that, if we use **all points** in the training set (N) to compute the loss, we are performing a **batch gradient descent**. If we were to use a **single point** at each time, it would be a **stochastic gradient descent**. Anything else (n) **in-between 1 and N** characterizes a **mini-batch gradient descent**.

![](https://miro.medium.com/max/345/1*7fmJUcQT578OBfX7Q8hluQ.png)

Step 2: Compute the Gradients
---
A gradient is a **partial derivative** — why partial? Because one computes it with respect to **(w.r.t.) a single parameter**. We have two parameters, a and b, so we must compute two partial derivatives.

**_A derivative tells you how much a given quantity changes when you slightly vary some other quantity. In our case, how much does our MSE loss change when we vary each one of our two parameters?_**

![](https://miro.medium.com/max/850/1*YvTj1B-h1gzSI5F24OgrrA.png)

Step 3: Update the Parameters
---
In the final step, we _**use the gradients to update the parameters**_. Since we are trying to _**minimize our losses**_, we _**reverse the sign of the gradient**_ for the update.
There is still another parameter to consider: the _**learning rate**_, denoted by the Greek letter _**eta**_ (that looks like the letter _**n**_), which is the _**multiplicative factor**_ that we need to apply to the gradient for the parameter update.

![](https://miro.medium.com/max/209/1*eWnUloBYcSNPRBzVcaIr1g.png)

Step 4: Rinse and Repeat!
---
Now we use the **updated parameters** to go back to **Step 1** and restart the process.

> An **_epoch is complete whenever every point has been already used for computing the loss_**. For **batch** gradient descent, this is trivial, as it uses all points for computing the loss — **one epoch is the same as one update**. For **stochastic gradient** descent, **one epoch means N updates**, while for **mini-batch (of size n), one epoch has N/n updates**.

Repeating this process over and over, for many epochs, is, in a nutshell, **training a model**.

Linear Regression in Numpy
===
For training a model, there are **two initialization steps**:

1. **_Random initialization of parameters/weights_** (we have only two, a and b);
2. **_Initialization of hyper-parameters_** (in our case, only learning rate and number of epochs);

Make sure to always initialize your random seed to ensure **reproducibility** of your results. As usual, the random seed is [42](https://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy#Answer_to_the_Ultimate_Question_of_Life,_the_Universe,_and_Everything_(42)), the least random of all random seeds one could possibly choose :-)

**For each epoch, there are four training steps:**

- **Compute model’s predictions** — this is the **_forward pass_**;
- **Compute the loss**, using predictions and and labels and the appropriate loss function for the task at hand;
- **Compute the gradients for every parameter**;
- **Update the parameters**;

Just keep in mind that, if you don’t use batch gradient descent (our example does),you’ll have to write an **inner loop** to perform the **four training steps** for either each **individual point (stochastic)** or **n points (mini-batch)**.

In [4]:
# Initialize random parameters randomly
np.random.seed(42)
a = np.random.randn(1)
b = np.random.randn(1)

print(a, b)

# Set the learning rate & No. of Epochs
lr = 1e-1
n_epochs = 1000

for epoch in range(n_epochs):
    # compute model's predictions
    yhat = a + b * x_train
    
    # compute loss
    error = (y_train - yhat)
    loss = (error ** 2).mean()
    
    # compute the gradients for both "a" and "b" parameters
    a_grad = -2 * error.mean()
    b_grad = -2 * (x_train * error).mean()
    
    # update the parameters using the gradient and learning rate
    a -= lr * a_grad
    b -= lr * b_grad
    
    
print(a, b)
    
    
# SANITY CHECK : do we get same results as our Gradient Descent?
from sklearn.linear_model import LinearRegression
linr = LinearRegression()
linr.fit(x_train, y_train)
print(linr.intercept_, linr.coef_[0])

[0.49671415] [-0.1382643]
[1.02354094] [1.96896411]
[1.02354075] [1.96896447]


PyTorch
===

![](https://miro.medium.com/max/1200/1*GbwKkmA0NdndXRhOOwNclA.jpeg)

## Loading Data, Devices and CUDA

- `from_numpy()` returns a **CPU tensor**.
- `to()` sends your tensor to whatever **device** you specify, including your **GPU** (referred to as `cuda` or `cuda:0`).
- `cuda.is_available()` to find out if you have a GPU at your disposal and set your device accordingly.
- `float()` to cast it to a lower precision (32-bit float)
- `numpy()` turns tensors back to Numpy arrays, provided you've CPU tensors
- `cpu()` to  convert gpu tensors to cpu tensors
- `requires_grad=True` to enable **gradient computation of tensors**

**The `to(device)` "shadows" the gradient...**

> **In PyTorch, every method that ends with an underscore (_) makes changes in-place**, meaning, they will modify the underlying variable.

In [6]:
import torch
import torch.optim as optim
import torch.nn as nn
from torchviz import make_dot

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Our data was in Numpy arrays, but we need to transform them into PyTorch's Tensors
# and then we send them to the chosen device
x_train_tensor = torch.from_numpy(x_train).float().to(device)
y_train_tensor = torch.from_numpy(y_train).float().to(device)

# Here we can see the difference - notice that .type() is more useful
# since it also tells us WHERE the tensor is (device)
(type(x_train), type(x_train_tensor), x_train_tensor.type())

(numpy.ndarray, torch.Tensor, 'torch.FloatTensor')

In [7]:
# FIRST
# Initializes parameters "a" and "b" randomly, ALMOST as we did in Numpy
# since we want to apply gradient descent on these parameters, we need
# to set REQUIRES_GRAD = TRUE
a = torch.randn(1, requires_grad=True, dtype=torch.float)
b = torch.randn(1, requires_grad=True, dtype=torch.float)
print(a, b)

# SECOND
# But what if we want to run it on a GPU? We could just send them to device, right?
a = torch.randn(1, requires_grad=True, dtype=torch.float).to(device)
b = torch.randn(1, requires_grad=True, dtype=torch.float).to(device)
print(a, b)
# Sorry, but NO! The to(device) "shadows" the gradient...

# THIRD
# We can either create regular tensors and send them to the device (as we did with our data)
a = torch.randn(1, dtype=torch.float).to(device)
b = torch.randn(1, dtype=torch.float).to(device)
# and THEN set them as requiring gradients...
a.requires_grad_()
b.requires_grad_()
print(a, b)

tensor([0.4018], requires_grad=True) tensor([-0.9345], requires_grad=True)
tensor([-0.4809], requires_grad=True) tensor([0.4354], requires_grad=True)
tensor([-1.0193], requires_grad=True) tensor([0.1215], requires_grad=True)


In [8]:
# We can specify the device at the moment of creation - RECOMMENDED!
torch.manual_seed(42)
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
print(a, b)

tensor([0.3367], requires_grad=True) tensor([0.1288], requires_grad=True)


Autograd
===
Autograd is PyTorch’s **_automatic differentiation package_**. Thanks to it, we don’t need to worry about partial derivatives, chain rule or anything like it.

- `backward()` - for computing the gradients
- `grad` - attribute of a tensor which gives the actual values of the gradients
- `zero_()` - **_gradients are accumulated in pytorch_**. So every time we use the gradients to update the parameters, we need to **zero the gradients afterwards**.
- `torch.no_grad()` - allows us to **perform regular Python operations on tensors, independent of PyTorch's computation graph**.

**_Python builds dynamic computation graph from every Python operation that involves any gradient-computing tensor or its dependencies._**


In [10]:
lr = 1e-1
n_epochs = 1000

torch.manual_seed(42)
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

for epoch in range(n_epochs):
    yhat = a + b * x_train_tensor
    error = y_train_tensor - yhat
    loss = (error ** 2).mean()
    
    # We just tell PyTorch to work its way BACKWARDS from the specified loss!
    loss.backward()
    
    # lets check the computed gradients
    #print(a.grad)
    #print(b.grad)
    
    # What about UPDATING the parameters? Not so fast...
    
    # FIRST ATTEMPT
    # AttributeError: 'NoneType' object has no attribute 'zero_' (we lost the gradient while reassigning the updated results to our parameters)
    # a = a - lr * a.grad
    # b = b - lr * b.grad
    # print(a)

    # SECOND ATTEMPT
    # RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.
    # a -= lr * a.grad
    # b -= lr * b.grad        
    
    # THIRD ATTEMPT
    # We need to use NO_GRAD to keep the update out of the gradient computation
    # Why is that? It boils down to the DYNAMIC GRAPH that PyTorch uses...
    
    with torch.no_grad():
        a -= lr * a.grad
        b -= lr * b.grad
        
        
    # PyTorch is "clingy" to its computed gradients, we need to tell it to let it go...
    a.grad.zero_()
    b.grad.zero_()
    
print(a)
print(b)

tensor([1.0235], requires_grad=True)
tensor([1.9690], requires_grad=True)


Dynamic Computation Graph
===

> **“Unfortunately, no one can be told what the dynamic computation graph is. You have to see it for yourself.” Morpheus**

_The [PyTorchViz](https://github.com/szagoruyko/pytorchviz) package and its `make_dot(variable)` method allows us to easily visualize a graph associated with a given Python variable._


In [11]:
torch.manual_seed(42)

a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

yhat = a + b * x_train_tensor
error = y_train_tensor - yhat
loss = (error ** 2).mean()

In [19]:
import os
os.environ["PATH"] += os.pathsep + "C:/Users/the name oofn/Anaconda3/Lib/site-packages/graphviz/"
make_dot(yhat)

ExecutableNotFound: failed to execute ['dot', '-Tsvg'], make sure the Graphviz executables are on your systems' PATH

<graphviz.dot.Digraph at 0x11f82655fd0>