Doubt Page 126

**Index**

- [Problem Statement](#A-hot-problem)
    - [Data Collection](#Gathering-Data)
- [Gradient Descent in a nutshell](#Gradient-Descent-in-a-nutshell)
- [How to Autograd?](#Using-Autograd)
    - [Always zero your GRAD](#Grad-value-accumulates)
- [Optimization models](#Optimization-models)
- [Training and Validation sets](#Training-and-Validation-sets)

Let's solve the below problem to learn how to train model in PyTorch

### A hot problem
    We just got back from a trip to some obscure location, and we brought back a fancy, wall-mounted analog thermometer. 
    It looks great, and it’s a perfect fit for our living room. Its only flaw is that it doesn’t show units. Not to worry,
    we’ve got a plan: we’ll build a dataset of readings and corresponding temperature values in our favorite units, choose 
    a model, adjust its weights iteratively until a measure of the error is low enough, and finally be able to interpret 
    the new readings in units we understand.

In [None]:
import torch
import matplotlib.pyplot as plt

#### Gathering Data

    Here, the t_c values are temperatures in Celsius, and the t_u values are our unknown units.

In [None]:
t_c = [0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0]
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]
t_c = torch.tensor(t_c)
t_u = torch.tensor(t_u)

plt.scatter(t_c, t_u)

In [None]:
# Writing a simple linear model and calculating mean squared error loss 

def model(t_u, w, b):
    return w * t_u + b

def loss_fn(t_p, t_c):
    squared_diffs = (t_p - t_c)**2
    return squared_diffs.mean()

### Gradient Descent in a nutshell
    
    Gradient descent is not that different from the scenario we just described. The idea is to compute the rate of change of
    the loss with respect to each parameter, and modify each parameter in the direction of decreasing loss.

This is saying that in the neighborhood of the current values of *w* and *b*, a unit increase in *w* leads to some change in the loss. If the change is **negative**, then we need to **increase** *w* to minimize the loss, whereas if the change is **positive**, we need to **decrease** *w*. By how much?

In [None]:
# Initialize the weight and bias.

w = torch.ones(())
b = torch.zeros(())

In [None]:
delta = 0.1
learning_rate = 1e-2

In [None]:
# Calculating the rate of change and updating weights

loss_rate_of_change_w = \
(loss_fn(model(t_u, w + delta, b), t_c) -
loss_fn(model(t_u, w - delta, b), t_c)) / (2.0 * delta)

w = w - learning_rate * loss_rate_of_change_w
w

In [None]:
# Updating bias

loss_rate_of_change_b = \
(loss_fn(model(t_u, w, b + delta), t_c) -
loss_fn(model(t_u, w, b - delta), t_c)) / (2.0 * delta)

b = b - learning_rate * loss_rate_of_change_b
b

In [None]:
def dloss_fn(t_p, t_c):
    dsq_diffs = 2 * (t_p - t_c) / t_p.size(0)               # The division is from the derivative of mean.
    return dsq_diffs

def dmodel_dw(t_u, w, b):
    return t_u

def dmodel_db(t_u, w, b):
    return 1.0

def grad_fn(t_u, t_c, t_p, w, b):
    dloss_dtp = dloss_fn(t_p, t_c)
    dloss_dw = dloss_dtp * dmodel_dw(t_u, w, b)
    dloss_db = dloss_dtp * dmodel_db(t_u, w, b)
    return torch.stack([dloss_dw.sum(), dloss_db.sum()])    # The summation is the reverse of the
                                                            # broadcasting we implicitly do when
                                                            # applying the parameters to an entire
                                                            # vector of inputs in the model.


In [None]:
def training_loop(n_epochs, learning_rate, params, t_u, t_c):
    for epoch in range(1, n_epochs + 1):
        w, b = params
        
        # Forward Pass
        t_p = model(t_u, w, b)
        loss = loss_fn(t_p, t_c)
        grad = grad_fn(t_u, t_c, t_p, w, b)
        
        # Backward Pass
        params = params - learning_rate * grad
        
        print('Epoch %d, Loss %f' % (epoch, float(loss)))
    return params

In [None]:
t_un = 0.1 * t_u # Normalizing input

training_loop(
n_epochs = 5,
learning_rate = 1e-2,
params = torch.tensor([1.0, 0.0]),
t_u = t_un,
t_c = t_c)

In [None]:
from matplotlib import pyplot as plt
params = [1, 0]
t_p = model(t_un, *params)
fig = plt.figure(dpi=600)
plt.xlabel("Temperature (°Fahrenheit)")
plt.ylabel("Temperature (°Celsius)")
plt.plot(t_u.numpy(), t_p.detach().numpy())
plt.plot(t_u.numpy(), t_c.numpy(), 'o')

### Using Autograd

    PyTorch tensors can remember where they come from, in terms of the operations and parent tensors that originated them, 
    and they can automatically provide the chain of derivatives of such operations with respect to their inputs. This means 
    we won’t need to derive our model by hand; given a forward expression, no matter how nested, PyTorch will 
    automatically provide the gradient of that expression with respect to its input parameters.
    
Notice the 
```python 
requires_grad=True 
``` 
parameter in the initialization below

That argument is telling PyTorch to track the entire family tree of tensors resulting from operations on ```params``` the value of the derivative will be automatically populated as a ``` grad ``` attribute of the params tensor.

In [None]:
params = torch.tensor([1.0, 0.0], requires_grad=True)

In [None]:
loss = loss_fn(model(t_u, *params), t_c)
loss.backward()

In [None]:
params.grad

#### Grad value accumulates
PyTorch would compute the derivatives of the loss throughout the chain of functions (the computation graph) and **accumulate** their values in the grad attribute of those tensors (the leaf nodes of the graph)

So if backward was called earlier, the loss is evaluated again, backward is called again (as in any training loop), and the gradient at each leaf is accumulated (that is, summed) on top of the one computed at the previous iteration, which leads to an incorrect value for the gradient.

In [None]:
def training_loop(n_epochs, learning_rate, params, t_u, t_c):
    for epoch in range(1, n_epochs + 1):
        if params.grad is not None:
            params.grad.zero_()
        t_p = model(t_u, *params)
        loss = loss_fn(t_p, t_c)
        loss.backward()
        with torch.no_grad(): # AutoGrad Engine look away here
            params -= learning_rate * params.grad
        if epoch % 500 == 0:
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
    return params

In [None]:
training_loop(
n_epochs = 5000,
learning_rate = 1e-2,
params = torch.tensor([1.0, 0.0], requires_grad=True),
t_u = t_un,
t_c = t_c)

In [None]:
torch.no_grad()

### Optimization models

Each optimizer exposes two methods: ```zero_grad``` and ```step```. **zero_grad** zeroes the ```grad``` attribute of all the parameters passed to the optimizer upon construction. **step** updates the value of those parameters according to the optimization strategy implemented by the specific optimizer.

In [None]:
import torch.optim as optim
dir(optim)

In [None]:
# SGD stands for stochastic gradient descent.

params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)

In [None]:
def training_loop(n_epochs, optimizer, params, t_u, t_c):
    for epoch in range(1, n_epochs + 1):
        
        t_p = model(t_u, *params)
        loss = loss_fn(t_p, t_c)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if epoch % 500 == 0:
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
    return params

In [None]:
# Using Adam Optimizer.

params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-1
optimizer = optim.Adam([params], lr=learning_rate)
training_loop(n_epochs = 5000, optimizer = optimizer, params = params, 
              t_u = t_u, t_c = t_c) # Not using scaled values as Adam is less sesitive to the scaling

#### Training and Validation sets

In [None]:
n_samples = t_u.shape[0]
n_val = int(0.2 * n_samples)

shuffled_indices = torch.randperm(n_samples)

train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]

train_indices, val_indices

In [None]:
train_t_u = t_u[train_indices]
train_t_c = t_c[train_indices]
val_t_u = t_u[val_indices]
val_t_c = t_c[val_indices]
train_t_un = 0.1 * train_t_u
val_t_un = 0.1 * val_t_u

In [None]:
def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u, train_t_c, val_t_c):
    for epoch in range(1, n_epochs + 1):
        train_t_p = model(train_t_u, *params)
        train_loss = loss_fn(train_t_p, train_t_c)
        
        # Make sure we don't create computation graph for validation set
        with torch.no_grad():
            val_t_p = model(val_t_u, *params)
            val_loss = loss_fn(val_t_p, val_t_c)
            assert val_loss.requires_grad == False
        
        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()
        if epoch <= 3 or epoch % 500 == 0:
            print(f"Epoch {epoch}, Training loss {train_loss.item():.4f},"f" Validation loss {val_loss.item():.4f}")
    return params

In [None]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)
training_loop(
n_epochs = 3000,
optimizer = optimizer,
params = params,
train_t_u = train_t_un,
val_t_u = val_t_un,
train_t_c = train_t_c,
val_t_c = val_t_c)

In [None]:
# Inference
def calc_forward(t_u, t_c, is_train):
    with torch.set_grad_enabled(is_train):
        t_p = model(t_u, *params)
        loss = loss_fn(t_p, t_c)
    return loss