###### PyTorch’s autograd:
 You computed the gradient of a composition of functions:the model and the loss—with respect to their innermost parameters—w and b—by propagating derivatives backward via the chain rule. The basic requirement is that all functions you’re dealing with are differentiable analytically. In this case, you can compute the gradient (which we called “the rate of change of the loss”) with respect to the parameters in one sweep. If you have a complicated model with millions of parameters, as long as the model is differentiable, computing the gradient of  loss with respect to parameters amounts to writing the analytical expression for the derivatives and evaluating them once. Granted, writing the analytical expression for the derivatives of a deep composition of linear and nonlinear functions isn’t easy and quick, either.


 This situation is where PyTorch tensors come to the rescue, with a PyTorch component called autograd. PyTorch tensors can remember where they come from in terms
 of the operations and parent tensors that originated them, and they can provide the
 chain of derivatives of such operations with respect to their inputs automatically. You
 won’t need to derive your model by hand; given a forward expression, no matter how
 nested, PyTorch provides the gradient of that expression with respect to its input
 parameters automatically. 

     Rewrite the thermometer calibration code, this time using 
    autograd:

In [22]:
import torch

In [23]:
t_c = [0.5,  14.0, 15.0, 28.0, 11.0,  8.0,  3.0, -4.0,  6.0, 13.0, 21.0]
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4] 
t_c = torch.tensor(t_c)
t_u = torch.tensor(t_u)

In [24]:
def model(t_u, w, b): 
    return w * t_u + b

In [25]:
def loss_fn(t_p, t_c):  
    squared_diffs = (t_p - t_c)**2
    return squared_diffs.mean()

Initialize a parameters tensor:


In [26]:
params = torch.tensor([1.0, 0.0], requires_grad=True) 

The requires_grad=True argument to the tensor constructor, that argument
 is telling PyTorch to track the entire family tree of tensors resulting from operations on params. 

 In other words, any tensor that has params as an ancestor has access to the
 chain of functions that were called to get from params to that tensor.

 In case these
 functions are differentiable (and most PyTorch tensor operations are), the value of
 the derivative is automatically populated as a grad attribute of the params tensor.  In general, all PyTorch tensors have an attribute named grad, normally None:


In [27]:
params.grad is None

True

All you have to do to populate it is start with a tensor with requires_grad set to True, call the model, compute the loss, and then call backward on the loss tensor:

In [28]:
loss = loss_fn(model(t_u, *params), t_c) 
loss.backward()

In [29]:
params.grad

tensor([4517.2969,   82.6000])

At this point, the grad attribute of params contains the derivatives of the loss with
 respect to each element of params 

 You could have any number of tensors with requires_grad set to True and any
 composition of functions. In this case, PyTorch would compute derivatives of the loss
 throughout the chain of functions (the computation graph) and accumulate their values in the grad attribute of those tensors(the leaf nodes of the graph). 

In [30]:
def training_loop(n_epochs, learning_rate, params, t_u, t_c): 
    for epoch in range(1, n_epochs + 1): 
        # This could be done at any point in the loop prior to
        # calling loss.backward()

        if params.grad is not None:
            params.grad.zero_()
        t_p = model(t_u, *params)  
        loss = loss_fn(t_p, t_c) 
        loss.backward()
        params = (params - learning_rate * params.grad).detach().requires_grad_()
        if epoch % 500 == 0:   
            # It’s somewhat cumbersome, but as you’ll see in
            # “Optimizers a-la Carte,”it’s not an issue in practice.
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
    return params 

Notice that when you updated params, you also did an odd .detach().requires_grad_()_.  Detach the new params tensor from the computation graph associated
 with its update expression by calling .detatch(). This way, params effectively loses the
 memory of the operations that generated it. Then you can reenable tracking by calling .requires_grad_(), an in_place operation (see the trailing _) that reactivates autograd for the tensor. 

 Now you can release the memory held by old versions of
 params and need to backpropagate through only your current weights. 

In [31]:
t_un= t_u*0.1

In [32]:
training_loop(n_epochs = 5000, learning_rate = 1e-2, 
              params = torch.tensor([1.0, 0.0], requires_grad=True),
              t_u = t_un,  
              t_c = t_c)

Epoch 500, Loss 7.860116
Epoch 1000, Loss 3.828538
Epoch 1500, Loss 3.092191
Epoch 2000, Loss 2.957697
Epoch 2500, Loss 2.933134
Epoch 3000, Loss 2.928648
Epoch 3500, Loss 2.927830
Epoch 4000, Loss 2.927679
Epoch 4500, Loss 2.927652
Epoch 5000, Loss 2.927647


tensor([  5.3671, -17.3012], requires_grad=True)

You get the same result that you got previously. Good for you! Although you’re capable of computing derivatives by hand, you no longer need to.

###### Optimizers a la carte :

To update every parameter in your model yourself. The
 torch module has an optim submodule where you can find classes that implement different optimization algorithms. Here’s an abridged listing:


In [33]:
import torch.optim as optim

In [34]:
dir(optim)

['ASGD',
 'Adadelta',
 'Adagrad',
 'Adam',
 'AdamW',
 'Adamax',
 'LBFGS',
 'Optimizer',
 'RMSprop',
 'Rprop',
 'SGD',
 'SparseAdam',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'lr_scheduler']

Every optimizer constructor takes a list of parameters (aka PyTorch tensors, typically
 with requires_grad set to True) as the first input. All parameters passed to the optimizer are retained inside the optimizer object so that the optimizer can update their
 values and access their grad attribute

 Each optimizer exposes two methods: zero_grad and step. The former zeros the
 grad attribute of all the parameters passed to the optimizer upon construction. The
 latter updates the value of those parameters according to the optimization strategy
 implemented by the specific optimizer.

Now create params and instantiate a gradient descent optimizer:


In [35]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-5
optimizer = optim.SGD([params], lr=learning_rate) 

Here, SGD stands for Stochastic Gradient Descent. The optimizer itself is a vanilla gradient
 descent (as long as the momentum argument is set to 0.0, which is the default).

 The
 term stochastic comes from the fact that the gradient is typically obtained by averaging
 over a random subset of all input samples, called a minibatch.

The optimizer itself,
 however, doesn’t know whether the loss was evaluated on all the samples (vanilla) or a
 random subset thereof (stochastic), so the algorithm is the same in the two cases. 

In [36]:
# take your new optimizer for a spin:

t_p = model(t_u, *params) 
loss = loss_fn(t_p, t_c) 
loss.backward()
optimizer.step()
params

tensor([ 9.5483e-01, -8.2600e-04], requires_grad=True)

###### Note:
The value of params was updated when step was called, and you didn’t have to touch
 it yourself! What happened was that the optimizer looked into params.grad and
 updated params by subtracting learning_rate times grad from it, exactly as in your
 former hand-rolled code. 

 Here’s the loop-ready code, with the extra zero_grad in the right spot (before the
 call to backward):


In [37]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)
t_p = model(t_un, *params) 
loss = loss_fn(t_p, t_c)
optimizer.zero_grad()
loss.backward()
optimizer.step()
params

tensor([1.7761, 0.1064], requires_grad=True)

 All you have to do is provide a list of params to it (that list can be extremely
 long, as needed for deep neural network models) and then forget about the details. Update your training loop accordingly:


In [38]:
def training_loop(n_epochs, optimizer, params, t_u, t_c): 
    for epoch in range(1, n_epochs + 1):  
        t_p = model(t_u, *params)  
        loss = loss_fn(t_p, t_c)
        optimizer.zero_grad() 
        loss.backward()     
        optimizer.step()
        if epoch % 500 == 0:
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
    return params

In [39]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
# It’s important that both params here are the same object; otherwise,
# the optimizer won’t know what parameters the model used.
optimizer = optim.SGD([params], lr=learning_rate)

training_loop(n_epochs = 5000, 
              optimizer = optimizer, 
              params = params, 
              t_u = t_un,
              t_c = t_c)

Epoch 500, Loss 7.860116
Epoch 1000, Loss 3.828538
Epoch 1500, Loss 3.092191
Epoch 2000, Loss 2.957697
Epoch 2500, Loss 2.933134
Epoch 3000, Loss 2.928648
Epoch 3500, Loss 2.927830
Epoch 4000, Loss 2.927679
Epoch 4500, Loss 2.927652
Epoch 5000, Loss 2.927647


tensor([  5.3671, -17.3012], requires_grad=True)

###### Analysis:
Again, you get the same result as before. Great. You have further confirmation that
 you know how to descend a gradient by hand! To test more optimizers, all you have to
 do is instantiate a different optimizer, such as Adam, instead of SGD. The rest of the
 code stays as is. This stuff is pretty handy.

We won’t go into much detail on Adam, but it suffices to say that it’s a more sophisticated optimizer in which the learning rate is set adaptively. In addition, it’s a lot less
 sensitive to the scaling of the parameters—so insensitive that you can go back to use the original (non-normalized) input t_u and even increase the learning rate to 1e-1. Adam won’t even blink:

 

In [40]:
params = torch.tensor([1.0, 0.0], requires_grad=True) 
learning_rate = 1e-1 
optimizer = optim.Adam([params], lr=learning_rate) 

In [41]:
training_loop(n_epochs = 2000,
              optimizer = optimizer,
              params = params,  
              t_u = t_u, # back to orignal input  
              t_c = t_c)


Epoch 500, Loss 7.612903
Epoch 1000, Loss 3.086700
Epoch 1500, Loss 2.928578
Epoch 2000, Loss 2.927646


tensor([  0.5367, -17.3021], requires_grad=True)

###### Analysis:
The optimizer isn’t the only flexible part of your training loop. Turn your attention to
 the model. To train a neural network on the same data and the same loss, all you’d
 need to change is the model function. Doing this wouldn’t make sense in this case,
 because you know that converting Celsius to Fahrenheit amounts to a linear transformation. Neural networks allow you to remove your arbitrary assumptions about the
 shape of the function you should be approximating. Even so, neural networks manage
 to be trained even when the underlying processes are highly nonlinear (such in the
 case of describing an image with a sentence).

 We’ve touched on a lot of the essential concepts that will enable you to train complicated deep learning models while knowing what’s going on under the hood: backpropagation to estimate gradients, autograd, and optimizing weights of models by
 using gradient descent or other optimizers. We don’t have a whole lot more to cover.
 The rest is mostly filling in the blanks, however extensive they are.

Next, we discuss how to split up samples, which sets up a perfect use case for learning to control autograd better.


<<<<<<<<<<<<<<<<<<<<<<<<<<< End >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>.

###### Training, validation, and overfitting:

 A deep neural network can potentially approximate complicated functions, provided that the number of neurons—and, therefore, parameters—is high enough. The
 fewer the parameters, the simpler the shape of the function your network will be able
 to approximate. So here’s rule one:
 * if the training loss isn’t decreasing, chances are that the model is too simple for the data
 *  The other possibility is that your data doesn’t contain meaningful information for it to explain the output
 

If the nice folks at the
 shop sold you a barometer instead of a thermometer, you’d have little chance to predict temperature in Celsius from pressure alone, even if you used the latest neural network architecture from Quebec:
  https://www.umontreal.ca/en/artificialintelligence

What about the validation set? Well, if the loss evaluated in the validation set
 doesn’t decrease along with the training set, your model is improving its fit of the samples it’s seeing during training, but it isn’t generalizing to samples outside this precise
 set. As soon as you evaluate the model at new, previously unseen points, the values of
 the loss function are poor. Here’s rule two:
 * if the training loss and the validation loss diverge, you’re overfitting. 

What’s the cure, though? Good question. Overfitting looks like a problem of making
 sure that the behavior of the model in between data points is sensible for the process
 you’re trying approximate. First, you should make sure that you get enough data for
 the process. If you collected data from a sinusoidal process by sampling it regularly at
 a low frequency, you’d have a hard time fitting a model to it. 

> Avoid Overfitting:

Assuming that you have enough data points, you should make sure that the model
 that’s capable of fitting the training data is as regular as possible between the data
 points. You have several ways to achieve this goal. One way is to add so-called penalization terms to the loss function to make it cheaper for the model to behave more
 smoothly and change more slowly (up to a point). Another way is to add noise to the
 input samples, to artificially create new data points between training data samples and
 force the model to try to fit them too.

 Several other ways are somewhat related to the
 preceding ones. But the best favor you can do for yourself, at least as a first move, is to
 make your model simpler. From an intuitive standpoint, a simpler model may not fit
 the training data as perfectly as a more complicated model would do, but it will likely
 behave more regularly between data points. 

###### Analyze:
You’ve got some nice tradeoffs here. On one hand, you need to model to have
 enough capacity for it to fit the training set. On the other hand, you need the model
 to avoid overfitting. Therefore, the process for choosing the right size of a neural network model, in terms of parameters, is based on two steps: increase the size until it fits
 and then scale it down until it stops overfitting. 

 You can split the
 data into a training set and a validation set by shuffling t_u and t_c in the same way
 and then splitting the resulting shuffled tensors into two parts. 
 
 Shuffling the elements of a tensor amounts to finding a permutation of its indices.
 The randperm function does this:


In [42]:
n_samples = t_u.shape[0] 
n_val = int(0.2 * n_samples)

In [43]:
n_samples,n_val

(11, 2)

In [44]:
shuffled_indices = torch.randperm(n_samples)

In [45]:
# splitting data into training and validation
train_indices = shuffled_indices[:-n_val] 
val_indices = shuffled_indices[-n_val:]


In [46]:
train_indices, val_indices 

(tensor([ 9,  0, 10,  3,  4,  2,  7,  8,  5]), tensor([1, 6]))

You get index tensors that you can use to build training and validation sets starting
 from the data tensors:


In [49]:
# training input and desired output 
train_t_u = t_u[train_indices] 
train_t_c = t_c[train_indices]
#  validation input and desired output
val_t_u = t_u[val_indices]
val_t_c = t_c[val_indices]

In [50]:
# normalize training and validation
train_t_un = 0.1 * train_t_u 
val_t_un = 0.1 * val_t_u

Your training loop doesn’t change. You want to evaluate the validation loss at every
 epoch to have a chance to recognize whether you’re overfitting:


In [51]:
def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u, train_t_c, val_t_c):
    for epoch in range(1, n_epochs + 1):   
        train_t_p = model(train_t_u, *params) 
        train_loss = loss_fn(train_t_p, train_t_c)
        val_t_p = model(val_t_u, *params)
        val_loss = loss_fn(val_t_p, val_t_c)
        optimizer.zero_grad()  
        train_loss.backward() 
        optimizer.step()
        if epoch <= 3 or epoch % 500 == 0:   
            print('Epoch {}, Training loss {}, Validation loss {}'.format(epoch, float(train_loss), float(val_loss)))
    return params 

In [52]:
params = torch.tensor([1.0, 0.0], requires_grad=True) 
learning_rate = 1e-2 
optimizer = optim.SGD([params], lr=learning_rate)


In [53]:
training_loop(n_epochs = 3000,
              optimizer = optimizer, 
              params = params,
              train_t_u = train_t_un, 
              val_t_u = val_t_un,  
              train_t_c = train_t_c,
              val_t_c = val_t_c)


Epoch 1, Training loss 90.34750366210938, Validation loss 35.44009780883789
Epoch 2, Training loss 40.74310302734375, Validation loss 12.03737735748291
Epoch 3, Training loss 34.29985046386719, Validation loss 11.497885704040527
Epoch 500, Training loss 8.272359848022461, Validation loss 1.0062336921691895
Epoch 1000, Training loss 3.6953022480010986, Validation loss 2.0580081939697266
Epoch 1500, Training loss 2.8709726333618164, Validation loss 3.6276702880859375
Epoch 2000, Training loss 2.72251296043396, Validation loss 4.496127128601074
Epoch 2500, Training loss 2.695772886276245, Validation loss 4.9011125564575195
Epoch 3000, Training loss 2.690957546234131, Validation loss 5.079550743103027


tensor([  5.5010, -18.3851], requires_grad=True)

Here, we’re not being entirely fair to the model. The validation set is small, so the validation loss will be meaningful only up to a point. In any case, note that the validation
 loss is higher than your training loss, although not by an order of magnitude.The fact
 that a model performs better on the training set is expected since the model parameters are being shaped by the training set. Your main goal is to also see both the training loss and the validation loss decreasing.

##### Note:
Although ideally, both losses would be
 roughly the same value, as long as validation loss stays reasonably close to the training
 loss, you know that your model is continuing to learn generalized things about your
 data.

 #####  Nits in autograd and switching it off :
 PyTorch allows you to switch off autograd when you don’t
 need it by using the torch.no_grad context manager. You won’t see any meaningful
 advantage in terms of speed or memory consumption on your small problem. But for
 larger models, the differences can add up. You can make sure that this context manager
 works by checking the value of the requires_grad attribute on the val_loss tensor:

In [58]:
def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u, train_t_c, val_t_c):
    for epoch in range(1, n_epochs + 1):  
        train_t_p = model(train_t_u, *params)   
        train_loss = loss_fn(train_t_p, train_t_c)
        with torch.no_grad(): # Context manager here.
            val_t_p = model(val_t_u, *params)
            val_loss = loss_fn(val_t_p, val_t_c)
            # All requires_grad args are forced to False 
            # inside this block.
            assert val_loss.requires_grad == False 
        optimizer.zero_grad()  
        train_loss.backward() 
        optimizer.step() 

Using the related set_grad_enabled context, you can also condition code to run with
 autograd enabled or disabled, according to a Boolean expression, typically indicating
 whether you’re running in training or inference. You could define a calc_forward function that takes data in input and runs model and loss_fn with or without autograd, according to a Boolean train_is argument:

In [56]:
def calc_forward(t_u, t_c, is_train):
    with torch.set_grad_enabled(is_train):   
        t_p = model(t_u, *params)       
        loss = loss_fn(t_p, t_c) 
        return loss 

<---------------------- End Chapter # 4 --------------------------------->

###### Summary:

 Linear models are the simplest reasonable model to use to fit data. 

 Convex optimization techniques can be used for linear models, but they don’t generalize to neural networks.

 Deep learning can be used for generic models that aren’t engineered to solve a specific task but can be adapted automatically to specialize in the problem at hand. 

 Learning algorithms amount to optimizing parameters of models based on
 observations. Loss function is a measure of the error in carrying out a task, such as the error between predicted outputs and measured values. The goal is to get loss function as low as possible.
 
 The rate of change of the loss function with respect to model parameters can be used to update the same parameters in the direction of decreasing loss.

 The optim module in PyTorch provides a collection of ready-to-use optimizers for updating parameters and minimizing loss functions. 

 Optimizers use the autograd feature of PyTorch to compute the gradient for
 each parameter, depending on how that parameter contributed to the final output. This feature allows users to rely on the dynamic computation graph during complex forward passes. 

 Context managers such as with torch.no_grad(): can be used to control autograd behavior. 

 Data is often split into separate sets of training samples and validation samples, allowing a model to be evaluated on data it wasn’t trained on. 

 Overfitting a model happens when the model’s performance continues to
 improve on the training set but degrades on the validation set. This situation usually occurs when the model doesn’t generalize and instead memorizes the desired outputs for the training set.