## Learning Algorithm

 - Optimizing the parameters of a simple linear model by "hand" 

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import torch
torch.set_printoptions(edgeitems=2, linewidth=75)

In [2]:
# Creating input data - tensors from lists

# Celcius
t_c = [0.5,  14.0, 15.0, 28.0, 11.0,  8.0,  3.0, -4.0,  6.0, 13.0, 21.0]

# Unknown
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]

t_c = torch.tensor(t_c)
t_u = torch.tensor(t_u)

print(t_u)

tensor([35.7000, 55.9000, 58.2000, 81.9000, 56.3000, 48.9000, 33.9000,
        21.8000, 48.4000, 60.4000, 68.4000])


In [None]:
# Quick inspection - linear assumption
plt.scatter(t_u, t_c)

### Next Steps

Linear model: y = wx + b
 - w = weight
 - b = bias 
 
The two may be linearly related - We can derive t_c by multiplying t_u by some coefficient and adding by a constant
 - t_c = w * t_u + b
 
### Parameter Estimation - w & b

 - Overview: We have a model with some unknown parameters, and we need to estimate those parameters such that the error between the predicted output and the actual output is as small as possible 
 - Error metric:
     - Squared Loss --> Mean Squared Loss (MSE)

In [None]:
# Defining our model 
def model(t_u, w, b):
    return w * t_u + b

# Defining our loss function
def loss_func(t_c, t_p):
    squared_diff = (t_p - t_c)**2
    return squared_diff.mean()

# Instantiating our model parameters - w & b
w = torch.ones(())
b = torch.ones(())

# Calling our model 
t_p = model(t_u, w, b)

# Loss function
loss = loss_func(t_c, t_p)
print(loss)

### But. . .our loss is not at a minimum

How do we estimate the parameters w & b such that our loss reaches a global minimum?
 - **Gradient Descent**: Optimizing the loss function with respect to the parameters
     - Partial derivatives with respect to the loss function
     - The derivative goes back to physics and the idea is to find the rate of change
     
If the rate of change is **negative**, then we need to **increase** w to minimize the loss versus if the rate of
change is **positive**, then we need **decrease** w

But, by how much? This scaling factor is known as the **learning rate**:
 - Apply a change to w that is proportional to the rate of change
 - Also wise to the change the parameters slowly 

In [5]:
delta = 0.1

# Rate of change for parameter w
loss_rate_change_w = \
    (loss_func(model(t_u, w + delta, b), t_c) - 
     loss_func(model(t_u, w - delta, b), t_c)) / (2.0 * delta)

print(loss_rate_change_w)

learning_rate = 1e-2

# Updating the w parameter
w = w - learning_rate * loss_rate_change_w

# Rate of change for parameter b
loss_rate_change_b = \
    (loss_func(model(t_u, w, b + delta), t_c) - 
     loss_func(model(t_u, w, b - delta), t_c)) / (2.0 * delta)

print(loss_rate_change_b)

learning_rate = 1e-2

# Updating the b parameter
b = b - learning_rate * loss_rate_change_b

# This explains our parameter updating step of a learning algorithm 
# We could sit here and continually type in new parameters for w & b until our loss is minimized

tensor(4620.8970)
tensor(-4707.5000)


### Derivatives

Computing the derivatives with respect to each parameter
 - Using chain rule 

Model:
- t_c = w * t_u + b

In [6]:
def dloss_func(t_p, t_c):
    dsq_loss = 2 * (t_p - t_c) / t_p.size(0)
    return dsq_loss

def dmodel_w(t_u, w, b):
    return t_u

def dmodel_b(t_u, w, b):
    return 1.0

def gradient_func(t_u, t_c, t_p, w, b):
    dloss_dp = dloss_func(t_p, t_c)
    # Chain rule
    dloss_dw = dloss_dp * dmodel_w(t_u, w, b)
    dloss_db = dloss_dp * dmodel_b(t_u, w, b)
    
    return torch.stack([dloss_dw.sum(), dloss_db.sum()])

### Iterating to fit the model - the training loop

A training iteration = epoch

In [7]:
def training_loop(n_epochs, learning_rate, params, t_u, t_c):
    
    for epoch in range(1, n_epochs + 1):
        
        w, b = params
        
        ## FORWARD PASS ##
        t_p = model(t_u, w, b)
        loss = loss_func(t_p, t_c)
        
        ## BACKWARD PASS ## - grad is a single scalar quantity for each partial derivative of the loss 
        grad = gradient_func(t_u, t_c, t_p, w, b)
        # Updating the paramaters of the model after a backward pass 
        params = params - learning_rate * grad
        
        print('Epoch {}, Loss {}, w {}, b{}, grad {}'.format(epoch, float(loss), params[0], params[1], grad))
        
    return params

### Overtraining

Our training process blew up. Params is receiving updates that are too large, and their values start oscillating back and forth as each update overshoots, and the next update overcorrects even more

Fix?
 - Limit the magnitude of learning_rate * grad with a smaller learning rate
 

In [None]:
training_loop(100, 1e-2, torch.tensor([1.0, 0.0]), t_u, t_c)

In [None]:
# Smaller learning rate
training_loop(100, 1e-4, torch.tensor([1.0, 0.0]), t_u, t_c)

### Normalizing Inputs

Notice the loss is decreasing very very slowly meaning the updates to the parameters is TOO SMALL
 - We could make the learning rate adaptive

But, the gradient term in our update is problematic:
 - First epoch gradient for weight is about 50 times larger than the gradient for the bias
     - Thus, the weight and gradient are not scaled the same 

In [None]:
# Normalization
t_un = 0.1 * t_u
print(t_u)
print(t_un)

# Notice how my gradients have changed
training_loop(100, 1e-2, torch.tensor([1.0, 0.0]), t_un, t_c)

In [None]:
params = training_loop(5000, 1e-2, torch.tensor([1.0, 0.0]), t_un, t_c)

In [None]:
# Plotting
t_p = model(t_un, *params)

plt.plot(t_u.numpy(), t_p.detach().numpy())
plt.plot(t_u.numpy(), t_c.numpy(), 'o')

### Torch.grad

PyTorch tensors can remember where they come from and, in terms of operations and parent tensors that orginated them. They can then automatically provide the chain of derivatives of such operations with respect to their inputs 
 - requires_grad is telling pytorch to track the entire family tree of tensors resulting from operations on params
     - Any tensor that will have params as an anscestor will have access to the chain of functions that were called to from params to that tensor 

In [None]:
# The model and loss function remain the same, but our parameter initialization IS DIFFERENT
params = torch.tensor([1.0, 0.0], requires_grad=True)
#params.grad is None

# Forward Pass
loss = loss_func(model(t_u, *params), t_c)
# Backward Pass
loss.backward()

params.grad

### Accumulating grad functions

Calling backward will lead to derivatives that **accumulate** at leaf nodes. We need to **zero** the gradient explicitly after using it for parameter updates
 - If backward was called earlier, the loss is evaluated again, backward is called again, and the gradient at each leaf is accumulated


In [None]:
if params.grad is not None:
    params.grad.zero_()

In [None]:
def training_loop(n_epochs, learning_rate, params, t_u, t_c):
    
    for epoch in range(1, n_epochs + 1):
        
        # Zeroing the gradient
        if params.grad is not None:
            params.grad.zero_()
        
        ## FORWARD PASS ##
        t_p = model(t_u, *params)
        loss = loss_func(t_p, t_c)
        
        ## BACKWARD PASS ## 
        loss.backward()
        
        # Updating params in place - subtracting our updated params from the previous params
        with torch.no_grad():
            params -= learning_rate * params.grad
        
        
        if epoch % 500 == 0:
            print('Epoch {}, Loss {}'.format(epoch, float(loss)))
        
    return params

In [None]:
training_loop(5000, 
              1e-2,
              torch.tensor([1.0, 0.0], requires_grad=True),
              t_un, 
              t_c)

### Optimizers

PyTorch optimizer abstracts the optimization strategy away from user code:
 - Saves us from having to update each and every parameter to our model ourselves

In [None]:
import torch.optim as optim
dir(optim)

In [None]:
# SGD optimizer
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-5
optimizer = optim.SGD([params], lr=learning_rate)

t_p = model(t_u, *params)
loss = loss_func(t_p, t_c)

optimizer.zero_grad()
loss.backward()

# The value of params is updated by calling .step()
optimizer.step()

params

In [None]:
def training_loop(n_epochs, optimizer, params, t_u, t_c):
    
    for epoch in range(1, n_epochs + 1):
        
        ## FORWARD PASS ##
        t_p = model(t_u, *params)
        loss = loss_func(t_p, t_c)
        
        ## BACKWARD PASS ## 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if epoch % 500 == 0:
            print('Epoch {}, Loss {}'.format(epoch, float(loss)))
        
    return params

In [None]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)

training_loop(5000,
              optimizer,
              params, 
              t_un, 
              t_c)

## Linear Model using .nn module

Revisiting the code and linear model from the Jupyter notebook "Learning_Algorithm"

In [None]:
import torch.nn as nn

In [None]:
t_c = [0.5,  14.0, 15.0, 28.0, 11.0,  8.0,  3.0, -4.0,  6.0, 13.0, 21.0]
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]
t_c = torch.tensor(t_c).unsqueeze(1) # <1>
t_u = torch.tensor(t_u).unsqueeze(1) # <1>

t_u.shape

In [None]:
n_samples = t_u.shape[0]
n_val = int(0.2 * n_samples)

shuffled_indices = torch.randperm(n_samples)

train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]

train_indices, val_indices

In [None]:
t_u_train = t_u[train_indices]
t_c_train = t_c[train_indices]

t_u_val = t_u[val_indices]
t_c_val = t_c[val_indices]

t_un_train = 0.1 * t_u_train
t_un_val = 0.1 * t_u_val

In [None]:
# Three arguments in the nn.Linear module - 1.) # of feature inputs 2.) # of feature outputs  3.) Inlcude bias or not (default=True)
linear_model = nn.Linear(1, 1)
linear_model(t_un_val)

In [None]:
print(linear_model.weight, '\n')
print(linear_model.bias)

### Batching Inputs

We have a model that takes one input and produces one output, but PyTorch nn.Module and its subclasses are designed to do so on multiple samples at the same time. To accommodate multiple samples, modules expect the zeroth dimension of the input to be the number of samples in the batch

Any module in nn is written to produce outputs for a batch of multiple inputs at the same time. Thus, assuming we need to run nn.Linear on 10 samples, we can create an input tensor of size B × Nin, where B is the size of the batch and Nin is the number of input features, and run it once through the model

In order to increase our batch size, we need to add an extra dimension to turn that 1D tensor into a matrix with samples in the rows and features in the columns

In [None]:
t_c = [0.5,  14.0, 15.0, 28.0, 11.0,  8.0,  3.0, -4.0,  6.0, 13.0, 21.0]
print(t_c)
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]

t_c = torch.tensor(t_c).unsqueeze(1) 
print(t_c)
t_u = torch.tensor(t_u).unsqueeze(1) 

In [None]:
optimizer = optim.SGD(linear_model.parameters(),
                     lr=1e-2)

# Using parameters() method to ask any nn.Module for a list of parameters owned by it 
linear_model.parameters()
list(linear_model.parameters())

In [None]:
def training_loop(n_epochs, optimizer, model, loss_fn, t_u_train, t_u_val, t_c_train, 
                  t_c_val):
    
    for epoch in range(1, n_epochs + 1):
        
        ## FORWARD PASS ##
        t_p_train = model(t_u_train)
        loss_train = loss_func(t_p_train, t_c_train)
        t_p_val = model(t_u_val)
        loss_val = loss_func(t_p_val, t_c_val)
        
        ## BACKWARD PASS ## 
        optimizer.zero_grad()
        loss_train.backward()
        optimizer.step()
        
        if epoch % 500 == 0:
            print("Epoch {}".format(epoch), "Training Loss {}".format(loss_train.item()),
                  "Validation Loss {}".format(loss_val.item()))

In [None]:
linear_model = nn.Linear(1,1)
optimizer = optim.SGD(linear_model.parameters(), lr=1e-2)

training_loop(n_epochs=3000, 
              optimizer = optimizer,
              model = linear_model,
              loss_fn = nn.MSELoss(),
              t_u_train = t_u_train,
              t_u_val = t_un_val,
              t_c_train = t_c_train,
              t_c_val = t_c_val)

### Basic Neural Network 

Replacing our linear model with a neural network as our approximating function

Build the simplest possible neural network: a linear module, followed by an activation function, feeding into another linear module
 - Input (1d) --> Linear (13d) --> Activation Function (tanH) --> Linear (1d) --> Output

The model fans out from 1 input feature to 13 hidden features, passes them through a tanh activation, and lin- early combines the resulting 13 numbers into 1 output feature

In [None]:
seq_model = nn.Sequential(nn.Linear(1, 30),
                          nn.Tanh(),
                          nn.Linear(30, 1))

seq_model

In [None]:
[param.shape for param in seq_model.parameters()]

In [None]:
optimizer = optim.SGD(seq_model.parameters(), lr=1e-3)

training_loop(
        n_epochs = 3000,
        optimizer = optimizer,
        model = seq_model,
        loss_fn = nn.MSELoss(),
        t_u_train = t_un_train,
        t_u_val = t_un_val,
        t_c_train = t_c_train,
        t_c_val = t_c_val)

In [None]:
t_range = torch.arange(20., 90.).unsqueeze(1)
fig = plt.figure(dpi=400)
plt.figure(figsize=(10,7))
plt.xlabel("Fahrenheit")
plt.ylabel("Celsius")
plt.plot(t_u.numpy(), t_c.numpy(), 'o')
plt.plot(t_range.numpy(), seq_model(0.1 * t_range).detach().numpy(), 'c-')
plt.plot(t_u.numpy(), seq_model(0.1 * t_u).detach().numpy(), 'kx')