##  Backprop training

In this notebook I do the backprop excercise manually to comaper performance of with- and without bias neural networks with manual and AutoGrad calucluation

Here I set stable weights and training set for each model so that the results would be comparable

In [86]:
batch_size = 64
input_size = 3
hidden_size = 2
output_size = 1

# Create random input and output data
x = torch.randn(batch_size, input_size)
y = torch.randn(batch_size, output_size)

# Randomly initialize weights
w1 = torch.randn(input_size, hidden_size)
w2 = torch.randn(hidden_size, output_size)

# Randomly initialize bias
b1 = torch.randn(hidden_size)
b2 = torch.randn(output_size) 

w1_2 = torch.clone(w1)
w2_2 = torch.clone(w2)
b1_2 = torch.clone(b1)
b2_2 = torch.clone(b2)

w1_3 = torch.clone(w1)
w2_3 = torch.clone(w2)
b1_3 = torch.clone(b1)
b2_3 = torch.clone(b2)
w1_3.requires_grad=True
w2_3.requires_grad=True
b1_3.requires_grad=True
b2_3.requires_grad=True

w1_4 = torch.clone(w1)
w2_4 = torch.clone(w2)
b1_4 = torch.clone(b1)
b2_4 = torch.clone(b2)
w1_4.requires_grad=True
w2_4.requires_grad=True
b1_4.requires_grad=True
b2_4.requires_grad=True


Firstly, a model without bias

In [87]:
learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    #TODO
    h_1 = x.mm(w1)
    h_relu = h_1.clamp(min=0)
    out = h_relu.mm(w2)
    
    # Compute and print loss
    loss = (out - y).pow(2).sum().item()
    
    # Backward pass: 
    dloss_dout = 2 * (out - y)
    
    grad_w2 = h_relu.t().mm(dloss_dout) 
    
    grad_h_relu = dloss_dout.mm(w2.t())
    
    grad_h_relu[h_1 < 0] = 0
    
    grad_w1 = x.t().mm(grad_h_relu)
    
    
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
    if t % 100 == 99:
        print('Loss on iteration {} = {}'.format(t, loss))
    

Loss on iteration 99 = 65.89350891113281
Loss on iteration 199 = 65.64342498779297
Loss on iteration 299 = 65.40166473388672
Loss on iteration 399 = 65.16791534423828
Loss on iteration 499 = 64.94190216064453


Bias included:

In [88]:
learning_rate = 1e-6

for t in range(500):
    # Forward pass: compute predicted y
    #TODO
    h_1 = x.mm(w1_2)
    h_1_b = h_1.add(b1_2)
    h_relu = h_1_b.clamp(min=0)
    h_2 = h_relu.mm(w2_2)
    out = h_2.add(b2_2)
    
    # Compute and print loss
    loss = (out - y).pow(2).sum().item()
    
    # Backward pass: 
    dloss_dout = 2 * (out - y)
    
    dout_dh2 =  1
    
    dout_db2 =  1

    grad_b2 = sum(dloss_dout*dout_db2)/64 #since bias is a vector, not a matrix, I use average to compute the gradient

    dh2_dw2 = h_relu

    grad_w2 = dh2_dw2.t().mm((dloss_dout * dout_dh2))

    dh2_dhrelu = w2_2 

    dh1b_db1 = 1 

    dh1b_dh1 = 1 
    
    dh1_dw1 = x

    grad_h_relu = (dloss_dout * dout_dh2).mm(dh2_dhrelu.t())
    grad_h_relu[h_1 < 0] = 0

    grad_b1 = sum(grad_h_relu * dh1b_db1)/64

    grad_w1 = dh1_dw1.t().mm(grad_h_relu)
    
    w1_2 -= learning_rate * grad_w1
    w2_2 -= learning_rate * grad_w2
    b1_2 -= learning_rate * grad_b1
    b2_2 -= learning_rate * grad_b2
    
    if t % 100 == 99:
        print('Loss on iteration {} = {}'.format(t, loss))

 

Loss on iteration 99 = 68.4677734375
Loss on iteration 199 = 68.021728515625
Loss on iteration 299 = 67.59562683105469
Loss on iteration 399 = 67.18849182128906
Loss on iteration 499 = 66.79944610595703


### PyTorch AutoGrad

In [89]:
learning_rate = 1e-6

for t in range(500):
    y_pred = (x.mm(w1_3)+b1_3).clamp(min=0).mm(w2_3)+b2_3

    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())
    
    loss.backward()
   
    with torch.no_grad():
        w1_3 -= learning_rate * w1_3.grad
        w2_3 -= learning_rate * w2_3.grad
        b1_3 -= learning_rate * b1_3.grad
        b2_3 -= learning_rate * b2_3.grad        
        
        w1_3.grad.zero_()
        w2_3.grad.zero_()
        b1_3.grad.zero_()
        b2_3.grad.zero_()        

99 68.45321655273438
199 67.99429321289062
299 67.55693817138672
399 67.14004516601562
499 66.74253845214844


In [90]:
#automize learning with adaptive Adam
import torch.optim as optim

loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-6
optimizer = torch.optim.Adam([w1_4, w2_4, b1_4, b2_4], lr=learning_rate)

for t in range(500):
    optimizer.zero_grad()
    
    y_pred = (x.mm(w1_4)+b1_4).clamp(min=0).mm(w2_4)+b2_4
    
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())
    
    loss.backward()
   
    optimizer.step()

99 68.91714477539062
199 68.90420532226562
299 68.89129638671875
399 68.8783950805664
499 68.86551666259766


So, we see, that Autograd uses simple gradient desscent as default, since it is nearly the same as I have dome manulally and slightly different from the last case wher we used Adam. The trainig set is random, so the losses do not basically show anything: for another random dataset bias could woren performance and Adam would make it better