# Gradient Descend

We will construct a simple linear model and see how Gradient Descend algorithm helps us to find the best parameters. 


In [77]:
import torch

torch.manual_seed(42) # fix the seed; easy to debug

<torch._C.Generator at 0x10d376c30>

## Vectorizes Operation

In [78]:
u = torch.randint(10, (5, 1))
v = torch.randint(10, (5, 1))
print(u)
print(v)

tensor([[ 2.],
        [ 7.],
        [ 6.],
        [ 4.],
        [ 6.]])
tensor([[ 5.],
        [ 0.],
        [ 4.],
        [ 0.],
        [ 3.]])


In [79]:
u.size()

torch.Size([5, 1])

In [80]:
u * v

tensor([[ 10.],
        [  0.],
        [ 24.],
        [  0.],
        [ 18.]])

In [81]:
u.pow(v)

tensor([[   32.],
        [    1.],
        [ 1296.],
        [    1.],
        [  216.]])

## Ground Truth Model


In [82]:
n = 5
a = 2.0
b = 3.0
epsilon = 0.02

# Specify our ground truth model
X = torch.randint(10, (n, 1))
ps = torch.ones(n)/2.0
exponents = torch.bernoulli(ps) * epsilon + (torch.ones(n) - epsilon/2)
e = exponents.reshape(5, 1)
T = a * (X.pow(e)) + b


In [83]:
print(X)
print(e)
print(T)

tensor([[ 8.],
        [ 4.],
        [ 0.],
        [ 4.],
        [ 1.]])
tensor([[ 1.0100],
        [ 0.9900],
        [ 1.0100],
        [ 0.9900],
        [ 0.9900]])
tensor([[ 19.3362],
        [ 10.8899],
        [  3.0000],
        [ 10.8899],
        [  5.0000]])


In [84]:
# verify that our implementation is correct. 
print( a * X[0][0]**e[0][0] + b)

tensor(19.3362)


The above output shows that we can do vectorized operations on tensors easily. 

You should be able to write down the ground truth model mathematically. 

## Fitting a Linear Model

$$\hat{\mathbf{y}} = \mathbf{x}\mathbf{w} + \mathbf{b}$$

In [85]:
# note that all operations here are vectorized operation. 

def forward(x): # dangerous: assuming w and b are global variables. 
    return x.mm(w) + b

# Loss function
def SSEloss(y_pred, y): 
    return (y_pred - y) * (y_pred - y) /2.0

def loss(x, t):
    y_pred = forward(x)
    return (y_pred - t).pow(2).sum() /2.0

def gradient_w(x, t):  
    return  torch.t(forward(x) - t).mm(x)

def gradient_b(x, t):  
    return  (forward(x) - t).sum() # need to add sum(); otherwise, each b_i will be different due to different gradients it receives. 

In [103]:
# Training loop
MAX_EPOCHES = 500
learning_rate = 0.01 # it is tricky to find a good learning rate. 


w = torch.randn(1, 1)
b = torch.ones(n, 1)

for epoch in range(MAX_EPOCHES):
    grad_w = gradient_w(X, T)
    w = w - learning_rate * grad_w
    grad_b = gradient_b(X, T)
    b = b - learning_rate * grad_b
    
    print(".. grad: ", w.item(), b[0].item())
    l = loss(X, T)
    print("epoch = ", epoch, ", loss=", l.item(), "learning rate = ", learning_rate)
    
    # Adaptive learning rate on: 500 epoches => loss = 0.04446
    #                       off: 500 epoches => loss = 0.04179                                 
    #learning_rate = 1.0 / (100 + epoch)  
    
# After training
print(forward(X))
print(T)


.. grad:  2.3410696983337402 1.0431773662567139
epoch =  0 , loss= 3.5474693775177 learning rate =  0.01
.. grad:  2.360976457595825 1.080811619758606
epoch =  1 , loss= 3.3889691829681396 learning rate =  0.01
.. grad:  2.355175733566284 1.1175503730773926
epoch =  2 , loss= 3.255636215209961 learning rate =  0.01
.. grad:  2.3487560749053955 1.153543472290039
epoch =  3 , loss= 3.1272027492523193 learning rate =  0.01
.. grad:  2.342444658279419 1.188809871673584
epoch =  4 , loss= 3.003889322280884 learning rate =  0.01
.. grad:  2.3362600803375244 1.2233643531799316
epoch =  5 , loss= 2.8855020999908447 learning rate =  0.01
.. grad:  2.3302001953125 1.2572213411331177
epoch =  6 , loss= 2.771848201751709 learning rate =  0.01
.. grad:  2.324262857437134 1.2903947830200195
epoch =  7 , loss= 2.662736654281616 learning rate =  0.01
.. grad:  2.3184452056884766 1.322898507118225
epoch =  8 , loss= 2.5579850673675537 learning rate =  0.01
.. grad:  2.3127448558807373 1.354746103286743

epoch =  200 , loss= 0.042784929275512695 learning rate =  0.01
.. grad:  2.041593313217163 2.8697073459625244
epoch =  201 , loss= 0.04274478927254677 learning rate =  0.01
.. grad:  2.0414819717407227 2.8703291416168213
epoch =  202 , loss= 0.04270647093653679 learning rate =  0.01
.. grad:  2.0413730144500732 2.870938539505005
epoch =  203 , loss= 0.04266950488090515 learning rate =  0.01
.. grad:  2.0412659645080566 2.871535539627075
epoch =  204 , loss= 0.04263414442539215 learning rate =  0.01
.. grad:  2.041161298751831 2.8721206188201904
epoch =  205 , loss= 0.042600564658641815 learning rate =  0.01
.. grad:  2.0410587787628174 2.8726937770843506
epoch =  206 , loss= 0.042567864060401917 learning rate =  0.01
.. grad:  2.0409581661224365 2.8732552528381348
epoch =  207 , loss= 0.04253678768873215 learning rate =  0.01
.. grad:  2.0408596992492676 2.873805522918701
epoch =  208 , loss= 0.042506616562604904 learning rate =  0.01
.. grad:  2.0407631397247314 2.87434458732605
epoc

epoch =  364 , loss= 0.04178687930107117 learning rate =  0.01
.. grad:  2.0362741947174072 2.899425745010376
epoch =  365 , loss= 0.041786596179008484 learning rate =  0.01
.. grad:  2.0362703800201416 2.8994476795196533
epoch =  366 , loss= 0.04178640991449356 learning rate =  0.01
.. grad:  2.036266326904297 2.8994691371917725
epoch =  367 , loss= 0.041786566376686096 learning rate =  0.01
.. grad:  2.0362625122070312 2.8994901180267334
epoch =  368 , loss= 0.041786521673202515 learning rate =  0.01
.. grad:  2.0362589359283447 2.8995108604431152
epoch =  369 , loss= 0.04178622364997864 learning rate =  0.01
.. grad:  2.036255359649658 2.899531126022339
epoch =  370 , loss= 0.04178660735487938 learning rate =  0.01
.. grad:  2.0362517833709717 2.8995509147644043
epoch =  371 , loss= 0.04178617149591446 learning rate =  0.01
.. grad:  2.036248207092285 2.8995702266693115
epoch =  372 , loss= 0.04178636148571968 learning rate =  0.01
.. grad:  2.0362448692321777 2.8995893001556396
epo

In [87]:
print(w)
print(b)

tensor([[ 2.0361]])
tensor([[ 2.9004],
        [ 2.9004],
        [ 2.9004],
        [ 2.9004],
        [ 2.9004]])


This is fairly close to the ground truth: $w = 2.0, b = 3.0$

### Ground Truth Model

The ground truth model is: 

$$
y = a x^{1\pm \epsilon/2} + b, 
$$
where $\epsilon$ is a random variable that takes the value of `epsilon` with probability 0.5, and takes the value of `-epsilon` with probability 0.5. 