##### License
Licensed under the BSD 3-Clause License (the "License");

In the paper ['Fast Sparse Regression and Classification' (2008)](http://statweb.stanford.edu/~jhf/ftp/GPSpaper.pdf), [Jerome Friedman](https://statweb.stanford.edu/~jhf/) introduces the generalized path seeking (GPS) algorithm to directly construct, sequentially, a path in parameter space that approximates that for a given penalty $P(a)$ on the coffefficents $a$ of an associated regression model. This Jupyter notebook is an implementation of GPS that takes advantage of the 'autograd' facility provided by many machine learning frameworks for calculating gradients. [PyTorch (1.0.1)](https://pytorch.org/) will be used in this case.

In [1]:
import torch
from torch.autograd import Variable
import math
import numpy as np
import pandas as pd

Define a simple model and loss function to demonstrate how useful autograd will be for implementing GPS. $a$ is a vector of coefficients of the regression model and is exposed at the toplevel.

In [2]:
N = 9
a = torch.zeros(N,requires_grad=True, dtype=torch.float64)

def Fmodel(a, x):
    """ a is the coefficient vector of a linear regression model
        a[0] is the constant term
        x is a matrix of data where each row is an observation """
    return (x @ a[1:]) + a[0]

def mse(t1,t2):
    """ mean square error """
    diff = t1-t2
    return torch.sum(diff*diff)/diff.numel()

For a concrete example, we will use data from [Stamey et al (1989)](https://www.ncbi.nlm.nih.gov/pubmed/2468795) that can be accessed online at [Robert Tibshirani's](https://statweb.stanford.edu/~tibs) website.

In [3]:
pdata = pd.read_csv("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data",
                    index_col=0,sep='\t')
# lpsa values are the y[i]'s that we want to predict
lpsa = torch.from_numpy(pdata.lpsa.values)
print(pdata.columns)
xvals = torch.from_numpy(pdata[['lcavol','lweight','age','lbph','svi','lcp','gleason','pgg45']].values)
print(xvals[:10])

Index(['lcavol', 'lweight', 'age', 'lbph', 'svi', 'lcp', 'gleason', 'pgg45',
       'lpsa', 'train'],
      dtype='object')
tensor([[-0.5798,  2.7695, 50.0000, -1.3863,  0.0000, -1.3863,  6.0000,  0.0000],
        [-0.9943,  3.3196, 58.0000, -1.3863,  0.0000, -1.3863,  6.0000,  0.0000],
        [-0.5108,  2.6912, 74.0000, -1.3863,  0.0000, -1.3863,  7.0000, 20.0000],
        [-1.2040,  3.2828, 58.0000, -1.3863,  0.0000, -1.3863,  6.0000,  0.0000],
        [ 0.7514,  3.4324, 62.0000, -1.3863,  0.0000, -1.3863,  6.0000,  0.0000],
        [-1.0498,  3.2288, 50.0000, -1.3863,  0.0000, -1.3863,  6.0000,  0.0000],
        [ 0.7372,  3.4735, 64.0000,  0.6152,  0.0000, -1.3863,  6.0000,  0.0000],
        [ 0.6931,  3.5395, 58.0000,  1.5369,  0.0000, -1.3863,  6.0000,  0.0000],
        [-0.7765,  3.5395, 47.0000, -1.3863,  0.0000, -1.3863,  6.0000,  0.0000],
        [ 0.2231,  3.2445, 63.0000, -1.3863,  0.0000, -1.3863,  6.0000,  0.0000]],
       dtype=torch.float64)


In [4]:
preds = Fmodel(a, xvals)

In [5]:
preds

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.], dtype=torch.float64, grad_fn=<AddBackward0>)

In [6]:
lpsa[:10]

tensor([-0.4308, -0.1625, -0.1625, -0.1625,  0.3716,  0.7655,  0.7655,  0.8544,
         1.0473,  1.0473], dtype=torch.float64)

In [7]:
loss = mse(preds,lpsa)
print(loss)

tensor(7.4611, dtype=torch.float64, grad_fn=<DivBackward0>)


PyTorch keeps track of the mathematical operations performed on $a$ and will automatically calculate the gradient using reverse-mode differentiation that is often used for back-propagation in neural nets. In PyTorch, invoking the backword() method on a scalar tensor will automatically calculate the gradient. Note that gradients cumulate and will need to be zeroed out where appropriate for the calculation at hand.

In [8]:
loss.backward()
print(a)
print(a.grad)

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=torch.float64,
       requires_grad=True)
tensor([  -4.9568,   -8.6696,  -18.4120, -319.4542,   -1.0935,   -1.6087,
          -0.8643,  -34.0798, -148.0683], dtype=torch.float64)


One approach to updating the coefficients and minimize the loss is to iteratively move along the direction of the gradient. We will use this idea later to address a minor issue with the GPS algorithm as stated in the paper.

In [9]:
for i in range(10000):
    preds = Fmodel(a, xvals)
    loss = mse(preds, lpsa)
    loss.backward()
    # no_grad() context because we do not want to calcualte but update a
    with torch.no_grad():
        a -= a.grad * 1e-5
        a.grad.zero_()
a_ref = a

In [10]:
print(loss)

tensor(0.8563, dtype=torch.float64, grad_fn=<DivBackward0>)


In [11]:
print(preds)

tensor([1.4125, 1.5940, 2.2870, 1.5660, 1.9234, 1.3732, 2.0223, 1.8837, 1.3424,
        1.8751, 1.9471, 1.7544, 2.4407, 2.2096, 2.0001, 2.1117, 2.4514, 2.3558,
        1.2002, 2.1494, 1.8940, 2.5385, 1.6929, 2.9630, 2.1405, 2.1942, 2.7896,
        2.2424, 3.0851, 2.4215, 2.0914, 2.0184, 2.2699, 1.6132, 1.8863, 2.2407,
        2.7582, 2.0804, 3.0699, 1.9769, 2.7702, 2.3134, 2.0606, 2.2945, 2.5000,
        2.2440, 4.1743, 2.7475, 1.5530, 2.3040, 2.7018, 2.2052, 2.8866, 3.0069,
        2.2331, 2.3985, 1.7394, 1.6258, 2.4212, 2.4656, 2.3084, 2.9030, 3.7387,
        3.2277, 2.1504, 2.3739, 3.2891, 2.6829, 1.9986, 2.5101, 2.9001, 2.7291,
        2.5677, 3.2882, 2.9128, 3.2606, 3.2614, 2.8288, 3.4287, 2.9003, 2.7113,
        2.9801, 3.1047, 3.3660, 2.5567, 3.2542, 2.0618, 2.5395, 3.2797, 3.4611,
        2.4237, 2.4141, 3.2438, 2.5904, 2.3703, 3.5405, 3.0644],
       dtype=torch.float64, grad_fn=<AddBackward0>)


In [12]:
print(lpsa)

tensor([-0.4308, -0.1625, -0.1625, -0.1625,  0.3716,  0.7655,  0.7655,  0.8544,
         1.0473,  1.0473,  1.2669,  1.2669,  1.2669,  1.3481,  1.3987,  1.4469,
         1.4702,  1.4929,  1.5581,  1.5994,  1.6390,  1.6582,  1.6956,  1.7138,
         1.7317,  1.7664,  1.8001,  1.8165,  1.8485,  1.8946,  1.9242,  2.0082,
         2.0082,  2.0215,  2.0477,  2.0857,  2.1576,  2.1917,  2.2138,  2.2773,
         2.2976,  2.3076,  2.3273,  2.3749,  2.5217,  2.5533,  2.5688,  2.5688,
         2.5915,  2.5915,  2.6568,  2.6776,  2.6844,  2.6912,  2.7047,  2.7180,
         2.7881,  2.7942,  2.8064,  2.8124,  2.8420,  2.8536,  2.8536,  2.8820,
         2.8820,  2.8876,  2.9205,  2.9627,  2.9627,  2.9730,  3.0131,  3.0374,
         3.0564,  3.0750,  3.2753,  3.3375,  3.3928,  3.4356,  3.4579,  3.5130,
         3.5160,  3.5308,  3.5653,  3.5709,  3.5877,  3.6310,  3.6801,  3.7124,
         3.9843,  3.9936,  4.0298,  4.1296,  4.3851,  4.6844,  5.1431,  5.4775,
         5.5829], dtype=torch.float64)


Central to the implementation of GPS are equations (24), (25), and (26) defined on page 6 of the article - reproduced below. Equations (24) and (25) captures the gradient of empirical 'risk' $\hat{R}(a)$ and the penalty $P(a)$ used to regularized the model, respectively. Both are directional vectors in the parameter space of a regression problem. While (24) is directly depedent on the model and data, (25) has a more 'universal' nature in that they are gradients of a penalty with respect to parameters -- which typically have a form applicable across models. Note that $\nu$ is a parameterization of the steps size in parameter space, and the 'hat' symbol e.g. ($\hat{R}$ and $\hat{a}$) signifies empirical quantities that are explicit dependent on the observed data.

$$
\begin{eqnarray}
g_{j}(\nu) & = & - & 
\left[\frac{\partial \hat{R}(a)}{\partial a_{j}}\right]_{a=\hat{a}(\nu)} & \hspace{1in} (24) \\
p_{j}(\nu) & = & & 
\left[\frac{\partial P(a)}{\partial \left| a_{j} \right|} \right]_{a=\hat{a}(\nu)} & \hspace{1in} (25) \\
\lambda_{j}(\nu) & = & & 
\frac{g_{j}(\nu)}{p_{j}(\nu)} & \hspace{1in} (26)
\end{eqnarray}
$$

Before putting together the whole algorithm, each piece will be demonstrated separately. Without going into details at this point (see Section 2.3 of the paper), $a$ will be reset to zero and slightly pushed along the negative gradient as a starting point for this demonstration. The difficulty with zero is not surprising when dealing with gradients of absolute values evaluate at zero.

In [13]:
N = 9
a = torch.zeros(N,requires_grad=True, dtype=torch.float64)

In [14]:
loss = mse(Fmodel(a, xvals), lpsa)
loss.backward()
with torch.no_grad():
    a -= a.grad * 1e-5
    a.grad.zero_()

$$ Compute \{\lambda_{j}(\nu)\}^{n}_{1} $$

In [15]:
print(a)
R = mse(Fmodel(a, xvals), lpsa);R.backward(); g = -a.grad; a.grad.zero_() #(24)
print(R)
#P = abs(a).pow(1/2).sum(); P.backward(); p = abs(a.grad); a.grad.zero_() #(25)
P = abs(a).pow(2).sum(); P.backward(); p = abs(a.grad); a.grad.zero_() #(25)
print(P)
l = g/p #(26)
print(p)
print(g)

tensor([4.9568e-05, 8.6696e-05, 1.8412e-04, 3.1945e-03, 1.0935e-05, 1.6087e-05,
        8.6427e-06, 3.4080e-04, 1.4807e-03], dtype=torch.float64,
       requires_grad=True)
tensor(6.2674, dtype=torch.float64, grad_fn=<DivBackward0>)
tensor(1.2558e-05, dtype=torch.float64, grad_fn=<SumBackward0>)
tensor([9.9135e-05, 1.7339e-04, 3.6824e-04, 6.3891e-03, 2.1870e-05, 3.2174e-05,
        1.7285e-05, 6.8160e-04, 2.9614e-03], dtype=torch.float64)
tensor([  4.4702,   7.9575,  16.6355, 287.8595,   1.0111,   1.4853,   0.8695,
         30.7401, 133.4941], dtype=torch.float64)


$$ S = \{ j \, | \, \lambda_{j}(\nu) * \hat{a}_{j}(\nu) < 0 \} $$

In [16]:
# use element wise multiplication and less than 0 predicate
# to find elements with corresponding opposite sign
S = l*a < 0
print(l)
print(a)
print (S)

tensor([45092.3164, 45893.0302, 45175.6898, 45054.8953, 46234.0950, 46164.1531,
        50303.5029, 45100.0898, 45078.5758], dtype=torch.float64)
tensor([4.9568e-05, 8.6696e-05, 1.8412e-04, 3.1945e-03, 1.0935e-05, 1.6087e-05,
        8.6427e-06, 3.4080e-04, 1.4807e-03], dtype=torch.float64,
       requires_grad=True)
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8)


$$
\begin{eqnarray}
if \; (S = empty) \hspace{5pt} & j^{*} & = arg\,max_{j} & | \lambda_{j}(\nu) | \\
else \hspace{5pt} & j^{*} & = arg\,max_{j \in S} & | \lambda_{j}(\nu) | 
\end{eqnarray}
$$

In [17]:
# need to maintain idx location, i.e. need to keep tensor shape the same throughout
# torch.max() finds max element and respective index
if S.sum() > 0: # check for elements in set
    # non empty case (order different than stated algorithm)
    # S.double() for matching element types needed by PyTorch
    # element wise mult to zero out elements not meeting predicate condition
    # done this way to make sure idx refer to corresponding element
    val,idx = torch.max(S.double() * l.abs(), 0)
else:
    # empty case
    val,idx = torch.max(l.abs(), 0)
print(val,idx)

tensor(50303.5029, dtype=torch.float64) tensor(6)


$$ \hat{a}_{j^{*}}(\nu + \Delta \nu) = \hat{a}_{j^{*}}(\nu) + \Delta \nu * sign(\lambda_{j^{*}}(\nu)) \\
 \{ \hat{a}_{j}(\nu + \Delta \nu) = \hat{a}_{j}(\nu) \}_{j \ne j^{*}} $$
$$ \nu \leftarrow \nu + \Delta \nu $$

In [18]:
# update single component, a[idx], with new value
# here are the values at play for this iteration
print(idx, a[idx], l[idx])

tensor(6) tensor(8.6427e-06, dtype=torch.float64, grad_fn=<SelectBackward>) tensor(50303.5029, dtype=torch.float64)


Only a[idx] compoment is updated with a new value, the rest remain unchanged. The code would then be something along the line of:
```python
with torch.no_grad():
    a[idx] += del_nu * torch.sign(l[idx])
    a.grad.zero_()
```
Choosing $\Delta \nu$ is an implementation decision.
Note that $\Delta \nu$ is an implied change in the parameterized path of $a(\nu)$ that would bring about a change of $\Delta a$. Since $sign(\lambda_{j^{*}}(\nu))$ contributes only the sign of the change, $\Delta \nu$ is effectively the magnitude of $\Delta a$. 

Section 9.4 of the paper suggests one approach to setting the step size: chose $\Delta \nu$ to reduce the empirical risk $\hat{R}(\hat{a})$ by a fixed fraction $\epsilon$.

$$ \frac{\left [ \hat{R}(\hat{a}(\nu)) - \hat{R}(\hat{a}(\nu + \Delta \nu)) \right]}
{\hat{R}(\hat{a}(\nu))} = \epsilon $$

The algorithm updates one component $a_{j^{*}}$ at a time. An approximation for $\epsilon$ is then

$$ \left | \frac{g_{j^{*}}(\nu) * \Delta a_{j^{*}}}{\hat{R}(a(\nu))_{a=\hat{a}(\nu)}} \right | \approx \epsilon $$

With a choice of $\epsilon$ = 0.01,

In [19]:
with torch.no_grad():
    del_nu = 0.01 * (R / g[idx]).abs()
    a[idx] += del_nu * torch.sign(l[idx])
    R_post = mse(Fmodel(a, xvals), lpsa)
    a.grad.zero_()
    
# should be 'close' to 0.01
print(1-(R_post/R))

tensor(0.0084, dtype=torch.float64, grad_fn=<RsubBackward1>)


The GPS algorithm with all the pieces composed together.

Line numbering are diffent from the paper because the breakdown above was composed as units that better match this description.
$$
\begin{array}{ll}
1 & Initialize: \nu = 0; \{\hat{a}_{j}(0) = 0\}_{1}^{n} \\
2 & Loop \; \{ \\
3 & \hspace{10pt} Compute \{\lambda_{j}(\nu)\}^{n}_{1} \\
4 & \hspace{10pt} S = \{ j \, | \, \lambda_{j}(\nu) * \hat{a}_{j}(\nu) < 0 \} \\
5 & \hspace{10pt} \begin{eqnarray}
if \; (S = empty) \hspace{5pt} & j^{*} & = arg\,max_{j} & | \lambda_{j}(\nu) | \\
else \hspace{10pt} & j^{*} & = arg\,max_{j \in S} & | \lambda_{j}(\nu) |
\end{eqnarray} \\
6 & \hspace{10pt} \hat{a}_{j^{*}}(\nu + \Delta \nu) = \hat{a}_{j^{*}}(\nu) + \Delta \nu * sign(\lambda_{j^{*}}(\nu)); \{ \hat{a}_{j}(\nu + \Delta \nu) = \hat{a}_{j}(\nu) \}_{j \ne j^{*}} \\
7 & \hspace{10pt} \nu \leftarrow \nu + \Delta \nu \\
8 & \} \; Until \; \lambda(\nu) = 0
\end{array}
$$

In [40]:
## Line 1
N = 9
a = torch.zeros(N,requires_grad=True, dtype=torch.float64)
loss = mse(Fmodel(a, xvals), lpsa)
loss.backward()
with torch.no_grad():
    a -= a.grad * 1e-5
    a.grad.zero_()
nu = 0
ncount = 0
NMAX = 10000

## Line 2
while True:
    ## Line 3
    R = mse(Fmodel(a, xvals), lpsa);R.backward(); g = -a.grad; a.grad.zero_() #(24)
    P = abs(a).pow(1).sum(); P.backward(); p = abs(a.grad); a.grad.zero_() #(25)
    l = g/p #(26)
    ## Line 4
    # use element wise multiplication and less than 0 predicate
    # to find elements with corresponding opposite sign
    S = l*a < 0
    ## Line 5
    # need to maintain idx location, i.e. need to keep tensor shape the same throughout
    # torch.max() finds max element and respective index
    if S.sum() > 0: # check for elements in set
        # non empty case (order different than stated algorithm)
        # S.double() for matching element types needed by PyTorch
        # element wise mult to zero out elements not meeting predicate condition
        # done this way to make sure idx refer to corresponding element
        val,idx = torch.max(S.double() * l.abs(), 0)
    else:
        # empty case
        val,idx = torch.max(l.abs(), 0)
    ## Line 6
    with torch.no_grad():
        # recall that no gradients should be calculated here while a is being updated
        del_nu = 0.01 * (R / g[idx]).abs()
        a[idx] += del_nu * torch.sign(l[idx])
        # should be 'close' to 0.01
        # R_post = mse(Fmodel(a, xvals), lpsa); print(1-(R_post/R))
        a.grad.zero_()
    ## Line 7
    nu += del_nu
    ## Line 8
    #print(l.sum())
    if (ncount % 10 == 0): 
        print(ncount,R.data,P.data)
    if (ncount > NMAX) or (l.abs().sum() <= 0):
        break
    ncount += 1

0 tensor(6.2674, dtype=torch.float64) tensor(0.0054, dtype=torch.float64)
10 tensor(5.6699, dtype=torch.float64) tensor(0.0075, dtype=torch.float64)
20 tensor(5.1295, dtype=torch.float64) tensor(0.0096, dtype=torch.float64)
30 tensor(4.6406, dtype=torch.float64) tensor(0.0116, dtype=torch.float64)
40 tensor(4.1983, dtype=torch.float64) tensor(0.0135, dtype=torch.float64)
50 tensor(3.7983, dtype=torch.float64) tensor(0.0154, dtype=torch.float64)
60 tensor(3.4364, dtype=torch.float64) tensor(0.0172, dtype=torch.float64)
70 tensor(3.1091, dtype=torch.float64) tensor(0.0190, dtype=torch.float64)
80 tensor(2.8131, dtype=torch.float64) tensor(0.0208, dtype=torch.float64)
90 tensor(2.5453, dtype=torch.float64) tensor(0.0225, dtype=torch.float64)
100 tensor(2.3031, dtype=torch.float64) tensor(0.0243, dtype=torch.float64)
110 tensor(2.0841, dtype=torch.float64) tensor(0.0260, dtype=torch.float64)
120 tensor(1.8861, dtype=torch.float64) tensor(0.0278, dtype=torch.float64)
130 tensor(1.7072, dtyp

1290 tensor(1.4669, dtype=torch.float64) tensor(0.0677, dtype=torch.float64)
1300 tensor(1.3280, dtype=torch.float64) tensor(0.0696, dtype=torch.float64)
1310 tensor(1.2030, dtype=torch.float64) tensor(0.0718, dtype=torch.float64)
1320 tensor(1.0922, dtype=torch.float64) tensor(0.0751, dtype=torch.float64)
1330 tensor(1.1604, dtype=torch.float64) tensor(0.0728, dtype=torch.float64)
1340 tensor(1.0687, dtype=torch.float64) tensor(0.0785, dtype=torch.float64)
1350 tensor(1.0803, dtype=torch.float64) tensor(0.0795, dtype=torch.float64)
1360 tensor(1.5849, dtype=torch.float64) tensor(0.0664, dtype=torch.float64)
1370 tensor(1.4346, dtype=torch.float64) tensor(0.0681, dtype=torch.float64)
1380 tensor(1.2989, dtype=torch.float64) tensor(0.0701, dtype=torch.float64)
1390 tensor(1.1769, dtype=torch.float64) tensor(0.0724, dtype=torch.float64)
1400 tensor(1.0714, dtype=torch.float64) tensor(0.0764, dtype=torch.float64)
1410 tensor(2.3652, dtype=torch.float64) tensor(0.0672, dtype=torch.float64)

2560 tensor(1.3806, dtype=torch.float64) tensor(0.0975, dtype=torch.float64)
2570 tensor(1.2500, dtype=torch.float64) tensor(0.0994, dtype=torch.float64)
2580 tensor(1.1325, dtype=torch.float64) tensor(0.1017, dtype=torch.float64)
2590 tensor(1.0309, dtype=torch.float64) tensor(0.1055, dtype=torch.float64)
2600 tensor(1.0615, dtype=torch.float64) tensor(0.1098, dtype=torch.float64)
2610 tensor(1.4963, dtype=torch.float64) tensor(0.0961, dtype=torch.float64)
2620 tensor(1.3545, dtype=torch.float64) tensor(0.0978, dtype=torch.float64)
2630 tensor(1.2265, dtype=torch.float64) tensor(0.0998, dtype=torch.float64)
2640 tensor(1.1115, dtype=torch.float64) tensor(0.1022, dtype=torch.float64)
2650 tensor(1.0732, dtype=torch.float64) tensor(0.1102, dtype=torch.float64)
2660 tensor(1.0247, dtype=torch.float64) tensor(0.1066, dtype=torch.float64)
2670 tensor(18.4143, dtype=torch.float64) tensor(0.1159, dtype=torch.float64)
2680 tensor(16.6580, dtype=torch.float64) tensor(0.1125, dtype=torch.float6

3860 tensor(1.0904, dtype=torch.float64) tensor(0.1277, dtype=torch.float64)
3870 tensor(0.9935, dtype=torch.float64) tensor(0.1347, dtype=torch.float64)
3880 tensor(1.0220, dtype=torch.float64) tensor(0.1318, dtype=torch.float64)
3890 tensor(0.9852, dtype=torch.float64) tensor(0.1366, dtype=torch.float64)
3900 tensor(5.7698, dtype=torch.float64) tensor(0.1668, dtype=torch.float64)
3910 tensor(5.2197, dtype=torch.float64) tensor(0.1634, dtype=torch.float64)
3920 tensor(4.7221, dtype=torch.float64) tensor(0.1600, dtype=torch.float64)
3930 tensor(4.2720, dtype=torch.float64) tensor(0.1568, dtype=torch.float64)
3940 tensor(3.8648, dtype=torch.float64) tensor(0.1537, dtype=torch.float64)
3950 tensor(3.4965, dtype=torch.float64) tensor(0.1506, dtype=torch.float64)
3960 tensor(3.1633, dtype=torch.float64) tensor(0.1477, dtype=torch.float64)
3970 tensor(2.8619, dtype=torch.float64) tensor(0.1449, dtype=torch.float64)
3980 tensor(2.5893, dtype=torch.float64) tensor(0.1421, dtype=torch.float64)

5200 tensor(1.7280, dtype=torch.float64) tensor(0.1604, dtype=torch.float64)
5210 tensor(1.5637, dtype=torch.float64) tensor(0.1619, dtype=torch.float64)
5220 tensor(1.4151, dtype=torch.float64) tensor(0.1634, dtype=torch.float64)
5230 tensor(1.2809, dtype=torch.float64) tensor(0.1651, dtype=torch.float64)
5240 tensor(1.1597, dtype=torch.float64) tensor(0.1669, dtype=torch.float64)
5250 tensor(1.0505, dtype=torch.float64) tensor(0.1690, dtype=torch.float64)
5260 tensor(0.9544, dtype=torch.float64) tensor(0.1722, dtype=torch.float64)
5270 tensor(1.2946, dtype=torch.float64) tensor(0.1649, dtype=torch.float64)
5280 tensor(1.1720, dtype=torch.float64) tensor(0.1667, dtype=torch.float64)
5290 tensor(1.0616, dtype=torch.float64) tensor(0.1687, dtype=torch.float64)
5300 tensor(0.9637, dtype=torch.float64) tensor(0.1717, dtype=torch.float64)
5310 tensor(0.9382, dtype=torch.float64) tensor(0.1738, dtype=torch.float64)
5320 tensor(0.9424, dtype=torch.float64) tensor(0.1753, dtype=torch.float64)

6330 tensor(14.6864, dtype=torch.float64) tensor(0.1781, dtype=torch.float64)
6340 tensor(13.2857, dtype=torch.float64) tensor(0.1751, dtype=torch.float64)
6350 tensor(12.0186, dtype=torch.float64) tensor(0.1722, dtype=torch.float64)
6360 tensor(10.8724, dtype=torch.float64) tensor(0.1695, dtype=torch.float64)
6370 tensor(9.8356, dtype=torch.float64) tensor(0.1669, dtype=torch.float64)
6380 tensor(8.8976, dtype=torch.float64) tensor(0.1644, dtype=torch.float64)
6390 tensor(8.0491, dtype=torch.float64) tensor(0.1620, dtype=torch.float64)
6400 tensor(7.2816, dtype=torch.float64) tensor(0.1597, dtype=torch.float64)
6410 tensor(6.5872, dtype=torch.float64) tensor(0.1574, dtype=torch.float64)
6420 tensor(5.9591, dtype=torch.float64) tensor(0.1553, dtype=torch.float64)
6430 tensor(5.3910, dtype=torch.float64) tensor(0.1533, dtype=torch.float64)
6440 tensor(4.8770, dtype=torch.float64) tensor(0.1514, dtype=torch.float64)
6450 tensor(4.4121, dtype=torch.float64) tensor(0.1526, dtype=torch.floa

7410 tensor(31.7122, dtype=torch.float64) tensor(0.2308, dtype=torch.float64)
7420 tensor(28.6874, dtype=torch.float64) tensor(0.2265, dtype=torch.float64)
7430 tensor(25.9511, dtype=torch.float64) tensor(0.2223, dtype=torch.float64)
7440 tensor(23.4759, dtype=torch.float64) tensor(0.2184, dtype=torch.float64)
7450 tensor(21.2367, dtype=torch.float64) tensor(0.2146, dtype=torch.float64)
7460 tensor(19.2112, dtype=torch.float64) tensor(0.2110, dtype=torch.float64)
7470 tensor(17.3789, dtype=torch.float64) tensor(0.2076, dtype=torch.float64)
7480 tensor(15.7214, dtype=torch.float64) tensor(0.2044, dtype=torch.float64)
7490 tensor(14.2219, dtype=torch.float64) tensor(0.2013, dtype=torch.float64)
7500 tensor(12.8655, dtype=torch.float64) tensor(0.1983, dtype=torch.float64)
7510 tensor(11.6385, dtype=torch.float64) tensor(0.1955, dtype=torch.float64)
7520 tensor(10.5286, dtype=torch.float64) tensor(0.1927, dtype=torch.float64)
7530 tensor(9.5245, dtype=torch.float64) tensor(0.1902, dtype=to

8670 tensor(6.1272, dtype=torch.float64) tensor(0.1800, dtype=torch.float64)
8680 tensor(5.5430, dtype=torch.float64) tensor(0.1780, dtype=torch.float64)
8690 tensor(5.0145, dtype=torch.float64) tensor(0.1760, dtype=torch.float64)
8700 tensor(4.5364, dtype=torch.float64) tensor(0.1752, dtype=torch.float64)
8710 tensor(4.1040, dtype=torch.float64) tensor(0.1770, dtype=torch.float64)
8720 tensor(3.7128, dtype=torch.float64) tensor(0.1788, dtype=torch.float64)
8730 tensor(3.3589, dtype=torch.float64) tensor(0.1805, dtype=torch.float64)
8740 tensor(3.0388, dtype=torch.float64) tensor(0.1821, dtype=torch.float64)
8750 tensor(2.7492, dtype=torch.float64) tensor(0.1837, dtype=torch.float64)
8760 tensor(2.4873, dtype=torch.float64) tensor(0.1853, dtype=torch.float64)
8770 tensor(2.2504, dtype=torch.float64) tensor(0.1868, dtype=torch.float64)
8780 tensor(2.0361, dtype=torch.float64) tensor(0.1883, dtype=torch.float64)
8790 tensor(1.8423, dtype=torch.float64) tensor(0.1898, dtype=torch.float64)

9850 tensor(1.0537, dtype=torch.float64) tensor(0.2018, dtype=torch.float64)
9860 tensor(0.9553, dtype=torch.float64) tensor(0.2059, dtype=torch.float64)
9870 tensor(0.9571, dtype=torch.float64) tensor(0.2074, dtype=torch.float64)
9880 tensor(0.9420, dtype=torch.float64) tensor(0.2187, dtype=torch.float64)
9890 tensor(0.8945, dtype=torch.float64) tensor(0.2126, dtype=torch.float64)
9900 tensor(0.9927, dtype=torch.float64) tensor(0.2072, dtype=torch.float64)
9910 tensor(0.9028, dtype=torch.float64) tensor(0.2106, dtype=torch.float64)
9920 tensor(0.9197, dtype=torch.float64) tensor(0.2083, dtype=torch.float64)
9930 tensor(0.8999, dtype=torch.float64) tensor(0.2112, dtype=torch.float64)
9940 tensor(1.0287, dtype=torch.float64) tensor(0.2068, dtype=torch.float64)
9950 tensor(0.9331, dtype=torch.float64) tensor(0.2094, dtype=torch.float64)
9960 tensor(0.9801, dtype=torch.float64) tensor(0.2079, dtype=torch.float64)
9970 tensor(0.8940, dtype=torch.float64) tensor(0.2119, dtype=torch.float64)

In [33]:
a

tensor([0.0138, 0.0154, 0.0039, 0.0320, 0.0612, 0.0417, 0.0721, 0.0094, 0.0124],
       dtype=torch.float64, requires_grad=True)

In [34]:
a_ref

tensor([0.0041, 0.1258, 0.0433, 0.0262, 0.0232, 0.0315, 0.0709, 0.0301, 0.0105],
       dtype=torch.float64, requires_grad=True)

In [36]:
l

tensor([-1.9936e+00,  3.6785e+01,  8.0053e+00, -1.0236e+02,  5.3276e-02,
         3.3120e+00,  4.3077e+00, -2.1189e+01, -1.8927e+02],
       dtype=torch.float64)