## Health Outcomes with Linear Regression

In [1]:
# importing basic lib
from sklearn import datasets, linear_model, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
import math, scipy, numpy as np   
from scipy import linalg

#### Diabetes Dataset
        We will use a dataset from patients with diabetes. The data consists of 442 samples and 10 variables (like tall and skinny). The dependent variable is a quantitatve measure of disease progeression one year after the basel




        This is a clasical dataset, famously used by Efron, Hastie, Johnstone, and Tibshirani in their Least angle regression paper.

In [2]:
data = datasets.load_diabetes()

In [3]:
feature_names = ['age','sex','bmi','bp','s1','s2','s3','s4','s5','s6']

In [4]:
trn,test, y_trn, y_test = train_test_split(data.data, data.target, test_size=0.2)

In [5]:
trn.shape, test.shape

((353, 10), (89, 10))

### Linear Regression in Sci-kit Learn

Consider a system $X\beta = y$ , where $X$ has more rows than columns. This occurs when you have more data samples than variables. We want to find $/beta$ that minimizes

$$ || X\beta - y ||_2 $$

        Starting with simmpler sklearn implementation

In [7]:
regr = linear_model.LinearRegression()
%timeit regr.fit(trn, y_trn)

The slowest run took 37.80 times longer than the fastest. This could mean that an intermediate result is being cached.
12.5 ms ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
pred = regr.predict(test)

        We will have some metrics on how good our prediction is. We will look at the mean squared norm (L2) and mean absolute error (L1)

In [9]:
def regr_metric(act, pred):
    return (math.sqrt(metrics.mean_squared_error(act, pred)), metrics.mean_absolute_error(act,pred))

In [10]:
regr_metric(y_test, regr.predict(test))

(49.876877547419426, 39.34822094080999)

## Polynomial Features

Linear Regression finds the best coefficient $/beta_i$ for:
$$ x_0 \beta_0 + x_1 \beta_1 + x_2\beta_2 =y $$

Adding Polynomial features is still alinear regression problem, just with more terms:

$$x_0 \beta_0 + x_1 \beta_1 + x_2 \beta_2 + x_o^2 \beta_3 + x_{0} x_1 \beta_4 + .... = y $$

    We need to use our original data X to calculate the additional polynomial features:

In [11]:
trn.shape

(353, 10)

            The perfomrmce of the model was bad, now to improve it, we have to add some features. Currenlty our model is linear in each variable, but we can add polynomial features to change this.

In [12]:
poly = PolynomialFeatures(include_bias= False)

In [15]:
trn_feat = poly.fit_transform(trn)

In [17]:
', '.join(poly.get_feature_names_out(feature_names))

'age, sex, bmi, bp, s1, s2, s3, s4, s5, s6, age^2, age sex, age bmi, age bp, age s1, age s2, age s3, age s4, age s5, age s6, sex^2, sex bmi, sex bp, sex s1, sex s2, sex s3, sex s4, sex s5, sex s6, bmi^2, bmi bp, bmi s1, bmi s2, bmi s3, bmi s4, bmi s5, bmi s6, bp^2, bp s1, bp s2, bp s3, bp s4, bp s5, bp s6, s1^2, s1 s2, s1 s3, s1 s4, s1 s5, s1 s6, s2^2, s2 s3, s2 s4, s2 s5, s2 s6, s3^2, s3 s4, s3 s5, s3 s6, s4^2, s4 s5, s4 s6, s5^2, s5 s6, s6^2'

In [18]:
trn_feat.shape

(353, 65)

In [19]:
## now do the fitting

regr.fit(trn_feat, y_trn)

LinearRegression()

In [20]:
regr_metric(y_test, regr.predict(poly.fit_transform(test)))

(56.08366396156073, 42.37826260086632)

        Since time is squared in features and linear in points, the below code will be slow to run

In [21]:
%timeit poly.fit_transform(trn)

5.03 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Speeding up feature generation

        We would like to speed this up. we will use Numba, a python library that  compiles code directly into C


        Numba is a compiler

        

        Experiments with vectorization and native code

        Let's get aquainted with numba       

In [22]:
%matplotlib inline

In [25]:
import math, numpy as np, matplotlib.pyplot as plt
from pandas_summary import DataFrameSummary
from scipy import ndimage

In [28]:
from numba import jit, vectorize, guvectorize, cuda, float32, void, float64

       We will show the impact of:
       1. Avoid memory allocations and copies
       2. Better locality
       3. Vectorization

    If we use numpy on whole arrays at a time, it creates lots of temporaries, and can't use cache. If we use numba looping through an array item at a time, then we don't have to allocate large temporary arrays, and can reuse cached data since we're doing multiple calculations on each array item.

In [32]:
# untype and unvectroized
def proc_python(xx,yy):
    zz = np.zeros(nobs, dtype='float32')
    for j in range(nobs):
        x,y = xx[j], yy[j]
        x = x*2 - (y *55)

        y = x + y*2

        z = x+y + 99
        z = z* (z - .88)
        zz[j] = z
    return zz

In [33]:
nobs = 10000
x = np.random.randn(nobs).astype('float32')
y = np.random.randn(nobs).astype('float32')

In [34]:
%timeit proc_python(x,y)

# this is the untyped and unvectorized version

1.35 s ± 574 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Numpy 
         We will vectorize it now

In [35]:
# Typed and Vectorized
def proc_numpy(x,y):
    z = np.zeros(nobs, dtype='float32')
    x = x*2 - ( y * 55 )
    y = x + y*2         
    z = x + y + 99      
    z = z * ( z - .88 ) 
    return z

In [36]:
np.allclose( proc_numpy(x,y), proc_python(x,y), atol=1e-4 )

True

In [37]:
%timeit proc_numpy(x,y)    # Typed and vectorized

641 µs ± 125 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Numba

        Numab offers several different decorators. We will try two different ones:
        1. @jit : very general
        2. @vectorize : don't need to write a for loop

In [38]:
@jit()
def proc_numba(xx,yy,zz):
    for j in range(nobs):   
        x, y = xx[j], yy[j] 
        x = x*2 - ( y * 55 )
        y = x + y*2         
        z = x + y + 99      
        z = z * ( z - .88 ) 
        zz[j] = z           
    return zz

In [39]:
z = np.zeros(nobs).astype('float32')
np.allclose( proc_numpy(x,y), proc_numba(x,y,z), atol=1e-4 )


True

In [40]:
%timeit proc_numba(x,y,z)

95.8 µs ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


                Now we'll use the vectorize decorator. 
                Numba's compiler optimizes this in a smarter way


   

In [41]:
@vectorize
def vec_numba(x,y):
    x = x*2 - ( y * 55 )
    y = x + y*2         
    z = x + y + 99      
    return z * ( z - .88 ) 

In [42]:
np.allclose(vec_numba(x,y), proc_numba(x,y,z), atol=1e-4 )

True

In [43]:
%timeit vec_numba(x,y)

104 µs ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


        NUMBA is amazingly fast

### Numba Polynomial features

In [44]:
@jit(nopython=True)
def vec_poly(x, res):
    m,n=x.shape
    feat_idx=0
    for i in range(n):
        v1=x[:,i]
        for k in range(m): res[k,feat_idx] = v1[k]
        feat_idx+=1
        for j in range(i,n):
            for k in range(m): res[k,feat_idx] = v1[k]*x[k,j]
            feat_idx+=1

#### Row Major vs Column Major Storage

        "The row-major layout of a matrix puts the first row in contiguous memory, then the second row right after it, then the third, and so on. Column-major layout puts the first column in contiguous memory, then the second, etc.... While knowing which layout a particular data set is using is critical for good performance, there's no single answer to the question which layout 'is better' in general.

        "It turns out that matching the way your algorithm works with the data layout can make or break the performance of an application.

        "The short takeaway is: always traverse the data in the order it was laid out."

        Column-major layout: Fortran, Matlab, R, and Julia

        Row-major layout: C, C++, Python, Pascal, Mathematica


In [45]:
trn = np.asfortranarray(trn)
test = np.asfortranarray(test)

In [46]:
m,n = trn.shape
n_feat = n*(n+1)//2 + n
trn_feat = np.zeros((m,n_feat), order='F')

test_feat = np.zeros((len(y_test), n_feat), order='F')

In [47]:
vec_poly(trn,trn_feat)
vec_poly(test,test_feat)

In [48]:
regr.fit(trn_feat, y_trn)

LinearRegression()

In [49]:
regr_metric(y_test, regr.predict(test_feat))

(56.08366396156081, 42.37826260086712)

In [50]:
%timeit vec_poly(trn, trn_feat)

405 µs ± 125 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


    This was the time from the scikit learn implementation PolynomialFeatures

In [51]:
%timeit poly.fit_transform(trn)

The slowest run took 5.93 times longer than the fastest. This could mean that an intermediate result is being cached.
2.12 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Regularization and Noise

        Regularization is a way to reduce over-fitting and create models that bettter generalize to new dataLasso Regression uses an L1 penalty , which pushestoward sparse coefficients

In [52]:
reg_regr = linear_model.LassoCV(n_alphas= 10)

In [53]:
reg_regr.fit(trn_feat, y_trn)

  positive,
  positive,
  positive,


LassoCV(n_alphas=10)

In [54]:
reg_regr.alpha_

0.011006106192913239

In [57]:
regr_metric(y_test, reg_regr.predict(test_feat))

(49.58213957493595, 39.277438762354656)

### Noise 

    Adding some noise

In [58]:
idxs = np.random.randint(0, len(trn), 10)

In [60]:
y_trn2 = np.copy(y_trn)
y_trn2[idxs] *= 10 # label noise

In [61]:
regr = linear_model.LinearRegression()
regr.fit(trn, y_trn)
regr_metric(y_test, regr.predict(test))

(49.87687754741944, 39.34822094080999)

In [63]:
regr.fit(trn, y_trn2)
regr_metric(y_test, regr.predict(test))

(69.14132406238272, 57.59514125171805)

            Huber Loss is a loss function that is less sensitive to outliers than squared error loss. It is quadratic for small error values, and linear for large values

In [64]:
hregr = linear_model.HuberRegressor()
hregr.fit(trn, y_trn2)
regr_metric(y_test, hregr.predict(test))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


(51.24180446481409, 40.38448625427875)