## 3.1.1 Generalized Linear Models

The following are a set of methods intended for regression in which the target value is expected to be a linear combination of the features.

>$\hat{y}(w,x) = w_0 + w_1 x_1 + ... + w_p x_p$

The vector $w = (w_1,...,w_p)$ as coef_ and the $w_0$ as intercept_

To perform classiﬁcation with generalized linear models, see Logistic regression.

### 3.1.1.1 Ordinary Least Squares

**Complexity**: $O(n_{samples}n^2_{features})$

*LinearRegression* fits a linear model with coefficients $w = (w_1,...,w_p)$ to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted bu the linear approximation.
>$\min_{w}||X w - y||^2_2$

LinearRegression will take in its fit method arrays X, y and will store the coefﬁcients w of the linear model in its coef_ member

**Assumption**: the independence of the features. 

When features are correlated and the columns of the design matrix $X$ have an approximate linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed target, producing a large variance. This situation of multicollinearity can arise, for example, when data are collected without an experimental design.


In [3]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
print(reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2]))
print(reg.coef_)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
[0.5 0.5]


### 3.1.1.2 Ridge Regression

**Complexity**: $O(n_{samples}n^2_{features})$

Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefﬁcients. 
> $\min_{w}{||X w - y||^2_2 + \alpha||w||^2_2}$

The complexity parameter $ \alpha >= 0$ controls the amount of shrinkage: the larger the value of $\alpha$, the greater the amount of shrinkage and thus the coefﬁcients become more robust to collinearity.

**Assumption**: the multicollinearity between the features.

In [2]:
from sklearn import linear_model
reg = linear_model.Ridge(alpha=.5)
print(reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1]))
print(reg.coef_)
print(reg.intercept_)

Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
[0.34545455 0.34545455]
0.1363636363636364


RidgeCV implements ridge regression with built-in cross-validation of the alpha parameter. The object works in the same way as GridSearchCV except that it defaults to Generalized Cross-Validation (GCV), an efﬁcient form of leave-one-out cross-validation

In [4]:
import numpy as np
from sklearn import linear_model
reg = linear_model.RidgeCV(alphas=np.logspace(-6,6,13))
print(reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1]))
print(reg.alpha_)

RidgeCV(alphas=array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01,
       1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06]),
    cv=None, fit_intercept=True, gcv_mode=None, normalize=False,
    scoring=None, store_cv_values=False)
0.01


### 3.1.1.3 Lasso

The Lasso is a linear model that estimates sparse coefﬁcients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefﬁcients, effectively reducing the number of features upon which the given solution is dependent. For this reason Lasso and its variants are fundamental to the ﬁeld of compressed sensing.

Under certain conditions, it can recover the exact set of non-zero coefﬁcients.

>$\min_{w}{\frac{1}{2n_{samples}}||X w - y||^2_2 + \alpha||w||_2}$

The implementation in the class Lasso uses coordinate descent as the algorithm to ﬁt the coefﬁcients.

In [5]:
from sklearn import linear_model
reg = linear_model.Lasso(alpha=.1)
print(reg.fit([[0, 0], [1, 1]], [0, 1]))
print(reg.coef_)
print(reg.predict([[1,1]]))

Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
[0.6 0. ]
[0.8]


#### Using cross-validation

scikit-learn exposes objects that set the Lasso alpha parameter by cross-validation: LassoCV and LassoLarsCV. LassoLarsCV is based on the Least Angle Regression algorithm.

- LassoCV: For high-dimentional datasets with many collinear features
- LassoLarsCV:
    - exploring more relevant values of alpha parameter
    - faster if the number of samples is very small compared to the number of features
    
#### Information-criteria based model selection

Alternatively, the estimator LassoLarsIC proposes to use the Akaike information criterion (AIC) and the Bayes Information criterion (BIC). It is a computationally cheaper alternative to ﬁnd the optimal value of alpha as the regularization path is computed only once insteadof k+1 times when using k-fold cross-validation. However, such criteria needs a proper estimation of the degrees of freedom of the solution, are derived for large samples (asymptotic results) and assume the model is correct, i.e. that the data are actually generated by this model. They also tend to break when the problem is badly conditioned (more features than samples).

#### Multi-task Lasso

The MultiTaskLasso is a linear model that estimates sparse coefﬁcients for multiple regression problems jointly: y isa2Darray,ofshape (n_samples, n_tasks). The constraint is that the selected features are the same for all the regression problems, also called tasks. 

>$\min_{w}{\frac{1}{2n_{samples}}||X W - y||^2_{Fro} + \alpha||W||_{21}}$

where Fro indicates the Frobenius norm
>$||A||_{Fro} = \sqrt{\sum_{ij}{a^2_{ij}}}$

and $l_1l_2$ reads
>$||A||_{21} = \sum_{i}{\sqrt{\sum_j{a^2_{ij}}}}$