## 3.1.1 Generalized Linear Models

The following are a set of methods intended for regression in which the target value is expected to be a linear combination of the features.

>$\hat{y}(w,x) = w_0 + w_1 x_1 + ... + w_p x_p$

The vector $w = (w_1,...,w_p)$ as coef_ and the $w_0$ as intercept_

To perform classiﬁcation with generalized linear models, see Logistic regression.

### 1) Ordinary Least Squares

**Complexity**: $O(n_{samples}n^2_{features})$

*LinearRegression* fits a linear model with coefficients $w = (w_1,...,w_p)$ to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted bu the linear approximation.
>$\min_{w}||X w - y||^2_2$

LinearRegression will take in its fit method arrays X, y and will store the coefﬁcients w of the linear model in its coef_ member

**Assumption**: the independence of the features. 

When features are correlated and the columns of the design matrix $X$ have an approximate linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed target, producing a large variance. This situation of multicollinearity can arise, for example, when data are collected without an experimental design.


In [None]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
print(reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2]))
print(reg.coef_)

### 2) Ridge Regression

**Complexity**: $O(n_{samples}n^2_{features})$

Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefﬁcients. 
> $\min_{w}{||X w - y||^2_2 + \alpha||w||^2_2}$

The complexity parameter $ \alpha >= 0$ controls the amount of shrinkage: the larger the value of $\alpha$, the greater the amount of shrinkage and thus the coefﬁcients become more robust to collinearity.

**Assumption**: the multicollinearity between the features.

In [None]:
from sklearn import linear_model
reg = linear_model.Ridge(alpha=.5)
print(reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1]))
print(reg.coef_)
print(reg.intercept_)

RidgeCV implements ridge regression with built-in cross-validation of the alpha parameter. The object works in the same way as GridSearchCV except that it defaults to Generalized Cross-Validation (GCV), an efﬁcient form of leave-one-out cross-validation

In [None]:
import numpy as np
from sklearn import linear_model
reg = linear_model.RidgeCV(alphas=np.logspace(-6,6,13))
print(reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1]))
print(reg.alpha_)

### 3) Lasso

The Lasso is a linear model that estimates sparse coefﬁcients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefﬁcients, effectively reducing the number of features upon which the given solution is dependent. For this reason Lasso and its variants are fundamental to the ﬁeld of compressed sensing.

Under certain conditions, it can recover the exact set of non-zero coefﬁcients.

>$\min_{w}{\frac{1}{2n_{samples}}||X w - y||^2_2 + \alpha||w||_2}$

The implementation in the class Lasso uses coordinate descent as the algorithm to ﬁt the coefﬁcients.

In [None]:
from sklearn import linear_model
reg = linear_model.Lasso(alpha=.1)
print(reg.fit([[0, 0], [1, 1]], [0, 1]))
print(reg.coef_)
print(reg.predict([[1,1]]))

#### Using cross-validation

scikit-learn exposes objects that set the Lasso alpha parameter by cross-validation: LassoCV and LassoLarsCV. LassoLarsCV is based on the Least Angle Regression algorithm.

- LassoCV: For high-dimentional datasets with many collinear features
- LassoLarsCV:
    - exploring more relevant values of alpha parameter
    - faster if the number of samples is very small compared to the number of features
    
#### Information-criteria based model selection

Alternatively, the estimator LassoLarsIC proposes to use the Akaike information criterion (AIC) and the Bayes Information criterion (BIC). It is a computationally cheaper alternative to ﬁnd the optimal value of alpha as the regularization path is computed only once insteadof k+1 times when using k-fold cross-validation. However, such criteria needs a proper estimation of the degrees of freedom of the solution, are derived for large samples (asymptotic results) and assume the model is correct, i.e. that the data are actually generated by this model. They also tend to break when the problem is badly conditioned (more features than samples).

#### Multi-task Lasso

The MultiTaskLasso is a linear model that estimates sparse coefﬁcients for multiple regression problems jointly: y isa2Darray,ofshape (n_samples, n_tasks). The constraint is that the selected features are the same for all the regression problems, also called tasks. 

>$\min_{w}{\frac{1}{2n_{samples}}||X W - y||^2_{Fro} + \alpha||W||_{21}}$

where Fro indicates the Frobenius norm
>$||A||_{Fro} = \sqrt{\sum_{ij}{a^2_{ij}}}$

and $l_1l_2$ reads
>$||A||_{21} = \sum_{i}{\sqrt{\sum_j{a^2_{ij}}}}$

### 4) Elastic-Net

ElasticNet is a linear regression model trained with both $l_1$ and $l_2$-norm regularization of the coefﬁcients. This combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge. We control the convex combination of  $l_1$ and $l_2$ using the l1_ratio parameter. 

Elastic-net is useful when there are multiple features which are correlated with one another. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both. 

A practical advantage of trading-off between Lasso and Ridge is that it allows Elastic-Net to inherit some of Ridge’s stability under rotation

>$\min_{w}{\frac{1}{2n_{samples}}||X w - y||^2_2 + \alpha\rho||w||_{1} + \frac{\alpha(1-\rho)}{2}||w||^2_2}$



### 5) Least Angle Regression

Least-angle regression (LARS) is a regression algorithm for high-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. LARS is similar to forward stepwise regression. At each step, it ﬁnds the feature most correlated with the target. When there are multiple features having equal correlation, instead of continuing along the same feature, it proceeds in a direction equiangular between the features. 

The advantages of LARS are: 
- It is numerically efﬁcient in contexts where the number of features is signiﬁcantly greater than the number of samples. 
- It is computationally just as fast as forward selection and has the same order of complexity as ordinary least squares.
- It produces a full piecewise linear solution path, which is useful in cross-validation or similar attempts to tune the model. 
-  If two features are almost equally correlated with the target, then their coefﬁcients should increase at approximately the same rate. The algorithm thus behaves as intuition would expect, and also is more stable.
- It is easily modiﬁed to produce solutions for other estimators, like the Lasso. 

The disadvantages of the LARS method include:
- Because LARS is based upon an iterative reﬁtting of the residuals, it would appear to be especially sensitive to the effects of noise. This problem is discussed in detail by Weisberg in the discussion section of the Efron et al. (2004) Annals of Statistics article. 

The algorithm is similar to forward stepwise regression, but instead of including features at each step, the estimated coefﬁcients are increased in a direction equiangular to each one’s correlations with the residual

#### LARS Lasso
LassoLars is a lasso model implemented using the LARS algorithm, and unlike the implementation based on coordinate descent, this yields the exact solution, which is piecewise linear as a function of the norm of its coefﬁcients

In [None]:
from sklearn import linear_model 
reg = linear_model.LassoLars(alpha=.1) 
print(reg.fit([[0, 0], [1, 1]], [0, 1]))
print(reg.coef_)

### 6) Orthogonal Matching Pursuit (OMP)
OrthogonalMatchingPursuit and orthogonal_mp implements the OMP algorithm for approximating the ﬁt of a linear model with constraints imposed on the number of non-zero coefﬁcients (ie. the $l_0$ pseudo-norm).

Being a forward feature selection method like Least Angle Regression, orthogonal matching pursuit can approximate the optimum solution vector with a ﬁxed number of non-zero elements:

> $\arg_\gamma\min ||\gamma||_0$ subject to $||\gamma||_0 <= n_{nonzero\_coefs}$

OMP is based on a greedy algorithm that includes at each step the atom most highly correlated with the current residual. It is similar to the simpler matching pursuit (MP) method, but better in that at each iteration, the residual is recomputed using an orthogonal projection on the space of the previously chosen dictionary elements.

### 7) Bayesian Regression

Bayesian regression techniques can be used to include regularization parameters in the estimation procedure: the regularization parameter is not set in a hard sense but tuned to the data at hand. 

This can be done by introducing uninformative priors over the hyper parameters of the model. The $l_2$ regularization used in Ridge Regression is equivalent to ﬁnding a maximum a posteriori estimation under a Gaussian prior over the coefﬁcients w with precision $\lambda^{-1}$. Instead of setting lambda manually, it is possible to treat it as a random variable to be estimated from the data. 

To obtain a fully probabilistic model, the output $y$ is assumed to be Gaussian distributed around $Xw$:
>$p(y|X,w,\alpha) = N(y|Xw,\alpha)$

where $\alpha$ is again treated as a random variable that is to be estimated from the data.

The advantages of Bayesian Regression are: 
- It adapts to the data at hand. 
- It can be used to include regularization parameters in the estimation procedure. 

The disadvantages of Bayesian regression include:
- Inference of the model can be time consuming.

#### Bayesian Ridge Regression
BayesianRidge estimates a probabilistic model of the regression problem as described above. The prior for the coefﬁcient w is given by a spherical Gaussian:
>$p(w|\lambda) = N(w|0,\lambda^{-1}I_p)$

The priors over α and λ are chosen to be gamma distributions, the conjugate prior for the precision of the Gaussian. The resulting model is called Bayesian Ridge Regression, and is similar to the classical Ridge. 

The parameters w, α and λ are estimated jointly during the ﬁt of the model, the regularization parameters α and λ being estimated by maximizing the log marginal likelihood. 

In [None]:
from sklearn import linear_model
X =  [[0., 0.], [1., 1.], [2., 2.], [3., 3.]]
Y = [0., 1., 2., 3.] 
reg = linear_model.BayesianRidge()
print(reg.fit(X,Y))
print(reg.predict([[1, 0.]]))
print(reg.coef_)

### 8) Automatic Relevacne Determination -ARD

ARDRegression is very similar to Bayesian Ridge Regression, but can lead to sparser coefﬁcients $w$. ARDRegression poses a different prior over $w$, by dropping the assumption of the Gaussian being spherical. Instead, the distribution over $w$ is assumed to be an axis-parallel, elliptical Gaussian distribution. 

This means each coefﬁcient $w_i$ is drawn from a Gaussian distribution, centered on zero and with a precision $\lambda_i$:

> $p(w|\lambda) = N(w|0,A^{-1})$
with $diag(A)=\lambda = \{\lambda_1,...,\lambda_p\}$

In contrast to Bayesian Ridge Regression, each coordinate of $w_i$ has its own standard deviation $\lambda_i$. The prior over all $\lambda_i$ is chosen to be the same gamma distribution given by hyperparameters $\lambda_1$ and $\lambda_2$. 

ARD is also known in the literature as Sparse Bayesian Learning and Relevance Vector Machine

### 9) Logistic regression

Logistic regression, despite its name, is a linear model for classiﬁcation rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classiﬁcation (MaxEnt) or the log-linear classiﬁer. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. 

Logistic regression is implemented in LogisticRegression. This implementation can ﬁt binary, One-vs-Rest, or multinomial logistic regression with optional $l_1$, $l_2$ or Elastic-Net regularization.

The solvers implemented in the class LogisticRegression are “liblinear”, “newton-cg”, “lbfgs”, “sag” and “saga”: 
- The solver “liblinear” uses a coordinate descent(CD) algorithm, and relies on the excellent C++LIBLINEAR library, which is shipped with scikit-learn. However, the CD algorithm implemented in liblinear cannot learn a true multinomial (multiclass) model; instead, the optimization problem is decomposed in a “one-vs-rest” fashion so separate binary classiﬁers are trained for all classes. This happens under the hood, so LogisticRegression instances using this solver behave as multiclass classiﬁers. For $l_1$ regularization sklearn.svm.l1_min_c allows to calculate the lower bound for C in order to get a non “null” (all feature weights to zero) model. 
- The “lbfgs”, “sag” and “newton-cg” solvers only support $l_2$ regularization or no regularization, and are found to converge faster for some high-dimensional data. Setting multi_class to “multinomial” with these solvers learns a true multinomial logistic regression model, which means that its probability estimates should be better calibrated than the default “one-vs-rest” setting. 
- The “sag” solver uses Stochastic Average Gradient descent. It is faster than other solvers for large datasets, when both the number of samples and the number of features are large. 
- The “saga” solver is a variant of “sag” that also supports the non-smooth penalty="l1". This is therefore the solver of choice for sparse multinomial logistic regression. It is also the only solver that supports penalty="elasticnet". 
- The “lbfgs” is an optimization algorithm that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm, which belongs to quasi-Newton methods. The “lbfgs” solver is recommended for use for small data-sets but for larger datasets its performance suffers.

The “lbfgs” solver is used by default for its robustness. For large datasets the “saga” solver is usually faster. For large dataset,youmayalsoconsiderusing SGDClassifier with‘log’loss,which might be even faster but requires more tuning.
