Across the module, we designate the vector $w = (w_1, ..., w_p)$ as ```coef_``` and $w_0$ as ```intercept_```

# Ordinary Least Squares

In [1]:
import warnings

from sklearn import linear_model

warnings.filterwarnings('ignore')

reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [2]:
reg.coef_

array([0.5, 0.5])

However, coefficient estimates for Ordinary Least Squares rely on the independence of the model terms. When terms are **correlated** and the columns of the design matrix X have an approximate linear dependence, the design matrix becomes close to **singular** and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed response, producing a large variance.

# Ridge Regression

Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of coefficients.

$$
min_w ||Xw-y||_2^2 + \alpha||w||_2^2
$$

In [3]:
reg = linear_model.Ridge(alpha=.5)
reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])

Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [4]:
reg.coef_

array([0.34545455, 0.34545455])

In [5]:
reg.intercept_

0.1363636363636364

RidgeCV implements ridge regression with built-in cross-validation of the alpha parameter. 

In [6]:
reg = linear_model.RidgeCV(alphas = [0.1, 1.0, 10.0])
reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])

RidgeCV(alphas=[0.1, 1.0, 10.0], cv=None, fit_intercept=True, gcv_mode=None,
    normalize=False, scoring=None, store_cv_values=False)

In [7]:
reg.alpha_

0.1

# Lasso

The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer parameter values, effectively reducing the number of variables upon which the given solution is dependent.

$$
min_w\frac{1}{2n_{samples}}||Xw-y||_2^2+\alpha||w||_1
$$

In [8]:
reg = linear_model.Lasso(alpha=0.1)
reg.fit([[0, 0], [1, 1]], [0, 1])

Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [9]:
reg.predict([[1, 1]])

array([0.8])

# Logistic Regression

The solvers implemented in the class LogisticRegression are "liblinear", "newton-cg", "lbfgs", "sag" and "saga".

The solver "liblinear" uses a coordinate descent (CD) algorithm, and relies on the excellent C++ LIBLINEAR library, which is shipped with scikit-learn. However, the CD algorithm implemented in liblinear cannot learn a true multinomial (multiclass) model; instead, the optimization problem is decomposed in a "one-vs-rest" fashion so separate binary classifiers are trained for all classes.

The "lbfgs", "sag" and "newton-cg" solver only support L2 penalization and are found to converge faster for some high dimensional data. Setting multi_class to "multinomial" with these solvers learns a true multinomial logistic regression model, which means that its probability estimates should be better calibrated than the default "one-vs-rest" setting.

The "sag" solver uses a Stochastic Average Gradient descent. It is faster than other solvers for large datasets, when both the number of samples and the number of features are large.

The "saga" solver is a variant of "sag" that also supports the non-smooth penalty="l1" option. This is therefore the solver of choice for sparse multinomial logistic regression.

In a nutshell, one may choose the solver with the following rules:

Case | Solver 
--- | --- 
L1 penalty | "liblinear" or "saga"
multinomial loss | "lbfgs", "sag", "saga" or "newton-cg"
very large dataset | "sag" or "saga"

The "saga" solver is often the **best** choice. The "liblinear" solver is used by default for historical reasons.

For large dataset, you may also consider using SGDClassifier with 'log' loss.