<h1 id="Contents">Contents<a href="#Contents"></a></h1>
        <ol>
        <li><a class="" href="#Stochastic-Gradient-Descent">Stochastic Gradient Descent</a></li>
<ol><li><a class="" href="#SGDClassifier">SGDClassifier</a></li>
<ol><li><a class="" href="#Loss-functions">Loss functions</a></li>
<li><a class="" href="#Penalties">Penalties</a></li>
<li><a class="" href="#Multi-class-classification">Multi-class classification</a></li>
<li><a class="" href="#Parameters-of-SGDClassifier">Parameters of SGDClassifier</a></li>
<li><a class="" href="#Attributes-of-SGDClassifier">Attributes of SGDClassifier</a></li>
</ol><li><a class="" href="#SGDRegressor">SGDRegressor</a></li>
<ol><li><a class="" href="#Parameters-of-SGDRegressor">Parameters of SGDRegressor</a></li>
<li><a class="" href="#Attributes-of-SGDRegressor">Attributes of SGDRegressor</a></li>
</ol><li><a class="" href="#More-Informations-on-SGD">More Informations on SGD</a></li>
<ol><li><a class="" href="#Time-Complexity">Time Complexity</a></li>
<li><a class="" href="#Stopping-Criterion">Stopping Criterion</a></li>
<li><a class="" href="#Mathematical-Details">Mathematical Details</a></li>
</ol>

# Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to fitting linear classifiers and regressors under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. 

SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing. Given that the data is sparse, the classifiers in this module easily scale to problems with more than $10^5$ training examples and more than $10^5$ features.

Strictly speaking, SGD is merely an optimization technique and does not correspond to a specific family of machine learning models. It is only a way to train a model. Often, an instance of `SGDClassifier` or `SGDRegressor` will have an equivalent estimator in the scikit-learn API, potentially using a different optimization technique. For example, using `SGDClassifier(loss='log_loss')` results in logistic regression, i.e. a model equivalent to `LogisticRegression` which is fitted via SGD instead of being fitted by one of the other solvers in `LogisticRegression`. Similarly,` SGDRegressor(loss='squared_error', penalty='l2')` and `Ridge` solve the same optimization problem, via different means.

The advantages of Stochastic Gradient Descent are:
* Efficiency.

* Ease of implementation (lots of opportunities for code tuning).

The disadvantages of Stochastic Gradient Descent include:

* SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.

* SGD is sensitive to feature scaling.

## SGDClassifier

The class SGDClassifier implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties for classification.

In [2]:
from sklearn.linear_model import SGDClassifier
X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=10)
clf.fit(X, y)
print(clf.predict([[2., 2.]]))

[1]


SGD fits a linear model to the training data.

In [3]:
clf.coef_, clf.intercept_

(array([[9.85221675, 9.85221675]]), array([-9.99002993]))

The signed distance to the hyperplane (computed as the dot product between the coefficients and the input sample, plus the intercept) is given by `SGDClassifier.decision_function`:

In [6]:
clf.decision_function([[2., 2.]])

array([29.41883706])

### Loss functions

The concrete loss function can be set via the loss parameter. `SGDClassifier` supports the following loss functions:

* `loss="hinge"`: (soft-margin) linear Support Vector Machine,

* `loss="modified_huber"`: smoothed hinge loss,

* `loss="log"`: logistic regression,

and all regression losses. In this case the target is encoded as -1 or 1, and the problem is treated as a regression problem. The predicted class then correspond to the sign of the predicted target.

The first two loss functions are lazy, they only update the model parameters if an example violates the margin constraint, which makes training very efficient and may result in sparser models (i.e. with more zero coefficients), even when L2 penalty is used.


Using `loss="log"` or `loss="modified_huber"` enables the `predict_proba` method, which gives a vector of probability estimates $P(y|x)$ per sample $x$:

In [9]:
clf = SGDClassifier(loss="log", max_iter=10).fit(X, y)
clf.predict_proba([[1., 1.]])

array([[0.00416343, 0.99583657]])

### Penalties

The concrete penalty can be set via the penalty parameter. SGD supports the following penalties:

* **penalty="l2"**: L2 norm penalty on `coef_`.

* **penalty="l1"**: L1 norm penalty on `coef_`.

* **penalty="elasticnet"**: Convex combination of L2 and L1; `(1 - l1_ratio) * L2 + l1_ratio * L1.`

### Multi-class classification

SGDClassifier supports multi-class classification by combining multiple binary classifiers in a “one versus all” (OVA) scheme. For each of the $K$ classes, a binary classifier is learned that discriminates between that and all other $K-1$ classes. At testing time, we compute the confidence score (i.e. the signed distances to the hyperplane) for each classifier and choose the class with the highest confidence. 

The Figure below illustrates the OVA approach on the iris dataset. The dashed lines represent the three OVA classifiers; the background colors show the decision surface induced by the three classifiers.

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_sgd_iris_001.png)

In the case of multi-class classification `coef_` is a two-dimensional array of shape (n_classes, n_features) and `intercept_` is a one-dimensional array of shape (n_classes,). The i-th row of `coef_` holds the weight vector of the OVA classifier for the i-th class; classes are indexed in ascending order. Note that, in principle, since they allow to create a probability model, loss="log_loss" and loss="modified_huber" are more suitable for one-vs-all classification.

### Parameters of SGDClassifier

Here are some of the parameters of SGDClassifier:
* **loss** : {‘hinge’, ‘log_loss’, ‘log’, ‘modified_huber’, ‘squared_hinge’, ‘perceptron’, ‘squared_error’, ‘huber’, ‘epsilon_insensitive’, ‘squared_epsilon_insensitive’}, default=’hinge’
 
    The loss function to be used.

    * ‘hinge’ gives a linear SVM.

    * ‘log_loss’ gives logistic regression, a probabilistic classifier.

    * ‘modified_huber’ is another smooth loss that brings tolerance to
    outliers as well as probability estimates.

    * ‘squared_hinge’ is like hinge but is quadratically penalized.

    * ‘perceptron’ is the linear loss used by the perceptron algorithm.

    * The other losses, ‘squared_error’, ‘huber’, ‘epsilon_insensitive’ and ‘squared_epsilon_insensitive’ are designed for regression but can be useful in classification as well.
* **penalty** : {‘l2’, ‘l1’, ‘elasticnet’}, default=’l2’
 
    The penalty (aka regularization term) to be used.

    * ‘l2’ is the standard regularizer used in ridge regression.

    * ‘l1’ is the standard regularizer used in lasso regression.

    * ‘elasticnet’ is a combination of L1 and L2 regularization that is
    also known as the elastic net.
* **l_1_ratio** : float, optional, default=0.15
 
    The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1.
* **fit_intercept** : bool, optional, default=True
    
    Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be centered).
* **max_iter** : int, optional, default=1000

    Maximum number of iterations for the solver. The solver iterates until convergence (determined by ‘tol’) or this number of iterations.
 
    Maximum number of iterations for conjugate gradient solver.
* **tol** : float, optional, default=1e-3
    
    Tolerance for stopping criteria.
* **shuffle** : bool, optional, default=True
  
    Whether or not the training data should be shuffled before each epoch.
* **epsilon** : float, optional, default=0.1
    
    Epsilon in the epsilon-insensitive loss functions; only if loss is ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’. For ‘huber’, determines the threshold at which it becomes less important to get the prediction exactly right. For epsilon-insensitive, any differences between the current prediction and the correct label are ignored if they are less than this threshold. Values must be in the range [0.0, inf).

* **alpha** : float, optional, default=0.0001
    
    Constant that multiplies the regularization term. The higher the value, the stronger the regularization. Also used to compute the learning rate when set to learning_rate is set to ‘optimal’. Values must be in the range [0.0, inf).

* **learning_rate** : {‘constant’, ‘optimal’, ‘invscaling’, ‘adaptive’}, optional, default=’constant’
 
    The learning rate schedule.
    * ‘constant’: eta = eta0

    * ‘optimal’: eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed by Leon Bottou.

    * ‘invscaling’: eta = eta0 / pow(t, power_t)

    * ‘adaptive’: eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.
* **eta0** : float, optional, default=0.0
    
    The initial learning rate for the ‘constant’ or ‘invscaling’ schedules. The default value is 0.0.
* **power_t** : float, optional, default=0.5
  
    The exponent for inverse scaling learning rate. It is used only when learning_rate is set to ‘invscaling’.

* **early_stopping** : bool, optional, default=False
    
    Whether to use early stopping to terminate training when validation score is not improving. If set to true, it will automatically set aside 10% of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs.

* **n_iter_no_change** : int, optional, default=5
     
    Number of iterations with no improvement after which training will be terminated.

* **class_weight** : dict, {class_label: weight}, optional, default=None
    
    Weights associated with classes. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.
    The ‘balanced’ mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data.
    For multi-output, the weights are normalized separately for each column.
* **warm_start** : bool, optional, default=False
    
    When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.

### Attributes of SGDClassifier

* **coef_** : array, shape (1, n_features) or (n_classes, n_features)
    
    Weights assigned to the features.
* **intercept_** : array, shape (1,) or (n_classes,)
  
    Constant bias assigned to the samples.
* **n_iter_** : int
        
        Number of iterations run.
* **classes_** : array of shape (n_classes,)
    
    The classes labels.

In [10]:
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
clf = make_pipeline(StandardScaler(),
                    SGDClassifier(max_iter=1000, tol=1e-3))
clf.fit(X, Y)


print(clf.predict([[-0.8, -1]]))

[1]


## SGDRegressor

The class `SGDRegressor` implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models. `SGDRegressor` is well suited for regression problems with a large number of training samples.

The concrete loss function can be set via the loss parameter. `SGDRegressor` supports the following loss functions:

* **loss="squared_error"**: Ordinary least squares,

* **loss="huber"**: Huber loss for robust regression,

* **loss="epsilon_insensitive"**: linear Support Vector Regression.

### Parameters of SGDRegressor

The `SGDRegressor` class has the same parameters as the `SGDClassifier` class. Only the `loss` parameter is different.

### Attributes of SGDRegressor

The `SGDRegressor` class has the same attributes as the `SGDClassifier` class except for the `classes_` attribute.

In [11]:
import numpy as np
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
n_samples, n_features = 10, 5
rng = np.random.RandomState(0)
y = rng.randn(n_samples)
X = rng.randn(n_samples, n_features)
# Always scale the input. The most convenient way is to use a pipeline.
reg = make_pipeline(StandardScaler(),
                    SGDRegressor(max_iter=1000, tol=1e-3))
reg.fit(X, y)


Pipeline(steps=[('standardscaler', StandardScaler()),
                ('sgdregressor', SGDRegressor())])

## More Informations on SGD

### Time Complexity

The major advantage of SGD is its efficiency, which is basically linear in the number of training examples. If X is a matrix of size (n, p) training has a cost of $O(k n \bar p)$
, where k is the number of iterations (epochs) and 
$\bar{p}$ is the average number of non-zero attributes per sample

### Stopping Criterion

The classes `SGDClassifier` and `SGDRegressor` provide two criteria to stop the algorithm when a given level of convergence is reached:

* With `early_stopping=True`, the input data is split into a training set and a validation set. The model is then fitted on the training set, and the stopping criterion is based on the prediction score (using the score method) computed on the validation set. The size of the validation set can be changed with the parameter `validation_fraction`.

* With `early_stopping=False`, the model is fitted on the entire input data and the stopping criterion is based on the objective function computed on the training data.

In both cases, the criterion is evaluated once by epoch, and the algorithm stops when the criterion does not improve `n_iter_no_change` times in a row. The improvement is evaluated with absolute tolerance `tol`, and the algorithm stops in any case after a maximum number of iteration `max_iter`.

### Mathematical Details

Given a set of training examples $(x_1, y_1), \ldots, (x_n, y_n)$, our goal is to find a linear model $f(x) = w^T x + b$ which minimizes the following loss function:
$$
E(w,b) = \frac{1}{n}\sum_{i=1}^{n} L(y_i, f(x_i)) + \alpha R(w)
$$
where $L$ is a loss function that measures model (mis)fit and $R$ is a regularization term (aka penalty) that penalizes model complexity; $\alpha$ is a non-negative hyperparameter that controls the regularization strength.

The choice of $L$ determines which loss function is used. The choice of $R$ determines the regularization term.

Stochastic gradient descent is an optimization method for unconstrained optimization problems. In contrast to (batch) gradient descent, SGD approximates the true gradient of $E(w,b)$ by considering a single training example at a time.

The class `SGDClassifier` implements a first-order SGD learning routine. The algorithm iterates over the training examples and for each example updates the model parameters according to the update rule given by:
$$w \leftarrow w - \eta \left[\alpha \frac{\partial R(w)}{\partial w}
+ \frac{\partial L(w^T x_i + b, y_i)}{\partial w}\right]$$

where $\eta$ is the learning rate which controls the step-size in the parameter space. The intercept $b$ is updated similarly but without regularization. The learning rate can be sheduled over the training epochs.