In [1]:
 import numpy as np

## Normal Equation

 θ
 =X⊺X−1 X⊺ y

 θ
 is the value of θ that minimizes the cost function.
 • y is the vector of target values containing y(1) to y(m)

In [2]:
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

In [3]:
X_b = np.c_[np.ones((100, 1)), X]  # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
theta_best

array([[3.78892358],
       [3.12805954]])

The pseudoinverse itself is computed using a standard matrix factorization technique
 called Singular Value Decomposition (SVD) that can decompose the training set
 matrix X into the matrix multiplication of three matrices U Σ V⊺ (see
 Σ+
 numpy.linalg.svd()). The pseudoinverse is computed as X+ = VΣ+U⊺. To compute
 the matrix 
, the algorithm takes Σ and sets to zero all values smaller than a tiny
 threshold value, then it replaces all the nonzero values with their inverse, and finally
 it transposes the resulting matrix. This approach is more efficient than computing the
 Normal Equation, plus it handles edge cases nicely: indeed, the Normal Equation may
 not work if the matrix X⊺X is not invertible (i.e., singular), such as if m < n or if some
 features are redundant, but the pseudoinverse is always defined.

## Gradient Descent

Gradient Descent is a generic optimization algorithm capable of finding optimal solu
tions to a wide range of problems. The general idea of Gradient Descent is to tweak
 parameters iteratively in order to minimize a cost function.

### Batch Gradient Descent


To implement Gradient Descent, you need to compute the gradient of the cost func
tion with regard to each model parameter θ
 . In other words, y u need to calcula 
 how much the cost function will change if you change j
 just a little bit. This is ca ed
 a partial derivative.

Batch Gradient Descent: it uses the whole batch of training
 data at every step (actually, Full Gradient Descent would probably
 be a better name). As a result it is terribly slow on very large train
ing sets (but we will see much faster Gradient Descent algorithms
 shortly). However, Gradient Descent scales well with the number of
 features; training a Linear Regression model when there are hun
dreds of thousands of features is much faster using Gradient
 Descent than using the Normal Equation or SVD decomposition.

### Stochastic Gradient Descent

The main problem with Batch Gradient Descent is the fact that it uses the whole
 training set to compute the gradients at every step, which makes it very slow when
 the training set is large. At the opposite extreme, Stochastic Gradient Descent picks a
 random instance in the training set at every step and computes the gradients based
 only on that single instance. Obviously, working on a single instance at a time makes
 the algorithm much faster because it has very little data to manipulate at every itera
tion. It also makes it possible to train on huge training sets, since only one instance
 needs to be in memory at each iteration (Stochastic GD can be implemented as an
 out-of-core algorithm

On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much
 less regular than Batch Gradient Descent: instead of gently decreasing until it reaches
 the minimum, the cost function will bounce up and down, decreasing only on aver
age. Over time it will end up very close to the minimum, but once it gets there it will
 continue to bounce around, never settling 4-9). So once the algo
rithm stops, the final parameter values are good, but not optimal.

#### If your cost function is very irregular, i.e. has lots of ups and downs, then Stochastic Gradient Descent is useful because it can help the algorithm jump out of the local minima

In [4]:
from sklearn.linear_model import SGDRegressor

In [5]:
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1)

In [6]:
sgd_reg.fit(X, y.ravel())

In [7]:
sgd_reg.intercept_, sgd_reg.coef_

(array([3.84095955]), array([3.20738491]))

### Mini-batch Gradient Descent

The last Gradient Descent algorithm we will look at is called Mini-batch Gradient
 Descent. It is simple to understand once you know Batch and Stochastic Gradient
 Descent: at each step, instead of computing the gradients based on the full training set
 (as in Batch GD) or based on just one instance (as in Stochastic GD), Mini-batch GD
 computes the gradients on small random sets of instances called mini-batches. The
 main advantage of Mini-batch GD over Stochastic GD is that you can get a perfor
mance boost from hardware optimization of matrix operations, especially when using
 GPUs.