### Least square

Least square is method used in regression analysis in approximate the solutions. In this approach minimalize the sum of the squares of the residuals in every results of equation. So we want the best fitting. 

In this notebook I will focus on linear model. I use Boston housing dataset. In this task we approximate proce of flat based on features. 

The dataset have $n$ datapoints $\boldsymbol x_i$ in a data matrix $\boldsymbol X$:

$$
\begin{bmatrix} 
\boldsymbol {x}_1^T \\
\vdots \\ 
\boldsymbol {x}_n^T 
\end{bmatrix} \boldsymbol {\theta} = \begin{bmatrix} 
y_1 \\
\vdots \\ 
y_n 
\end{bmatrix}.
$$

Assuming that we have a prediction model $\hat{y}_i =  f(\boldsymbol {x}_i) = \boldsymbol \theta^T\boldsymbol{x}_i$.

So we have: $\boldsymbol X\boldsymbol{\theta} = \boldsymbol {y}.$

$\boldsymbol y$ collects all house prices $y_1,\dotsc, y_n$ of the training set.

The goal is to find the best $\boldsymbol \theta$ that minimizes the following (least squares) objective:

$$
\begin{eqnarray} 
&\sum^n_{i=1}{\lVert \boldsymbol \theta^T\boldsymbol {x}_i - y_i \rVert^2} \\
&= (\boldsymbol X\boldsymbol {\theta} - \boldsymbol y)^T(\boldsymbol X\boldsymbol {\theta} - \boldsymbol y).
\end{eqnarray}
$$

Note that we aim to minimize the squared error between the prediction $\boldsymbol \theta^T\boldsymbol {x}_i$  of the model and the observed data point $y_i$ in the training set. 

We set the gradient of the least-squares objective to find the optimal parameters $\boldsymbol 0$:
$$
\begin{eqnarray} 
\nabla_{\boldsymbol\theta}(\boldsymbol X{\boldsymbol \theta} - \boldsymbol y)^T(\boldsymbol X{\boldsymbol \theta} - \boldsymbol y) &=& \boldsymbol 0 \\
\iff \nabla_{\boldsymbol\theta}(\boldsymbol {\theta}^T\boldsymbol X^T - \boldsymbol y^T)(\boldsymbol X\boldsymbol {\theta} - \boldsymbol y) &=& \boldsymbol 0 \\
\iff \nabla_{\boldsymbol\theta}(\boldsymbol {\theta}^T\boldsymbol X^T\boldsymbol X\boldsymbol {\theta} - \boldsymbol y^T\boldsymbol X\boldsymbol \theta - \boldsymbol \theta^T\boldsymbol X^T\boldsymbol y + \boldsymbol y^T\boldsymbol y ) &=& \boldsymbol 0 \\
\iff 2\boldsymbol X^T\boldsymbol X\boldsymbol \theta - 2\boldsymbol X^T\boldsymbol y &=& \boldsymbol 0 \\
\iff \boldsymbol X^T\boldsymbol X\boldsymbol \theta        &=& \boldsymbol X^T\boldsymbol y.
\end{eqnarray}
$$

The solution, which gives zero gradient solves the __normal equation__ $$\boldsymbol X^T\boldsymbol X\boldsymbol \theta = \boldsymbol X^T\boldsymbol y.$$

In [1]:
# imports
import numpy as np
from sklearn.datasets import load_boston

In [2]:
boston = load_boston()
boston_X, boston_y = boston.data, boston.target

# calculate theta_hat
boston_theta_hat = np.linalg.solve(boston_X.T @ boston_X, boston_X.T @ boston_y)