# Geometrical Intution of Least Squares

## Prerequisities


- Linear Algebra (Subspaces, Basis, Projection, Orthogonality)
- Vectors and Matrices, and some of their basic operations (multiplication, transposes)
- Vectors as points and lines in n-dimensional space

## Learning Objectives

After reading this notebook, students should be able to:

- Understand projection onto a line and generalize it to subspaces.
- Able to derive the normal equation of least squares geometrically.
- Visualize Least Squares geometrically.


## Orthogonality and Projection onto a line

Orthogonality is the generalised notion for perpendicularity. Let us suppose, we have two vectors $x$ and $y$. The most promising way to check orthogonality is to see their dot product or scalar product or inner product as:

$$x^Ty = \begin{bmatrix}
x_1 &x_2  &\dots  &x_n
\end{bmatrix}\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_n
\end{bmatrix} = x_1y_1 + x_2y_2+\dots +x_ny_n$$

The inner product, $x^Ty$, is $0$ if and only if the vectors $x$ and $y$ are perpendicular, i.e., make $90$ degree angle between them.

With the idea of orthogonality, we will see the projection of a point on a line. Let us suppose we have two vectors, $a$, and $b$. Vectors are the points in the coordinate axis but with a direction. An arrow represents direction. Now, we want to find the closest point, $p$  onto the line in the direction of the vector $ a $ from point $b$.

<figure align="center">
       <!-- <img src="https://drive.google.com/uc?export=view&id=1yHLiffAXoEf_sVKpC0nIJSWvGZ0sDCAP" height="200" width="400"> -->
       <img src="https://i.postimg.cc/WpK36cpW/image.png" height="200" width="400">
       <figcaption>Figure 1: Projection of a point on a line </figcaption>
   </figure>

So, how do we locate point $p$ onto a line $a$?

In such cases, _orthogonality_ plays the main role. To locate the point $p$ on the line $a$, we rely on a key point. The key point is: _the line connecting $b$ and $p$ is perpendicular to $a$._ We can see the key point illustrated in Figure 1 with the perpendicular symbol $p$ which is the projection of $b$ onto line through $a$.
Now, we will locate the projection point, $p$. Since, the point $p$ lies on the line $a$, it is some multiple of $a$. Let that multiplying factor be $\hat{c}$. So, we have:

$$p = \hat{c}a$$

Finding $\hat{c}$ solves our problem of locating $p$. Recall the geometrical fact that the line from $b$ to the closest point $p = \hat{c}a$ is perpendicular to $a$:

$$a \perp (b-p)$$

From the test of orthogonality,

$$a^T(b-p) = 0$$

Expanding the bracket,

$$a^Tb- a^Tp = 0$$

Moving $a^Tp$ to the other side of the equation and substituting $p = \hat{c}a$, we get:

$$a^Tb- a^T(c\hat{a}) = 0$$

$$\hat{c} = \frac{a^Tb}{a^Ta}$$


In this case of projection onto a line, $\hat{c}$ is a single number.

This is a simple example of projection of a point on a line but the key idea is the same for all projection. The projection could be on a plane(or any subspace). Projecting a point $b$ on any subspace (for example, plane) is all about finding the point $p$ on the subspace (plane) that is closest to $b$.

## Projection and Least Squares

By now, we are all known to Least Squares and its central principle that guides us to estimate the parameters. In this notebook, we aim to interpret the least-squares geometrically. The geometrical interpretation of least squares deals with the projection of a point on any subspace. We will see it in detail.

Since this is an extended notebook of least squares with linear regression. We will start with the linear equation:

$$\mathbf{y} = \mathbf{X}\beta$$

The linear system of equations is consistent and they have a solution if $y$ lies on the column space of $\mathbf{X}$. But we have cases of failures, when we have more equations than unknowns ($\beta$'s). A simple example is illustrated below:

$$
\begin{matrix}
y_1 = x_1\beta\\
y_2 = x_2\beta\\
y_3 = x_3\beta
\end{matrix}
$$

In matrix form,

$$\begin{bmatrix}
y_1 \\
y_2 \\
y_3
\end{bmatrix} = \begin{bmatrix}
x_1 \\
x_2 \\
x_3
\end{bmatrix}\begin{bmatrix}
\beta
\end{bmatrix}$$


This system of linear equations are consistent and solvable if and only if ($y_1, y_2, y_3$) lies on the column space of $\mathbf{X}$, i.e., on the same line containing ($x_1, x_2, x_3$) through origin. In other words, ($y_1, y_2, y_3$) should be constant multiple of ($x_1, x_2, x_3$) for a solution to exist.

But in real practices, inconsistent equations arise. And we have to solve them. The output $y$, which is the observation, lies outside the column space of $\mathbf{X}$. In such cases, we tend to find the $\beta$'s such that all linear equations minimize the error distance from actual observations. We tend to minimize the average error for all sets of equations collectively. The average error is the _Sum of Squared Errors(SSE)_.

Let's revise a bit. We will see the least-squares solution from the calculus viewpoint (which we have done earlier). After that, we will see geometrically.


__From Calculus:__

Average error or _SSE_ for the above set of equations is:

$$\text{SSE} = (y_1-x_1\beta)^2 + (y_2-x_2\beta)^2 + (y_3-x_3\beta)^2$$


As in the explanation in the previous notebooks, we always try to minimize _SSE_. The graph of _SSE_ is always a parabola (convex graph facing upwards). The minimum point is at at the lowest point of the convex graph, where the derivative is $0$ as:

$$\frac{\partial\ \text{SSE} }{\partial \beta} = \frac{\partial\ }{\partial \beta}(y_1-x_1\beta)^2 + (y_2-x_2\beta)^2 + (y_3-x_3\beta)^2 = 0$$

Partially derivating with respect to $\beta$, we get:

$$-2x_1(y_1-x_1\beta) - -2x_2(y_2-x_2\beta)- 2x_3(y_3-x_3\beta) = 0 $$

Dividing all terms by $-2$ and expanding product terms, we get:

$$x_1y_1-x_1^2\beta+x_2y_2-x_2^2\beta+x_3y_3-x_3^2\beta = 0$$

Isolating $\beta$, we get:

$$\hat{\boldsymbol{\beta}} = \frac{x_1y_1+x_2y_2+x_3y_3}{x_1^2+x_2^2+x_3^2} = \frac{\mathbf{X}^T\mathbf{y}}{\mathbf{X}^T\mathbf{X}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$


This is the normal equation for estimating $\beta$. Here, we saw for a simple example with one unknown and three equations. But this normal equation can be generalized for several unknowns.


__From Geometrical Interpretation__


Now, we will derive this normal equation from a geometrical viewpoint for the general case.

Geometrically, in general cases with several unknowns, we are projecting $\mathbf{y}$ onto a subspace. We then have a matrix $\mathbf{X}$ with dimension, ($\text{n}\times \text{d+1}$), i.e., $\text{n}$ observations and $\text{d+1}$ unknowns. The number $\text{n}$ will always be higher than the number of unknowns, $\text{d+1}$ where $\text{d}$ is the dimension of the input variable. So, the system of linear equations will be inconsistent. There won't be a set of $\boldsymbol{\beta}$ that would perfectly satisfy all set of linear equations. And we choose $\hat{\boldsymbol{\beta}}$ to minimize the sum of squares of error distances. Estimating $\hat{\boldsymbol{\beta}}$ this way is same as locating point $p$ on line $a$ that is closest to $b$.

In order to find the minimum error, we project $\mathbf{y}$ on the column space of $\mathbf{X}$. The error vector, ($\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}}$) must be perpendicular to that column space in order to minimize the error distance. So from orthogonality:


$$\mathbf{X}^T(\mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta}}) = 0$$

Solving this expression for $\hat{\boldsymbol{\beta}}$, we get:

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

With geometrical interpretation, we derived the normal equation. Now, we will visualize it.

Consider $\mathbf{X}$ with dimension $3 \times 2$. We have two unknowns. The error vector should be perpendicular to each columns of $\mathbf{X}$. We tend to estimate parameters so as to project $\mathbf{y}$ onto the column space of $\mathbf{X}$. We can see the projection in Figure 2 below:

<figure align="center">
       <!-- <img src="https://drive.google.com/uc?export=view&id=1QGv4xHzl6bBxXGQtCG47D9y_fFd1w8Ef" height="350" width="600"> -->
       <img src="https://i.postimg.cc/WbnzxWMg/image.png" height="350" width="600">
       <figcaption>Figure 2: Projection onto the column space of a 3 by 2 matrix </figcaption>
   </figure>


