# Derivations related to Least Square Error (LSE) 

### What is Least Square Error (LSE)?

LSE is defined as the sum of squared residuals of a dataset. To put this in perspective, let's take
$x^{(i)}$ and $y^{(i)}$ an input vector and label respectively, and our prediction: $y_{pred}^{(i)}$ after running our model on the input $x^{(i)}$

Then we have a residual (or error) of a specific example as $(y^{i} - y_{pred}^{i})$ which is the difference between the actual output and our predicted output.

Now we need to square the residuals. $(y^{i} - y_{pred}^{i})^{2}$
There are several reasons we do this, but I will mention two here:
1. If the difference between actual and predicted is negative i.e. $(y^{i} - y_{pred}^{i}) < 0 $ then by squaring the term, we always receive a positive outcome, so the differences are treated simply by their magnitude rather than their magnitude and sign

1. By squaring the term, larger differences are valued *MORE* than smaller differences since squaring a term is non-linear

Finally, we need to sum over each data point (example) to obtain the total error over all data points:
$$\\ LSE = \sum_{i}(y^{i} - y_{pred}^{i})^{2}$$

### Coding the LSE Formula

In [10]:
import numpy as np
import sys
from pathlib import Path

# Add the directory containing your code file to sys.path
sys.path.append(str(Path().resolve().parent.parent))

from models.LinearRegression import LinearRegression

# CREATE SOME DUMMY DATA
x = np.linspace(0, 10, 10)
y = x + 2 + np.random.rand() - 0.5 # y is a linear function of x with some noise

# reshape the data to place into the model
X = np.reshape(x, (x.shape[0], 1))

# initialize our model (I will use my Linear Regression here)
model = LinearRegression(1, 1)

Coding this formula is not very tricky. We first need to calculuate $y_{pred}^{i}$. This is done so by our model, (likely Linear Regression). To do so we execute the following for a vector of inputs $x^{(i)}$. We can think of this as a matrix $X$ where the rows represent each example, and the columns are each feature of the input:

In [13]:
y_pred = model(X)
y_pred

array([0.7789378 , 1.05777379, 1.33660978, 1.61544576, 1.89428175,
       2.17311774, 2.45195373, 2.73078972, 3.00962571, 3.28846169])

The output of this function is a vector $y_{pred}$ where each entry is the $y_{pred}^{(i)}$ for the corresponding $x^{(i)}$
<br> We now need to take our actual outputs $y$ and element-wise subtract the $y_{pred}$ vector. 
<br> We can do this easily if both y and X are numpy arrays.
<br>Then we need to square each element and sum all of the elements together. (Since the subtraction of each output by the prediction results in a vector of residuals) 
<br>Thus the entire function is defined as:

In [14]:
def LSELoss(y_pred, y):
    return np.sum((model(X) - y) ** 2)

LSELoss(y_pred, y)

### Why Least Squares? (Digging into the theory)
##### *(Skip to the end for a high-level summary without all of the math)*
Assume we have the same circumstances as above with $y_{pred}^{(i)}, y^{(i)}, x^{(i)}$

Now we can relate our predicted output and actual output as the following:
$\\ y^{(i)} = y_{pred}^{(i)} + \epsilon^{(i)}$, where $\epsilon^{(i)} \sim \mathcal{N}(0, \sigma^{2})$
This means that our predicted output + some error (unmodelled effects, noise, etc.) is equal to our actual output,
for each example. We are assuming that our unmodelled error is normally distributed, and that each $\epsilon^{i}$ is *i.i.d. (Independent and Identically Distributed)*

By assuming this, we now have a probability model of our unmodelled error:
$$ P(\epsilon^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma}exp[-\frac{\epsilon^{(i)^{2}}}{2\sigma^{2}}]$$

The probability of this specific unmodelled error occuring (since we are assuming is a normal r.v.) is the same as the probability of our output occuring, given our input and using the formula relating our output to predicted output, we can replace our unmodelled error:

$$ P(y^{(i)} | x^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma}exp[-\frac{(y^{(i)} - y_{pred}^{(i)})^{2}}{2\sigma^{2}}] \implies  y^{(i)} | x^{(i)} \sim \mathcal{N}(y_{pred}^{(i)}, \sigma^{2})$$

So now we know that our output follows a normal distribution with an expectation of our predicted output, meaning that 
we can find a likelihood function of our predicted output throughout all $m$ examples. Since $y_{pred}^{(i)}$ is a function of our parameters we define our Likelihood function as a function of our parameters, rather than $y_{pred}^{(i)}$. Let $$g^{(i)}: \mathbb{R}^{n} \rightarrow \mathbb{R} \\ \theta \rightarrow y_{pred}^{(i)}$$

This is just so that our predicted output is defined as a function of our parameter vector $\theta$
<br> Then we have the Likelihood as:

$$ \mathcal{L}(\theta) = \prod_{i = 1}^{m} P(y^{(i)} | x^{(i)}) $$

By taking the log of the likelihood, we then get a summation:

$$ \log{\mathcal{L}(\theta)} = l(\theta) = \sum_{i=1}^{m}\log{\frac{1}{\sqrt{2 \pi \sigma}}} + \log{exp[-\frac{(y^{(i)} - g^{(i)}(\theta))^{2}}{2\sigma^{2}}]}
\\ = m\log{\frac{1}{\sqrt{2 \pi \sigma}}} + \sum_{i=1}^{m}-\frac{(y^{(i)} - g^{(i)}(\theta))^{2}}{2\sigma^{2}}$$

Now by using Maximum Likelihood Estimation we need to choose the parameters $\theta$ of our model to maximize $l(\theta)$

$$ \underset{\theta}{\mathrm{argmax }} [m\log{\frac{1}{\sqrt{2 \pi \sigma}}} + \sum_{i=1}^{m}-\frac{(y^{(i)} - g^{(i)}(\theta))^{2}}{2\sigma^{2}}]
\\ = \underset{\theta}{\mathrm{argmax}} [\sum_{i=1}^{m}-\frac{(y^{(i)} - g^{(i)}(\theta))^{2}}{2\sigma^{2}}]
\\ = \underset{\theta}{\mathrm{argmin}} [\sum_{i=1}^{m}(y^{(i)} - g^{(i)}(\theta))^{2}]
\\ = \underset{\theta}{\mathrm{argmin}} [LSE]
$$

Ok so, at a high level, if we assume that our outputs follow a normal distribution, centered around our predicted output (our predicted output should be the average of the real outputs), then by using maximum likelihood estimation, we end up minimizing the least sqaures error to find our optimal parameters.

