# Least Squares Solution & Model Evaluation

Sections:
* Maximum likelihood estimation for simple linear regression
* Model evaluation methods

This lecture draws from Chapter 3 of James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). "An introduction to statistical learning: with applications in r."

---
# 1. Maximum likelihood estimate for simple linear regression

Remember in the last lecture I showed how the least squares solution for finding the best fitting linear regression coefficients is

$$ \hat{\beta} = (X'X)^{-1}X'Y $$

Which if we set p = 1 (i.e., X only has one factor and an intercept term), can also be written as

 $$ \hat{\beta}_1 = \frac{ \sum_{i=1}^{n} (x_i-\bar{x})(y_i - \bar{y})} {\sum_{i=1}^{n} (x_i - \bar{x})^2} $$

$$ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} $$

One thing I neglected to show was _why_ this is the best solution to a simple linear regression problem. So today, we'll work through the math. Before we go further let's refresh the assumptions of OLS regression.

<br>
**(Get ready, it's about to get pretty mathy up in here)**

<br>

---
## Assumptions of univariate linear models

1. $Y$ is *normally distributed*.
2. $f(X)$ describes a *linear* relationship between $X$ & $Y$.
3. There is *no collinearity* in the different variables of $X$.
4. $f(X)$ is *stationary*, such that the observation of one data pair, $<x_n, y_n>$ does not affect the relationship of another data pair, $<x_{n+1}, y_{n+1}>$.

---

<br>
Remember, as long as the assumptions listed above are not violated, you can use the equations shown above to find the best fitting regression coefficients (or more accurately, the best fitting line that describes $Y=f(X)$). To understand why you need to understand the concept of a [_likelihood function_](https://en.wikipedia.org/wiki/Likelihood_function). 

* **Likelihood function**: The probability that the data your observations arise from a specific probability distribution defined by a specific set of parameters.

More succinctly, it is the likelihood of the data ($Y$) given the specific predictor variables ($X$) and a mapping fuction ($f()$), including the parameters that describe the distribution of the data. 

Now that last part of the description of the likelihood is the important part. This is why we have the first assumption that $Y$ is normally distributed. Remember from our first lab that the probability distribution function for a normal distribution is

$$ f(x | \mu, \sigma) = \frac{1} {{\sigma \sqrt {2\pi } }} e^{{\frac{ - ( {x - \mu })^2 }{2\sigma^2} }} $$

Now if we assume that $Y$ is normally distributed, it also assumes that its _residuals_ are normally distributed. In other words, $ RSS = (y_1 - \hat{\beta}_0 - \hat{\beta}_1x_1) + ... + (y_n - \hat{\beta}_0 - \hat{\beta}_1x_n) $ is normally distributed. Thus we can assume that the likelihood is the product ($\prod$) of all the residual errors, which  are random variables from a normal distribution. Let's consider this in the case where p = 1.


$$ \prod_{i=1}^{n} p(y_i | x_i; \beta_0, \beta_1, \sigma) =  \prod_{i=1}^{n} \frac{1} {{\sigma \sqrt {2\pi } }} e^{{\frac{ - ( {y_i - (\beta_0 + \beta_1x_i) })^2 }{2\sigma^2} }} $$

In plain English, this says that the likelihood is the aggregated probability of observing a particular value of $y$, given the residual errors between the observations and the predicted values from the model. In this case we want to _maximize_ this function, such that the data has the highest probability of arising from a model with a specific set of values for $\beta_0, \beta_1,$ and $\sigma$.

Now in practice this is hard to work with because $\prod_{i=1}^{n} p(y_i | x_i; \beta_0, \beta_1, \sigma)$ is a continuous function and we are often working in a discrete data. Therefore it is easier to take the log of this function, called the _log likelihood function_ ($logL$), which makes the problem boil down to more simple algebra.

$$  logL(\beta_0, \beta_1, \sigma)= \log \prod_{i=1}^{n} p(y_i | x_i; \beta_0, \beta_1, \sigma) \\
= \sum_{i=1}^{n} \log  p(y_i | x_i; \beta_0, \beta_1, \sigma) \\ 
= \frac{-n}{2} \log(2\pi) - n \log(\sigma) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1x_i))^2 
$$


Presented this way the math is much easier. You're just summing the residuals together assuming a specific set of regression coefficients ($\beta_0, \beta_1$) and a specific variance on the distribution of $Y$ ($\sigma^2$). If you have those numbers you can find the best parameters that maximize $L(\beta_0, \beta_1, \sigma)$.

<br>

---

Now what we'll want to do is show that the OLS regession solution maximizes the likelihood function, using a method called _maximum likelihood estimation (MLE)_. 

Before we get into MEL, let's refresh some basic points on derivatives and introductory statistics, as these will help to see the solution.

* $\frac{\partial(x^2 + y)}{\partial{x}} = 2x$
* $E[X] = \bar{X}$
* $E[X^2] = Var[X] + E[X]^2 $
* $E[XY] = Cov[XY] + E[X]E[Y]$

Okay, now that this is out of the way, let us see the maximum likelihood solution to the OLS problem.

The _deviance_ ($D$) of a model is the measure of how far the data deviates away from the expectations generated by your model. In other words, it is a _cost function_. In the OLS case, this is 

$$ D = (Y - (\beta_0 + \beta_1X))^2 $$

Now we want to know the expected, or most likely, deviance value as a summary of how well the model fits the data. In this context, this is the _mean squared error (MSE)_. We sometimes refer to this as the _objective function_ of a model, which is a function to be minimized or maximized.

We can write out the full MSE for the _estimated_ parameters $\hat{\beta}_0$ and $\hat{\beta}_1$ as

$$ MSE(\hat{\beta}_0, \hat{\beta}_1) = E[(Y - (\hat{\beta}_0 + \hat{\beta}_1X))^2] \\
= E[Y^2] - 2\hat{\beta}_0E[Y] - 2\hat{\beta}_1E[XY] + E[(\hat{\beta}_0 + \hat{\beta}_1X)^2] \\
= E[Y^2] - 2\hat{\beta}_0 E[Y] - 2\hat{\beta}_1(Cov[XY]+E[X]E[Y]) + \hat{\beta}_0^2 + 2\hat{\beta}_0\hat{\beta}_1E[X] + \hat{\beta}_1^2E[X^2] \\ 
= E[Y^2] - 2\hat{\beta}_0 E[Y] - 2\hat{\beta}_1Cov[XY] - 2\hat{\beta}_1E[X]E[Y] + \hat{\beta}_0^2 + 2\hat{\beta}_0\hat{\beta}_1E[X] + \hat{\beta}_1^2Var[X] + \hat{\beta}_1^2(E[X])^2
$$

Now we have a good place to start. With p=1, want to minimize this function with respect to $\hat{\beta}_0$ and $\hat{\beta}_1$, by finding where the derivative of the function (remember it is a cost function) is zero. 

$$ \frac{\partial E[(Y - (\hat{\beta}_0 + \hat{\beta}_1X))^2]}{\partial \hat{\beta}_0} = -2E[Y] + 2\hat{\beta}_0 + 2\hat{\beta}_1E[X]$$ 

$$ \frac{\partial E[(Y - (\hat{\beta}_0 + \hat{\beta}_1X))^2]}{\partial \hat{\beta}_1} = -2\hat{\beta}_1Cov[XY] - 2\hat{\beta}_1E[X]E[Y] + 2\hat{\beta}_0E[X] + 2\hat{\beta}_1Var[X] + 2\hat{\beta}_1(E[X])^2 $$

Remember, we want to find the point on the error space where the _derivative_ equals zero.

![Cost Landscape](imgs/L7RSSLandscape.png)

If you set the left hand side of each equation to zero, and solve for $\hat{\beta}_0$ and $\hat{\beta}_1$ respectively (a task you will be asked to complete in your homework), then you'll find that the solutions are the OLS equations shown in the last lecture.

 $$ \hat{\beta}_1 = \frac{ \sum_{i=1}^{n} (x_i-\bar{x})(y_i - \bar{y})} {\sum_{i=1}^{n} (x_i - \bar{x})^2} $$

$$ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} $$

_Voila!_ You can now see why you need each of the 4 assumptions for the OLS regression model. 

* $Y$ needs to be normal because that gives us a likelihood function we can work with. 
* $f(X)$ needs to be linear, and all variables in $X$ need to be statistically independent, in order to make the algrebra work.
* The MLE assumes only one set of parameters $\beta_1, \beta_2,$ and $\sigma$, thus if the sytem changes over time, we can't use simple MLE.




---
# 2. Model evaluation approaches


The first part of this lecture proved the OLS solution for simple linear regression, which is the process of learning the best possible $f(X)$ given your data is an example of the _model fit_.

* **Model Fit:** What are the best parameters that explain the most variance in $Y$ from variability in $X$? 

But it is important to keep in mind that even though this may be the best solution to the problem $Y=f(X)$, it does not necessarily mean that you've fully explained $Y$ by $f(X)$.  So you need to be able to _evaluate_ how well your model explained the data. 

* **Model evaluation:** How well does the best fit model $f(X)$ explain $Y$?

For model evaluation, we general use measures of a model's _goodness of fit_. We will go over 3 different goodness of fit measures.

---
## Residual Standard Error (RSE)

We went over RSE in the last lecture. As a reminder, it is a variance estimate of the residual error, defined as

$$ RSE = \sigma_{model}^2 = \sqrt{ \frac{RSS}{n-2} } = \sqrt{ \frac{ \sum_{i=1}^{n} (y-\hat{y})^2 }{n-2} }$$

RSE is the simplest form of goodness of fit measures as it provides a measure of how much of your observed error $\hat{y}$ explains. The smaller the RSS, the smaller the RSE. Ideally, your RSE should be much much smaller than the variance of $Y$ itself ($\sigma_{y}^2$).

---
## Coefficient of Determination ($r^2$)

Unlike the RSE, that measures error in your model, the $r^2$ statistic measures the percentage of variance in $Y$ that is explained by $X$. For this we compare a measure of the residuals from the model (RSS) to a measure of variance of $Y$ (TSS). 

$$ RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$
$$ TSS = \sum_{i=1}^{n} (y_i - \bar{y})^2 $$

The idea is to estimate the ratio of the difference between RSS & TSS.

$$ r^2 = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2} {\sum_{i=1}^{n} (y_i - \bar{y})^2}$$

Being a ratio, $r^2$ is constrained between 0 and 1 (although you'll find that if you have a _very poor_ model fit, it can breakt this boundary), with 0 being indicating that $X$ explains 0% of the variance in $Y$, and 1 indiciating that $X$ explains 100% of the variance in $Y$. 

The nice thing about the $r^2$ statistic is that it provides a goodness of fit that is meaningful and interpretable units.

---
## F-statistic

Finally, another typical measure of goodness of fit is the F-statistic, discussed in the last lecture. 

$$ F = \frac{ \frac{(TSS-RSS)}{p}} { \frac{RSS}{n - p - 1}} $$

Notice how the F-statistic is a hybrid of the RSE and $r^2$. It is an omnibus measure of the divergence of RSS & TSS. However, rather than be reflected as a strict ratio that is constrained between 0 & 1 (like $r^2$), the F-statistic can be any positive number since it is using RSS as the denominator, rather than TSS. Thus, what the F-statistic is telling you is whether the variance not explained by your model (TSS) is greater than the variance explained your model (RSS), if so then the ratio is high.

---

As we go along we will encounter many more goodness of fit statistics, but they generally all follow similar forms as shown in the RSE, $r^2$, and F-statistic. We will return to the issue of model evaluation when we get to model selection routines (Chapter 6).

