# Estimating the Simple Regression Model
As a brief review, we are now in the position where we have specified the form of model we wish to use on our data. For some continuous outcome variable and some continuous predictor, we start with simplest relationship we can imagine: a straight-line. We then formalise this by placing it within the context of a statistical model, giving:

$$
\begin{align*}
    y_{i}   &= \beta_{0} + \beta_{1}x_{i} + \epsilon_{i} \\
    \epsilon_{i} &\sim \mathcal{N}\left(0,\sigma^{2}\right).
\end{align*}
$$

This is all fine, except that we have not actually done anything yet! All we have done is written down the model equation that we would like to use. However, we cannot do anything with this because it contains *unknown values*. At the point of analysing the data, we have measurements of both $y$ and $x$, but we do *not* have values for:

- The intercept $\beta_{0}$
- The slope $\beta_{1}$
- The errors $\epsilon_{i}$ &ndash; because they depend upon $\beta_{0}$ and $\beta_{1}$
- The variance $\sigma^{2}$ &ndash; because it depends upon the errors

So, we are currently a bit stuck. Earlier in this lesson, we saw a way of *estimating* the slope and intercept using the method of *least-squares*. In this section, we will examine a more generic way of arriving at these values using the *Method of Maximum Likelihood*. In effect, this will provide us with values for all the unknowns above, allowing us to actually perform calculations and reach conclusions using our model. 

## Notation for Estimates
Before getting to the details of estimation, it is important that we establish some new notation for the estimates themseleves. So far, our notation has implicitly assumed that we are referring to the whole popluation under study. As such, the parameters indicated above represent *population-level constants*. In other words, $\beta_{0}$ is the *true* intercept and $\beta_{1}$ is the *true* slope. In reality, however, we will usually only have a *sample* from a population. As such, the parameters we calculate will only be *estimates* of the true values.

Traditionally, we denote a parameter estimate by placing a "hat" on top of the corresponding Greek letter. For instance, we would denote an estimate of $\beta_{1}$ from a given sample as $\hat{\beta}_{1}$ (pronounced "beta 1 hat"). This is important because whenever we see $\hat{\beta}_{1}$, this tells us that the value is effectively a *guess* based on a certain sample of data. In comaprison, whenever we see $\beta_{1}$, this indicates the *true* population value, even though this is largely theoretical and unknowable quantity. We will see more later about why this distinction is important. For the moment, you just need to keep an eye on those little hats because they make quite a big difference to the meaning.

`````{admonition} Alternative Notation for Estimates
:class: tip
Some authors like to use the Latin alternatives to the Greek letters to denote estimates. We will not be doing this, but it is worth knowing about to make sense of notation you may see in textbooks or the literature. In this scheme, the estimate of $\beta_{1}$ would be $b_{1}$. So, the true population model would be

$$
y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i},
$$

whereas as an estimated model based on data would be

$$
y_{i} = b_{0} + b_{1}x_{i} + e_{i}.
$$

You can decide which of these you prefer, but we will be wearing "hats" in all our notation going forward.
`````

## Errors vs Residuals
Another subtle but important difference is that, much like the parameters $\beta_{0}$ and $\beta_{1}$, the *errors* in the model above are defined at the level of the population. This is important, because their value depends upon $\beta_{0}$ and $\beta_{1}$. In other words, to get the *true* errors, we would need to know the *true* population values of the parameters. Because this is almost never possible, the errors we actually use are based on the *estimates* $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$. As such, the errors we calculate from our estimated model are going to be different from the true errors. We therefore make a distinction between *errors* and *residuals*. The *errors* are the differences from the *true* regression line, whereas the *residuals* are the differences from the *estimated* regression line. To denote this, residuals are often written with a Latin $e_{i}$, meaning we can write our *estimated* model as:

$$
y_{i} = \hat{\beta}_{0} + \hat{\beta}_{1}x_{i} + e_{i}.
$$

This is helpful, because it makes it clearer that the residuals are not really a parameter, rather thay are a *derived* quantity. So we keep Greek letters wearing hats for our estimated parameters, and lower-case Latin letters for everything else.

## Methods of Estimation
Now that we have established the core problem and some new notation, it is time to actually talk about the estimation process. As with many aspects of the statistics, there are multiple ways to go about estimating the parameter values using the data. On this course, we will be focussing mostly on the method of *maximum likelihood estimation* (MLE). The reason for this is because MLE is a very general method that can applied to lots of different models of increasing complexity. Thinking in terms of the likelihood is also useful because it naturally fits with the probabilistic framework we have built so far. However, the only downside is that linear models are almost always presented in terms of *ordinary least squares* (OLS) estimation, with MLE reserved for more complex applications. As such, this perspective is little unusual for a model as basic as simple linear regression. Neverthless, it is useful to start understanding MLE at this stage, as it provides a complete and hollistic perspective on all the models we will be covering on this course.

### Ordinary Least Squares (OLS)
At the beginning of this lesson, we demonstrated how the simple regression line could be fit by minimising the sum of squared errors. Effectively, we were finding the line that *minimised* the variance of the data around the regression line or, more simply, the error variance. This was the method of OLS. OLS is useful because it is very simple to conceptualise and results in a simple set of equations for finding estimates. It is also a nice fit for our concept of *variance* as OLS will result in the line that explains the *largest* chunk of the total variance, leaving the error variance as small as possible. Also, OLS makes no particular assumption about the distribution of the data, meaning that under cases of non-normality, the estimates remain *unbiased* (i.e. we can trust them as good estimates of the population values), though some of their other properties break down[^gaussmarkovfoot]. Because of this, OLS is generally introduced as *the* method used for normal linear models. The main problem with OLS is that it is largely only applicable to these types of models, and nothing else. For instance, models that do not assume a normal distribution (i.e. Generalised Linear Models), those that assume more complex correlational and variance structures (i.e. Linear Mixed-effects Models) and those that combine the two (i.e. Generalised Linear Mixed-effects Models) cannot be estimated with OLS. As such, OLS is actually a very niche approach. Because of this, it is much more helpful in terms of thinking *generally* to consider a *likelihood* approach to estimation. This is widely applicable across all the models indicated above, as well as being fundamental to Bayesian approaches to statistics.

### Maximum Likelihood Estimation (MLE)
The method of maximum likelihood is quite an intuitive approach to estimation, if you understood the statistical modelling framework presented last week. In effect, this method is based on finding parameter estimates based on *probability*. This involves taking the assumed population distribution of the data and using it to calculate probabilities that allow us to determine the best parameters. This is nice because it takes the full probabilitistic framework of our model into account, in a way that OLS does not. It is also useful because ML can estimate parameters from non-normal distributions and can estimate much more complex variance and correlation structures, both of which OLS cannot do. Although this additional flexibility is not needed by simple linear models, it is required by anything more complex. As such, OLS gets very quickly abandoned. Of particular utility is that MLE will produce *identical* estimates to OLS for parameters such as the slope and the intercept. As such, the likelihood principle is equally as applicable to linear models as other more complex models. MLE is therefore very applicable *generically*, providing a way of thinking about estimation that is consistent with the probabilistic framework we have established thus far[^searlefoot].

### Restricted Maximum Likelihood (REML)
A closely related approach to MLE is *restricted* or *residual* maximum likelihood (REML). We will detail the logic behind this in the next section, but in brief, this approach solves one particular issue with MLE in terms of estimating *variances*. The basic application of MLE will result in variance estimates that are *biased*. You can think of this in terms of MLE not taking Bessel's correction into account (as discussed last week). REML is able to correct the estimate of the variance correctly by effectively removing the effects of the predictors from the data and then running MLE on the resultant errors. This removal adjusts the errors in such a way that the correction is automatically taken into account.

`````{admonition} Estimation of Linear Models in Software
:class: tip
It is important to understand that the implementation of basic linear models in software almost always use OLS instead of MLE. So for us, it is important to remember that `R` does *not* use MLE within the `lm()` function, even though it *could*. In this context, OLS is easier and simpler to implement, whereas MLE would largely be considered *overkill* for such a simple problem. However, because the results are *the same*, there is no harm viewing linear models through the lens of ML, because this allows for a very general perspective that will be useful as models get more complex in future. 
`````

[^gaussmarkovfoot]: Proof of this is given by the [Gauss-Markov Theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem).

[^searlefoot]: This is the perspective taken by [McCulloch, Searle & Neuhaus (2008)](https://www.librarysearch.manchester.ac.uk/permalink/44MAN_INST/1r887gn/alma9930787964401631), who are leading experts on the use of linear models and their derivatives within statistics. This book gives everything you need to understand about the mathematical theory behind this framework, though it is not for the faint of heart!