# Estimating the Simple Regression Model
As a brief review, we are now in the position where we have specified the form of model we wish to use on our data. For some continuous outcome variable and some continuous predictor, we start with simplest relationship we can imagine: a straight-line. We then formalise this by placing it within the context of a statistical model, giving:

$$
\begin{align*}
    y_{i}   &= \beta_{0} + \beta_{1}x_{i} + \epsilon_{i} \\
    \epsilon_{i} &\sim \mathcal{N}\left(0,\sigma^{2}\right).
\end{align*}
$$

This is all fine, except that we have not actually done anything yet! All we have done is written down the model equation that we would like to use. However, we cannot do anything with this because it contains *unknown values*. At the point of analysing the data, we have measurements of both $y$ and $x$, but we do *not* have values for:

- The intercept $\beta_{0}$
- The slope $\beta_{1}$
- The errors $\epsilon_{i}$ &ndash; because they depend upon $\beta_{0}$ and $\beta_{1}$
- The variance $\sigma^{2}$ &ndash; because it depends upon the errors

So, we are currently a bit stuck. Earlier in this lesson, we saw a way of *estimating* the slope and intercept using the method of *least-squares*. In this section, we will examine a more generic way of arriving at these values using the *Method of Maximum Likelihood*. In effect, this will provide us with values for all the unknowns above, allowing us to actually perform calculations and reach conclusions using our model. 

## Notation for Estimates
Before getting to the details of estimation, it is important that we establish some new notation for the estimates themseleves. So far, our notation has implicitly assumed that we are referring to the whole popluation under study. As such, the parameters indicated above represent *population-level constants*. In other words, $\beta_{0}$ is the *true* intercept and $\beta_{1}$ is the *true* slope. In reality, however, we will usually only have a *sample* from a population. As such, the parameters we calculate will only be *estimates* of the true values.

Traditionally, we denote a parameter estimate by placing a "hat" on top of the corresponding Greek letter. For instance, we would denote an estimate of $\beta_{1}$ from a given sample as $\hat{\beta}_{1}$ (pronounced "beta 1 hat"). This is important because whenever we see $\hat{\beta}_{1}$, this tells us that the value is effectively a *guess* based on a certain sample of data. In comaprison, whenever we see $\beta_{1}$, this indicates the *true* population value, even though this is largely theoretical and unknowable quantity. We will see more later about why this distinction is important. For the moment, you just need to keep an eye on those little hats because they make quite a big difference to the meaning.

`````{admonition} Alternative Notation for Estimates
:class: tip
Some authors like to use the Latin alternatives to the Greek letters to denote estimates. We will not be doing this, but it is worth knowing about to make sense of notation you may see in textbooks or the literature. In this scheme, the estimate of $\beta_{1}$ would be $b_{1}$. So, the true population model would be

$$
y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i},
$$

whereas as an estimated model based on data would be

$$
y_{i} = b_{0} + b_{1}x_{i} + e_{i}.
$$

You can decide which of these you prefer, but we will be wearing "hats" in all our notation going forward.
`````

## Errors vs Residuals
Another subtle but important difference is that, much like the parameters $\beta_{0}$ and $\beta_{1}$, the *errors* in the model above are defined at the level of the population. This is important, because their value depends upon $\beta_{0}$ and $\beta_{1}$. In other words, to get the *true* errors, we would need to know the *true* population values of the parameters. Because this is almost never possible, the errors we actually use are based on the *estimates* $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$. As such, the errors we calculate from our estimated model are going to be different from the true errors. We therefore make a distinction between *errors* and *residuals*. The *errors* are the differences from the *true* regression line, whereas the *residuals* are the differences from the *estimated* regression line. To denote this, residuals are often written with a Latin $e_{i}$, meaning we can write our *estimated* model as:

$$
y_{i} = \hat{\beta}_{0} + \hat{\beta}_{1}x_{i} + e_{i}.
$$

This is helpful, because it makes it clearer that the residuals are not really a parameter, rather thay are a *derived* quantity. So we keep Greek letters wearing hats for our estimated parameters, and lower-case Latin letters for everything else.

## Methods of Estimation
Now that we have established the core problem and some new notation, it is time to actually talk about the estimation process. As with many aspects of the statistics, there are multiple ways to go about estimating the parameter values using the data. Indeed, one of main differences between Frequentist and Bayesian statistics is the method of estimation. On this course, we will be focussing mostly on the methods of *maximum likelihood* (ML) and *restricted maximum likelihood* (REML), because these are the most common approaches used in practice. Before getting to the details of these, we will spend a little time talking about the methods very generally and how they compared to the more familiar (but ultimately, very restricted) method of *ordinary least squares* (OLS).

### Ordinary Least Squares (OLS)
At the beginning of this lesson, we demonstrated how the simple regression line could be fit by minimising the sum of squared errors. Effectively, this was finding the line that *minimised* the variance of the data around the regression line or, more simply, the error variance. This was the method of OLS.

`````{admonition} The Gauss-Markov Theorem
:class: tip
...
`````



### Maximum Likelihood (ML)

An important distinction here is that ML depends upon the assumed distribution of the data in order to work. Indeed, ML will only agree with OLS under the assumption of a normal distribution. However, this does make ML more generic because we can use it for estimation with any distribution for the data. This is partly how *generalised linear models* (GLMs) are able to accommodate non-normal outcome variables. 

### Restricted Maximum Likelihood (REML)



[^searlefoot]: This is the perspective taken by [McCulloch, Searle & Neuhaus (2008)](https://www.librarysearch.manchester.ac.uk/permalink/44MAN_INST/1r887gn/alma9930787964401631), who are leading experts on the use of linear models and their derivatives within statistics. This book gives everything you need to understand about the mathematical theory behind this framework, though it is not for the faint of heart!

`````{admonition} Estimation of Linear Models in Software
:class: tip
It is important to understand that the implementation of basic linear models in software almost always use OLS instead of ML or REML. So for us, it is important to remember that `R` does *not* use ML or REML estimation for the `lm()` function, even though it *could*. In this context, OLS is easier and simpler to estimate, and is more computationally efficient because it does not require iteration. However, because the results are *the same*, there is no harm viewing linear models through the lens of ML, because this allows for a very general perspective that will be useful as models get more complex in future[^searlefoot]. 
`````