# Maximum Likelihood Estimation
In the previous section, we discussed estimation of the simple regression model very generally, establishing that we could use *either* MLE or OLS to arrive at the same results. As argued in that section, although OLS is conceptually simple, it is limited in its application to the more complex models that we will be encountering later on the course. As such, it is useful to focus primarily on Maximum Likelihood Estimation (MLE) as a general method that can be used to estimate values across a variety of models. In this section, we will dig into MLE in a bit more detail to give you a sense of how it works and, also, to show how the results are identical to OLS.

## MLE vs OLS

## How Does MLE Work?
Before getting started on understanding MLE, it is worth highlighting that you will *never* need to actually implement this yourself. Software will always take care of all of this for you. As such, this section is just trying to help you gain intution about how MLE works. Even if you do not understand every nuance, you can still happily use MLE in software as the estimation technique. So just try to gain some intution here and do not worry about understanding every detail. 

Fundamentally, MLE is based on finding parameter values that make a certain value, known as the *likelihood*, as *big* as possible. The value of likelihood is calculated using the *likelihood function*, denoted 

$$
L\left(\boldsymbol{\theta}|\mathcal{D}\right),
$$

where $\mathcal{D}$ represents the data we have collected and $\boldsymbol{\theta}$ is a generic representation of any set of parameters. For instance, in simple regression, $\boldsymbol{\theta} = \{\beta_{0},\beta_{1},\sigma^{2}\}$. In order to understand this, we can make an equivalence between the likelihood and probability. When placed in probabilitstic terms, we have

$$
L\left(\boldsymbol{\theta}|\mathcal{D}\right) = P\left(\mathcal{D}|\boldsymbol{\theta}\right).
$$

 In words, this is saying that the likelihood of a set of paramater values, given some data, is the *same* as evaluating the probability of the data, assuming those parameter values are true. The key point here is that MLE is based on evaluating the *probability of the data*, given some values of the parameters[^bayesfoot]. We can think about it as taking a guess for the parameter values, then calculating how probable those values make the data we have collected. It is like asking the question: "how likely would it have been to collect the data that we have collected, if these were the parameter values?". By searching through lots of different possible combinations of parameter values, the aim is to find the specific combination that leads to the *highest probability* of the data. In other words, the values that *maxmimise the likelihood*. Within the context of simple regression, we could then write

$$
\left(\hat{\beta}_{0}, \hat{\beta}_{1}, \hat{\sigma}^{2}\right) = \text{arg max}\hspace{0.5em}  L\left(\beta_{0}, \beta_{1}, \sigma^{2}|\mathbf{y}\right),
$$

which is just saying that our parameter estimates are those values that make the output of the likelihood function as *big* as possible.

### The Likelihood Function
So how do we evaluate the likelihood function? In brief, if we want the likelihood of our entire dataset, we need to *multiply* the probability of each data value, using some guesses for the parameter values. We can then repeat this for some other guesses. If the likelihood gets *larger* then our new guesses make the data more probable than our old guesses. We will see shortly how to do this in a more principled way than just guessing, but this should give you enough of an idea at this point.

So how do we calculate the probabilities of each data point? Remember that our core assumption for simple regression is that

$$
y_{i} \sim \mathcal{N}\left(\beta_{0} + \beta_{1}x_{i}, \sigma^{2}\right).
$$

The probability of any value from a normal distribution is given by its *probability density function*, which is

$$
f(x) = \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x - \mu)^{2}}{2\sigma^{2}}}.
$$

Combining these two tells us that the probability of any particular datapoint can be calculated using

$$
P(y_{i}) = \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(y_{i} - (\beta_{0} + \beta_{1}x_{i}))^{2}}{2\sigma^{2}}}.
$$

So, if we insert some guesses for $\beta_{0}$, $\beta_{1}$ and $\sigma^{2}$, we can calculate the probability of any of our data values. This effectively tells us the probability of that datapoint, assuming our guesses are correct. The likelihood function then involves multiplying all these values together to give

$$
L(\beta_{0},\beta_{1},\sigma^{2}|\mathbf{y}) = P\left(y_{1}\right) \times P(y_{2}) \times \dots \times P(y_{n})
$$

which we can more compactly express using Big Pi notation (see the box below)

$$
\begin{align*}
    L(\beta_{0},\beta_{1},\sigma^{2}|\mathbf{y}) &= \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(y_{i} - (\beta_{0} + \beta_{1}x_{i}))^{2}}{2\sigma^{2}}} \\
    &= \prod_{i=1}^{n} P(y_{i})
\end{align*}
$$

`````{admonition} Big Pi Notation
:class: tip
Big pi notation, denoted by the captitol Greek letter $\Pi$, is used as a shorthand for *multiplication*. Below the big Pi, we indicate the notation for our index across these multiplications, as well as our starting index value. Above the big Pi, we indicate the value where we stop. So the notation

$$
P = \prod_{i=1}^{3} y_{i}
$$

is equivalent to

$$
P = y_{1} \times y_{2} \times y_{3} ,
$$

So, the notation says that our index is called $i$ and that we start it at $1$. We then keep going until $i = 3$, at which point we stop. 

This has a direct connection to a for-loop. So, in code, this is the same as shortening

```R
P <- y[1] * y[2] * y[3]
```

to

```R
P <- 1
for (i in 1:3){
    P <- P * y[i]
}
```
So, you can think of the Big Pi notation as a *multiplication loop* over a certain set of indices.
`````

### A Concrete Example in `R`
To get a more concrete sense of calculating the likelihood, let us use the `mtcars` data again. Furthermore, let us say that we have guessed that $\beta_{0} = 30$, $\beta_{1} = -5$ and $\sigma^{2} = 1$. If we are assuming that the data have come from a normal distribution, we can calculate the probability of the first value of `mpg` using

In [1]:
beta.0 <- 30
beta.1 <- -5
sigma2 <- 1
mu     <- beta.0 + beta.1*mtcars$wt[1]
P.y1   <- dnorm(mtcars$mpg[1], mean=mu, sd=sqrt(sigma2))
print(P.y1 )

[1] 8.926166e-05


where the function `dnorm` returns the *density* of the normal distribution for the given data. The probability of the second value of `mpg` would then be

In [3]:
mu  <- beta.0 + beta.1*mtcars$wt[2]
P.y2 <- dnorm(mtcars$mpg[2], mean=mu, sd=sqrt(sigma2))
print(P.y2)

[1] 2.125155e-07


and so on. The find out the overall likelihood for the *whole* dataset, we just need to *multiply* these probabilities. However, this can cause computational problems[^likprobsfoot], so it is more usual to sum the *log* of these probabilities. This gives the *log-likelihood*.

In [4]:
mu     <- beta.0 + beta.1*mtcars$wt
loglik <- sum(dnorm(mtcars$mpg, mean=mu, sd=sqrt(sigma2), log=TRUE))
print(loglik)

[1] -780.7884


This value is not particularly interpretable, but this does not matter. All we want to do is make it as *big* as possible. In the context of a negative value, this means our aim is to make the likelihood as *positive* as possible.

Let us inch closer to the results we got from least-squares earlier by setting $\beta_{0} = 35$, $\beta_{1} = -5$ and $\sigma^{2} = 1$. This gives

In [5]:
beta.0 <- 35
beta.1 <- -5
sigma2 <- 1
mu     <- beta.0 + beta.1*mtcars$wt
loglik <- sum(dnorm(mtcars$mpg, mean=mu, sd=sqrt(sigma2), log=TRUE))
print(loglik)

[1] -192.4884


So, the log likelihood is now *more positive*, meaning we are moving in the right direction. These parameter values make our data *more probable* than the values we had previously. As such, these estimates are, in some sense, *closer* to the true values. This is not necessarily the largest we can make the likelihood, but hopefully it is clear how this provides a metric indicating which combination of parameters make our data the *most probable*. In principle, we could just continue with some guesses for the parameters until we cannot make the log-likelihood any bigger.

### Exact Solutions
An obvious issue with the scheme above is that searching through many combinations of guesses for the parameters is not particularly efficient or practical. In fact, there is an infinite number of values we could choose, as well as an infinite number of combinations. So how can we possibly find the values we need? In order to do so, there are two options. For some simple problems, the equations that maximise the likelihood have already been worked out using the tools of calculus. As such, the equation for the likelihood can be solved to find those values that guarantee a maximum. Normal linear models are one such example. By assuming a Gaussian distribution for the outcome variables, the MLE estimates for a simple regression model are

$$
\begin{align*}
    \hat{\beta}_{1}^{(\text{ML})} &= \frac{\sum{\left(x_{i} - \bar{x}\right)\left(y_{i} - \bar{y}\right)}}{\sum{\left(x_{i} - \bar{x}\right)^{2}}}\\
    \hat{\beta}_{0}^{(\text{ML})} &= \bar{y} - \hat{\beta}_{1}^{(\text{ML})}\bar{x}, \\ 
\end{align*}
$$

which agree with what we saw earlier for OLS. As such, both MLE and OLS agree on the values of the intercept and slope. Because of this, we can conceptualise estimation of linear models either in terms of least-squares *or* in terms of the likelihood as, practically, the outcome is the same. Unfortunately, for more complex models, there are no exact solutions for maximising the likelihood. In these cases, we have to turn to computational methods in the form of *iterative* MLE.

### Iterative MLE
For situations where it is not possible to find an exact solution, we rely on computer algorithms to search through many possible combinations of values to find which one *maximises* the log-likelihood. These are known more generally as *optimisation* algorithms and are a complex topic in numerical computing. For our purpose, we do not really need to understand how these algorithms work. All we really need to know is that they use rules and heuristics to explore the space of all possible parameter values in order to find values that the algorithm thinks make the log-likelihood the largest.

Within `R` the generic functions `optim()`, `nlm()` and `nlminb()` can all be used to do this. In the example below, we choose `nlm()` (*nonlinear minimisation*) for its general robustness for MLE problems. This function needs some starting guesses for the parameters, so in the example below we set $\hat{\beta}_{0} = \bar{y}$, $\hat{\beta}_{1} = 0$ and $\sigma = \text{SD}(y)$. As the name implies, this function *minimises*, so we return the *negative* log-likelihood instead[^minmaxfoot]. After running the iterative ML estimation, we get 

In [29]:
set.seed(123)
x <- mtcars$wt
y <- mtcars$mpg

# Define negative log-likelihood
neg_loglik <- function(params) {
  beta.0 <- params[1]
  beta.1 <- params[2]
  sigma  <- params[3]

  if (sigma <= 0) return(1e10)

  mu     <- beta.0 + beta.1*x    
  loglik <- sum(dnorm(y, mean=mu, sd=sigma, log=TRUE)) # log-likelihood
  return(-loglik)                                      # -ve log-likelihood
}

# Starting values (guesses for intercept, slope and SD)
init_params <- c(mean(y), 0, sd(y))

# Run optimisation
mle <- nlm(f=neg_loglik, p=init_params)

# Print results
mle_pars <- mle$estimate[1:2]
names(mle_pars) <- c("beta.0", "beta.1")
print(mle_pars)


   beta.0    beta.1 
37.285127 -5.344472 


Which we can compare to the OLS results `R` gives us when using the `lm()` function

In [7]:
print(coef(lm(mpg ~ wt, data=mtcars)))

(Intercept)          wt 
  37.285126   -5.344472 


As we can see, these are very close, showing how iterative MLE can be applied in many cases, even those where exact solutions exist. Although we do not need this for simple linear models, we will see later on the course how iterative MLE is necessary for Generalised Linear Models, Linear Mixed-effects Models and Generalised Linear Mixed-effects Models.

### Restricted Maximum Likelihood
In the example above, your may have notice that we neglected to show the estimates for $\sigma^{2}$. This was not an accident...
 
Sometimes denoted REML or ReML, ...

How REML works is a bit complicated and beyond the scope of this lesson. In effect, REML estimates the residual variance by first *adjusting* for the effects of the predictors from the data and then running ML. This adjustment allows the loss of degrees of freedom to be automatically taken into account. Although it might be tempting to think we can just use the residuals for this purpose, because the residuals are technically *not* i.i.d., this will not work. Instead, you have to use other means of forcing the removal of the predictor effects to be i.i.d. This gets quite complicated, but should give you something of a flavour of how REML works. 



`````{admonition} Degrees of Freedom
:class: tip
...
`````

## The Estimated Model
So, at this point, we can think of our parameter estimates either as those that minimise the sum-of-squared errors or, more generically, as those values that maximise the likelihood (i.e. make the data most probable). As argued previously, the likelihood perspective is more generally applicable and so it is useful to think of the estimates in proabilistic terms. As such, if our data are truly drawn from a normal distribution then these parameter values make the data more likely than any other possible combination of values.


### The Final Estimates
So, using the `mtcars` example again, after using either OLS or MLE (with REML for the variance) we have:

- $\hat{\beta}_{0} = 37.2851$
- $\hat{\beta}_{1} = -5.3445$
- $\hat{\sigma}^{2} = 9.2774$

Meaning that our final regression model can be expressed as

$$
\text{MPG}_{i} = 37.2851 + -5.3445 \times \text{Weight}_{i} + e_{i}.
$$

The residuals can be calculated once values for $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ are available through a simple subtraction. If we take the *fitted* or *predicted* values of MPG to be

$$
\widehat{\text{MPG}}_{i} = 37.2851 + -5.3445 \times \text{Weight}_{i}
$$

then the residuals are simply

$$
e_{i} = \text{MPG}_{i} - \widehat{\text{MPG}}_{i}.
$$

We can also express this in terms of our probabilistic model like so

$$
\text{MPG}_{i} \sim \mathcal{N}\left(37.2851 + -5.3445 \times \text{Weight}_{i}, 9.2774\right),
$$

though this starts to get a little difficult to parse. Neverthless, the point is that these estimates now complete every unknown element of the model.


### Interpreting the Estimates
Although we have focussed mainly on *how* to get estimates using our data, in practical terms our main interest is what do the parameters actually mean? 

However, there are still a lot of things we do not know ... Most of these questions will be answered via the process of *statistical inference*, which we will be covering next week.

[^densityfoot]: This is the area under the normal curve, equivalent to the *probability* of a specific value.

[^likprobsfoot]: This is often a problem of just getting values of 0 due to issues with computational precision when working with many small probabilities. Taking logs not only changes the scale so that this does not happen, but it also turns *multiplication* into *summation*. Historically, this made calculating the likelihood much easier by hand. If you ever want to get back to the likelihood value, you can just undo the logs by using `exp(loglik)`.

[^bayesfoot]: This may seem like the wrong quantity. Surely, we are interested in $P(\boldsymbol{\theta}|\mathcal{D})$? In other words, finding the parameters that are most probable, given the data. Unfortunately, evaluating the probability $P(\boldsymbol{\theta}|\mathcal{D})$ requires Bayesian methods and so cannot be evaluated from a purely Frequentist perspective. This is why the likelihood has such an awkward definition.

[^minmaxfoot]: Making the *negative* log-likelihood *smaller* is the same as making the *positive* log-likelihood *bigger*. All we need to do to turn a maxmisation problem into a minimisation problem is to multiply the value returned by the function by $-1$.