# Maximum Likelihood Estimation
In the previous section, we discussed estimation of the simple regression model very generally, establishing that we could use *either* MLE or OLS to arrive at the same results. As argued in that section, although OLS is conceptually simple, it is limited in its application to the more complex models that we will be encountering later on the course. As such, it is useful to focus primarily on Maximum Likelihood Estimation (MLE) as a general method that can be used to estimate values across a variety of models. In this section, we will dig into MLE in a bit more detail to give you a sense of how it works and, also, to show how the results are identical to OLS.

## How Does MLE Work?
Before getting started on understanding MLE, it is worth highlighting that you will *never* need to actually implement this yourself. Software will always take care of all of this for you. As such, this section is just trying to help you gain intution about how MLE works. Even if you do not understand every nuance, you can still happily use MLE in software as the estimation technique. So just try to gain some intution here and do not worry about understanding every detail. 

Fundamentally, MLE is based on finding parameter values that make a certain value, known as the *likelihood*, as *big* as possible. The value of likelihood is calculated using the *likelihood function*, denoted 

$$
L\left(\boldsymbol{\theta}|\mathcal{D}\right),
$$

where $\mathcal{D}$ represents the data we have collected and $\boldsymbol{\theta}$ is a generic representation of any set of parameters. For instance, in simple regression, $\boldsymbol{\theta} = \{\beta_{0},\beta_{1},\sigma^{2}\}$. In order to understand this, we can make an equivalence between the likelihood and probability. When placed in probabilitstic terms, we have

$$
L\left(\boldsymbol{\theta}|\mathcal{D}\right) = P\left(\mathcal{D}|\boldsymbol{\theta}\right).
$$

 In words, this is saying that the likelihood of a set of paramater values, given some data, is the *same* as evaluating the probability of the data, assuming those parameter values are true. The key point here is that MLE is based on evaluating the *probability of the data*, given some values of the parameters[^bayesfoot]. We can think about it as taking a guess for the parameter values, then calculating how probable those values make the data we have collected. It is like asking the question: "how likely would it have been to collect the data that we have collected, if these were the parameter values?". By searching through lots of different possible combinations of parameter values, the aim is to find the specific combination that leads to the *highest probability* of the data. In other words, the values that *maxmimise the likelihood*. Within the context of simple regression, we could then write

$$
\left(\hat{\beta}_{0}, \hat{\beta}_{1}, \hat{\sigma}^{2}\right) = \text{arg max}\hspace{0.5em}  L\left(\beta_{0}, \beta_{1}, \sigma^{2}|\mathbf{y}\right),
$$

which is just saying that our parameter estimates are those values that make the output of the likelihood function $L\left(\beta_{0}, \beta_{1}, \sigma^{2}|\mathbf{y}\right)$ as *big* as possible.

### The Likelihood Function
So how do we evaluate the likelihood function? In brief, if we want the likelihood of our entire dataset, we need to *multiply* the probability of each data value, using some guesses for the parameter values. We can then repeat this for some other guesses. If the likelihood gets *larger* then our new guesses make the data more probable than our old guesses. We will see shortly how to do this in a more principled way than just guessing, but this should give you enough of an idea at this point.

So how do we calculate the probabilities of each data point? Remember that our core assumption for simple regression is that

$$
y_{i} \sim \mathcal{N}\left(\beta_{0} + \beta_{1}x_{i}, \sigma^{2}\right).
$$

The probability of any value from a normal distribution is given by its *probability density function*, which is

$$
f(x) = \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x - \mu)^{2}}{2\sigma^{2}}}.
$$

Combining these two tells us that the probability of any particular datapoint can be calculated using

$$
P(y_{i}) = \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(y_{i} - (\beta_{0} + \beta_{1}x_{i}))^{2}}{2\sigma^{2}}}.
$$

So, if we insert some guesses for $\beta_{0}$, $\beta_{1}$ and $\sigma^{2}$, we can calculate the probability of any of our data values. This effectively tells us the probability of that datapoint, assuming our guesses are correct. The likelihood function then involves multiplying all these values together to give

$$
L(\beta_{0},\beta_{1},\sigma^{2}|\mathbf{y}) = \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(y_{1} - (\beta_{0} + \beta_{1}x_{1}))^{2}}{2\sigma^{2}}} \times \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(y_{2} - (\beta_{0} + \beta_{1}x_{2}))^{2}}{2\sigma^{2}}} \times \dots \times \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(y_{n} - (\beta_{0} + \beta_{1}x_{n}))^{2}}{2\sigma^{2}}},
$$

which we can more compactly express using Big Pi notation (see the box below)

$$
L(\beta_{0},\beta_{1},\sigma^{2}|\mathbf{y}) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(y_{i} - (\beta_{0} + \beta_{1}x_{i}))^{2}}{2\sigma^{2}}}.
$$

`````{admonition} Big Pi Notation
:class: tip

So the notation

$$
P = \prod_{i=1}^{n} y_{i}
$$

is equivalent to

$$
P = y_{1} \times y_{2} \times y_{3} ,
$$

if we take $n=3$. In code, is the same as shortening

```R
P <- y[1] * y[2] * y[3]
```

to

```R
n <- 3
P <- 1

for (i in 1:n){
    P <- P * y[i]
}
```
So, you can think of the Big Pi notation as a *multiplication loop* over a certain set of indices.

`````

### An Example in `R`
As an example, let us use the `mtcars` data again. Furthermore, let us say that we have guessed that $\beta_{0} = 30$, $\beta_{1} = -5$ and $\sigma^{2} = 1$. Do not worry too much about where these guesses have come from, they are just an example. If we are assuming that the data have come from a normal distribution, we can therefore calculate the probability of the first value of `mpg` using

In [2]:
beta.0 <- 30
beta.1 <- -5
mu     <- beta.0 + beta.1*mtcars$wt[1]
sigma2 <- 1
lik    <- dnorm(mtcars$mpg[1], mean=mu, sd=sqrt(sigma2))

print(lik)


[1] 8.926166e-05


where the function `dnorm` returns the *density* of the normal distribution for the given data. We can see how this agrees with manually calculating the density using the formula given above

In [7]:
lik <- 1/sqrt(2*pi*sigma2) * exp(-((mtcars$mpg[1] - mu)^2/(2*sigma2)))

print(lik)

[1] 8.926166e-05




The probability of the second value of `mpg` would then be

In [2]:
mu  <- beta.0 + beta.1*mtcars$wt[2]
lik <- dnorm(mtcars$mpg[2], mean=mu, sd=sqrt(sigma2))

print(lik)

[1] 2.125155e-07


and so on. The find out the overall likelihood for the *whole* dataset, we just need to *multiply* these probabilities. However, this can cause computational problems[^likprobsfoot], so it is more usual to sum the *log* of these probabilities to give the *log likelihood*.

In [3]:
mu     <- beta.0 + beta.1*mtcars$wt
loglik <- 0
loglik <- sum(dnorm(mtcars$mpg, mean=mu, sd=sqrt(sigma2), log=TRUE))
print(loglik)

[1] -780.7884


This value is not particularly interpretable, but this does not matter. All we want to do is make it as *big* as possible. Because this has returned a *negative* value, what we need to do is make is as *positive* as possible. So maybe we should try something else? 

Let us inch closer to the results we got from least-squares earlier by setting $\beta_{0} = 35$, $\beta_{1} = -5$ and $\sigma^{2} = 1$. This gives

In [4]:
beta.0 <- 35
beta.1 <- -5
mu     <- beta.0 + beta.1*mtcars$wt
sigma2 <- 1
loglik <- 0

loglik <- sum(dnorm(mtcars$mpg, mean=mu, sd=sqrt(sigma2), log=TRUE))
print(loglik)

[1] -192.4884


So, the log likelihood is now *more positive*, meaning we are moving in the right direction. So hopefully it is clear how calculating the log-likelihood gives us a metric of which combination of parameters make our data the *most probable*. In principle, we could just continue with some guesses for the parameters until we cannot make the log-likelihood any bigger.

### Optimisation Algorithms
An obvious issue with the scheme above is that searching through many combinations of guesses for the parameters is not particularly efficient or principled. In fact, there is an infinite number of values we could choose, as well as an infinite number of combinations. So how can we possibly find the values we need? 

In order to do so, we rely on computer algorithms to search through many possible combinations of values to find which one *maximises* the log-likelihood. These are known as *optimisation* algorithms and are a complex topic in numerical computing. For us, we do not really need to understand how these work. ...

In [10]:
set.seed(123)
x <- mtcars$wt
y <- mtcars$mpg

# Define negative log-likelihood
neg_loglik <- function(params) {
  beta.0 <- params[1]
  beta.1 <- params[2]
  sigma  <- params[3]
  mu     <- beta.0 + beta.1*x

  if (sigma <= 0) return(Inf)  # The standard deviation cannot be -ve
    
  loglik <- sum(dnorm(y, mean=mu, sd=sigma, log=TRUE)) # log-likelihood

  return(-loglik) # retun negative log-likelihood
}

# Starting values (a guess based on plotting the data)
init_params <- c(37,-5,3)

# Run optimisation
mle <- optim(
  par     = init_params,
  fn      = neg_loglik,
  method  = "BFGS",
  control = list(reltol = 1e-12)
)

# Print results
mle_est <- mle$par[1:2]
names(mle_est) <- c("beta.0", "beta.1")
print(mle_est)


   beta.0    beta.1 
37.285126 -5.344472 


Which we can compare to the results `R` gives us when using the `lm()` function

In [10]:
print(coef(lm(mpg ~ wt, data=mtcars)))

(Intercept)          wt 
  37.285126   -5.344472 


## Exact Solutions
The application of ML given above is sometimes known as *iterative* ML, highlighting the fact that in certain applications of ML, an iterative algorithm is unnecessary. This is because a single solution that maximise the likelihood function can be derived, meaning we can determine the equation that will give us the solution without the need to search around in the parameter space to find it. One example of this is linear models. If we assume a normal distribution for the outcome variable, we can work out an *exact* equation for those estimates that maximises the likelihood. Although the derivation of this is beyond the scope of this lesson, it turns out that the exact solution is *identical* to the OLS equations for the slope and intercept. As such, ML agrees with OLS and we can think of the estimating equations as either an application of OLS *or* as an application of ML.

## Restricted Maximum Likelihood
In the example above, your may have notice that we neglected to show the estimates for $\sigma^{2}$. This was not an accident...
 
Sometimes denoted REML or ReML, ...

How REML works is a bit complicated and beyond the scope of this lesson. In effect, REML estimates the residual variance by first *adjusting* for the effects of the predictors from the data and then running ML. This adjustment allows the loss of degrees of freedom to be automatically taken into account. Although it might be tempting to think we can just use the residuals for this purpose, because the residuals are technically *not* i.i.d., this will not work. Instead, you have to use other means of forcing the removal of the predictor effects to be i.i.d. This gets quite complicated, but should give you something of a flavour of how REML works. 

In [None]:
set.seed(123)
y <- mtcars$mpg
X <- model.matrix(~ wt, data=mtcars)

# Define REML negative log-likelihood
reml_neg_loglik <- function(sigma, X, y) {
  
  # Remove the effects of the predictors from the data
  # THIS IS COMPLICATED, DO NOT WORRY ABOUT IT!!!!
  qr_X <- qr(X)
  Q    <- qr.Q(qr_X, complete = TRUE)
  n    <- nrow(X)
  p    <- ncol(X)
  K    <- Q[, (p + 1):n]
  err  <- t(K) %*% y  # This is y after the effects of the predictors are removed

  if (sigma <= 0) return(Inf)  # The standard deviation cannot be -ve

  loglik <- sum(dnorm(err, mean=0, sd=sigma, log=TRUE)) # log-likelihood

  return(-loglik)  # return negative log-likelihood
}

# Starting values (a guess based on the data)
init_sigma <- sd(y)

# Run optimisation
reml <- optim(
  par     = init_sigma,
  fn      = reml_neg_loglik,
  X       = X,
  y       = y,
  method  = "BFGS",
  control = list(reltol = 1e-12),
  hessian = TRUE
)

# Print results (top is REML, bottom is OLS)
print(reml$par)
print(summary(lm(mpg ~ wt, data=mtcars))$sigma)


[1] 3.045882
[1] 3.045882




`````{admonition} Degrees of Freedom
:class: tip
...
`````

## The Estimated Model
So, at this point, we can think of our parameter estimates either as those that minimise the sum-of-squared errors or, more generically, as those values that make the data most probable. As argued previously, the likelihood perspective is more generally applicable and so it is useful to think of the estimates in proabilistic terms. As such, if our data are truly drawn from a normal distribution then these parameter values make the data more likely than any other possible combination of values.



### The Final Estimates
So, using the `mtcars` example again, after running either OLS or MLE (with REML for the variance) we have:

- $\hat{\beta}_{0} =$
- $\hat{\beta}_{1} =$
- $\hat{\sigma}^{2} =$

Meaning that our final regression model can be expressed as

$$
\text{MPG}_{i} = ... + (... \times \text{Weight}_{i}) + e_{i}.
$$

The errors can be calculated once values for $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ are available through a simple subtraction

$$
e_{i} = \text{MPG}_{i} - [... + (... \times \text{Weight}_{i})].
$$

This then completes every unknown element of the model.


### Interpreting the Estimates
Although we have focussed mainly on *how* to get estimates using our data, in practical terms our main interest is what do the parameters actually mean? 

However, there are still a lot of things we do not know ... Most of these questions will be answered via the process of *statistical inference*, which we will be covering next week.

[^densityfoot]: This is the area under the normal curve, equivalent to the *probability* of a specific value.

[^likprobsfoot]: This is often a problem of just getting values of 0 due to issues with computational precision when working with many small probabilities. Taking logs not only changes the scale so that this does not happen, but it also turns *multiplication* into *summation*. Historically, this made calculating the likelihood much easier by hand. If you ever want to get back to the likelihood value, you can just undo the logs by using `exp(loglik)`.

[^bayesfoot]: This may seem like the wrong quantity. Surely, we are interested in $P(\boldsymbol{\theta}|\mathcal{D})$? In other words, finding the parameters that are most probable, given the data. Unfortunately, evaluating the probability $P(\boldsymbol{\theta}|\mathcal{D})$ requires Bayesian methods and so cannot be evaluated from a purely Frequentist perspective. This is why the likelihood has such an awkward definition.