# Maximum Likelihood Estimation
Now that we have established some new notation and terminology, we can turn to the main topic of this section.

### How Does MLE Work?
...

$$
L\left(\boldsymbol{\theta}|\mathcal{D}\right) = P\left(\mathcal{D}|\boldsymbol{\theta}\right),
$$

where $\mathcal{D}$ represents the data we have collected and $\boldsymbol{\theta}$ is a generic representation of any set of parameters. For instance, in simple regression, $\boldsymbol{\theta} = \{\beta_{0},\beta_{1},\sigma^{2}\}$. So the quantity of interest is the probability of the data given some values of the parameters[^bayesfoot].

The key point to understand about ML is that it is based on evaluating *the probability of the data*, given some values of the parameters. So we can think about it as taking a guess for the parameter values, then calculating how probable those values make the data we have collected. It is like asking the question: "how likely would it have been to collect the data that we have collected, if these were the parameter values?". By searching through lots of different possible combinations of parameter values, the aim is to find the specific combination that leads to the *highest probability* of the data. Formally, we can write

$$
\left(\hat{\beta}_{0}, \hat{\beta}_{1}, \hat{\sigma}^{2}\right) = \text{arg max}\hspace{0.5em}  L\left(\beta_{0}, \beta_{1}, \sigma^{2}|\mathbf{y}\right),
$$

which is just saying that our parameter estimates are those values that make the output of the likelihood function $L\left(\beta_{0}, \beta_{1}, \sigma^{2}|\mathbf{y}\right)$ as *big* as possible.

As an example, let us use the `mtcars` data again. Furthermore, let us say that we have guessed that $\beta_{0} = 30$, $\beta_{1} = -5$ and $\sigma^{2} = 1$. Do not worry too much about where these guesses have come from, they are just an example. If we are assuming that the data have come from a normal distribution, we can therefore calculate the probability of the first value of `mpg` using

In [1]:
beta.0 <- 30
beta.1 <- -5
mu     <- beta.0 + beta.1*mtcars$wt[1]
sigma2 <- 1
lik    <- dnorm(mtcars$mpg[1], mean=mu, sd=sqrt(sigma2))

print(lik)


[1] 8.926166e-05


where the function `dnorm` returns the *density* of the normal distribution for the given data[^densityfoot]. The probability of the second value of `mpg` would then be

In [2]:
mu  <- beta.0 + beta.1*mtcars$wt[2]
lik <- dnorm(mtcars$mpg[2], mean=mu, sd=sqrt(sigma2))

print(lik)

[1] 2.125155e-07


and so on. The find out the overall likelihood for the *whole* dataset, we just need to *multiply* these probabilities. However, this can cause computational problems[^likprobsfoot], so it is more usual to sum the *log* of these probabilities to give the *log likelihood*.

In [3]:
mu     <- beta.0 + beta.1*mtcars$wt
loglik <- 0
loglik <- sum(dnorm(mtcars$mpg, mean=mu, sd=sqrt(sigma2), log=TRUE))
print(loglik)

[1] -780.7884


This value is not particularly interpretable, but this does not matter. All we want to do is make it as *big* as possible. Because this has returned a *negative* value, what we need to do is make is as *positive* as possible. So maybe we should try something else? 

Let us inch closer to the results we got from least-squares earlier by setting $\beta_{0} = 35$, $\beta_{1} = -5$ and $\sigma^{2} = 1$. This gives

In [4]:
beta.0 <- 35
beta.1 <- -5
mu     <- beta.0 + beta.1*mtcars$wt
sigma2 <- 1
loglik <- 0

loglik <- sum(dnorm(mtcars$mpg, mean=mu, sd=sqrt(sigma2), log=TRUE))
print(loglik)

[1] -192.4884


So, the log likelihood is now *more positive*, meaning we are moving in the right direction. So hopefully it is clear how calculating the log-likelihood gives us a metric of which combination of parameters make our data the *most probable*. Now, importantly, this depends upon the assumed distribution of the data, so we can start to see how the assumptions we made when defining our model are starting to be used. We can also see that this depends upon the data we have collected. So, the assumption here is that our sample is *representative* in order to make sure the estimates are close to the population values.

### Optimisation Algorithms
Of course, searching through many combinations of guesses for the parameters is not particularly efficient or principled. In order to do this sensibly, we rely on computer algorithms to search through many possible combinations of values to find which one *maximises* the log-likelihood. These are known as *optimisation* algorithms and are a complex topic in numerical computing. For us, we do not really need to understand how these work. ...

In [None]:
set.seed(123)
x <- mtcars$wt
y <- mtcars$mpg

# Define negative log-likelihood
neg_loglik <- function(params) {
  beta.0 <- params[1]
  beta.1 <- params[2]
  sigma  <- params[3]
  mu     <- beta.0 + beta.1*x

  if (sigma <= 0) return(Inf)  # The standard deviation cannot be -ve
    
  loglik <- sum(dnorm(y, mean=mu, sd=sigma, log=TRUE)) # log-likelihood

  return(-loglik) # retun negative log-likelihood
}

# Starting values (a guess based on plotting the data)
init_params <- c(37,-5,3)

# Run optimisation
mle <- optim(
  par     = init_params,
  fn      = neg_loglik,
  method  = "BFGS",
  control = list(reltol = 1e-12),
  hessian = TRUE
)

# Print results
mle_est <- mle$par[1:2]
names(mle_est) <- c("beta.0", "beta.1")
print(mle_est)


   beta.0    beta.1 
37.285126 -5.344472 


Which we can compare to the results `R` gives us when using the `lm()` function

In [10]:
print(coef(lm(mpg ~ wt, data=mtcars)))

(Intercept)          wt 
  37.285126   -5.344472 


### Exact Solutions
The application of ML given above is sometimes known as *iterative* ML, highlighting the fact that in certain applications of ML, an iterative algorithm is unnecessary. This is because a single solution that maximise the likelihood function can be derived, meaning we can determine the equation that will give us the solution without the need to search around in the parameter space to find it. One example of this is linear models. If we assume a normal distribution for the outcome variable, we can work out an *exact* equation for those estimates that maximises the likelihood. Although the derivation of this is beyond the scope of this lesson, it turns out that the exact solution is *identical* to the OLS equations for the slope and intercept. As such, ML agrees with OLS and we can think of the estimating equations as either an application of OLS *or* as an application of ML.

### Restricted Maximum Likelihood
In the example above, your may have notice that we neglected to show the estimates for $\sigma^{2}$. This was not an accident...
 
Sometimes denoted REML or ReML, ...

How REML works is a bit complicated and beyond the scope of this lesson. In effect, REML estimates the residual variance by first *adjusting* for the effects of the predictors from the data and then running ML. This adjustment allows the loss of degrees of freedom to be automatically taken into account. Although it might be tempting to think we can just use the residuals for this purpose, because the residuals are technically *not* i.i.d., this will not work. Instead, you have to use other means of forcing the removal of the predictor effects to be i.i.d. This gets quite complicated, but should give you something of a flavour of how REML works. 

In [None]:
set.seed(123)
y <- mtcars$mpg
X <- model.matrix(~ wt, data=mtcars)

# Define REML negative log-likelihood
reml_neg_loglik <- function(sigma, X, y) {
  
  # Remove the effects of the predictors from the data
  # THIS IS COMPLICATED, DO NOT WORRY ABOUT IT!!!!
  qr_X <- qr(X)
  Q    <- qr.Q(qr_X, complete = TRUE)
  n    <- nrow(X)
  p    <- ncol(X)
  K    <- Q[, (p + 1):n]
  err  <- t(K) %*% y  # This is y after the effects of the predictors are removed

  if (sigma <= 0) return(Inf)  # The standard deviation cannot be -ve

  loglik <- sum(dnorm(err, mean=0, sd=sigma, log=TRUE)) # log-likelihood

  return(-loglik)  # return negative log-likelihood
}

# Starting values (a guess based on the data)
init_sigma <- sd(y)

# Run optimisation
reml <- optim(
  par     = init_sigma,
  fn      = reml_neg_loglik,
  X       = X,
  y       = y,
  method  = "BFGS",
  control = list(reltol = 1e-12),
  hessian = TRUE
)

# Print results (top is REML, bottom is OLS)
print(reml$par)
print(summary(lm(mpg ~ wt, data=mtcars))$sigma)


[1] 3.045882
[1] 3.045882




`````{admonition} Degrees of Freedom
:class: tip
...
`````

## The Final Estimated Model

### Interpreting the Estimates


[^densityfoot]: This is the area under the normal curve, equivalent to the *probability* of a specific value.

[^likprobsfoot]: This is often a problem of just getting values of 0 due to issues with computational precision when working with many small probabilities. Taking logs not only changes the scale so that this does not happen, but it also turns *multiplication* into *summation*. Historically, this made calculating the likelihood much easier by hand. If you ever want to get back to the likelihood value, you can just undo the logs by using `exp(loglik)`.

[^bayesfoot]: This may seem like the wrong quantity. Surely, we are interested in $P(\boldsymbol{\theta}|\mathcal{D})$? In other words, finding the parameters that are most probable, given the data. Unfortunately, evaluating the probability $P(\boldsymbol{\theta}|\mathcal{D})$ requires Bayesian methods. Because Fisher hated Bayesian statistics, he was determined to find methods of estimation that did not require Bayes Theorem. Hence, the likelihood was adopted as a method that could be used from a purely Frequentist perspective. We will see more about this later on the course.