# Maximum Likelihood Estimation

This notebook illustrates maximum likelihood estimation and how to calculate different standard errors (from the information matrix, the gradients and the "sandwich" approach).

The application is (on purpose) very simple: estimating the mean and variance of a random variable. Some of the subsequent notebooks more complicated models, for instance, GARCH models.

## Load Packages and Extra Functions

In [1]:
using Printf, LinearAlgebra, DelimitedFiles, Statistics, Optim, ForwardDiff

include("jlFiles/printmat.jl")

printyellow (generic function with 1 method)

## Loading Data

In [2]:
xx  = readdlm("Data/FFdSizePs.csv",',',skipstart=1)
x   = xx[:,2]                 #returns for the smallest size portfolio
xx  = nothing

## Traditional Estimates

of the mean $\mu$ and the variance $\sigma^2$.

To compare with the MLE, we use $1/T$ in the variance estimation, not $1/(T-1)$, by using the `corrected=false` option.

In [3]:
T = length(x)

(μ_trad,σ²_trad) = (mean(x),var(x,corrected=false))

std_trad = sqrt.([σ²_trad,2*σ²_trad^2]/T)   #standard errors, textbook formulas

printblue("Traditional estimates and their std:\n")
xx = [[μ_trad,σ²_trad] std_trad]
printmat(xx,colNames=["estimate","std"],rowNames=["μ","σ²"])

[34m[1mTraditional estimates and their std:[22m[39m

    estimate       std
μ      0.042     0.010
σ²     0.840     0.013



# Point Estimates from ML

The next few cells define a log likelihood function and estimate the coefficients by maximizing it.

## The (log) Likelihood Function for Estimating the Parameters of a N(,)

In [4]:
function NormalLL(par,x)      #par are the parameters, x is the data
    (μ,σ²) = par
    LLt    = -(1/2)*log(2*pi) - (1/2)*log(σ²) .- (1/2)*(x.-μ).^2/σ²  #vector, all x[t]
    LL     = sum(LLt)
    return LL, LLt
end

NormalLL (generic function with 1 method)

## Try the Likelihood Function

In [5]:
par0 = [0.0,1.0]                #initial parameter guess

(LL,LLt) = NormalLL(par0,x)     #just trying the log likelihood fn

printlnPs("log likelihood value at par0: ",LL)

log likelihood value at par0: -11155.385


## Optimize the Likelihood Function

In [6]:
Sol = optimize(par->-NormalLL(par,x)[1],par0)  #minimize -LL

parHat = Optim.minimizer(Sol)                 #the optimal solution 

printlnPs("log-likelihood at point estimate: ",-Optim.minimum(Sol))

printblue("\nParameter estimates:\n")
xx = [[μ_trad,σ²_trad] parHat]
printmat(xx,colNames=["traditional","MLE"],rowNames=["μ","σ²"],width=13)

log-likelihood at point estimate: -11088.409

[34m[1mParameter estimates:[22m[39m

    traditional          MLE
μ         0.042        0.042
σ²        0.840        0.840



# Standard Errors I: Information Matrix 

If the likelihood function is correctly specified, then MLE is typically asymptotically normally distributed as

$
\sqrt{T}(\hat{\theta}-\theta)  \rightarrow^{d}N(0,V) \: \text{, where } \: V=I(\theta)^{-1}\text{ with }
$

$
I(\theta) =-\text{E}\frac{\partial^{2}\ln L_t}{\partial\theta\partial\theta^{\prime}}
$

where $I(\theta)$ is the information matrix and $\ln L_t$  is the contribution of period $t$ to the likelihood function.

The code below calculates numerical derivatives. I does so by noticing that $
\text{E}\frac{\partial^{2}\ln L_t}{\partial\theta\partial\theta^{\prime}} = 
\frac{\partial^{2}\text{E}\ln L_t}{\partial\theta\partial\theta^{\prime}} 
$,
so we can differentiate the mean (across data points) log likelihood.

In [7]:
Ia = -ForwardDiff.hessian(par->mean(NormalLL(par,x)[2]),parHat)  #derivative of mean(LLt)

Ia       = (Ia+Ia')/2         #to guarantee symmetry, fixes rounding errors
vcv      = inv(Ia)/T
std_hess = sqrt.(diag(vcv))

printblue("standard errors:\n")
xx = [std_trad std_hess]
printmat(xx,colNames=["traditional","MLE (InfoMat)"],rowNames=["μ","σ²"],width=18)

[34m[1mstandard errors:[22m[39m

         traditional     MLE (InfoMat)
μ              0.010             0.010
σ²             0.013             0.013



# Standard Errors II: Gradients and Sandwich

Alternatively, we can use the outer product of the gradients to calculate the
information matrix as

$
J(\theta)=\text{E}\left[  \frac{\partial\ln L_t}{\partial\theta
}\frac{\partial\ln L_t}{\partial\theta^{\prime}}\right]
$

The code below fills row $t$ of a $T\times 2$ matrix (called `δL`) with 
$
\frac{\partial\ln L_t}{\partial\theta}.
$
For each $t$, the outer product is a $2\times2$ matrix, and then we average (each element) across $t$. This is done by calculating 
`J = δL'δL/T`,
which is the same as 
```
J = zeros(2,2)
for t = 1:T
    J = J + δL[t,:]*δL[t,:]'/T
end
```

We could also use the "sandwich" estimator

$
V=I(\theta)^{-1}J(\theta)I(\theta)^{-1}.
$

When data is *not* iid $N($), then the three variance-covariance matrices may differ, and the sandwich approach is often the most robust.

### Std from Gradients

In [8]:
δL = ForwardDiff.jacobian(par->NormalLL(par,x)[2],parHat)   #Tx2
J         = δL'δL/T               #2xT * Tx2

vcv       = inv(J)/T
std_grad  = sqrt.(diag(vcv))                          #std from gradients

printblue("standard errors:\n")
xx = [std_trad std_hess std_grad]
printmat(xx,colNames=["traditional","MLE (InfoMat)","MLE (gradients)"],rowNames=["μ","σ²"],width=18)

[34m[1mstandard errors:[22m[39m

         traditional     MLE (InfoMat)   MLE (gradients)
μ              0.010             0.010             0.010
σ²             0.013             0.013             0.005



### Std from Sandwich

In [9]:
vcv       = inv(Ia) * J * inv(Ia)/T
std_sandw = sqrt.(diag(vcv))                          #std from sandwich

printblue("standard errors:\n")
xx = [std_trad std_hess std_grad std_sandw]
printmat(xx,colNames=["traditional","MLE (InfoMat)","MLE (gradients)","MLE (sandwich)"],rowNames=["μ","σ²"],width=18)

[34m[1mstandard errors:[22m[39m

         traditional     MLE (InfoMat)   MLE (gradients)    MLE (sandwich)
μ              0.010             0.010             0.010             0.010
σ²             0.013             0.013             0.005             0.036



Try this: replace the data series `x` with simulated data from a $N()$ distribution. Then, do the different standard errors get closer to each other?