# Maximum Likelihood Estimation

This notebook illustrates how maximum likelihood works and how to estimate different standard errors (form the information matrix, the gradients and the "sandwich" approach).

The application is very basic: how to estimate the mean and variance of a series.

## Loading Packages

In [1]:
using Dates, LinearAlgebra, DelimitedFiles, Statistics, Optim, ForwardDiff

include("jlFiles/printmat.jl")

printlnPs (generic function with 2 methods)

## Loading Data

In [2]:
xx  = readdlm("Data/FFdSizePs.csv",',',skipstart=1)
ymd = round.(Int,xx[:,1])     #YearMonthDay, like 20121231
x   = xx[:,2]                 #returns for the smallest size portfolio
xx  = nothing

## Traditional Estimates

of the mean $\mu$ and the variance $\sigma^2$.

To compare with the MLE, we use $1/T$ in variance estimate, not $1/(T-1)$.

In [3]:
T = length(x)

(μ_trad,σ²_trad) = (mean(x),var(x,corrected=false))

std_parTrad = sqrt.([σ²_trad,2*σ²_trad^2]/T)          #textbook formulas

println("Traditional estimates and std")
println("  estimate      std")
printmat([[μ_trad,σ²_trad] std_parTrad])

Traditional estimates and std
  estimate      std
     0.042     0.010
     0.840     0.013



## The Likelihood Function for Estimating the Parameters of a N(,)

In [4]:
function NormalLL(par::Vector,x)
  (μ,σ²) = par
  LL    = -(1/2)*log(2*pi) .- (1/2)*log.(σ²) .- (1/2)*(x.-μ).^2/σ²
  loss  = -sum(LL)    #to minimize this
  return loss, LL
end

NormalLL (generic function with 1 method)

## Try the Likelihood Function

In [5]:
par0 = [mean(x)+0.1,var(x)+0.05]         #initial parameter guess

(loss,LL) = NormalLL(par0,x)             #testing the log likelihood fn

println("LL value at par0: ",-loss)

LL value at par0: -11142.004572304257


## Optimize the Likelihood Function

In [6]:
Sol = optimize(par->NormalLL(par,x)[1],par0)  #minimize -sum(LL)

parHat = Optim.minimizer(Sol)                 #the optimal solution 

printlnPs("Value of log-likelihood fn at estimate: ",-Optim.minimum(Sol))

println("\nParameter estimates: ")
println("      MLE      Traditional")
printmat([parHat [μ_trad,σ²_trad]])

Value of log-likelihood fn at estimate: -11088.409

Parameter estimates: 
      MLE      Traditional
     0.042     0.042
     0.840     0.840



## Standard Errors

If the likelihood function is correctely specified, then MLE is typically asymptotically normally distributed as

$
\sqrt{T}(\hat{\theta}-\theta)  \rightarrow^{d}N(0,V) \: \text{, where } \: V=I(\theta)^{-1}\text{ with }
$

$
I(\theta) =-\text{E}\frac{\partial^{2}\ln L_t}{\partial\theta\partial\theta^{\prime}}
$

where $I(\theta)$ is the information matrix and $\ln L_t$  is the contribution of period $t$ to the likelihood function.

The code below calculates numerical derivatives.  


Alternatively, we can use the outer product of the gradients to calculate the
information matrix as

$
J(\theta)=\text{E}\left[  \frac{\partial\ln L_t}{\partial\theta
}\frac{\partial\ln L_t}{\partial\theta^{\prime}}\right]
$

We could also use the "sandwich" estimator

$
V=I(\theta)^{-1}J(\theta)I(\theta)^{-1}.
$

### Std from Hessian (Information Matrix)

In [7]:
Ia         = -ForwardDiff.hessian(par->mean(NormalLL(par,x)[2]),parHat)
Ia         = (Ia+Ia')/2              #to guarantee symmetry
vcv        = inv(Ia)/T
std_parHat = sqrt.(diag(vcv))

println("std from Hessian and traditional")
printmat([std_parHat std_parTrad])

std from Hessian and traditional
     0.010     0.010
     0.013     0.013



### Std from Gradient and Sandwich

In [8]:
LLgrad = ForwardDiff.jacobian(par->NormalLL(par,x)[2],parHat)   #T x 2 matrix
J           = LLgrad'LLgrad/T
vcv         = inv(J)/T
stdb_parHat = sqrt.(diag(vcv))                          #std from gradients

vcv         = inv(Ia) * J * inv(Ia)/T
stdc_parHat = sqrt.(diag(vcv))                          #std from sandwich

println("\n4 different standard errors")
println("    hessian  gradient  sandwich  traditional")
printmat([std_parHat stdb_parHat stdc_parHat std_parTrad])


4 different standard errors
    hessian  gradient  sandwich  traditional
     0.010     0.010     0.010     0.010
     0.013     0.005     0.036     0.013



When data is non-iidN(), then the sandwich estimate may differ from the information-based estimate.