<a href="https://colab.research.google.com/github/DepartmentOfStatisticsPUE/bi-2021/blob/main/materials/3_mle_optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Metoda największej wiarygodności



## Teoria

If $f_{i}\left(k_{i} ; \mathbf{\theta}\right)$ is the PDF of a random-variable  where  $\mathbf{\theta}$ is a vector of parameters (e.g. $\lambda$ in Poisson distribution), then for a collection of $N$ independent samples from this distribution, the joint distribution the random vector  $k_i$ is

\begin{equation}
        f(\mathbf{k} ; \mathbf{\theta})=\prod_{i=1}^{N} f_{i}\left(k_{i} ; \mathbf{\theta}\right)
\end{equation}


The maximum likelihood estimate of the parameters  $\mathbf{\theta}$ are the parameters which maximize this function with $\mathbf{x}$ fixed and given by the data:

\begin{equation}
    \hat{\mathbf{\theta}} =\arg \max _{\mathbf{\theta}} f(\mathbf{k} ; \mathbf{\theta}) =\arg \min _{\theta} l_{\mathbf{k}}(\mathbf{\theta}),
\end{equation}

where 


\begin{equation}
    l_{\mathrm{k}}(\mathbf{\theta}) =-\sum_{i=1}^{N} \log f\left(k_{i} ; \mathbf{\theta}\right) =-N \log f\left(k_{i} ; \mathbf{\theta}\right) 
\end{equation}

## Maximum likelihood -- przykład dla rozkładu Poissona


Likelihood function for Poisson distribution is

\begin{equation}
        L\left(\lambda ; x_{1}, \ldots, x_{n}\right)=\prod_{j=1}^{n} \exp (-\lambda) \frac{1}{x_{j} !} \lambda^{x_{j}}
\end{equation}

The log-likelihood function is


\begin{equation}
        l\left(\lambda ; x_{1}, \ldots, x_{n}\right)=-n \lambda-\sum_{j=1}^{n} \ln \left(x_{j} !\right)+\ln (\lambda) \sum_{j=1}^{n} x_{j}
\end{equation}

The maximum likelihood estimator of $\lambda$

\begin{equation}
        \hat{\lambda}=\frac{1}{n} \sum_{j=1}^{n} x_{j}
\end{equation}

## Maximum likelihood -- practice

+ Depending on the distribution it is possible to derive closed form for the parameters $\mathbf{\theta}$.
+ In other cases, numerical methods that require gradient (first derivatives) and hessian (second derivatives) are used, for example Newton-Raphson method:
+ Statistical packages also implement derivative-free optimization methods that do not require to calculate gradient and hessian.


## Newton's method -- single parameter function

Let $f$: $\mathbb{R}^1 \to \mathbb{R}^1 $ be a differentiable function. We seek a solution of $f(x)=0$, starting from an initial estimate $x_0=x_1$. 
    
At the $n$'s step, given $x_n$,  compute the next approximation $x_{n+1}$ by
    
\begin{equation}
        x_{n+1} = x_{n} - \frac{f(x_n)}{f'(x_n)}
\end{equation} 
    
and repeat until converge (i.e. $|x_{n+1}-x_n| < \epsilon$).
    
For step by step examples see: http://amsi.org.au/ESA_Senior_Years/SeniorTopic3/3j/3j_2content_2.html.


## Newton-Rapshon algorithm -- multivariate case

 Newton's method for optimization consists of applying Newton's method for solving systems of equations, where the equations are the first order conditions, saying that the gradient should equal the zero vector.
    
\begin{equation}
        \nabla \mathbf{f}(\mathbf{x}) = \mathbf{0}
\end{equation}
    
A second order Taylor expansion of the left-hand side leads to the iterative scheme

\begin{equation}
    \mathbf{x}_{n+1} = \mathbf{x}_{n} - \mathbf{H}(\mathbf{x}_{n})^{-1}\nabla \mathbf{f}(\mathbf{x}_{n}),
\end{equation}

where $\nabla \mathbf{f}(\mathbf{x}_{n})$ is a gradient and $\mathbf{H}(\mathbf{x}_{n})$ is a hessian of $\mathbf{f}(\mathbf{x}_{n})$ (second order derivatives). 

## Optimization procedures

There are plenty of different optimisation procedures that may be used for estimating parameters:

 + Gradient free

    + Nelder-Mead
    + Simulated Annealing
    + Particle Swarm

+ Gradient required
    + Conjugate Gradient Descent
    + Gradient Descent
    + (L-)BFGS

+ Hessian required
    + Newton's Method
    + Newton's Method With a Trust Region

For more, see **Optim.jl** documentation https://julianlsolvers.github.io/Optim.jl or **scipy.optimize** https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html.

# Practice -- exercise

## Task 

Assume that $X$ follows zero-truncated Poisson distribution given by

$$
P(X=x, X>0; \lambda) = \frac{f(x; \lambda)}{1 - f(0;\lambda)} = \frac{\lambda^x e^{-\lambda}}{x!(1-e^{-\lambda})} =  \frac{\lambda^x}{(e^\lambda-1)x!},
$$

complete the following task:

+ estimate $\lambda$ based on the maximum likelihood estimation method.
+ Note that, obtaining $\lambda$ requires deriving log-likelihood function (also gradient with respect to $\lambda$) and applying optimization procedure (e.g. \texttt{optim} function).


## Solution

We start with likelihood function

$$
L = \prod_i \frac{\lambda^x_i}{(e^\lambda-1)x_i!},
$$

then we compute log-likelihood

$$
    \log L = \log
    \left(
    \prod_i \frac{\lambda^x_i}{(e^\lambda-1)x_i!}
    \right) = 
    \sum_i \log 
    \left( 
    \frac{\lambda^x_i}{(e^\lambda-1)x_i!}
    \right)
$$ 

after simplification we get

$$
\log L = \sum_i x_i \log \lambda - \sum_i \log(e^\lambda-1) - \sum_i \log(x_i!) 
$$ 

In order to get estimate of $\lambda$ we need to calculate derivatives with respect to this parameter. Thus, gradient is given by 

$$
\frac{\partial \log L}{\partial \lambda} = \frac{\sum_i x_i}{\lambda} - \frac{n e^\lambda}{e^\lambda - 1} = \frac{\sum_i x_i}{\lambda} - n \frac{e^\lambda}{e^\lambda - 1}. 
$$

We can also calculate second derivative (hessian)

$$
\frac{\partial^2 \log L}{\partial \lambda^2} =  - \frac{\sum_i x_i}{\lambda^2} + n \frac{e^\lambda}{(e^\lambda-1)^2}.
$$

# Implementacja w R


In [3]:
install.packages(c("rootSolve", "maxLik", "extraDistr"))

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [4]:
library(maxLik)
library(rootSolve)
library(extraDistr) ## rtpois


Attaching package: ‘rootSolve’


The following objects are masked from ‘package:maxLik’:

    gradient, hessian



Attaching package: ‘extraDistr’


The following object is masked from ‘package:miscTools’:

    ddnorm




In [5]:
ll <- function(par, x) {
  m <- sum(x)*log(par)-length(x)*log(exp(par)-1)
  m
}

In [6]:
grad <- function(par, x)  {
  g <- sum(x) / par - length(x)*exp(par)/(exp(par)-1)
  g
}

In [7]:
hess <- function(par, x) {
  h <- -sum(x)/par^2 + length(x)*exp(par)/(exp(par)-1)^2 
  h
}

In [8]:
set.seed(123)
x <- rtpois(10000, lambda = 2.5, a = 0)

In [9]:
res <- maxLik(logLik = ll, start = 1, x = x, method = "NR")
summary(res)

--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 6 iterations
Return code 2: successive function values within tolerance limit (tol)
Log-Likelihood: 637.5431 
1  free parameters
Estimates:
     Estimate Std. error t value Pr(> t)    
[1,]  2.47732    0.01714   144.6  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------

In [10]:
res2 <- maxLik(logLik = ll,  grad = grad, hess = hess, start = 1, x = x, method = "NR")
summary(res2)

--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 6 iterations
Return code 1: gradient close to zero (gradtol)
Log-Likelihood: 637.5431 
1  free parameters
Estimates:
     Estimate Std. error t value Pr(> t)    
[1,]  2.47732    0.01713   144.6  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------