<a href="https://colab.research.google.com/github/DepartmentOfStatisticsPUE/bi-2021/blob/main/materials/3_mle_optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Metoda największej wiarygodności



## Teoria

If $f_{i}\left(k_{i} ; \mathbf{\theta}\right)$ is the PDF of a random-variable  where  $\mathbf{\theta}$ is a vector of parameters (e.g. $\lambda$ in Poisson distribution), then for a collection of $N$ independent samples from this distribution, the joint distribution the random vector  $k_i$ is

\begin{equation}
        f(\mathbf{k} ; \mathbf{\theta})=\prod_{i=1}^{N} f_{i}\left(k_{i} ; \mathbf{\theta}\right)
\end{equation}


The maximum likelihood estimate of the parameters  $\mathbf{\theta}$ are the parameters which maximize this function with $\mathbf{x}$ fixed and given by the data:

\begin{equation}
    \hat{\mathbf{\theta}} =\arg \max _{\mathbf{\theta}} f(\mathbf{k} ; \mathbf{\theta}) =\arg \min _{\theta} l_{\mathbf{k}}(\mathbf{\theta}),
\end{equation}

where 


\begin{equation}
    l_{\mathrm{k}}(\mathbf{\theta}) =-\sum_{i=1}^{N} \log f\left(k_{i} ; \mathbf{\theta}\right) =-N \log f\left(k_{i} ; \mathbf{\theta}\right) 
\end{equation}

## Maximum likelihood -- przykład dla rozkładu Poissona


Likelihood function for Poisson distribution is

\begin{equation}
        L\left(\lambda ; x_{1}, \ldots, x_{n}\right)=\prod_{j=1}^{n} \exp (-\lambda) \frac{1}{x_{j} !} \lambda^{x_{j}}
\end{equation}

The log-likelihood function is


\begin{equation}
        l\left(\lambda ; x_{1}, \ldots, x_{n}\right)=-n \lambda-\sum_{j=1}^{n} \ln \left(x_{j} !\right)+\ln (\lambda) \sum_{j=1}^{n} x_{j}
\end{equation}

The maximum likelihood estimator of $\lambda$

\begin{equation}
        \hat{\lambda}=\frac{1}{n} \sum_{j=1}^{n} x_{j}
\end{equation}

## Maximum likelihood -- practice

+ Depending on the distribution it is possible to derive closed form for the parameters $\mathbf{\theta}$.
+ In other cases, numerical methods that require gradient (first derivatives) and hessian (second derivatives) are used, for example Newton-Raphson method:
+ Statistical packages also implement derivative-free optimization methods that do not require to calculate gradient and hessian.


## Newton's method -- single parameter function

Let $f$: $\mathbb{R}^1 \to \mathbb{R}^1 $ be a differentiable function. We seek a solution of $f(x)=0$, starting from an initial estimate $x_0=x_1$. 
    
At the $n$'s step, given $x_n$,  compute the next approximation $x_{n+1}$ by
    
\begin{equation}
        x_{n+1} = x_{n} - \frac{f(x_n)}{f'(x_n)}
\end{equation} 
    
and repeat until converge (i.e. $|x_{n+1}-x_n| < \epsilon$).
    
For step by step examples see: http://amsi.org.au/ESA_Senior_Years/SeniorTopic3/3j/3j_2content_2.html.


## Newton-Rapshon algorithm -- multivariate case

 Newton's method for optimization consists of applying Newton's method for solving systems of equations, where the equations are the first order conditions, saying that the gradient should equal the zero vector.
    
\begin{equation}
        \nabla \mathbf{f}(\mathbf{x}) = \mathbf{0}
\end{equation}
    
A second order Taylor expansion of the left-hand side leads to the iterative scheme

\begin{equation}
    \mathbf{x}_{n+1} = \mathbf{x}_{n} - \mathbf{H}(\mathbf{x}_{n})^{-1}\nabla \mathbf{f}(\mathbf{x}_{n}),
\end{equation}

where $\nabla \mathbf{f}(\mathbf{x}_{n})$ is a gradient and $\mathbf{H}(\mathbf{x}_{n})$ is a hessian of $\mathbf{f}(\mathbf{x}_{n})$ (second order derivatives). 

## Optimization procedures

There are plenty of different optimisation procedures that may be used for estimating parameters:

 + Gradient free

    + Nelder-Mead
    + Simulated Annealing
    + Particle Swarm

+ Gradient required
    + Conjugate Gradient Descent
    + Gradient Descent
    + (L-)BFGS

+ Hessian required
    + Newton's Method
    + Newton's Method With a Trust Region

For more, see **Optim.jl** documentation https://julianlsolvers.github.io/Optim.jl or **scipy.optimize** https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html.

# Przykład #1 
 
## Zadanie

Assume that $X$ follows zero-truncated Poisson distribution given by

$$
P(X=x, X>0; \lambda) = \frac{f(x; \lambda)}{1 - f(0;\lambda)} = \frac{\lambda^x e^{-\lambda}}{x!(1-e^{-\lambda})} =  \frac{\lambda^x}{(e^\lambda-1)x!},
$$

complete the following task:

+ estimate $\lambda$ based on the maximum likelihood estimation method.
+ Note that, obtaining $\lambda$ requires deriving log-likelihood function (also gradient with respect to $\lambda$) and applying optimization procedure (e.g. \texttt{optim} function).


## Rozwiązanie

We start with likelihood function

$$
L = \prod_i \frac{\lambda^x_i}{(e^\lambda-1)x_i!},
$$

then we compute log-likelihood

$$
    \log L = \log
    \left(
    \prod_i \frac{\lambda^x_i}{(e^\lambda-1)x_i!}
    \right) = 
    \sum_i \log 
    \left( 
    \frac{\lambda^x_i}{(e^\lambda-1)x_i!}
    \right)
$$ 

after simplification we get

$$
\log L = \sum_i x_i \log \lambda - \sum_i \log(e^\lambda-1) - \sum_i \log(x_i!) 
$$ 

In order to get estimate of $\lambda$ we need to calculate derivatives with respect to this parameter. Thus, gradient is given by 

$$
\frac{\partial \log L}{\partial \lambda} = \frac{\sum_i x_i}{\lambda} - \frac{n e^\lambda}{e^\lambda - 1} = \frac{\sum_i x_i}{\lambda} - n \frac{e^\lambda}{e^\lambda - 1}. 
$$

We can also calculate second derivative (hessian)

$$
\frac{\partial^2 \log L}{\partial \lambda^2} =  - \frac{\sum_i x_i}{\lambda^2} + n \frac{e^\lambda}{(e^\lambda-1)^2}.
$$

# Implementacja w R


In [3]:
install.packages(c("rootSolve", "maxLik", "extraDistr", "numDeriv", "rbenchmark"))

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [55]:
library(maxLik)
library(rootSolve)
library(extraDistr) ## rtpois
library(numDeriv) ## numerical gradient
library(rbenchmark)

In [5]:
ll <- function(par, x) {
  m <- sum(x)*log(par)-length(x)*log(exp(par)-1)
  m
}

In [6]:
grad <- function(par, x)  {
  g <- sum(x) / par - length(x)*exp(par)/(exp(par)-1)
  g
}

In [7]:
hess <- function(par, x) {
  h <- -sum(x)/par^2 + length(x)*exp(par)/(exp(par)-1)^2 
  h
}

In [8]:
set.seed(123)
x <- rtpois(10000, lambda = 2.5, a = 0)

In [9]:
res <- maxLik(logLik = ll, start = 1, x = x, method = "NR")
summary(res)

--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 6 iterations
Return code 2: successive function values within tolerance limit (tol)
Log-Likelihood: 637.5431 
1  free parameters
Estimates:
     Estimate Std. error t value Pr(> t)    
[1,]  2.47732    0.01714   144.6  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------

In [10]:
res2 <- maxLik(logLik = ll,  grad = grad, hess = hess, start = 1, x = x, method = "NR")
summary(res2)

--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 6 iterations
Return code 1: gradient close to zero (gradtol)
Log-Likelihood: 637.5431 
1  free parameters
Estimates:
     Estimate Std. error t value Pr(> t)    
[1,]  2.47732    0.01713   144.6  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------

## Przykład 2

Assume that $X$ follows Poisson distribution given by 

$$
P(X=x, \lambda_i) = \frac{\lambda^x e^{-\lambda}}{x!},
$$

where $\lambda_i = \theta_0 + \theta_1 \times z_i$, $\theta_0=0.5$, $\theta_1=0.5$, and $z_i \sim \text{Bern}(0.7)$ and number of observations is equal to $n=10,000$.

Tasks:

+ generate $z_i$,
+ generate $\lambda_i$ according to $\theta_0 + \theta_1 \times z_i$,
+ generate $X \sim Poisson(\lambda_i)$
+ derive log-likelihood, gradient and hessian,
+ obtain MLE of $\boldsymbol{\theta} = (\theta_0, \theta_1)$ using Newton-Raphson method. 

In [57]:
set.seed(123)
n <- 10000
z <- rbinom(n = n, prob = 0.7, size = 1)
theta_true <- c(1, 1)
lambda_true <- theta_true[1] + theta_true[2]*z
X <- rpois(n = n, lambda = lambda_true)
table(X)

X
   0    1    2    3    4    5    6    7    8    9 
2069 2982 2441 1494  629  277   78   25    3    2 

Funkcja log-wiarygodności

$$
logLik(\theta_0, \theta_1; X_i, z_i) = -\lambda_i + x_i \log(\lambda_i) = -(\theta_0 + \theta_1z_i) + x_i \log(\theta_0 + \theta_1z_i)
$$

In [68]:
ll <- function(theta, z, X) {
  
  lam <- theta[1]+theta[2]*z
  l <- X*log(lam) - lam
  return(sum(l))
}

Gradient 

$$
\frac{\partial \log L}{\partial \mathbf{\theta}} = 
\begin{bmatrix}
\frac{\partial \log L}{\partial \theta_0} & = \frac{x_i}{\theta_0 + \theta_1 z_i} - 1\\
\frac{\partial \log L}{\partial \theta_1} & = \frac{x_i z_i}{\theta_0 + \theta_1 z_i} - z_i
\end{bmatrix}
$$

In [82]:
ll_grad <- function(theta, z, X) {
  
  lam <- theta[1]+theta[2]*z
  l_g <- matrix(0, nrow = NROW(lam), ncol = 2)
  l_g[,1] <- X/lam - 1
  l_g[,2] <- X*z/lam - z
  return(colSums(l_g))
}

Hessian

$$
\frac{\partial^2 \log L}{\partial \mathbf{\theta}^2} = 
\begin{bmatrix}
\frac{\partial^2 \log L}{\partial \theta_0^2} & \frac{\partial^2 \log L}{\partial \theta_0 \partial\theta_1} \\
\frac{\partial^2 \log L}{\partial \theta_0 \partial\theta_1} & \frac{\partial^2 \log L}{\partial \theta_1^2} \\
\end{bmatrix} = 
\begin{bmatrix}
\sum_i \frac{-x_i}{(\theta_0+ \theta_1 z_i)^2} & \sum_i \frac{-x_i z_i}{(\theta_0+ \theta_1 z_i)^2} \\
\sum_i \frac{-x_i z_i}{(\theta_0+ \theta_1 z_i)^2} & \sum_i \frac{-x_i z_i^2}{(\theta_0+ \theta_1 z_i)^2} \\
\end{bmatrix}
$$

In [97]:
ll_hess <- function(theta, z, X) {
  
  lam <- theta[1]+theta[2]*z
  l_h <- matrix(0, nrow = 2, ncol = 2)
  l_h[1,1] <- sum(-X / lam^2)
  l_h[2,2] <- sum(-X * z^2 / lam^2)
  l_h[1,2] <- l_h[2,1] <- sum(-X * z/ lam^2)
  return(l_h)
}

In [85]:
solution <- maxLik(logLik = ll, start = c(1,1), z = z, X = X, method = "NR")
summary(solution)

--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 3 iterations
Return code 2: successive function values within tolerance limit (tol)
Log-Likelihood: -7291.117 
2  free parameters
Estimates:
     Estimate Std. error t value Pr(> t)    
[1,]  0.96816    0.01811   53.46  <2e-16 ***
[2,]  1.02872    0.02472   41.61  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------

In [107]:
solution <- maxLik(logLik = ll, grad =  ll_grad, hess = ll_hess, start = c(1,1), z = z, X = X, method = "NR")
summary(solution)

--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 3 iterations
Return code 1: gradient close to zero (gradtol)
Log-Likelihood: -7291.117 
2  free parameters
Estimates:
     Estimate Std. error t value Pr(> t)    
[1,]  0.96816    0.01811   53.46  <2e-16 ***
[2,]  1.02872    0.02472   41.61  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------

Porównajmy jakie wartości uzyskujemy z gradientu i hessianu wyznaczonego analitycznie, a jakie z wyznaczonego numerycznie.

In [104]:
ll_grad(c(1,1), z=z, X=X)
numDeriv::grad(ll, c(1,1), z=z, X= X)

In [105]:
ll_hess(c(1,1), z=z, X=X)
numDeriv::hessian(ll, c(1,1), z=z, X=X)

0,1
-6376.5,-3518.5
-3518.5,-3518.5


0,1
-6376.5,-3518.5
-3518.5,-3518.5


Porównamy szybkość działania

In [None]:
## definiujemy sobie funkcje wcześniej
ll_grad_numeric <- function(theta, z, X) {
  numDeriv::grad(ll, theta, z=z, X=X)
}

ll_hess_numeric <- function(theta, z, X) {
  numDeriv::hessian(ll, theta, z=z, X=X)
}


In [118]:
benchmark(maxlik_bez = maxLik(logLik = ll, start = c(1,1), z = z, X = X, method = "NR"), 
          maxlik_analytic = maxLik(logLik = ll, grad =  ll_grad, hess = ll_hess, start = c(1,1), z = z, X = X, method = "NR"),
          maxlik_numerical = maxLik(logLik = ll, grad =  ll_grad_numeric, hess = ll_hess_numeric, start = c(1,1), z = z, X = X, method = "NR"))

Unnamed: 0_level_0,test,replications,elapsed,relative,user.self,sys.self,user.child,sys.child
Unnamed: 0_level_1,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2,maxlik_analytic,100,0.998,1.0,0.978,0.017,0,0
1,maxlik_bez,100,4.496,4.505,4.403,0.07,0,0
3,maxlik_numerical,100,9.165,9.183,9.023,0.118,0,0
