Introduction

This project described a Monte Carlo EM (MCEM) method to derive Maximum Likelihood Estimates (MLE) of the log-likelihood function.
In the E-step, perform K = 500 Gibbs sampling incorporated with a Metropolis-Hastings step, and drop the first 100 as a burn-in procedure.
Read article: Maximum Likelihood Algorithms for Generalized Linear Mixed Models (McCulloch 1997)
See Project Summary.pdf here

Model Notation

In this project, we consider a clustering problem. Suppose we have observed n observations, each observation is a binary process, i.e. the response $Y_{ij}=0~or~1,i=1,\cdots,n,j=1,\cdots,T$ . Here n is the number of subjects and T is the length of observation. In general, T might vary across subjects, time points may also be different. In this project, however, we simply assume that all subjects have common time length and time points. We also assume that these subjects belong to two clusters. For each cluster, the conditional expectation of response variable is

$P_{ij}=\mathbb{E}(Y_{ij}|U_i=1,X_{1,ij},Z_{1,i})=g^{-1}(\beta_1X_{1,ij}+Z_{1,i})$

$P_{ij}=E(Y_{ij}|U_i=2,X_{2,ij},Z_{2,i})=g^{-1}(\beta_2X_{2,ij}+Z_{2,i})$

where is cluster membership, $X_{c,ij}$ and $Z_{c,i}~(c = 1,2)$ are fixed and random effects, respectively. The link function $g^{-1}(x)=\frac{exp(x)}{1+exp(x)}$ is given. In a typical clustering problem, is usually unknown, and hence we treat as another random effect.

For random effects, we assume that $Z_{c,i}\sim N(0,\sigma_c^2)$ and $P(U = 1)=\pi$ (then $\pi_2=1-\pi_1$ ). Then the parameter to be estimated is $\Omega=(\beta_1,\beta_2,\sigma_1,\sigma_2,\pi_1)$ . Treating random effects as missing data, one can write the complete data likelihood function as

$L(\Omega|Y_{ij},U_i,Z_{U_i,i})=\prod_{i=1}^n\prod_{c=1}^2(\pi_cf_c(Z_{c,i})\[\prod_{j=1}^Tf_c(Y_{ij}|Z_{c,i})\])^{w_{i,c}}$

where $f_c(Z_{c,i})$ is the density function of Normal distribution, $f_c(Y_ij|Z_{c,i})=P_{ij}^{Y_ij}(1−P_{ij})^{1−Y_{ij}}$ . $ω_{ic}$ is the dummy variable of

$w_{ic}=\begin{cases}&1~~,if~subject~i~belongs~to~cluster~c\\ &0~~,otherwise \end{cases}$

the random effects and are called the latent varaibles and is called complete data. The distribution of depends on $\pi_1$ and the distribution of depends on , $\sigma_1$ and $\sigma_2$ .

Simluation Setup

Generate 100 simulations. In each simulation, set and . The true values of parameter are: $\beta_1 = 1,\beta_2 = 5,\pi_1 = 0.6,\sigma_1 = 2$ and $\sigma_2 = 10.$

Procedures

Flow Chart

Complete Observed Data

We estimated the augmented posterior liklihood and computed the expected log-augmented posterior at each iteration.

$\begin{aligned} &Q(\Omega,\Omega^*)=\int \int_{(U,Z)} \log(\Omega|Y,U,Z)f(U,Z|Y,\Omega^*)dUdZ\\ &= \int \int_{(U,Z)} \log \prod_{i=1}^{100} (\pi_1 \frac{e^{-Z_{1,i}^2/2\sigma_1^{2}}}{\sqrt{2\pi\sigma_1^2}}[\prod_{j=1}^{10}\frac{\exp\{Y_{ij}(\beta_1X_{ij}+Z_{1,i})\}}{1+\exp\{(\beta_1X_{ij}+Z_{1,i})\}}])^{U_i} \\ &((1-\pi_1) \frac{e^{-Z_{2,i}^2/2\sigma_2^2}}{\sqrt{2\pi\sigma_2^2}}[\prod_{j=1}^{10}\frac{\exp\{Y_{ij}(\beta_2X_{ij}+Z_{2,i})\}}{1+\exp\{(\beta_2X_{ij}+Z_{2,i})\}}])^{1-U_i}f(U,Z|Y,\Omega^*)dUdZ\\ &=\int \int_{(U,Z)} [\sum_{i=1}^{100}(U_i \log\pi_1+(1-U_i)\log(1-\pi_1)) \\ &+ \sum_{i=1}^{100}(U_i(\frac{-1}{2}\log(2\pi\sigma_1^2)-\frac{Z_{1,i}^2}{2\sigma_1^2})+(1-U_i)(\frac{-1}{2}\log(2\pi\sigma_2^2)-\frac{Z_{2,i}^2}{2\sigma_2^2}))\\ &+\sum_{i=1}^{100}\sum_{j=1}^{10}U_i(Y_{ij}(\beta_1X_{ij}+Z_{1,i})-\log(1+\exp\{\beta_1X_{ij}+Z_{1,i}\}))+ (1-U_i)(Y_{ij}(\beta_2X_{ij}+Z_{2,i})-\log(1+\exp\{\beta_2X_{ij}+Z_{2,i}\})) ]f(U,Z|Y,\Omega^*)dUdZ \\ \end{aligned}$

Given the current guess to the posterior mode $\theta^*$ , we supply the method of Monte Carlo to calculate the $Q(\Omega,\Omega^*)$ function. In particular, the Monte Carlo E-step is given as

$\begin{aligned} &a.~Draw~(Z_1,U_1),(Z_2,U_2),\cdots,(Z_m,U_m) \overset{i.i.d}{\sim}f(Z,U|Y,\Omega^*)\\ &b.~Let~\hat{Q}_{i+1}(\Omega,\Omega^*)=\frac{1}{m}\sum_{j=1}^m\log~p(\Omega|Z,U,Y) \end{aligned}$

Sampling (Markov chain Monte Carlo)

Since it is difficult to sample directly from the multivariate posterior distribution $f(U_i,Z_i|Y_{ij},\Omega)$ , We can use Gibbs Sampling, a Markov chain Monte Carlo (MCMC) algorithm to obtain a sequence of observations which are approximated from the multivariate distribution.

$\begin{aligned}f(U_i|Z_{U_i},\mathbf{Y}_i) &=\frac{f(U_i,Z_i,\mathbf{Y}_i|\Omega)}{f(Z_{i},\mathbf{Y}_i|\Omega)}\\ &=\frac{\pi_{U_i}f_{U_i}(Z_{i}|\sigma_1,\sigma_2)\prod\limits_{j=1}^Tf_{U_i}(Y_{ij}|Z_{i},\Omega)}{\sum\limits_{c=1}^2\pi_cf_c(Z_{i}|\sigma_1,\sigma_2)\prod\limits_{j=1}^Tf_{c}(Y_{ij}|Z_{i},\Omega)} \end{aligned}$

$\begin{aligned}f(Z_i|U_{i},\mathbf{Y}_i)&=\frac{f(U_i,Z_i,\mathbf{Y}_i|\Omega)}{f(Z_{i},\mathbf{Y}_i|\Omega)}\\&=\frac{\pi_{U_i}f_{U_i}(Z_{i}|\sigma_1,\sigma_2)\prod\limits_{j=1}^Tf_{U_i}(Y_{ij}|Z_{i},\Omega)}{\int_{\mathbb{R}}\pi_{U_i}f_{U_i}(z|\sigma_1,\sigma_2)\prod\limits_{j=1}^Tf_{U_i}(Y_{ij}|z,\Omega)dz}\end{aligned}$

Then, we suppose that $(U_{(k),i},Z_{(k),U_{(k)},i})$ is the th component of the th sample, we want to draw the th component of the th sample from these two conditional distributions. We draw

$\begin{aligned} &U_{(k+1),i}\sim f_{U_i|Z_{U_i},i,\mathbf{Y}_i}(u|Z_{U_{(k)},i},\mathbf{Y}_i,\Omega)\\ &Z_{(k+1),U_{(k+1)},i}\sim f_{Z_{U_i}|U_i,\mathbf{Y}_i}(z|U_{(k+1)},\mathbf{Y}_i,\Omega) \end{aligned}$

Since still obey Bernoulli distribution in prior distribution, it is obtainable for normal sampling. However, for variable , it is still hard to directly sample from the posterior distribution because its intractable intergal denominator. So, we proposed Metropolis-Hastings methods to generate random numbers. More details in the following algorithm charts.

where the accept probability

$A_k(Z,Z^*) = \min \{1,\frac{f_{Z|Y}(Z^*|Y,U,\Omega)h_Z(Z)}{f_{Z|Y}(Z|Y,U,\Omega)h_Z(Z^*)}\}$

we choose marginal distribution of to substitude the proposed density , we could simplify the accept probability as follow

$\begin{aligned} &\frac{f_{Z|Y}(Z^*|Y,U,\Omega)h_Z(Z)}{f_{Z|Y}(Z|Y,U,\Omega)h_Z(Z^*)}\\ &=\frac{\prod_{j=1}^{10}f_{Y_{ij}|Z}(Y_{ij}|Z^*,U,\Omega)f_Z(Z^*|U,\Omega)f_Z(Z|U,\Omega)}{\prod_{j=1}^{10}f_{Y_{ij}|Z}(Y_{ij}|Z,U,\Omega)f_Z(Z|U,\Omega)f_Z(Z^*|U,\Omega)}\\ &=\frac{\prod_{j=1}^{10}f_{Y_{ij}|Z}(Y_{ij}|Z^*,U,\Omega)}{\prod_{j=1}^{10}f_{Y_{ij}|Z}(Y_{ij}|Z,U,\Omega)}\\ &=\frac{\prod_{j=1}^{10} \frac{\exp\{Y_{ij}(\beta_1^{U_i}\beta_2^{1-U_i}X_{ij}+Z^*)\}}{1+\exp\{\beta_1^{U_i}\beta_2^{1-U_i}X_{ij}+Z^*\}}}{\prod_{j=1}^{10} \frac{\exp\{Y_{ij}(\beta_1^{U_i}\beta_2^{1-U_i}X_{ij}+Z)\}}{1+\exp\{\beta_1^{U_i}\beta_2^{1-U_i}X_{ij}+Z\}}} \end{aligned}$

Then we use the Monte Carlo Integrating to approximate the expectation of the log-likelihood.

Estimation (Netwon-Raphson algorithm)

Partial differentiate the complete log-likelihood of the parameters and set derivatives to 0, we get the maximum likelihood estimators

$\hat{\pi}_{MLE,1}=\frac{1}{n}\sum_{i=1}^n\mathbb{I}_{\{U_i=1\}}$

$\hat{\sigma}_{MLE,c}=\sqrt{\frac{\sum_{i=1}^n\mathbb{I}_{\{U_i=c\}}Z_{c,i}^2}{\sum_{i=1}^n\mathbb{I}_{\{U_i=c\}}}}$

However the MLE of $\beta_c$ is hard to obtain as a close form. We can use Netwon-Raphson algorithm to iteratively approximate it.

$\beta_c^{(t+1)} =\beta^{(t)}_c+\frac{\sum_{i=1}^n\mathbb{I}_{\{U_i=c\}}\sum_{j=1}^T(Y_{ij}X_{ij}-\frac{X_{c,ij}e^{(\beta_c^tX_{c,ij}+Z_{c,i})}}{1+e^{\beta_c^tX_{c,ij}+Z_{c,i}}})}{\sum_{i=1}^n\mathbb{I}_{\{U_i=c\}}\sum_{j=1}^T\frac{X_{c,ij}^2e^{\beta_c^tX_{c,ij}+Z_{c,i}}}{(1+e^{\beta_c^tX_{c,ij}+Z_{c,i}})^2}}$

Louis Turbo EM

Since the posterior predictions are hard to be estimated, we can apply some acceleration methods to better estimate them. Therefore, it helps to calibrate the estimating process and save time.

The convergence of EM algorithm is governed by the fraction of missing information. Thus, when the proportion of missing data is high, convergence can be quite slow. In this regard, Louis (1982) has proposed a device for accelerating the convergence of the EM algorithm.

The former iterative maximization of posterior predictive function could be written as $\begin{aligned} \theta^{(t+1))} &= \theta^{(t)}+\bigg[-\frac{\partial^2Q(\theta,\theta^{(t)})}{\partial\theta^2}\bigg|_{\theta^*} \bigg]^{-1} \frac{\partial Q(\theta,\theta^{(m)})}{\partial\theta}\\ &=\theta^{(t)}+\bigg[-\frac{\partial^2 \log p(\theta,|Y)}{\partial\theta^2}\bigg|_{\theta^*} \bigg]^{-1} \frac{\partial \log p(\theta,|Y)}{\partial\theta} \end{aligned}$

where $\theta^*$ is the mode of the augmented posterior predictive function.

Expanding $\partial Q(\theta,\theta^{(t)})/\partial \theta$ about $\theta^{(t)}$ , we have

$0=\frac{\partial Q(\theta,\theta^{(t)})}{\partial\theta}\bigg|_{\theta^*}\approx \frac{\partial Q(\theta,\theta^{(t)})}{\theta}\bigg|_{\theta^{(t)}}+\bigg[\frac{\partial^2 Q(\theta,\theta^{(t)})}{\theta^2}\bigg|_{\theta^{(t)}} \bigg] (\theta^*-\theta)$

Louis also notes that this approximation is the most useful in a neighborhood of the mode $\theta^*$ .

So, we have

$\frac{\partial Q(\theta,\theta^{(t)})}{\partial \theta}\bigg|_{\theta^{t}}= -\frac{\partial Q(\theta,\theta^{(t)})}{\partial\theta^2}\bigg|_{\theta^{(t)}}(\theta^*-\theta^{(t)})$

Combine the Newton-Raphson with the acceleration results, we obtain

$\begin{aligned} \theta^{(t+1)}&=\theta^{(t)}+\bigg[-\frac{\partial^2 Q(\theta,\theta^{(t)})}{\partial \theta^2}\bigg|_{\theta^{(t)}} \bigg]\times\bigg[-\frac{\partial^2 Q(\theta,\theta^{(t)})}{\partial \theta^2}\bigg|_{\theta^{(t)}} (\theta^{*}-\theta^{(t)}) \bigg]\\ &=\theta^{(t)}+\bigg[\frac{-\partial^2 \log p(\theta|Y)}{\theta^2}\bigg|_{\theta^{(t)}} \bigg]\times \bigg[-\int\frac{\partial^2 \log p(\theta|Y,Z)}{\partial\theta^2}p(Z|Y,\theta^{(t)})dZ\bigg|_{\theta^{(t)}}\bigg] (\theta^{*}-\theta^{(t)})\\ &=\theta^{(t)}+\bigg[-\int \frac{\partial^2\log p(\theta|Y,Z)}{\partial\theta^2}p(Z|\theta^{(t)},Y)dZ\bigg|_{\theta^{(t)}}-var\bigg[\frac{\partial \log p(\theta|Y,Z)}{\partial\theta}\bigg|_{\theta^{(t)}}\\ &\times \bigg[-\int\frac{\partial^2 \log p(\theta|Y,Z)}{\partial\theta^2}p(Z|Y,\theta^{(t)})dZ\bigg|_{\theta^{(t)}}\bigg] (\theta^{*}-\theta^{(t)}) \bigg] \bigg] \end{aligned}$

Results

Data Set Up

MCEM

EM Algorithm is sensitive to the initial values of parameters.
For each parameter, we calculate the changing rate of it and let $\epsilon^t$ be as following

$\epsilon^{t}=\max\{\frac{\hat{\beta}_1^{t+1}-\hat{\beta}_1^{t}}{\hat{\beta}_1^{t}+\delta},\frac{\hat{\beta}_2^{t+1}-\hat{\beta}_2^{t}}{\hat{\beta}_2^{t}+\delta},\frac{\hat{\sigma}_1^{t+1}-\hat{\sigma}_1^{t}}{\hat{\sigma}_1^{t}+\delta},\frac{\hat{\sigma}_2^{t+1}-\hat{\sigma}_2^{t}}{\hat{\sigma}_2^{t}+\delta},\frac{\hat{\pi}_1^{t+1}-\hat{\pi}_1^{t}}{\hat{\pi}_1^{t}+\delta}\}$

where $\delta>0$ is to assure that the denominator is positive. Setting the threshold $\epsilon_0$ , if $\epsilon^t<\epsilon_0$ then we will consider the simulation converges.

In this project, we choose $\delta=10^{-12},\epsilon_0=2.5*10^{-2}$

Values

Our convergences are pretty good, all parameters are converged in less than 50 steps, which cost about 1 minute.
We monitor the convergence of the alogorithm by plotting $\Omega^*$ vs. iteration number i and the plot reveals random flucuation about the line $\Omega = \Omega^*$ . So, we may continue with a large value of m to decrease the system variability.

Variables	True Value	Initial Value	Converged Value
$\beta_1$	1	0	0.9953680
$\beta_2$	1	0	1.4076125
$\sigma_1$	2	1	1.387342
$\sigma_2$	10	5	9.132040
$\pi_1$	0.6	0.8	0.480500

Evaluation

We conducted different number of simulations: $100,200,\cdots,1000$ and evaluate the corresponding MSE. From the result, we concluded that MCEM could obtain a fair results based on the intialization mentioned before.
We demonstrate the first 100 experiments of parameters and its estimation results below

The MSE of $\beta_2$ and $\sigma_2$ by MCEM are much bigger than other parameters. This may be the result of the difference of the magnitudes.

N	$\beta_1$	$\beta_2$	$\sigma_1$	$\sigma_2$	$\pi_1$
100	0.01931882	0.04455292	0.06768088	0.5722849	0.02300298
200	0.01833736	0.04397230	0.06758311	0.5673952	0.02303765
300	0.01820673	0.04591263	0.06918503	0.6156710	0.02475426
400	0.01826401	0.04716113	0.06819809	0.6093218	0.02446871
500	0.01859774	0.04598016	0.06920732	0.6096665	0.02488787
600	0.01844890	0.04572807	0.06838667	0.6032845	0.02462379
700	0.01873417	0.04545865	0.06950417	0.6325660	0.02525724
800	0.01852523	0.04516546	0.06837866	0.6270182	0.02496853
900	0.01822611	0.04475583	0.06697709	0.6238291	0.02465581
1000	0.01824120	0.04517658	0.06683521	0.6233873	0.02449014

LTEM vs. MCEM

Louis Turbo EM accelerates the EM algorithm as we can see that LTEM attains better result with good fixed initialization

Contact

gaochy5@mail2.sysu.edu.cn
huangyy73@mail2.sysu.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
Rcode		Rcode
Summary Project		Summary Project
Algorithm.png		Algorithm.png
LICENSE		LICENSE
LTEM_beta1.png		LTEM_beta1.png
LTEM_beta2.png		LTEM_beta2.png
LTEM_pi1.png		LTEM_pi1.png
LTEM_sigma1.png		LTEM_sigma1.png
LTEM_sigma2.png		LTEM_sigma2.png
MSE.png		MSE.png
README.md		README.md
_config.yml		_config.yml
all_data.png		all_data.png
beta.png		beta.png
flow_chart.png		flow_chart.png
pi.png		pi.png
sigma.png		sigma.png

License

Gaochenyin/Monte-Carlo-Expectation-Maximization

Folders and files

Latest commit

History

Repository files navigation