**University of Edinburgh**\
**School of Mathematics**\
**Bayesian Data Analysis, 2020/2021, Semester 2**\
**Daniel Paulin & Nicolò Margaritella**

**Solutions for Workshop 1: Introduction to Bayesian inference in `R`**


1.  **Analysis of binomial data: drug**. Consider the example from
lecture 1 where a new drug is being considered for relief of chronic
pain, with the success rate $\theta$ being the proportion of
patients experiencing pain relief. In the past, drugs of this type
have shown variable pain relief rates, with a mean of $40\%$ and a
standard deviation of $10\%$. We have seen that these could be
translated into a $\text{Beta}(9.2,13.8)$ distribution. This drug
had 15 successes out of 20 patients.

**(i) Calculate the posterior distribution of the success rate $\theta$.**

    

The posterior distribution of the success rate is
        $$\begin{aligned}
        p(\theta\mid y)&\propto f(y\mid\theta)\pi(\theta)\\
        &=\binom{n}{y}\theta^{y}(1-\theta)^{n-y}\frac{1}{B(a,b)}\theta^{a-1}(1-\theta)^{b-1}\\
        &\propto \theta^{a+y-1}(1-\theta)^{b+n-y-1},\end{aligned}$$
        which we recognise as the kernel of a beta distribution with
        parameters $a+y$ and $b+n-y$. Therefore,
        $$\theta\mid y\sim\text{Beta}(a+y,b+n-y).$$ Taking $a=9.2$,
        $b=13.8$, $n=20$, and $y=15$, results in a
        $\text{Beta}(24.2,18.8)$ distribution.


In [None]:
alpha.prior <- 9.2
beta.prior  <- 13.8
n           <- 20
num.succ    <- 15
p.hat       <- num.succ/n

# (a) The posterior dist'n is alpha' = alpha.prior + x
#                         and beta'  = beta.prior + n-x
alpha.post  <- alpha.prior+num.succ
beta.post   <- beta.prior +n-num.succ
cat("Posterior alpha=",alpha.post,"and beta=",beta.post,"\n")

**(ii) What is the posterior mean and $95\%$ highest posterior density (HPD) interval for the response rate?\
   *Hint*: For computing the HPD interval you can use, for instance, the function `hpd` from the `R` package `TeachingDemos`.**


The posterior mean is $24.2/(24.2+18.8)=0.563$. Using the
         function `hpd` from the package `TeachingDemos` (see `R`
         code below), we obtain the HPD interval $(0.416,0.708)$.

In [None]:
# Q ii. posterior mean=a'/(a'+b') 
cat("Posterior mean=",alpha.post/(alpha.post+beta.post),"\n")

require(TeachingDemos)
cat("The 95% HPD for probability of success is",
    hpd(qbeta, shape1=alpha.post, shape2=beta.post),"\n")

**(iii) Compute a symmetric $95\%$ credible interval. Compare this to the $95\%$ HPD interval.**


In [None]:
cat("0.025 and 0.975=",qbeta(p=c(0.025,0.975),shape1=alpha.post,
                             shape2=beta.post))

By computing the $2.5\%$ and $97.5\%$ percentiles of the
posterior distribution, we obtain the symmetric credible
interval $(0.414,0.706)$. The two intervals (HPD and credible)
are different but close because in this case the posterior
distribution is unimodal and almost symmetric around the mean (see plot below).

In [None]:
x <- seq(0.1,0.99,by=0.01)
plot(x,dbeta(x,shape1=alpha.post,shape2=beta.post),type="l")

**(iv) What is the probability that the true success rate is greater than $0.6$?**



The probability that the true success rate is greater than
$0.6$ is $0.316$.

In [None]:
#Q iv. Pr(p >0.6)
cat("Pr(p>0.6)=",1-pbeta(0.6,shape1=alpha.post,shape2=beta.post),"\n")

**(v) How is this value affected if a uniform prior is adopted? And how is it affected in the case that Jeffreys' prior is adopted?**


Under a uniform prior, i.e., with a $\text{Beta}(1,1)$ prior
distribution, the above probability changes to $0.904$. With a
Jeffreys' prior, it is $0.918$.

In [None]:
#Q v. with Uniform prior
alpha.post2 <- 1+num.succ
beta.post2  <- 1+n-num.succ
cat("Pr(p>0.6)=",1-pbeta(0.6,shape1=alpha.post2,shape2=beta.post2),"\n")

#Q v. with Jeffreys prior
alpha.post2 <- 0.5+num.succ
beta.post2  <- 0.5+n-num.succ
cat("Pr(p>0.6)=",1-pbeta(0.6,shape1=alpha.post2,shape2=beta.post2),"\n")

**(vi) Using the original $\text{Beta}(9.2,13.8)$ prior, suppose $40$ more patients were entered into the study.
What is the chance that at least $25$ of them experience pain relief? *Hint*:
You might want to use the `beta` and `gamma` functions implemented in `R`.**

Let $z$ denotes the number of positive responses in further
 $m=40$ patients. We must first calculate the posterior
 predictive distribution

 $$\begin{aligned}
 f(z\mid y)&=\int_{\Theta}f(z\mid\theta)p(\theta\mid y)\text{d}\theta\\
 &=\int_{0}^{1}\binom{m}{z}\theta^{z}(1-\theta)^{m-z}\frac{1}{B(a+y,b+n-y)}\theta^{a+y-1}(1-\theta)^{b+n-y-1}\text{d}\theta\\
 &=\binom{m}{z} \frac{1}{B(a+y,b+n-y)}\int_{0}^{1}\theta^{a+y+z-1}(1-\theta)^{b+n-y+m-z-1}\text{d}\theta\\
 &=\binom{m}{z} \frac{B(a+y+z,b+n-y+m-z)}{B(a+y,b+n-y)}\int_{0}^{1}\frac{1}{B(a+y+z,b+n-y+m-z)}\theta^{a+y+z-1}(1-\theta)^{b+n-y+m-z-1}\text{d}\theta\\
 &=\binom{m}{z} \frac{B(a+y+z,b+n-y+m-z)}{B(a+y,b+n-y)}\end{aligned}$$

 This is the Beta-Binomial Distribution. It is now
 straightforward to find that $\Pr(z\geq 25)=0.329$ (see `R`
 script, using the CDF function pbbinom from the package
 extraDistr).


In [None]:
#Q vi. Posterior marginal for event (posterior predictive dist'n):
# Pr(x>=25|n=40); need the beta-binomial distribution
# This is installed by default in Kaggle.
#In case it is not installed for you, it can be loaded using the package extraDistr
#install.packages("extraDistr")
library(extraDistr)
cat("Posterior Pr(X>=25|n=40) based on Beta-Binom(alpha.post, beta.post, n) \n")

cat(pbbinom(q = 24, size = 40, alpha = alpha.post, beta = beta.post, lower.tail = FALSE),"\n")
#lower.tail = FALSE means that we compute P(X>x) and not P(X<=x) (which is the default lower.tail = TRUE case)

**(vii) We might ask whether the observed data is 'compatible' with
the expressed prior distribution. One method is to calculate
the predictive probability of observing such an extreme number
of successes under this prior: this is a standard $p$-value
but where the null hypothesis is a distribution. Use the
predictive distribution for 20 future patients to find the
probability of getting at least $15$ successes (i.e., at least
$15$ patients experiencing pain relief). Do you think this
suggests the data are incompatible with the prior?**



We start by calculating the prior predictive distribution
  $$\begin{aligned}
  f(y)&=\int_{\Theta}f(y\mid\theta)p(\theta)\text{d}\theta\\
  &=\int_{0}^{1}\binom{n}{y}\theta^{y}(1-\theta)^{n-y}\frac{1}{B(a,b)}\theta^{a-1}(1-\theta)^{b-1}\text{d}\theta\\
  &=\binom{n}{y}\frac{1}{B(a,b)}\int_{0}^{1}\theta^{a+y-1}(1-\theta)^{b+n-y-1}\text{d}\theta\\
  &=\binom{n}{y}\frac{B(a+y,b+n-y)}{B(a,b)}\end{aligned}$$ The
  prior predictive probability of observing at least 15 positive
  responses can then be computed from the last expression and it
  is 0.01526 (see `R` script for further details). This suggests
  some evidence that the data and the prior are incompatible.

In [None]:
#Q vii. Bayesian P-value based on the Prior
#    Probability of at least 15 successes in n=20 trials. 
cat("Prior Pr(X>=15|n=20) based on Beta-Binom(alpha, beta, n) \n")
cat(pbbinom(q = 14, size = 20, alpha = alpha.prior, beta = beta.prior, lower.tail = FALSE),"\n")

**(viii) Check for prior/data conflict by making the
prior/likelihood/posterior plot.**

The prior/likelihood/posterior plot is shown in the figure
below. There is not much overlap between the support of the prior
and the likelihood and the prior has considerable effect on
the posterior. Re-doing the same plot but now with Jeffreys'
prior we can appreciate that now the prior has basically no
effect on the posterior, with most information coming from
the likelihood.


In [None]:
#Q viii.  pictures of densities and obs'n, and likelihood- 10-15% overlap?
theta <- seq(0.0,0.99,by=0.01)
binomial.likelihood.norm.constant <- 
  integrate(f=function(theta) { dbinom(x=num.succ,size=n,prob=theta)},
            lower=0,upper=1)$value
cat("Normalizing constant:", binomial.likelihood.norm.constant,"\n")

prior.dens <- dbeta(x=theta,shape1=alpha.prior,shape2=beta.prior)
post.dens  <- dbeta(x=theta,shape1=alpha.post, shape2=beta.post)
likelihood <- dbinom(x=num.succ,size=n,prob=theta)/
  binomial.likelihood.norm.constant
my.ylim <- range(c(prior.dens,post.dens,likelihood))
plot(theta,prior.dens,type='l',col='blue',ylim=my.ylim,xlab='Pr(Success)',
     ylab='',main='Prior, Likelihood, Posterior for Pr(Success)')
lines(theta,likelihood,col='green',lty=2)
lines(theta,post.dens,col='red',lty=3)
legend('topleft',legend=c('Prior','Likelihood','Posterior'),
       col=c('blue','green','red'),lty=1:3,bty="n" )

2. **Analysis of drug data with mixture priors**. In the previous 
example, suppose that most drugs $(95\%)$ are assumed to come from
the stated $\text{Beta}(9.2,13.8)$ prior, but there is a small
chance that the drug might be a 'winner'. 'Winners' are assumed to
have a prior distribution with mean $0.8$ and standard deviation
$0.1$.

**(i) What Beta distribution might represent the 'winners' prior?
Remember that a $\text{Beta}(a,b)$ distribution has mean
$\mu=a/(a+b)$ and variance $\sigma^2=ab/\{(a+b)^2(a+b+1)\}$.**


Note that by rearrangement, we have $a+b=a/\mu$, and so $b=a(1-\mu)/\mu$. By substituting this into the equation for $\sigma^2$, we obtain that $\sigma^2=a^2((1-\mu)/\mu)/\{(a/\mu)^2+(a/\mu)^3\}$.
Solving this $a$ and $b$ gives a $\text{Beta}(12,3)$ prior.

In [None]:
#Q i. parameters of Beta such that mu=0.8 and sigma=0.1
# solving for Beta shape parameters given mu and sigma
# showing this requires a few lines of calculations on a paper 
beta.param.calc <- function(mu,sigma) {
  alpha <- (mu^2*(1-mu)-mu*sigma^2)/sigma^2
  beta  <- alpha/mu-alpha
  return(list(alpha=alpha,beta=beta))
}

winner.beta.par <- beta.param.calc(mu=0.8,sigma=0.1)
alpha.winner <- winner.beta.par$alpha
beta.winner  <- winner.beta.par$beta
cat("Beta dist'n for winner, shape1=",alpha.winner,"shape2=",
    beta.winner ,"\n")

**(ii) Plot the mixture prior.**


The mixture prior
 $\theta\sim\pi\text{Beta}(a_1,b_1)+(1-\pi)\text{Beta}(a_2,b_2)$
 is plotted in the figure below.

In [None]:
#Q ii. Draw picture of the mixture prior- seems sensible
pi.1  <- 0.95
theta <- seq(0.00,0.99,by=0.01)

prior.dens.winner <- dbeta(x=theta,shape1=alpha.winner,shape2=beta.winner)
prior.mix.density <- pi.1*prior.dens+(1-pi.1)*prior.dens.winner
plot(theta,prior.mix.density,xlab="Prob. of Success",ylab="",
     main="Mixture Prior for Binomial p",type="l")

**(iii) What is now the chance that the response rate is greater than
$0.6$?\
*Hint*: You might start by showing that if
$$\theta\sim\pi \text{Beta}(a_1,b_1)+(1-\pi)\text{Beta}(a_2,b_2),$$
then
$$\theta\mid y\sim \omega_1\text{Beta}(a_1+y,b_1+n-y)+(1-\omega_1)\text{Beta}(a_2+y,b_2+n-y),$$
where
$$\omega_1= \pi\frac{B(a_1+y,b_1+n-y)}{B(a_1,b_1)}\left(\pi\frac{B(a_1+y,b_1+n-y)}{B(a_1,b_1)}+(1-\pi)\frac{B(a_2+y,b_2+n-y)}{B(a_2,b_2)}\right)^{-1}.$$
Here $y$ denotes the number of successes.**



We will start by finding the posterior distribution of
  $\theta$.

  $$\begin{aligned}
  p(\theta\mid y)&\propto \binom{n}{y}\theta^{y}(1-\theta)^{n-y}\left\{\pi\frac{1}{B(a_1,b_1)}\theta^{a_1-1}(1-\theta)^{b_1-1}
  +(1-\pi)\frac{1}{B(a_2,b_2)}\theta^{a_2-1}(1-\theta)^{b_2-1}\right\}\\
  &\propto \pi\frac{1}{B(a_1,b_1)}\theta^{a_1+y-1}(1-\theta)^{b_1+n-y-1}+(1-\pi)\frac{1}{B(a_2,b_2)}\theta^{a_2+y-1}
  (1- \theta)^{b_2+n-y-1}\\
  &= \pi\frac{B(a_1+y,b_1+n-y)}{B(a_1,b_1)}\frac{1}{B(a_1+y,b_1+n-y)}\theta^{a_1+y-1}(1-\theta)^{b_1+n-y-1}\\
  &~~~+(1-\pi)\frac{B(a_2+y,b_2+n-y)}{B(a_2,b_2)}\frac{1}{B(a_2+y,b_2+n-y)}\theta^{a_2+y-1}(1-\theta)^{b_2+n-y-1}\\
  &= \pi\frac{B(a_1+y,b_1+n-y)}{B(a_1,b_1)}\text{Beta}(\theta\mid a_1+y,b_1+n-y)\\
  &~~~+(1-\pi)\frac{B(a_2+y,b_2+n-y)}{B(a_2,b_2)}\text{Beta}(\theta\mid a_2+y,b_2+n-y).
  \end{aligned}$$

  We are almost there, but note that the 'weights'
  $\pi\frac{B(a_1+y,b_1+n-y)}{B(a_1,b_1)}$ and
  $(1-\pi)\frac{B(a_2+y,b_2+n-y)}{B(a_2,b_2)}$ do not sum up to
  one. Renormalising, we finally obtain that
  $$\theta\mid y\sim\omega_1\text{Beta}(\theta\mid a_1+y,b_1+n-y)+(1-\omega_1)\text{Beta}(\theta\mid a_2+y,b_2+n-y)$$
  with
  $$\omega_1=\pi\frac{B(a_1+y,b_1+n-y)}{B(a_1,b_1)}\left(\pi\frac{B(a_1+y,b_1+n-y)}{B(a_1,b_1)}+(1-\pi)\frac{B(a_2+y,b_2+n-y)}{B(a_2,b_2)}\right)^{-1}$$
  We are now ready to compute the required probability (see `R`
  script), which turns out to be $0.58062$.

In [None]:
#Q iii. Pr(p>0.6)? Need posterior distribution of the mixture
y <- 15; n<- 20
marginal.y <- pi.1*dbbinom(x=y,size=n,alpha=alpha.prior,beta=beta.prior) +
  (1-pi.1)*dbbinom(x=y,size=n,alpha=alpha.winner,beta=beta.winner)

w1 <- pi.1*choose(n=n,k=y)*beta(alpha.prior+y,beta.prior+n-y)/
  (beta(alpha.prior,beta.prior))
w2 <- (1-pi.1)*choose(n=n,k=y)*beta(alpha.winner+y,beta.winner+n-y)/
  (beta(alpha.winner,beta.winner))
w1.scale <- w1/marginal.y
w2.scale <- w2/marginal.y
cat("wt1=",w1.scale,"wt2=",w2.scale,"sum=",w1.scale+w2.scale,"\n")

alpha.winner.post=alpha.winner+y
beta.winner.post=beta.winner+n-y

cat("Pr(p> 0.6)=",w1.scale*(1-pbeta(0.6,shape1=alpha.post,shape2=beta.post)) +
      w2.scale*(1-pbeta(0.6,shape1=alpha.winner.post,shape2=beta.winner.post)),
    "\n")

**(iv) For this mixture prior, repeat the prior/data compatibility
test performed previously. Are the data more compatible with
this mixture prior?**



The procedure is similar to the one in 1. (vii), the only
 difference is the computation of the prior predictive
 distribution. In this case,

 $$\begin{aligned}
 f(y)&=\int_{\Theta}f(y\mid\theta)p(\theta)\text{d}\theta\\
 &=\int_{0}^{1}\binom{n}{y}\theta^{y}(1-\theta)^{n-y}\left\{\pi\frac{1}{B(a_1,b_1)}\theta^{a_1-1}(1-\theta)^{b_1-1}+(1-\pi)\frac{1}{B(a_2,b_2)}\theta^{a_2-1}(1-\theta)^{b_2-1}\right\}\text{d}\theta\\
 &=\pi\binom{n}{y}\frac{1}{B(a_1,b_1)}\int_{0}^{1}\theta^{a_1+y-1}(1-\theta)^{b_1+n-y-1}\text{d}\theta+(1-\pi)\binom{n}{y}\frac{1}{B(a_2,b_2)}\int_{0}^{1}\theta^{a_2+y-1}(1-\theta)^{b_2+n-y-1}\text{d}\theta\\
 &=\pi\binom{n}{y}\frac{B(a_1+y,b_1+n-y)}{B(a_1,b_1)}+(1-\pi)\binom{n}{y}\frac{B(a_2+y,b_2+n-y)}{B(a_2,b_2)}\end{aligned}$$

 The prior predictive probability of observing at least 15
 positive responses is now $0.0514$ (see `R` script for further
 details), which does not provide strong evidence of
 incompatibility.

In [None]:
#Q iv. Bayesian P-value calculation, based on the Mixture Prior
# marginal dist'n is a mixture of Beta-Binom
#    Probability of at least 15 successes in n=20 trials. 
cat("Prior Pr(X>=15|n=20) based on mixture of Beta-Binom \n")
piece1 <- pbbinom(q=14,size=20,alpha=alpha.prior,beta=beta.prior,lower.tail = FALSE)
piece2 <- pbbinom(q=14,size=20,alpha=alpha.winner,
                          beta=beta.winner,lower.tail = FALSE)
cat(pi.1*piece1+(1-pi.1)*piece2,"\n")

**(v) Check for prior/data conflict by making the
prior/likelihood/posterior plot.**

 The prior/likelihood/posterior plot is shown in the figure below.

In [None]:
#Q v. prior/likelihood/posterior plot
post.mix.density  <- w1.scale*dbeta(x=theta,shape1=alpha.post, shape2=beta.post) +
  w2.scale*dbeta(x=theta,shape1=alpha.winner.post,shape2=beta.winner.post)
binomial.likelihood.norm.constant <- 
  integrate(f=function(theta) { dbinom(x=num.succ,size=n,prob=theta)},
            lower=0,upper=1)$value
likelihood <- dbinom(x=num.succ,size=n,prob=theta)/
  binomial.likelihood.norm.constant
my.ylim <- range(c(prior.mix.density,post.mix.density,likelihood))
plot(theta,prior.mix.density,type='l',col='blue',ylim=my.ylim,xlab='Pr(Success)',
     ylab='',main='Mixture Prior, Likelihood, Posterior for Pr(Success)')
lines(theta,likelihood,col='green',lty=2)
lines(theta,post.mix.density,col='red',lty=3)
legend('topleft',legend=c('Prior','Likelihood','Posterior'),
       col=c('blue','green','red'),lty=1:3,bty="n")

3.  **Analysis of normal data: systolic blood pressure**. Suppose we are
interested in the long-term systolic blood pressure (SBP), in mmHg,
of a particular 60-year old female. We take two independent readings
of her SBP 6 weeks apart, giving values of 127 and 133. Each
measurement is assumed to be normally distributed around her
underlying long-term SBP $\theta$ with standard deviation
$\sigma=5$.\
We have additional information: a population survey revealed females
aged 60 had a mean long-term SBP of 120 with standard deviation 10.

**(i) Use the information from the population survey to specify a
normal prior for the woman's mean SBP.**



The survey information is equivalent to a normal prior for the
mean SBP with mean 120 and variance 100.

**(ii) What is the posterior mean and $95\%$ symmetric credible
interval for the woman's SBP? Compare this with the maximum
likelihood estimate and $95\%$ confidence interval.**



 We have seen in class that
 $$\theta\mid\mathbf{y},\sigma^2\sim\text{N}\left(\frac{\frac{\mu_0}{\sigma_0^2}+n\frac{\bar{\mathbf{y}}}{\sigma^2}}{\frac{1}{\sigma_0^2}+\frac{n}{\sigma^2}},\frac{1}{\frac{1}{\sigma_0^2}+\frac{n}{\sigma^2}}\right).$$
 In this case we have $\bar{\mathbf{y}}=130$, $n=2$,
 $\sigma^2=25$, $\mu_0=120$, $\sigma_0^2=100$, leading to a
 posterior mean of $128.89$ and a 95$\%$ credible interval of
 $(122.356,135.422)$. The MLE is 130 and a $95\%$ confidence
 interval is
 $\bar{y}\pm1.96\frac{\sigma}{\sqrt{n}}=(123.070,136.930)$.


In [None]:
# Q ii: posterior given 2 measurements of 127 and 133
obs <- c(127,133)

normal.mu.posterior.mean <- function(mu.prior,mu.sd,sigma,ybar,n) {
  wt.prior <- sigma^2/(sigma^2+n*mu.sd^2)
  out <- wt.prior*mu.prior + (1-wt.prior)*ybar
  return(out)
}

normal.mu.posterior.sd <- function(mu.sd,sigma,n) {
  out <- sqrt(mu.sd^2*sigma^2/(sigma^2+n*mu.sd^2))
  return(out)
}

post.mean <- normal.mu.posterior.mean(mu.prior=120,
                                      mu.sd=10,sigma=5,ybar=mean(obs),n=2)
post.sd   <- normal.mu.posterior.sd(mu.sd=10,sigma=5,n=2)
cat("posterior mean for mu=",post.mean ,"\n")
# posterior mean for mu= 128.8889 

cat("posterior sd for mu=",post.sd ,"\n")
# posterior sd for mu= 3.33 

# 95% credible interval
cat("95% CI for mu=",qnorm(p=c(0.025,0.975),mean=post.mean,sd=post.sd),"\n")
# 95% CI for mu= 122.3557 135.4221  

**(iii) Suppose 2 additional readings were taken, both of 130. What would
be the $95\%$ credible interval now?**

With the 2 new observations the sample mean is unchanged,
$\bar{\mathbf{y}}=130$, but now $n=4$. The 95% credible
interval is now $(124.658,134.165)$.

In [None]:
#Q iii: effect of 2 more readings of 130 on the posterior
obs <- c(127,133,130,130)
post.mean <- normal.mu.posterior.mean(mu.prior=120,
                                      mu.sd=10,sigma=5,ybar=mean(obs),n=4)
post.sd   <- normal.mu.posterior.sd(mu.sd=10,sigma=5,n=4)
cat("posterior mean for mu=",post.mean ,"\n")
# posterior mean for mu= 129.4118 

cat("posterior sd for mu=",post.sd ,"\n")
# posterior sd for mu= 2.425356  

# 95% credible interval
cat("95% CI for mu=",qnorm(p=c(0.025,0.975),mean=post.mean,sd=post.sd),"\n")
# 95% CI for mu= 124.6582 134.1654  

4.  **Caries study: Describing caries experience in Flanders (adapted
    from Lesaffre and Lawson, 2012, p. 37)**\
    The Signal-Tandmobiel study is a longitudinal oral health
    intervention study involving a sample of 4468 children. A random
    sample was taken by selecting primary schools at random and therein
    all children from the first class. The children were examined in
    1996 by 16 trained dentists (examiners) and annually thereafter for
    6 years. Here, we look at the caries experience on primary teeth of
    the first year of the study; hence, the data of 7-year old children
    are evaluated here. Caries experience on primary teeth is
    classically measured by the dmft-index. This score represents the
    number of primary teeth that are decayed (d), missing due to
    extraction for caries reasons (m) or filled (f) because of caries.
    It varies from 0 (no caries experience) to 20 (all primary teeth
    affected). We will analyse a subsample of the data formed by the
    dmft-index of 100 children (dataset `dmft.Rdata` is available on
    Learn). As a natural candidate for modelling the dmft-index we will
    use a Poisson distribution, i.e.,
    $$f(y;\theta)=\frac{e^{-\theta}\theta^y}{y!},\quad y=0,1,2\ldots\quad \theta>0.$$
    Additionally, the following prior information is available:

    -   The review paper of Vanobbergen et al. (2001) reported an
      average dmft-index of 4.1 obtained in a study based on 109
      seven-year-old children and conducted in Liège in 1983, while an
      average of 1.39 was obtained around Ghent on 200 five-year-old
      children examined in 1994.

    -   It is known that oral hygiene had improved considerably in
      Flanders in the recent years.

    The authors stated, and leveraging conjugacy properties, that a
    $\text{Gamma}(a,b)$ prior distribution for $\theta$ with shape $a=3$
    and rate $b=1$ seems to adequately represent the aforementioned
    knowledge.

**(i) Conduct some exploratory data analysis. What do you think about
the suitability of the Poisson model for this dataset?**



Hereby I will let $\texttt{y=dmft}$. The first sensible check is
  to look at the histogram of the data:

In [None]:
system("wget --no-check-certificate -r 'https://docs.google.com/uc?export=download&id=10gM_e7ujVTvLMZhxn9gpqambkaN8ED1K' -O /kaggle/working/dmft.RData")
# You need to enable the Internet in Settings in Kaggle (right hand side menu) before running this

load("/kaggle/working/dmft.RData")

y=dmft; n=length(y)

hist(y,freq=F,xlab="dmft-index",col="gray80",ylab="Density",main="Caries study: histogram of dmft index")

  As can be observed the value 0 has a high frequency when
  compared to the remaining values. A further check is to compute
  the mean and the variance of the data (for the Poisson
  distribution we know that the mean should be equal to the
  variance). We obtain the values listed below, clearly showing
  that the data is overdispersed.

In [None]:
 cat("Mean:",mean(y),"\n")
 cat("Variance:",var(y),"\n")

  As the authors state in the book: "While the Poisson
  distribution is usually the first choice to describe the
  distribution of counts, in medical applications it is often not
  the best choice. For the Poisson distribution, the counts
  represent the sum of independent events that happen with a
  constant average. The dmft-index is the sum of binary responses
  expressing the caries experience in each of the 20 primary
  teeth. However, cavities in the same mouth are correlated. This
  leads to (Poisson-) overdispersion, which means that the
  variance is larger than the mean." Therefore, the Poisson
  distribution is possibly not the best option to model this
  dataset. Possibly, a zero inflated Poisson distribution or a
  negative binomial distribution would be better options. For the
  sake of simplicity, we proceed with the Poisson distribution.

**(ii) Using exact calculations, determine the posterior mean,
standard deviation and $95\%$ credible interval for $\theta$.**



The likelihood is Poisson and $\theta$ is assigned a
 $\text{Gamma}(a,b)$ prior distribution. We have
 $$\begin{aligned}
 p(\theta\mid\mathbf{y})&\propto f(\mathbf{y};\theta)p(\theta)\\
 &=\left\{ \prod_{i=1}^{n}\frac{e^{-\theta}\theta^{y_i}}{y_i!}\right\}\frac{b^{a}}{\Gamma(a)}\theta^{a-1}e^{-b\theta}\\
 &\propto e^{-n\theta}\theta^{\sum_{i=1}^{n}y_i}\theta^{a-1}e^{-b\theta}\\
 &=\theta^{a+\sum_{i=1}^{n}y_i-1}e^{-\theta(b+n)},\end{aligned}$$
 i.e.,
 $\theta\mid\mathbf{y}\sim\text{Gamma}(a+\sum_{i=1}^{n}y_i,b+n)$.
 For this particular dataset we have $a=3$, $b=1$, $n=100$, and
 $\sum_{i=1}^{n}y_i=217$, thus leading to
 $\theta\mid\mathbf{y}\sim\text{Gamma}(220,101)$. For a
 $\text{Gamma}(a_1,b_1)$ distribution, we know that its mean is
 $\frac{a_1}{b_1}$ and its variance is $\frac{a_1}{b_1^2}$, and
 thus
 $$E(\theta\mid\mathbf{y})=\frac{220}{101}\approx 2.178,\qquad \sqrt{\text{var}(\theta\mid\mathbf{y})}=\sqrt{\frac{220}{101^2}}\approx 0.147.$$
 Using the function `qgamma` in `R`, to compute the $2.5\%$ and
 $97.5\%$ quantiles, we obtain that a $95\%$ credible interval
 for $\theta$ is $(1.900,2.475)$.


In [None]:
#exact
a=3; b=1
apost=a+sum(y); bpost=b+n
meanpost=apost/bpost
sdpost=sqrt(apost)/bpost
ci=qgamma(c(0.025,0.975),shape=apost,rate=bpost)
cat("Mean=",round(meanpost,3),"SD=",round(sdpost,3),
    "LB=",round(ci[1],3),"UB=",round(ci[2],3),"\n")


**(iii) Repeat part (ii), but using rejection sampling. You might want
to consider as proposal distribution an exponential
distribution with mean equal to the mean of the data. Comment
about the efficiency of the algorithm.**



In order to use the rejection sampling algorithm to simulate
  from the (unnormalised) posterior distribution, we need to
  choose a proposal distribution. The suggestion is to use an
  exponential distribution with mean equal to the mean of the
  data, i.e.,
  $$g(\theta)=\lambda e^{-\lambda \theta}, \quad \lambda=\frac{1}{\bar{\mathbf{y}}}.$$
  The next task is to find a constant $M$ such that
  $$f(\mathbf{y};\theta)p(\theta)\leq M g(\theta).$$ The optimal
  $M$ is such that
  $$M=\max_{\theta}\left\{\frac{f(\mathbf{y};\theta)p(\theta)}{g(\theta)}\right\}.$$
  We use the command `optimize` to find the optimal value of $M$
  (for further details see the `R` script).

In [None]:
#rejection sampling
#unnormalised posterior
unposterior=function(theta,data){
  likelihood=prod(dpois(data,theta))    
  prior=dgamma(theta,3,1)  
  unpost=likelihood*prior
  return(unpost)
}

#proposal distribution
g=function(theta,m){
  g=dexp(theta,rate=1/m)  
}

#auxiliar function to determine the optimal value of M; need to find maximum of unnormalised posterior/g 
aux=function(theta,data){
  likelihood=prod(dpois(x=data,lambda=theta))    
  prior=dgamma(theta,3,1)  
  unpost=likelihood*prior
  g=dexp(theta,rate=1/mean(data))      
  aux=unpost/g
  return(aux)
}

M=optimize(aux,interval=c(0,20),data=y,maximum=TRUE)$objective
cat("Optimal M=",signif(M,4),"\n")

In [None]:
# Look at M times unnormalized posterior-
m=1000
thetagrid=seq(0,20,len=m)
unnormal.post.ord=numeric(m)
for(i in 1:m){
  unnormal.post.ord[i]=unposterior(theta=thetagrid[i],data=y) 
}

plot(thetagrid,M*g(thetagrid,m=mean(y)),type="l",col="red",
     xlab=expression(theta),ylab="Density")
lines(thetagrid,unnormal.post.ord)

From the above figure we can appreciate that there is a 'large
  rejection area'. This is confirmed by the acceptance rate of
  the algorithm, about $6\%$ (see `R` script for details on the
  implementation). The histogram of the sampled $\theta$'s is
  plotted below.


In [None]:
#rejection sampling algorithm
n.samples=10000
count=0; attempts=0; thetapost=rep(0,n.samples)
t1=proc.time()["elapsed"]
while(count<n.samples){
  attempts=attempts+1  
  theta.c=rexp(1,1/mean(y))
  u=runif(1,0,1)
  alpha=unposterior(theta=theta.c,data=y)/(M*g(theta.c,mean(y)))
  if(u<=alpha){
    count=count+1
    thetapost[count]=theta.c
  }
}
t2=proc.time()["elapsed"]
cat("Speed=",t2-t1,"\n")
cat("Acceptance rate=",round(n.samples/attempts,3),"\n")

cat("E=",round(mean(thetapost),3), "SD=",round(sd(thetapost),3),
    "LB and UB=",round(unlist(quantile(thetapost,c(0.025,0.975))),3),"\n")

In [None]:
hist(thetapost,freq=F,breaks=20,xlab=expression(theta),ylab="Density",main="")

**(iv) Repeat part (ii), but using the Metropolis-Hastings algorithm.
Plot the ACF (autocorrelation function) and the Gelman-Rubin
diagnostics (see gelman.diag and gelman.plot functions in R).**

In [None]:
library(coda)
#The coda library contains useful diagnostics for MCMC runs

#In order to avoid overflow due to numerical precision issues, 
#we work with the logarithm of the unnormalized posterior
logunposterior=function(theta,data){
  loglikelihood=sum(dpois(data,theta,log=TRUE))    
  logprior=dgamma(theta,3,1,log=TRUE)  
  logunpost=loglikelihood+logprior
  return(logunpost)
}

runMH=function(data,burnin=1000,n.samples=10000){
    theta=1; #Initializing theta
    thetapost=rep(0,n.samples)

    for (it in 1:n.samples+burnin)
    {
        theta.prop=theta+rnorm(1)
        if(theta.prop<0){
            logacc=-Inf;
        }
        else{
            logacc=logunposterior(theta=theta.prop,data=y)-log(unposterior(theta=theta,data=y))
        }

        if(log(runif(1,0,1))<=logacc){
            theta=theta.prop
        }

        if(it>burnin){
            thetapost[it-burnin]=theta
        }
    }
    return(mcmc(thetapost))
}

#Run the MCMC sampler
t1=proc.time()["elapsed"]
samples1=runMH(data=y)
t2=proc.time()["elapsed"]
cat("Speed=",t2-t1,"\n")


In [None]:
#Trace plot shows good mixing
plot(samples1)

In [None]:
#acf also shows good mixing
acf(samples1)

In [None]:
samples2=runMH(data=y)
samples3=runMH(data=y)

combined.chains=mcmc.list(samples1,samples2,samples3)

gelman.plot(combined.chains)

In [None]:
gelman.diag(combined.chains)

As we can see, the Gelman-Rubin diagnostics show that the chains have converged.
The summary statistics can be seen below. This is similar to what we have obtained previously.

In [None]:
summary(combined.chains)