# <font color="darkblue"> Binomial Beta Model

A Bernoulli trial is a random experiment with exactly two possible outcomes: typically "success" (with probability $\theta$) and "failure" 
(with probability $1-\theta$). Each trial is independent of others.

**Examples**

1. Quality Control in Manufacturing: Each light bulb tested is either **defective (failure) or non-defective (success)**.

1. Customer Purchase Decision: A customer either makes a **purchase (success) or does not (failure)** during a store visit.

1. Medical Test Result: A medical test result is either **positive (success) or negative (failure)**, indicating the presence of a condition.

---

Consider $n$ independent and identically distributed (**i.i.d.**) Bernoulli random variables $Y_1, Y_2, \dots, Y_n$

$$
Y_i \sim \text{Bernoulli}(\theta), \quad \theta \in (0,1) \quad  Y_i \in \{0,1\}$$ 

Hence the range of $Y$ is $\mathscr{A}_Y = \{0,1\}$ and that of the parameter $\theta$ is $(0,1)$ or equivalently $0 < \theta < 1$

Each $Y_i$ takes values in with probability mass function:

$$P(Y_i = y_i \mid \theta) = \theta^{y_i} (1 - \theta)^{1 - y_i}$$

When you perform $n$ independent Bernoulli trials, the **sum of successes** across these trials follows a Binomial distribution. 
Specifically, if $X= \sum_{i=1}^{n} y_i$  represents the number of successes, then $X$ follows a Binomial distribution 

Then the Probability Mass Funtion (PMF) of $X|\theta$ is 

$$P[X = x]=\binom{n}{x} \theta^x (1-\theta)^{n-x}$$ where $\quad \theta \in (0,1) \quad  X \in \{0,1,2\cdots \cdots, n\}$ or $\mathscr{A}_X = \{0,1,2\cdots \cdots, n\}$


The aim of statistical inference is to make conclusions about a population based on a sample of data. Given a sample, it aims to estimate unknown parameters (such as the population mean or proportion) and assess the uncertainty around these estimates. Statistical inference uses methods like point estimation, confidence intervals, and hypothesis testing to draw valid conclusions and make decisions based on the data.

---

In the context of a Binomial proportion, Bayesian statistical inference aims to estimate the proportion of successes in a population, given a sample. Using Bayes' theorem, prior beliefs about the proportion are updated with observed data to produce a posterior distribution for the proportion, capturing uncertainty and incorporating prior knowledge into the estimate.

The **Beta distribution** is a natural choice for modeling the **Binomial proportion** in a Bayesian context because it is a **conjugate prior** for the Binomial likelihood. This means that when you update a Beta prior with Binomial data, the resulting posterior distribution is also a Beta distribution 

Additionally, the Beta distribution is flexible and defined on the interval [0, 1], which aligns perfectly with the possible values of a proportion. The shape of the Beta distribution can be easily adjusted using its parameters (α and β) to reflect different prior beliefs about the proportion 

---

## <font color="darkgreen"> **Likelihood Function**

Given $n$ observations, the likelihood function of $\theta$  is $$\mathscr{L}(\theta)=\binom{n}{x} \theta^x (1-\theta)^{n-x}$$

## <font color="darkblue"> **Beta Prior for $\theta$**

Assume a **Beta prior** distribution for the parameter $\theta$:

$$
\theta \sim \text{Beta}(a, b)
$$

The Probability Density Function (PDF) of the Beta distribution is:

$$
P(\theta) = \frac{\theta^{a - 1} (1 - \theta)^{b - 1}}{B(a, b)}
$$

where $B(a, b)$ is the **Beta function**:

$$
B(a, b) = \int_0^1 t^{a - 1} (1 - t)^{b - 1} \, dt
$$

---


## <font color="darkblue"> **Posterior Distribution via Bayes' Theorem**

Applying Bayes' theorem:

$$
\pi(\theta \mid \mathbf{X}) = \frac{\mathscr{L}(\theta) \, P(\theta)}{m(\mathbf{X})}$$ where $m(\mathbf{X})$ is called the marginal likelihood given by 

$$
m(\mathbf{X}) = \int_0^1 L(\theta) \, P(\theta) \, d\theta = \int_0^1 \frac{\theta^{x + a - 1} (1 - \theta)^{n - x + b - 1}}{B(a, b)} \, d\theta
$$

Now, The **numerator** is:

$$
\mathscr{L}(\theta) \, P(\theta) = \left[ \theta^x (1 - \theta)^{n - x} \right] \left[ \frac{\theta^{a - 1} (1 - \theta)^{b - 1}}{B(a, b)} \right]
$$

Simplifying:

$$
\mathscr{L}(\theta) \, P(\theta) = \frac{\theta^{x + a - 1} (1 - \theta)^{n - x + b - 1}}{B(a, b)}
$$

Whereas the integral in the **denominator** 

$$
m(\mathbf{X}) = \int_0^1 L(\theta) \, P(\theta) \, d\theta = \int_0^1 \frac{\theta^{x + a - 1} (1 - \theta)^{n - x + b - 1}}{B(a, b)} \, d\theta
$$ can be recognized as as the Beta function $B(x + a, n - x + b)$

$$
m(\mathbf{X}) = \frac{B(x + a, n - x + b)}{B(a, b)}
$$

Thus, the **posterior distribution** is:

$$
\pi(\theta \mid \mathbf{X}) = \frac{L(\theta) \, P(\theta)}{p(\mathbf{X})} = \frac{\theta^{x + a - 1} (1 - \theta)^{n - x + b - 1}}{B(x + a, n - x + b)}
$$

This is the PDF of a **Beta distribution**:

$$
\theta \mid \mathbf{X} \sim \text{Beta}(x + a, n - x + b)
$$

### <font color="darkblue"> **Final Notes**

- The **posterior distribution** is a Beta distribution.
  
- The **updated parameters** are:
  - **Posterior shape parameter**: $a' = x + a$
  - **Posterior rate parameter**: $b' = n - x + b$
    
- This demonstrates that the **Beta distribution is a conjugate prior** for the proportion parameter in the Binomial likelihood.

## <font color="darkblue">**More about the prior**

Beta distribution with two shape parameters $a$ and $b$ forms a conjugate prior for the proportion parameter $\theta$ of Binomial distribution. This can be understood from the form of posterior distribution which is also a Beta distribution with parameters $x+a$ and $n-x+b$. This convenience provides quick and direct summary for the parameter $\theta$ from the posterior with the sample information $X$

---

## <font color="darkred">**Point Estimate from the Posterior Beta distribution**

### MAP 

The **MAP estimate** corresponds to the mode of the posterior distribution. For a **Beta distribution**, the mode is given by:

$$\hat{\theta}_{MAP} = \frac{x + \alpha - 1}{n + \alpha + \beta - 2}$$

Where:
- $ x $ is the number of successes.
- $ n $ is the total number of trials.
- $ \alpha $ and $ \beta $ are the parameters of the Beta prior distribution.

#### Conditions for the Mode:

- The mode is valid for $ \theta $ in the interval $ (0, 1) $ if $ x + \alpha - 1 > 0 $ and $ n - x + \beta - 1 > 0 $. Otherwise, the mode may be at the boundaries $ \theta = 0 $ or $ \theta = 1 $.


### Mean:

$$\text{E}[\theta \mid \mathbf{X}] = \frac{x + a}{n + a + b}$$

### Variance:

$$\text{Var}[\theta \mid \mathbf{X}] = \frac{(x + a)(n - x + b)}{(n + a + b)^2 (n + a + b + 1)}$$

---

It can be noted that the choice of the prior parameters $a$ and $b$ will influence these summaries and also to recall that prior parameters can help to incorporate any additional / reasonable information about the parameter beyond the available sample data. 

---
# Non-informative for binomial proportion parameter

## 1. Uniform Prior 

When no information (prior) about the parameter $\theta$ is available, then we can assume that $\theta$ can assume all plausible values equally likely in its space $[0,1]$; that is, it can assume any value randomly in the space

For this no-information situation, we can consider a Uniform distribution in $[0,1]$ which is the Beta$(1,1)$; that is $a = 1, b = 1$


## 2. Jeffreys Prior 

Also, we can consider Jeffreys prior based on Fisher Information matrix, $I(\theta)=\sqrt{E[-H(\theta)]}$ where $H(\theta)$ is the Hessian matrix of $l(\theta)=\text{ln}(\mathscr{L}(\theta))$; here, ln refers to natural logirthm (base $e$), $\text{ln}_em$

In the case of Binomial likelihood, $$\mathscr{L}(\theta)=\binom{n}{x} \theta^x (1-\theta)^{n-x}$$ and hence,

$l(\theta) = \text{ln}(\mathscr{L}(\theta))= \text{ln}(\binom{n}{x} \theta^x (1-\theta)^{n-x})$

We use the properties of logirthm,

$l(\theta) = \text{ln}\binom{n}{x} +  \text{ln}(\theta^x) + \text{ln}(1-\theta)^{n-x}$


$l(\theta) = k_0 +  x ~ \text{ln}(\theta) + (n-x) ~ \text{ln}(1-\theta)$ where $k_0=\text{ln}\binom{n}{x}$ is a constant, independent of $\theta$

Now to get the Hessian matrix $H(\theta)$, we shall differentiate $l(\theta)$ twice with respect to $\theta$

$$\frac{dl}{d\theta}=\frac{x}{\theta}-\frac{n-x}{1-\theta}$$ (the first term in $l(\theta)$ is constant so that its differentiation with respect to $\theta$ becomes zero)

$$ H(\theta) = \frac{d^2l}{d\theta^2}=-\frac{x}{\theta^2}-\frac{n-x}{(1-\theta)^2}$$

Expectation of $H(\theta)$ is,

$$ \text{E}[H(\theta)] = -\frac{n\theta}{\theta^2}-\frac{n-n\theta}{(1-\theta)^2}$$

$$ \text{E}[H(\theta)] = -\frac{n}{\theta}-\frac{n}{1-\theta}$$

$$ \text{E}[H(\theta)] = -n[\frac{1-\theta+\theta}{\theta(1-\theta)}]$$

$$ \text{E}[H(\theta)] = -n[\frac{1}{\theta(1-\theta)}]$$

Hence the Fisher Information matrix  $I(\theta)=\sqrt{E[-H(\theta)]}$ will be 

$$ I(\theta) = [n\frac{1}{\theta(1-\theta)}]^{\frac{1}{2}}$$ which is proportional to (up to the constant)


$$ I(\theta)  \propto \theta^{-\frac{1}{2}} (1-\theta)^{-\frac{1}{2}}$$ 

$$ I(\theta)  \propto \theta^{\frac{1}{2}-1} (1-\theta)^{\frac{1}{2}-1}$$ 

The the expression in the right hand side of the above equation is similar to the Beta distribution with parameters $a = b = \frac{1}{2}$

$\implies$ the Jeffreys prior for the Binomial proportion parameter will become Beta distribution with parameters $a = b = \frac{1}{2}$


# <font color="darkblue"> A note on Beta Distribution (Prior)

**<font color="darkred"> Prior** $p(\theta)$

$$\theta \sim \mathrm{Beta}(a,b)$$ $0 < \theta <1~~ a,b > 0$


## Originally, we had $a = b = 1$

- ## <font color="red">What is the rationality?

## <font color="darkviolet"> Versatility of Beta Distributions

- Range is $(0, 1)$

- Both $a, b$ are shape parameters

- If $a = b = 1$, it is the Uniform random variable in $(0,1)$

- Symmetric when $a = b$

- When $a>b$, $\theta$ "near $1$" is more probable

- When $a<b$, $\theta$ "near $0$" is more probable

# <font color="darkred"> A Visual Representation

We can observe the useful shapes of Beta distribution that reflect many forms of the parameter in the $(0,1$

In [None]:
import numpy as np
import pandas as pd
import scipy as sc
from scipy import stats as st
import math
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
k=10000
#Uniform Prior - Vague / Flat
a=0.5
b=0.5
symm=0.5
rand_gen=st.beta.rvs(a,b,size=k)
fig = plt.figure(figsize = (20, 8))
plt.hist(rand_gen,alpha=0.8)
plt.axvline(x=symm, color='r',linewidth=4)
plt.show()

In [None]:
k=10000
#Jeffreys Prior
a=0.5
b=0.5
symm=0.5
rand_gen=st.beta.rvs(a,b,size=k)
fig = plt.figure(figsize = (20, 8))
plt.hist(rand_gen,alpha=0.8)
plt.axvline(x=symm, color='r',linewidth=4)
plt.show()

In [None]:
k=10000
#Symmetry (a, b >1)
a=5
b=5
symm=0.5
rand_gen=st.beta.rvs(a,b,size=k)
fig = plt.figure(figsize = (20, 8))
plt.hist(rand_gen,alpha=0.8)
plt.axvline(x=symm, color='r',linewidth=4)
plt.show()

In [None]:
#Mirror Image
k=10000
a=6
b=0.3
symm=0.5
fig = plt.figure(figsize = (20, 4))

plt.subplot(1, 2, 1)

rand_gen1=st.beta.rvs(a,b,size=k)
plt.hist(rand_gen1,alpha=0.8)
plt.axvline(x=symm, color='r',linewidth=4)
plt.text(0.3, 2000, [a,b], horizontalalignment='left', fontsize=30, color='green', weight='bold')

plt.subplot(1, 2, 2)

rand_gen2=st.beta.rvs(b,a,size=k)
plt.hist(rand_gen2,alpha=0.8)
plt.axvline(x=symm, color='r',linewidth=4)
plt.text(0.3, 2000, [b,a], horizontalalignment='left', fontsize=30, color='green', weight='bold')

plt.show()

## <font color="darkblue">**Inference from Posterior**

We have already observed that the mean and variance from the Posterior Beta distribution are 

$$\text{E}[\theta \mid \mathbf{X}] = \frac{x + a}{n + a + b}$$

$$\text{Var}[\theta \mid \mathbf{X}] = \frac{(x + a)(n - x + b)}{(n + a + b)^2 (n + a + b + 1)}$$

---

Now to obtain **interval estimate** for the parameter $\theta$ we use the following idea

A $100(1-\alpha)\%$ **Credible Interval** for $\theta$ is an interval $[a_1, a_2]$ such that the probability that $\theta$ **lies** in the interval is $1-\alpha$; that is, 

$$Pr(\theta \in [a_1, a_2])=1-\alpha$$

$$\implies Pr(\theta \in [a_1, a_2])=\int_{a_1}^{a_2} \pi(\theta|X) d\theta$$

If an appropriate / reasonable conjugate prior parameters $a, b$ are chosen then the above limits $a_1$ and $a_2$ can be calculated as the inverse CDF of Beta (posterior) distribution

This may lack a closed form approach in obtaining the limits, we may refer to any computational environment to make use of the built in functions for CDF of a distribution (here, a Beta distribution)

For example we may refer to python librabry *scipy* for this; following code may provide the necessary steps

```python
from scipy.stats import beta
beta.ppf(p, x+a, n-x+b)
```
- where $p$ is the required value $(0 < p < 1)$ for finding the quantile (inverse CDF)

- $x+a, n-x+b$  are the parameters of the Beta (posterior) distribution

$p$ is chosen based on the required size $1-\alpha$ of the Credible interval; that is

1. p = $\frac{\alpha}{2}$ for the lower limit $a_1$
2. p =  $1-\frac{\alpha}{2}$ for the upper limit $a_2$

# <font color="darkblue"> Predictive Distribution

Bayesian model allows to obtain distribution about unseen data in the light of information contained in the posterior of the parameter $\theta$ exist in the original likelihood


## <font color="darkred"> Predictive information about "New Data"

- Observe a data $X$ parameterised by $\theta$

- Construct prior for $\theta$

- Obtain posterior $\theta|X$

- **Interested to know the probability distribution about "unseen data $Y$"**

$$p(Y|X) = \int_{\theta}p(Y|\theta)\pi(\theta|X)~d\theta$$

- $p$ refers to the pdf of $Y$ parameterised by $\theta$ (same as $X$)

- $\pi$ refers to Posterior of $\theta$

## <font color="darkviolet"> For Binomial Case

1. Originally the data was a Binomial with $x$ and $n$

1. The Prior was Beta with parameters $a$ and $b$

1. The Posterior was Beta with parameters $x+a$ and $n-x+b$

Hence Posterior predictive for an unseen environment is out of a new trial, say $m$ number times, what is the distribution of then number of successes $y$

So again the likelihood $\mathcal{L}[\theta|Y]$ or PMF $f(Y|\theta)$ is from

$$Y|\theta = \mathrm{Binomial}(m, \theta)$$

## <font color="red"> But information for $\theta$ is no longer prior but the posterior

$$\therefore p(Y|X)=\int_{\theta}p(y|\theta)\pi(\theta|x)~d\theta$$ leads to


$$\therefore p(Y|X)=\int_0^1{m \choose y} \theta^y (1-\theta)^{m-y} \frac{1}{\beta(a_1,a_2)}\theta^{a_1-1} (1-\theta)^{a_2-1} ~d\theta$$

where $a_1=x+a$ and $a_2=n-x+b$

$$\implies p(Y|X)=\frac{{m \choose y}}{\beta(a_1,a_2)}\int_0^1 \theta^{y+a_1-1} (1-\theta)^{m-y+a_2-1} ~d\theta$$

$$\implies p(Y=y|x=x)=\frac{{m \choose y}}{\beta(a_1,a_2)}\beta(y+a_1, m-y+a_2)$$

where the range for $m$ is $y=0,1,2,\cdots\cdots,m$

## <font color="maroon">**This is called Beta-Binomial Distribution**