## Beta-Binomial Regression Model 

Beta-binomial regression model accounting for overdispersion in binomial data is one of the simplest Bayesian models. In this package, we perform beta-binomial regression model by means of beta-binomial distribution with a logistic link.

Suppose we toss a coin for $N$ trials and observe the number of heads as $y$. The probability of heads is inferred based on the observed data $D$. Let $\theta \in [0,1]$ represent the rate parameter (probability of getting a head).

We have several ways to estimate the paramters $\theta$ from observed data $D$. However, these approaches do not account the uncertainty of the estimates and this may cause the problem of overfitting. 

Hence, if you have proportion data and no need to consider the overdispersion in clustered binomial data, binomial regression model can be adopted. However, if the data is overdispersed and you want to account for the uncertainty of parameter estimation, beta-binomial regression model can be considered. One of the examples is to select the informative clonal SNPs in single cell studies and it is also demonstrated to show how the Betabin package works. You may refer to [documentation.ipynb](https://github.com/StatBiomed/BetabinGLM/blob/main/docs/documentation.ipynb).

### 1. Bayesian Statistics

Normally, modeling the uncertainty about the parameters can be adopted by using a probability distribution and in Bayesian statistics, the uncertainty is represented by posterior distribuion. 

If you want to estimate the parameter $\theta$ conditioned by observed data $D$, based on Bayes rule, we have:

$$p(\theta|D) = \frac{p(\theta)p(D|\theta)}{p(D)}\ = \frac{p(\theta)p(D|\theta)}{\int\,p(\theta')p(D|\theta')d\theta'}$$

where, 

$p(\theta|D)$ is the posterior distribution

$p(\theta)$ is the prior distribution 

$p(D|\theta)$ is the likelihood function 

Therefore, the posterior $p(\theta|D)$ is computed by conditioning the prior on the observed data $D$.

### 1.1 Binomial model

#### 1.1.1 Prior

In binomial distribution, the prior is called beta distribution. Its domain is bounded between 0 and 1. Defined as follows:

$$\mathrm{Beta}(\theta|\alpha,\beta) = \frac{1}{B(\alpha,\beta)}\ \theta^{\alpha-1} (1-\theta)^{\beta-1}$$

where $B(\alpha,\beta)$ is the beta function, defined by

$$\mathit{B}(\alpha,\beta) \triangleq \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}$$

$$B(\alpha,\beta) = \int_0^1 \theta^{\alpha-1} (1-\theta)^{\beta-1}d\theta$$

where $\Gamma(\alpha)$ is the Gamma function defined by

$$\Gamma(\alpha) \triangleq \int_0^\infty\!\theta^{\alpha-1}e^{-\theta}d\theta$$

For any positive interger $n$,

$$\Gamma(n) = (n-1)!$$

Hence, the prior for binomial distribution takes the following form:

$$p(\theta) \propto \theta^{\overset\smile{\alpha}-1}(1-\theta)^{\overset\smile{\beta}-1} = \mathrm{Beta}(\theta|\overset\smile{\alpha}, \overset\smile{\beta})$$


#### 1.1.2 Binomial likelihood

Consider the case mentioned at the start of this article. The likelihood for binomial model takes the following form:

$$p(\mathcal{D}|\theta) = \mathrm{Bin}(y|N,\theta) = \binom {N}{y}\theta^y(1-\theta)^{N-y}$$

We can see that the likelihood function takes the same form as the prior.

#### 1.1.3 Posterior 
Then, based on Bayes rule, we multiply the Bernoulli likelihood (1.1.2) with the beta prior (1.1.1):

$$\begin{equation}
\begin{split}
p(\theta|D)& \propto \theta^y(1-\theta)^{N-y}\theta^{\overset{\smile}{\alpha}-1}(1-\theta)^{\overset{\smile}{\beta}-1}\\
&=\theta^{y+\overset{\smile}{\alpha}-1}(1-\theta)^{N-y+\overset{\smile}{\beta}-1}\\
&\propto \mathrm{Beta}(\theta|y+\overset{\smile}{\alpha}, N-y+\overset{\smile}{\beta})\\
&= \mathrm{Beta}(\theta|\overset{\frown}{\alpha}, \overset{\frown}{\beta})
\end{split}
\end{equation}$$

where $\overset{\frown}{\alpha} \triangleq y+\overset{\smile}{\alpha}$, $\overset{\frown}{\beta} \triangleq N-y+\overset{\smile}{\beta}$. The parameters $\alpha$ and $\beta$ are called hyper-parameters.

We can see that the posterior distribution, which is proportional to the product of the prior and the likelihood function, will have the same function as the prior. 

This property is called conjugacy. And the beta distribution is a conjugate prior for the Bernoulli likelihood.

#### 1.1.3.1 Posterior mean 
The posterior mean is a more robust estimate for the uncertain parameter because it integrates over the whole space.

If $p(\theta|D) = \mathrm{Beta}(\theta|\overset\frown{\alpha},\overset\frown{\beta})$, then the posterior mean is given by


$$\DeclareMathOperator{\E}{\mathbb{E}}
\bar{\theta} \triangleq \mathbb{E}[\theta|D] = \frac{\overset\frown{\alpha}}{\overset\frown{\beta} + \overset\frown{\alpha}}$$


#### 1.1.4 Posterior predictive distribution 
Now, suppose we are interested in predicting the number of heads in $M>1$ future coin tossing trials and we are using the binomial model. The posterior over $\theta$ is the same as before (1.1.3), and the posterior predictive distribution becomes:

$$\begin{equation}
\begin{split}
p(y|D,M)& = \int_0^1 \mathrm{Bin}(y|M,\theta)\mathrm{Beta}(\theta|\overset\frown{\alpha}, \overset\frown{\beta}) d\theta \\
&= \binom {M}{y}\frac{1}{B(\overset\frown{\alpha}, \overset\frown{\beta})}\int_0^1 \theta^y(1-\theta)^{M-y}\theta^{\overset\frown{\alpha}-1}(1-\theta)^{\overset\frown{\beta}-1}d\theta \\
&= \binom {M}{y}\frac{1}{B(\overset\frown{\alpha}, \overset\frown{\beta})}\int_0^1 \theta^{y+\overset\frown{\alpha}-1}(1-\theta)^{M-y+\overset\frown{\beta}-1}d\theta \\
&= \binom {M}{y}\frac{B(y+\overset\frown{\alpha},M-y+\overset\frown{\beta})}{B(\overset\frown{\alpha}, \overset\frown{\beta})}
\end{split}
\end{equation}$$

### 1.2 Beta-binomial distribution 
Finally, we find the posterior predictive distribution shown above. And it is known as the beta-binomial distribution:

$$Bb(y|M,\overset\frown{\alpha},\overset\frown{\beta}) \triangleq \binom {M}{y} \frac{B(y+\overset\frown{\alpha},M-y+\overset\frown{\beta})}{B(\overset\frown{\alpha}, \overset\frown{\beta})}$$

### 2. Beta-binomial regression model

For beta-binomial regression model, we have proportion data $\binom {M}{y}$ as the endogenous variable and $x$ as the exogenous variable (non-linear predictor). Link function is used to fit the data to a linear model. Here, logit function is used as link function, while inverse of logit function called sigmoid function (i.e., $\sigma(w^\mathrm{T}x)$) is used to denot the mapping from the linear inputs to the mean of the output. $w$ is the weight vector (bias $b$ is absorbed into $w$ for easier interpretation).

- Logistic link

$$\sigma(a) = \frac{1}{1 + e^{-a}}, \mathrm{where}\, a = {w^\mathrm{T}}x$$

And $p(y=1|\theta) = \sigma(w^\mathrm{T}x)$ is called logistic regression.

We can define $p(y=1|\sigma(w^Tx))$ as the posteror mean $\mathbb{E}[\theta|D]$ (1.1.3.1) and $\phi = \frac{1}{\alpha+\beta+1}$, where $\phi$ is the overdispersion parameter with bounds between 0 and 1.

Then, 

$$p = \mathbb{E}[\theta|D] = \frac{\alpha}{\alpha+\beta}$$

After re-arranging the above formula, we can get

$$\alpha = (1-\phi)p$$ and $$\qquad\beta = \frac{1}{\phi}(1-p)+p-1$$

- Beta-binomial distribution

Subsitute back into beta-binomial distribution (1.2), we obtain the final version of the beta-binomial distribution with respect to $w$ and $\phi$ as the parameters.

$$p(y|w,\phi) = \binom {M}{y} \frac{B(y+(1-\phi)\sigma(w^\mathrm{T}x),M-y+\frac{1}{\phi}(1-\sigma(w^\mathrm{T}x))+\sigma(w^\mathrm{T}x)-1)}{B((1-\phi)\sigma(w^\mathrm{T}x), \frac{1}{\phi}(1-\sigma(w^\mathrm{T}x))+\sigma(w^\mathrm{T}x)-1)}$$

- Joint probability for beta-binomial distribution (likelihood)

$$p(y^{(i)}|\Theta) = \prod \limits _{i=1}^{n} \binom {M^{(i)}} {y^{(i)}} \frac{B(y^{(i)}+\alpha,M^{(i)}-y^{(i)}+\beta)}{B(\alpha, \beta)}$$

- Log-likelihood (LL)

$$L(\Theta) = \sum \limits _{i=1}^{n} \mathrm{log} \binom {M^{(i)}} {y^{(i)}} + \mathrm{Betaln}(M^{(i)}-y^{(i)}+\beta, y^{(i)}+\alpha) - \mathrm{Betaln}(\alpha, \beta)$$

- Objective function / cost function

$$\mathrm{Cost} = -[\mathrm{Betaln}(M^{(i)}-y^{(i)}+\beta, y^{(i)}+\alpha) - \mathrm{Betaln}(\alpha, \beta)]$$


In the above function, $\Theta$ represents the overall parameters (i.e.,$w, \phi$) that we are interested in. 

Hence, we will find the parameters which maximize the likelihood function with the optimizer (i.e., scipy.optimize.minimize). In other words, as we maximize the LL or LLH, we are also minimizing the cost. 