## Beta-Binomial Regression Model 

Refer to the [Detailed version](https://github.com/StatBiomed/GLM-hackathon/blob/main/Betabin/Mathematical_interpretation.ipynb) in [GLM-hackathon](https://github.com/StatBiomed/GLM-hackathon), which also has more information .

Beta-binomial regression model accounting for overdispersion in binomial data is one of the simplest Bayesian models. In this package, we perform beta-binomial regression model by means of beta-binomial distribution with a logistic link.

Suppose we toss a coin for $N$ trials and observe the number of heads as $y$. The probability of heads is inferred based on the observed data $D$. Let $\theta \in [0,1]$ represent the rate parameter (probability of getting a head).

We have several ways to estimate the paramters $\theta$ from observed data $D$. However, these approaches do not account the uncertainty of the estimates and this may cause the problem of overfitting. 

Hence, if you have proportion data and no need to consider the overdispersion in clustered binomial data, binomial regression model can be adopted. However, if the data is overdispersed and you want to account for the uncertainty of parameter estimation, beta-binomial regression model can be considered. One of the examples is to select the informative clonal SNPs in single cell studies and it is also demonstrated to show how the Betabin package works. You may refer to [documentation.ipynb](https://github.com/StatBiomed/BetabinGLM/blob/main/docs/documentation.ipynb).

### 1. Beta-binomial distribution

$$Bb(y|M,\alpha,\beta) \triangleq \binom {M}{y} \frac{B(y+\alpha,M-y+\beta)}{B(\alpha, \beta)}$$

### 2. Beta-binomial regression model

For beta-binomial regression model, we have proportion data $\binom {M}{y}$ as the endogenous variable and $x$ as the exogenous variable (non-linear predictor). Link function is used to fit the data to a linear model. Here, logit function is used as link function, while inverse of logit function called sigmoid function (i.e., $\sigma(w^\mathrm{T}x)$) is used to denot the mapping from the linear inputs to the mean of the output. $w$ is the weight vector (bias $b$ is absorbed into $w$ for convenience).

- Logistic link

$$\sigma(a) = \frac{1}{1 + e^{-a}}, \mathrm{where}\, a = {w^\mathrm{T}}x$$

And $p(y=1|\theta) = \sigma(w^\mathrm{T}x)$ is called logistic regression.

By some mathematics (details can be referred to [Mathematical_intepretation.ipynb](https://github.com/StatBiomed/GLM-hackathon)), we can define 

$$p = \frac{\alpha}{\alpha+\beta}$$

After re-arranging the above formula, we can get

$$\alpha = (1-\phi)p$$ and $$\qquad\beta = \frac{1}{\phi}(1-p)+p-1$$

- Beta-binomial distribution

Subsitute back into beta-binomial distribution (1.2), we obtain the final version of the beta-binomial distribution with respect to $w$ and $\phi$ as the parameters.

$$p(y|w,\phi) = \binom {M}{y} \frac{B(y+(1-\phi)\sigma(w^\mathrm{T}x),M-y+\frac{1}{\phi}(1-\sigma(w^\mathrm{T}x))+\sigma(w^\mathrm{T}x)-1)}{B((1-\phi)\sigma(w^\mathrm{T}x), \frac{1}{\phi}(1-\sigma(w^\mathrm{T}x))+\sigma(w^\mathrm{T}x)-1)}$$

- Joint probability for beta-binomial distribution (likelihood)

$$p(y^{(i)}|\Theta) = \prod \limits _{i=1}^{n} \binom {M^{(i)}} {y^{(i)}} \frac{B(y^{(i)}+\alpha,M^{(i)}-y^{(i)}+\beta)}{B(\alpha, \beta)}$$

- Log-likelihood (LL)

$$L(\Theta) = \sum \limits _{i=1}^{n} \mathrm{log} \binom {M^{(i)}} {y^{(i)}} + \mathrm{Betaln}(M^{(i)}-y^{(i)}+\beta, y^{(i)}+\alpha) - \mathrm{Betaln}(\alpha, \beta)$$

- Objective function / cost function

$$\mathrm{Cost} = -[\mathrm{Betaln}(M^{(i)}-y^{(i)}+\beta, y^{(i)}+\alpha) - \mathrm{Betaln}(\alpha, \beta)]$$


In the above function, $\Theta$ represents the overall parameters (i.e.,$w, \phi$) that we are interested in. 

Hence, we will find the parameters which maximize the likelihood function with the optimizer (i.e., scipy.optimize.minimize). In other words, as we maximize the LL or LLH, we are also minimizing the cost. 