# Binary Variables

## Bernoulli distribution
The distribution of $x$
$$Bern(x|\mu)=\mu^x(1-\mu)^{1-x}$$
where $x$ is a discrete binary random bariable, $x={0,1}$.  
$\mu$ is the probability of the event $x=1$, and $(1-\mu)$ is the probability of the event $x=0$, thus the Bernoulli distribution above is equal to 
$$Bern(x|\mu)=\left\{\begin{align*}p(x=1|\mu) &= \mu\\ p(x=0|\mu)&=1-\mu \end{align*}\right .$$
Mean and variance is given by
$$\begin{align*}\mathbb{E}[x]&=\mu\\ var[x]&=\mu(1-\mu)\end{align*}$$

---------------
Now suppose we have a data set $\mathcal{D}=\{x_1,\cdots,x_N\}$ of obseved values of $x$. We want to find the unknown mean $\mu$ from these data.
## Frequentist approach
The probability of each data is $p(x_n)$. The likelihood function is given by
$$p(\mathcal{D}|\mu)=\prod_{n=1}^Np(x_n|\mu)=\prod_{n=1}^N \mu^{x_n}(1-\mu)^{1-x_n}$$
Maximize the logarithm of the likelihood
$$\mu_{ML} = \arg \underset{\mu}{max}\ln p(\mathcal{D}|\mu)=\arg \underset{\mu}{max}\sum_{n=1}^N \ln p(x_n|\mu)=\arg \underset{\mu}{max}\sum_{n=1}^N \{x_n\ln\mu+(1-x_n)\ln(1-\mu)\}$$
If we set the derivate of $\ln p(\mathcal{D}|\mu)$ with respect to $\mu$ equal to zero, we obtain the maximum likelihood extimator. At this time, the $\mu_{ML}$ is given by
$$\mu_{ML}=\frac{1}{N}\sum_{n=1}^N x_n = \frac{m}{N}$$
where the $m$ denotes the number of observations of $x=1$ within this data set.

Now suppose the data set we get is $\mathcal{D}=\{1,1,1\}$, whose quantities of data is $3$, but the real probability is $\mu=0.7$ whatever. If we use the *frequentist* to estimate the probability, then we will receive $\mu_{ML}=1$ which is wrong. This is the disadvantage of *frequentist* approach. On the condition limitation of data, it is easily getting poor result using frequentist. This is a kind of over-fitting.

## Bayesian approach
### Binomial distibution
Consider the distribution of $m$, which is the number of observations of $x=1$ given the size of the data set $N$, as well as the summation of $N$ independent variables $x$.
$$Bin(m|N,\mu)=\binom{N}{m}\mu^m(1-\mu)^{N-m}$$
where $\binom{N}{m}\equiv \frac{N!}{(N-m1!m!)}$ is the number of ways of choosing $m$ objects out of a total of $N$ identical objects, which can also denote by $C_N^m$.  
**Mean**
$$\mathbb{E}[m]=\mathbb{E}[\underbrace{x+\cdots+x}_{N}]=\mathbb{E}[Nx]=N\mathbb{E}[x]=N\mu$$
**Variance**
$$var[m]=var[Nx]=Nvar[x]=N\mu(1-\mu)$$

### Beta distribution
In order to develop a Bayesian treatment, what we need to do is to introduce a prior ditribution $p(\mu)$ over the parameter $\mu$. Then from the Bayesian criteria, the posterior distribution has the form 
$$p(\mu|m,N)\propto Bin(m|N,\mu)p(\mu)$$
Now, our mission is to construct our prior distribution $p(\mu)$. We notice that the likelihood function takes the form of the product of factors of the form $\mu^x(1-\mu)^{1-x}$. If we choose a prior to be proportional to powers of $\mu$ and $(1-\mu)$, then our prior distribution has the same form as the likelihood function. This property is called *conjugacy*.
$$Beta(\mu|a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\mu^{a-1}(1-\mu)^{b-1}$$
which is called the *beta* distribution, where
$$\Gamma(x)=\int_0^{\infty}u^{x-1}e^{-u}du$$
The faction that comprises gamma functions controled by parameters $a$ and $b$ is a coefficient for ensuring the beta distribution to be normalized.  
**Mean**
$$\mathbb{E}[\mu]=\frac{a}{a+b}$$
**Variance**
$$var[\mu]=\frac{ab}{(a+b)^2(a+b+1)}$$
The parameters $a$ and $b$ are often called *hyperparameters* because they control the distribution of the parameter $\mu$. With this prior distribution, the posterior distribution has the form
$$p(\mu|m,l,a,b)\propto \mu^{m+1-a}(1-\mu)^{l+b-1}$$
where $l = N-m$, denotes the numbers of $x=0$.
We note that the form is also a *Beta* distribution. Normalizing the distribution gives us
$$p(\mu|m,l,a,b)=\frac{\Gamma(m+a+l+b)}{\Gamma(m+a)\Gamma(l+b)}\mu^{m+a-1}(1-\mu)^{l+b-1}$$
where $a$ and $b$ in the prior can be seen as the effective number of observations of $x=1$ and $x=0$. Thus the total number of $x=1$ is $m+a$ whereas the total number of $x=0$ is $l+b$.  
If our goal is to predict a new $x$ given the observed data set $\mathcal{D}$.
$$\begin{align*}
p(x=1|\mathcal{D})
&=\int_{-\infty}^{\infty}p(x=1|\mu)p(\mu|\mathcal{D})d\mu \qquad Product\ Rule\\
&=\int_0^1\mu p(\mu|\mathcal{D})d\mu\\
&=\mathbb{E}[\mu|\mathcal{D}]
\end{align*}$$
From the mean of Beta distribution, we obtain
$$p(x=1|\mathcal{D})=\frac{m+a}{m+a+l+b}$$
And the variance shows that if the data set $\mathcal{D}$ is bigger, the variance is less, finally tends to $0$, at the meanwhile, the probability $p(x=1)$ equal to the real mean. 

### Sequential approach to learning
From the Bayesian Beta distribution above, we can conclude that each indipendent trial affects the posterior directly by considering the trials before and the initial prior to be the current prior. Each upcomming trial will change our posterior distribution.


-----------------------------
--------------------------
# Multinomial Variables
### Multiple outcomes Bernoulli distribution
Consider picking one from $k$ mutually exclusive terms $\mathbf{x}=\{x_1,\cdots, x_k\}$ of different probabilities $\mathbf{\mu}=\{\mu_1,\cdots,\mu_k\}$. In the case $K=6$, when $x_3$ is picked, then $x_3$ is set to $1$, others is set to $0$, which is denoted by
$$\mathbf{x}=(0,0,1,0,0,0)^T$$
Comparing to the Bernoulli distribution, the distribution is given by
$$p(\mathbf{x}|\mathbf{\mu})=\prod_{k=1}^{K}\mu_k^{x_k}=\left\{\begin{matrix}
p(x_1=1) &=\mu_1 \\ 
p(x_2=1) &=\mu_2 \\ 
\vdots\\
p(x_k=1) &=\mu_k 
\end{matrix}\right.\qquad \sum_{k=1}^K \mu_k=1$$
which can be regarded as a generalization of the Bernoulli distribution to more than two outcomes. 
$$\mathbb{E}[\mathbf{x}|\mathbf{\mu}]=\sum_{\mathbf{x}}p(\mathbf{x}|\mathbf{\mu})\mathbf{x}=(\mu_1,\mu_2,\cdots,\mu_k)^T=\mathbf{\mu}$$

### Multinomial distribution
Now consider a data set $\mathcal{D}$ of $N$ independent observations $\mathbf{x}_1,\cdots,\mathbf{x}_N$. The corresponding likelihood function takes the form
$$p(\mathcal{D}|\mathbf{\mu})=\prod_{n=1}^N\prod_{k=1}^K\mu_k^{x_{nk}}=\prod_{k=1}^K\mu_k^{\left(\sum_nx_{nk}\right)}=\prod_{k=1}^K\mu_k^{m_k}$$
where $m_k=\sum_n x_{nk}$, represents the number of observations of $x_k=1$.  
We need to maximize the likelihood function with the restriction $\sum_{k=1}^K \mu_k=1$. This can be achived using Lagrange multiplier $\lambda$.
$$\mu_k^{ML}=\arg \underset{\mu_k}{max}\sum_{k=1}^K m_k\ln \mu_k+\lambda\left(\sum_{k=1}^K\mu_k - 1\right)=-m_k/\lambda$$
Because of $\sum_{k=1}^K \mu_k=1$, the $\lambda = -N$, which gives
$$\mu_k^{ML}=\frac{m_k}{N}$$
which is he fraction of the $N$ observations for which $x_k=1$.
We can consider the joint distribution of the quantities $m_1,\cdots,m_k$, conditioned on the parameters $\mathbf{\mu}$ and $N$, takes the form 
$$Mult(m_1,\cdots,m_k|\mathbf{\mu},N)=\binom{N}{m_1\cdots m_k}\prod_{k=1}^K\mu_k^{m_k}$$
where the coefficient express the number of ways of partitioning $N$ objects into $K$ group of size $m_1,\cdots,m_k$, which is given by
$$\binom{N}{m_1\cdots m_k}=\frac{N!}{m_1!\cdots m_k!}\qquad \sum_{k=1}^Km_k=N$$

### Dirichlet distribution
Likewise, multi-variable Beta distribution is the Dirichlet distribution, which is used for constructing the prior distribution.
$$Dir(\mathbf{\mu}|\mathbf{a})=\frac{\Gamma(a_0)}{\Gamma(a_1)\cdots\Gamma(a_k)}\prod^{K}_{k=1}\mu_k^{a_k-1}\qquad a_0=\sum_{k=1}^Ka_k$$
The posterior distribution again takes the form of a Dirichlet distribution, which is given by
$$p(\mathbf{\mu}|\mathcal{D},\mathbf{a})=Dir(\mathbf{\mu}|\mathbf{a}+\mathbf{m})=\frac{\Gamma(a_0+N)}{\Gamma(a_1+m_1)\cdots\Gamma(a_k+m_k)}\prod_{k=1}^K\mu_k^{a_k+m_k-1}$$
where $\mathbf{m}=(m_1,\cdots,m_k)^T$, denotes the quantities of $x_k=1$ which come from the data set $\mathcal{D}$.