# Bernoulli Distribution

The Bernoulli distribution is a discrete probability distribution that models a random experiment with exactly two possible outcomes: success (usually denoted as 1) and failure (usually denoted as 0). It is characterized by a single parameter \( p \), which represents the probability of success.

## Probability Mass Function (PMF)

If \( X \) is a random variable following a Bernoulli distribution, its PMF is given by:

$$
P(X = x) =
\begin{cases}
p, & \text{if } x = 1 \ (\text{success}), \\
1 - p, & \text{if } x = 0 \ (\text{failure}).
\end{cases}
$$

This can also be written compactly as:

$$
P(X = x) = p^x (1 - p)^{1 - x}, \quad \text{for } x \in \{0, 1\}.
$$

---

## Derivation of the Mean  $E[X]$

The mean (or expected value) of a random variable \( X \) is defined as:

$$
E[X] = \sum_x x \cdot P(X = x).
$$

For the Bernoulli distribution, \( x \) can take only two values: 0 and 1. Substituting these values into the formula:

$$
E[X] = \sum_{x=0}^1 x \cdot P(X = x).
$$

Expanding the summation:

$$
E[X] = (0 \cdot P(X = 0)) + (1 \cdot P(X = 1)).
$$

Substitute \( P(X = 0) = 1 - p \) and \( P(X = 1) = p \):

$$
E[X] = (0 \cdot (1 - p)) + (1 \cdot p).
$$

Simplify:

$$
E[X] = 0 + p = p.
$$

Thus, the mean of a Bernoulli random variable is:

$$
E[X] = p.
$$

---

## Derivation of the Variance $ \text{Var}(X) $

The variance of a random variable \( X \) is defined as:

$$
\text{Var}(X) = E[X^2] - (E[X])^2.
$$

### Step 1: Compute $ E[X^2] $

The second moment $ E[X^2] $ is given by:

$$
E[X^2] = \sum_x x^2 \cdot P(X = x).
$$

For the Bernoulli distribution, \( x \) can take only two values: 0 and 1. Substituting these values:

$$
E[X^2] = (0^2 \cdot P(X = 0)) + (1^2 \cdot P(X = 1)).
$$

Simplify:

$$
E[X^2] = (0 \cdot (1 - p)) + (1 \cdot p).
$$

Thus:

$$
E[X^2] = p.
$$

### Step 2: Compute  $ E[X])^2 $

From the earlier derivation, we know that:

$$
E[X] = p.
$$

Thus:

$$
(E[X])^2 = p^2.
$$

### Step 3: Substitute into the Variance Formula

Using the variance formula:

$$
\text{Var}(X) = E[X^2] - (E[X])^2.
$$

Substitute $  E[X^2] = p $  and  $ E[X]^2 = p^2 $:

$$
\text{Var}(X) = p - p^2.
$$

Factorize:

$$
\text{Var}(X) = p(1 - p).
$$

Thus, the variance of a Bernoulli random variable is:

$$
\text{Var}(X) = p(1 - p).
$$

---

# Beta Distribution

The Beta distribution is a continuous probability distribution defined on the interval \([0, 1]\). It is widely used in Bayesian statistics and in modeling probabilities or proportions. The Beta distribution is characterized by two shape parameters, typically denoted as \( \alpha > 0 \) and \( \beta > 0 \). These parameters control the shape of the distribution.

## Probability Density Function (PDF)

The PDF of the Beta distribution is given by:

$$
f(x; \alpha, \beta) =
\begin{cases}
\frac{x^{\alpha - 1}(1 - x)^{\beta - 1}}{B(\alpha, \beta)}, & \text{if } 0 \leq x \leq 1, \\
0, & \text{otherwise}.
\end{cases}
$$

where \( B(\alpha, \beta) \) is the Beta function, which ensures that the total probability integrates to 1. The Beta function is defined as:

$$
B(\alpha, \beta) = \int_0^1 t^{\alpha - 1} (1 - t)^{\beta - 1} dt.
$$

The Beta function can also be expressed in terms of the Gamma function \( \Gamma \):

$$
B(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)}.
$$

---

## Derivation of the Mean \( E[X] \)

The mean of a random variable \( X \) following a Beta distribution is given by:

$$
E[X] = \int_0^1 x f(x; \alpha, \beta) dx.
$$

Substitute the PDF \( f(x; \alpha, \beta) = \frac{x^{\alpha - 1}(1 - x)^{\beta - 1}}{B(\alpha, \beta)} \):

$$
E[X] = \int_0^1 x \cdot \frac{x^{\alpha - 1}(1 - x)^{\beta - 1}}{B(\alpha, \beta)} dx.
$$

Simplify:

$$
E[X] = \frac{1}{B(\alpha, \beta)} \int_0^1 x^\alpha (1 - x)^{\beta - 1} dx.
$$

This integral is recognized as a Beta function \( B(\alpha + 1, \beta) \):

$$
E[X] = \frac{B(\alpha + 1, \beta)}{B(\alpha, \beta)}.
$$

Using the relationship between Beta and Gamma functions:

$$
E[X] = \frac{\frac{\Gamma(\alpha + 1) \Gamma(\beta)}{\Gamma(\alpha + \beta + 1)}}{\frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)}}.
$$

Simplify:

$$
E[X] = \frac{\Gamma(\alpha + 1)}{\Gamma(\alpha)} \cdot \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha + \beta + 1)}.
$$

Using the property \( \Gamma(\alpha + 1) = \alpha \Gamma(\alpha) \):

$$
E[X] = \frac{\alpha}{\alpha + \beta}.
$$

---

## Derivation of the Variance $ \text{Var}(X) $

The variance of a Beta random variable \( X \) is given by:

$$
\text{Var}(X) = E[X^2] - (E[X])^2.
$$

### Step 1: Compute \( E[X^2] \)

Following similar steps as above, \( E[X^2] \) involves evaluating:

$$
E[X^2] = \frac{B(\alpha + 2, \beta)}{B(\alpha, \beta)}.
$$

Using properties of Beta and Gamma functions, this simplifies to:

$$
E[X^2] = \frac{\alpha (\alpha + 1)}{(\alpha + \beta)(\alpha + \beta + 1)}.
$$

### Step 2: Substitute and Simplify

Using \( E[X] = \frac{\alpha}{\alpha + \beta} \):

$$
\text{Var}(X) = \frac{\alpha (\alpha + 1)}{(\alpha + \beta)(\alpha + \beta + 1)} - \left( \frac{\alpha}{\alpha + \beta} \right)^2.
$$

Simplify:

$$
\text{Var}(X) = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}.
$$

# Conjugacy of the Bernoulli and Beta Distributions

In Bayesian statistics, **conjugacy** refers to a situation where the prior distribution and the posterior distribution belong to the same family of distributions. The Bernoulli distribution (likelihood) and the Beta distribution (prior) form a conjugate pair. This makes Bayesian inference particularly elegant because the posterior distribution retains the same functional form as the prior.

---

## Intuition Behind Conjugacy

### Bernoulli Likelihood

A Bernoulli random variable $ X \sim \text{Bernoulli}(p) $ models binary outcomes (success/failure), where $ p $ is the probability of success.

The likelihood function for observing $ n $ independent Bernoulli trials with outcomes $ x_1, x_2, \ldots, x_n $, where $ x_i \in \{0, 1\} $, is proportional to:

$$
P(x_1, x_2, \ldots, x_n \mid p) \propto p^{\text{number of successes}} (1 - p)^{\text{number of failures}}.
$$

Let:
- $ s = \sum_{i=1}^n x_i $ represent the total number of successes.
- $ f = n - s $ represent the total number of failures.

Thus, the likelihood function becomes:

$$
P(x_1, x_2, \ldots, x_n \mid p) = p^s (1 - p)^f,
$$

where $ s $ is the number of successes and $ f $ is the number of failures.

---

### Beta Prior

The Beta distribution is defined on the interval $ [0, 1] $, making it a natural choice for modeling probabilities (like $ p $). 

A Beta prior $ p \sim \text{Beta}(\alpha, \beta) $ encodes prior beliefs about $ p $ through its shape parameters $ \alpha > 0 $ and $ \beta > 0 $. 

The PDF of the Beta distribution is:

$$
P(p) = \frac{1}{B(\alpha, \beta)} p^{\alpha - 1} (1 - p)^{\beta - 1},
$$

where $$ B(\alpha, \beta) $$ is the Beta function, defined as:

$$
B(\alpha, \beta) = \int_0^1 t^{\alpha - 1} (1 - t)^{\beta - 1} dt.
$$

It can also be expressed in terms of the Gamma function $$ \Gamma(\cdot) $$:

$$
B(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)}.
$$

---

## Derivation: Posterior Distribution

### Step 1: Write the Prior

The prior distribution for $ p $ is:

$$
P(p) = \frac{1}{B(\alpha, \beta)} p^{\alpha - 1} (1 - p)^{\beta - 1}.
$$

---

### Step 2: Write the Likelihood

Suppose we observe $ n $ independent Bernoulli trials with outcomes $ x_1, x_2, \ldots, x_n $, where $ x_i \in \{0, 1\} $. Let $ s = \sum_{i=1}^n x_i $ represent the total number of successes, and $ f = n - s $ represent the total number of failures.

The likelihood function for $ p $ given the data is:

$$
P(x_1, x_2, \ldots, x_n \mid p) = p^s (1 - p)^f.
$$

---

### Step 3: Combine Prior and Likelihood

Using **Bayes' theorem**, the posterior distribution is proportional to the product of the prior and the likelihood:

$$
P(p \mid x_1, x_2, \ldots, x_n) \propto P(p) \cdot P(x_1, x_2, \ldots, x_n \mid p).
$$

Substitute the expressions for the prior and likelihood:

$$
P(p \mid x_1, x_2, \ldots, x_n) \propto \left[ \frac{1}{B(\alpha, \beta)} p^{\alpha - 1} (1 - p)^{\beta - 1} \right] \cdot \left[ p^s (1 - p)^f \right].
$$

Simplify:

$$
P(p \mid x_1, x_2, \ldots, x_n) \propto p^{\alpha - 1 + s} (1 - p)^{\beta - 1 + f}.
$$

---

### Step 4: Identify the Posterior Distribution

The resulting expression:

$$
P(p \mid x_1, x_2, \ldots, x_n) \propto p^{\alpha + s - 1} (1 - p)^{\beta + f - 1}
$$

is proportional to the PDF of a Beta distribution with updated parameters:

$$
p \mid x_1, x_2, \ldots, x_n \sim \text{Beta}(\alpha + s, \beta + f).
$$

Thus, the posterior distribution is also a Beta distribution, with updated shape parameters:

$$
\alpha_{\text{posterior}} = \alpha + s, \quad \beta_{\text{posterior}} = \beta + f.
$$

---

## Intuition Behind Conjugacy

The property of retaining the same distributional form after updating the parameters is called **conjugacy**. This simplifies Bayesian inference significantly because it allows us to update our prior beliefs in a straightforward way after observing data.

In this case, the Beta distribution is the **conjugate prior** of the Bernoulli distribution:

- The prior starts as $ \text{Beta}(\alpha, \beta) $.
- After observing $ s $ successes and $ f $failures, the posterior becomes $$ \text{Beta}(\alpha + s, \beta + f) $$.

---

## Summary

- **Prior**: $ p \sim \text{Beta}(\alpha, \beta) $
- **Likelihood**: $ P(x_1, x_2, \ldots, x_n \mid p) = p^s (1 - p)^f $
- **Posterior**: $p \mid x_1, x_2, \ldots, x_n \sim \text{Beta}(\alpha + s, \beta + f) $.

This conjugacy relationship makes Bayesian updates efficient and elegant, especially when dealing with Bernoulli or binomial data.
