# 2.0

Uniform Distribution represents a variable that is known to lie in an interval and equally likely to be found anywhere in the interval. Noninformative distribution obtained in the limit as $a \to -\infty, b \to \infty$. If $u$ is drawn from a standard unifrom distribution $U(0, 1)$, then $\theta = a + (b - a)u$ is a draw from $U(a, b)$.

Univariate Normal Distribution: two properties of the normal distribution that play a large role in model building and Bayesian computation are the addition and mixture properties.
- The sum of two independent normal random variables is normally distributed.
    - If $\theta_1$ and $\theta_2$ are independent with $N(\mu_1, \sigma_1^2)$ and $N(\mu_1, \sigma_2^2)$ respectively, then $\theta_1 + \theta_2 \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$

# 2.1

The binomial sampling model is:

$$p(y | \theta) = Bin(y | n, \theta) = {N\choose y} \theta^y(1 - \theta)^{n - y}$$

$p(y | \theta)$ represents the distribution/number of successes given the probability of success. $\theta$ can be interpreted as "based on what we know".

where $N$ is the number of exchangeable trials to choose from and $y$ is the number of successes in $n$ trials. Here $\theta$ represents the proportion of successes in the population (i.e. the probability of success in each trial; the population paramter).

$${N\choose y} = \frac{n!}{y!(n - y)!}$$

for example:

$${4\choose 2} = \frac{4!}{2!(4 - 2)!} = 12$$

and since the denominator as $(4 - 2)!$ this will cancel out the 2! from the numerator leaving only $4*3$ and the denominator with $2*1$

#### Example

The currently accepted value of the proportion of female births in large European-race populations is $0.485$. For this example we define the parameter $\theta$ to be the proportion of female births, but an alternative way of reporting this parameter is as a ratio of male (numerator) to female (denominator) birth rates, $\phi = \frac{(1 - \theta)}{\theta}$. Let $y$ be the number of girls in $n$ births. Using the formula above, we assume the $n$ births are conditionally independent given $\theta$, with the probability of female birth equal to $\theta$ for all cases.

First we need a prior distribution for $\theta$ before using the binomial distribution. We will assume it is a uniform interval $[0, 1]$. Posterior density for $\theta$ is:

$$p(\theta | y) \propto \theta^y(1 - \theta)^{n - y}$$

This is updating our prior value for $\theta$. Rather then the population parameter (our prior beliefs), we update this value to account for new information (our new beliefs based on observed values $y$).

With a fixed $n$ and $y$ the ${n \choose y}$ can be treated as a constant when calculating the posterior distribution of $\theta$.

##### Analogy
The probability space similar to a rectangular table (such as a billiard table):

1.  (Prior distribution) A ball W is randomly thrown (according to a uniform distribution on the table). The horizontal position of the ball on the table is θ, expressed as a fraction of the table width.

2.  (Likelihood) A ball O is randomly thrown n times. The value of y is the number of times O lands to the right of W.

##### Sub-Example
In analyzing the binomial model, Laplace also used the uniform prior distribution. His first serious application was to estimate the proportion of girl births in a population. A total of $241,945$ girls and $251,527$ boys were born in Paris from 1745 to 1770. Letting $\theta$ be the probability that any birth is female, Laplace showed that:

$$Pr( \theta \ge 0.5 | y = 241,945, n = 251,527 + 241,945 ) \approx 1.15 * 10^{-42}$$

Assume $\tilde y$ is a single new trial, the outcome is:

$$Pr(\tilde y = 1 | y) = \int_0^1 \theta p(\theta | y)d \theta = \frac{y + 1}{n + 2}$$

# 2.2

Prior distribution is $p(\theta)$ and posterior distribution is $p(\theta|y)$. Given:

$$E(\theta) = E(E(\theta | y))$$

the prior mean of $\theta$ is the average of all possible posterior means over the distribution of possible data. The variance formula:

$$var(u) = E(var(u | v)) + var(E(u | v))$$

says that the posterior variance is on average smaller than the prior variance by an amount that depends on the variation in posterior mean over the distribution of possible data.

The Beta Distribution:

$$\theta | y \approx \beta(y + 1, n - y + 1)$$

The prior mean will have less importance as the size of the data sample increases.

# 2.3

Mean, median, mode is used for the location of a distribution while variation is described by standard deviation, interquartile range, and other quantiles. Mean is the posterior expectation of the parameter, and mode may be interpreted as the single most likely value given the data and the model. Much practical inference relies on the use of normal approximations, often improved by applying a symmetrizing transformation to $\theta$, and here the mean and standard deviation play key roles.

The mean of the beta distribution is:

$$\frac{y + 1}{n + 2}$$

and the mode is:

$$\frac{y}{n}$$

#### Posterior Qunatiles and Intervals

It is important to report posterior uncertainty. If interval of symmetry is desired, a central interval of posterior probability - which corresponds to $100(1 - \alpha)\%$ interval - to the range of values above and below which lie exactly $100(\frac{\alpha}{2})\%$ of the posterior probability. These are Posterior Intervals. 

# 2.4

In the population interpretation, the prior distribution represents a population of possible parameter values, from which the $\theta$ of current interest has been drawn. In the more subjective state of knowledge interpretation, we must express our knowledge/uncertainty about $\theta$ as if its value could be thought of as a random realization from the prior distribution. Typically the prior distribution should include all plausible values of $\theta$, but the distribution need not be realistically concentrated around true values because the info about $\theta$ contained in the data will far outweight any reasonable prior probability specification.

In a uniform prior distribution for $\theta$, the prior predictive value for $y$ (given $n$) is uniform, which gives equal probability to the $n + 1$ possible values. This is usually sufficient when nothing about the data is known. There are weaknesses to this assumption, however. 

#### Prior Distributions

A prior distribution of a parameter is your uncertainty about the parameter before the current data are examined. Multiplying the prior distribution and the likelihood function together leads to the posterior distribution of the parameter. You use the posterior distribution to carry out all inferences. Think of likelihood as how likely an event will occur given what we already know. If we only have a prior, then all we know is that prior information, but if we have a prior and a posterior, then we will update what we already know with new information observed.

Bayesian probability measures the degree of belief you have in a random event. All priors are subjective priors. Objective/Noninformative distributions are more objective because they have minimal impact on the posterior distribution. Noninformative distributions occur when the prior is flat relative to the likelihood function. Noninformative priors are also invariant under transformation (unchanged after transformations are applied).

Improper priors are 

$$\pi(\theta) \propto 1$$

for $(-\infty, \infty)$. These are used to yield noninformative priors and proper posterior distributions. To determine if a posterior distribution is proper, you need to make sure the normalization constant for all $y$ is finite.

A prior is said to be conjugate for a family of distributions if the prior and the posterior distributions are from the same family (posterior and prior have the same distributional form).

#### Binomial with Different Prior Distributions

The likelihood is in the form:

$$p(y | \theta) \propto \theta^a(1 - \theta)^{b - 1}$$

or in other words:

$$p^y(1 -p)^{n - y}$$

Thus if the prior is of the same form, with its own values a and b, then the posterior density will also be of this form (shape $(\alpha)$ and scale $(\beta)$):

$$p(\theta) \propto \theta^{\alpha - 1}(1 - \theta)^{\beta - 1}$$

or in other words:

$$p^{\alpha - 1}(1 - p)^{\beta - 1}$$

which is a $\beta$ distributuon with the parameters $\alpha$ and $\beta: \theta \sim \beta(\alpha, \beta)$ Comparing $p(\theta)$ and $p(y | \theta)$ suggests that this prior density is equivalent to $\alpha - 1$ prior successes and $\theta - 1$ prior failures. Parameters of the prior distribution are referred to as hyperparameters. The proir distribution is indexed by two hyperparameters, which means we can specify a fixed distribution by fixing two features of the distribution like the mean and the variance. 

Posterior density for $\theta$ is:

$$p(\theta | y) \propto \beta(\theta | \alpha + y, \beta + n - y)$$

or in other words:

$$p^{\alpha + y - 1}(1 - p)^{\beta + n - y - 1}$$

Conjugacy defines how the posterior distribution follows the same parametric form as the prior distribution ($\beta$ prior distribution is a conjugate family for the binomial likelihood). The posterior mean $\theta$, which may be interpreted as the posterior probability of success for a future draw from the population, is now:

$$E(\theta | y) = \frac{\alpha + y}{\alpha + \beta + n}$$

which lies between the sample proportion, $y/n$, and the prior mean, $\alpha/(\alpha + \beta)$. The posterior variance is:

$$var(\theta | y) = \frac{(\alpha + y)(\beta + n - y)}{(\alpha + \beta + n)^2(\alpha + \beta + n + 1)} = \frac{E(\theta | y)[1 - E(\theta | y)]}{\alpha + \beta + n + 1}$$

As $y$ and $n - y$ become large with fixed $\alpha$ and $\beta$, $E(\theta | y) \approx y/n$ and $var(\theta | y) \approx 1/n y/n(1 - y/n)$, which approaches zero at the rate $1/n$. The central limit theorem of probability can be put in a Bayesian context to show:

$$(\frac{\theta - E(\theta | y)}{\sqrt(var(\theta | y))}|y) \to N(0, 1)$$

This limit is used to justify approximating the posterior distribution with a normal distribution. The normal distribution is more accurate approximation in practice for $\theta$ if we transform $\theta$ to the logit scale; that is $log(\theta / 1 - \theta)$, which expands the probability space from $[0, 1]$ to $(-\infty, \infty)$.

#### Conjugate Proir Distribution

Conjugacy: if $F$ is a class of sampling distribution $p(y | \theta)$, and $F$ is a class of prior distributions for $\theta$, then the class $F$ is conjugate for $F$ if:

$$p(\theta | y) \in P$$

for all

$$p(. | \theta) \in F\ \&\ p(.) \in P$$

We are interested in natural conjugate prior families, which arise by taking $F$ to be the set of all densities having the same functional form as the likelihood.

# 2.5

#### Likelihood of One Data Point

Sampling distribution is:

$$p(y | \theta) = \frac{1}{\sqrt(2 \pi \sigma)}\ e^{-\frac{1}{2 \sigma^2}(y - \theta)^2}$$

#### Conjugate Prior nd Posterior Distributions

Family of conjugate prior densities looks like:

$$p(\theta) = e^{A \theta^2 + B \theta + C}$$

and parameterize this family as:

$$p(\theta) \propto exp(\frac{1}{2 \tau^2_0}(\theta - \mu_0)^2)$$

where $\theta \sim N(\mu_0, \tau_0^2)$, with the hyperparameters $\mu_0$ and $\tau_0^2$. The conjugate prior density implies that the posterior distribution for $\theta$ is exponential of a quadratic form and thus normal. In the posterior density all variables except $\theta$ are regarded as constants giving the conditional density:

$$p(\theta | y) \propto exp(-\frac{1}{2}(\frac{(y - \theta)^2}{\sigma^2} + \frac{(\theta - \mu_0)^2}{\tau_0^2}))$$

Expanding the exponents, collecting terms and then completing the square in $\theta$ gives:

$$p(\theta | y) \propto exp(-\frac{1}{2 \tau ^2_1}(\theta - \mu_1)^2)$$

that is, $\theta | y \sim N(\mu_1, \tau_1^2)$ where:

$$\mu_1 = \frac{\frac{1}{\tau_0^2} \mu_0 + \frac{1}{\sigma^2}y}{\frac{1}{\tau_o^2} + \frac{1}{\sigma^2}}$$
$$\frac{1}{\tau^2_1} = \frac{1}{\tau^2_0} + \frac{1}{\sigma^2}$$

In manipulating normal distributions, the inverse of the variance is called the precision. For normal data and normal prior distribution with know precision, the posterior precision equals the prior precision plus the data precision.

The $\mu_1$ above is expressed as the weighted average of the prior mean and the observed value $y$ with weights proportional to precision. We can alternatively express $\mu_1$ as the prior mean adjusted toward the observed $y$:

$$\mu_1 = \mu_0 + (y - \mu_0) \frac{\tau_0^2}{\sigma^2 + \tau_0^2}$$

or as the data shrunk toward the prior mean:

$$\mu_1 = y - (y - \mu_0) \frac{\sigma^2}{\sigma^2 + \tau_0^2}$$

At the extremes the posterior mean equals the prior mean or the observed data:
- $\mu_1 = \mu_0$ if $y = \mu_0$ or $\tau^2_0 = 0$
- $\mu_1 = y$ if $y = \mu_0$ or $\sigma^2 = 0$