# 1.1

#### Three Steps

- Setting up probability model
    - Joint probability distribution for all observable and unobservable quantities in a problem
    - Model consistent with knowledge about underlying scientific problem and the data collection process

- Conditioning on observed data
    - Calculating and interpreting the appropriate posterior distribution

- Evaluating fit of model and the implications of the resulting posterior distribution
    - How well does the model fit the data
    - Are conclusions reasonable
    - How sensitive are the results to the modeling assumptions to the probability model

# 1.2

#### Two Estimands

- Unobserved quantities
    - Future observations
    - Outcome under treatment not recieved
- Observed quantities
    - parameters governing hypothetical process (like regression coefficients)

#### Notation

- $\theta$ are population parameters
- $y$ denotes the observed data
- $\tilde y$ represents the unknown, but potentially observable quantities
- Exchangeability are iid variables (independent and identically distributed) given $\theta$ and distribution $p(\theta)$
    - Exchangeability means data is unaffected by random permutations of the indexes (indexes are randomly assigned)
    - If two units have same value of $x$, then distributions of $y$ are equal
- $X$ are features, $X_k$ is $k$th feature of $X$, and feature matrix is a $n\ x\ k$ matrix

# 1.3

#### Bayesian Inference

Conditional on the observed value of $y$: $p(\theta | y)$ or $p(\tilde y | \theta)$

#### Probability Notation

- $p(.)$ denotes a marginal distribution (distribution and density can be used interchangeably)
    - $Pr(.)$ may be used for probability of an event in some instances so to avoid confusion
- $\theta\ \sim N(\mu, \sigma^2)$ us a normal distribution
    - $N(\mu, \sigma^2)$ for random variables
    - $N(\theta | \mu, \sigma^2)$ for density functions
- $\frac{sd(\theta)}{E(\theta)}$ is coefficient for variation
- $exp(E[log(\theta)])$ is geometric mean
- $exp(sd[log(\theta)])$ is geometric standard deviation

#### Bayes Rule

In order to make probability statements about $\theta$ and $y$, we need to provide joint probability distribution for $\theta$ and $y$. The Probability Mass/Density Function can be written as a product of two densities referred to as Prior Distribution $p(\theta)$ and Sampling Distribution $p(y | \theta)$:

$$p(\theta | y) = p(\theta)p(y | \theta)$$

Posterior Density:

$$p(\theta | y) = \frac{p(\theta | y)}{p(y)} = \frac{p(\theta)p(y | \theta)}{p(y)}$$

where:

$$p(\theta | y) \propto p(\theta)p(y | \theta)$$

#### Prediction

Before $y$ is considered, Marginal Distribution (Prior Predictive Distribution) of unknown $y$ is:

$$p(y) = \int p(y, \theta)d \theta = \int p(\theta)p(y | \theta)d \theta$$

It is prior because it is not conditional on a previous observation of the process, and predictive because it is the distribution for a quantity that is observable. After $y$ data has been observed, we can predict an unknown observable $\tilde y$ from the same process. The distribution of $\tilde y$ is called the Posterior Predictive Distribution: posterior because it is conditioned on an observed $y$ and predictive because it is a prediction for an observable $\tilde y$:

$$p(\tilde y | y) = \int p(\tilde y | \theta)p(\theta | y)d \theta$$

This displays the posterior predictive distribution as an average of the conditional predictions over the posterior distribution of $\theta$. 

#### Likelihood

Likelihood Function is when $y$ affects the posterior inference only through $p(y | \theta)$ when regarded as a function of $\theta$ for a fixed $y$. Likelihood Principle mstates that for a given sample of data, any two probability models $p(y | \theta)$ that have the same likelihood function yield the same inference for $\theta$. Be willing to apply Bayes' Rule under a variety of possible models.

# 1.4

#### Example

##### Prior Distribution
Consider a woman who has an affected brother, which implies that her mother must be a carrier of the hemophilia gene with one ‘good’ and one ‘bad’ hemophilia gene. We are also told that her father is not affected; thus the woman herself has a fifty-fifty chance of having the gene. The unknown quantity of interest, the state of the woman, has just two values: the woman is either a carrier of the gene $(\theta = 1)$ or not $(\theta = 0)$. Based on the information provided thus far, the prior distribution for the unknown $\theta$ can be expressed simply as $Pr(\theta = 1) = Pr(\theta = 0) = 0.5$.

##### Data Model and Likelihood
The data used to update the prior information consist of the affection status of the woman’s sons. Suppose she has two sons, neither of whom is affected. Let $y_i = 1$ or $0$ denote an affected or unaffected son, respectively. The outcomes of the two sons are exchangeable and conditional on the unknown $\theta$ and are independent; we assume the sons are not identical twins. The two items of independent data generate the following likelihood function:

$$Pr(y_1 = 0, y_2 = 0 | \theta = 1) = (0.5)(0.5) = 0.25$$
$$Pr(y_1 = 0, y_2 = 0 | \theta = 0) = (1)(1) = 1$$

These expressions follow from the fact that if the woman is a carrier, then each of her sons will have a $50\%$ chance of inheriting the gene and so being affected, whereas if she is not a carrier then there is a probability close to 1 that a son of hers will be unaffected. (In fact, there is a nonzero probability of being affected even if the mother is not a carrier, but this risk—the mutation rate—is small and can be ignored for this example.)

##### Posterior Distribution
$$Pr(\theta = 1 | y) = \frac{p(y | \theta = 1)Pr(\theta = 1)}{p(y | \theta = 1)Pr(\theta = 1) + p(y | \theta = 1)Pr(\theta = 0)} = \frac{(0.25)(0.5)}{(0.25)(0.5) + (1.0)(0.5)} = \frac{0.125}{0.629} = 0.20$$

The $0.25$ in the numerator represents the product of all priors $y_i$. In this case it is the chance the each son can have the gene given that the mother has the gene (i.e. $0.5 * 0.5$). This part of the equation can be understood as:

$$\prod_{i = 1}^n\ y_i$$

The denominator can be simplified to the following for larger sums:

$$\sum_{j = 1}^n\ p(\theta_j)p(y | \theta_j)$$

where all $j$s represent possible outcomes for $y$.

##### Adding More Data
A key aspect of Bayesian analysis is the ease with which sequential analyses can be performed. For example, suppose that the woman has a third son, who is also unaffected. The entire calculation does not need to be redone; rather we use the previous posterior distribution as the new prior distribution, to obtain:

$$\frac{(0.5)(0.2)}{(0.5)(0.2) + (1.0)(0.8)} = \frac{0.125}{0.629} = 0.111$$

Initial prior values can all be defaulted to $\frac{1}{n}$, or be defaulted to a value pre-calculated from a larger database. Or, if only a fraction of possible values from a database are used, then the default for each possible outcome can be a normalized value from the outcomes in question:

$$1(p(y)) = \frac{p_{max}(y | \theta)}{\sum_{j = 1}^n\ p(\theta_j)p(y | \theta_j)}$$

# 1.5

Probability: numerical quantities, defined on a set of ‘outcomes,’ that are nonnegative, additive over mutually exclusive outcomes, and sum to 1 over all possible mutually exclusive outcomes.

# 1.7

#### Mixture Models

Distribution of previously obtained scores for the candidate matches is considered a mixture of distribution scores for true matches and a distribution for non-matches. Parameters are estimated. These parameters allow us to estimate the probability of false match for any given decision threshold on the scores.

# 1.8

Joint Density: given two quantities $(u, v)$, we write the joint density as $p(u, v)$. Conditional Distribution (Desnity Function) is $p(u | v)$ and Marginal Density is $p(u) = \int p(u, v)dv$. The $u$ and $v$ can be vectors.The integral refers to te entire range of the variable being integrated out. It is useful to factor a joint density as a product of marginal and conditional densities:

$$p(u, v, w) = p(u | v, w)p(v | w)p(w)$$

to be more explicit, the following notation is helpful:

$$p(\theta, y | H) = p(\theta | H)p(y | \theta, H)$$

where $H$ refers to the set of hypotheses or assumptions used to define the model. 

$E(.)$ and $var(.)$ for mean and variance respectively:

$$E(u) = \int up(u)du$$
$$var(u) = \int (u - E(u))^2p(u)du$$

Covariance (variance matrix and covariance matrix is used interchangeably) matrix is defined as:

$$var(u) = \int (u - E(u))(u - E(u))^Tp(u)du$$

In expressions involving expectations, any variable that does not appear explicitly as a conditioning variable is assumed to be integrated out in the expectation.

#### Means and Variances of Conditional Models

The mean of $u$ can be obtained by averaging the conditional mean over the marginal distribution of $v$, given $u$ is a random variable and $v$ is some related quantity.

$$E(u) = E(E(u | v))$$

where inner expectation averages over $u$, conditional on $v$, and the outer expectation averages over $v$. Identity is derived by writing the expectation in terms of the joint distribution of $u$ and $v$ and then factoring the joint distribution:

$$E(u) = \int \int up(u, v)dudv = \int \int up(u | v)dup(v)dv = \int E(u | v)p(v)dv$$

The corresponding result for the variance includes two terms:
- Mean of the conditional variance
- Variance of the conditional mean

$$var(u) = E(var(u | v)) + var(E(u | v))$$

If $u$ is a vector, then $E(u)$ is a vector and $var(u)$ is a matrix.

#### Transformation of Variables

Suppose $p_u(u)$ is the density of the vector $u$, and we transform to $v = f(u)$ where $v$ has the same number of componenets as $u$. If $p_u$ is a discrete distribution, and $f$ is a one-to-one function, then the density of $v$ is given by:

$$p_v(v) = p_u(f^{-1}(v)$$

If $f$ is many-to-one function, then a sum of terms appears on the right side of this expression for $p_v(v)$, with one term corresponding to each of the branches of the inverse function. If $p_u$ is a continuous distribution, and $v = f(u)$ is a one-to-one transformation, the the joint density of the transformed vector is:

$$p_v(v) = |J| p_u(f^{-1}(v))$$

where $|J|$ is the determinant of the Jacobian of the transformation $u = f - 1(v)$ as a function of $v$ (Jacobian $J$ is a square matrix of partial derivatives with dimension given by the number of components of $u$, with the $i, j$th entry equal to $\frac{\partial u_i}{\partial u_j}$. If $f$ is a many-to-one, then $p_v(v)$ is a sum or integral of terms.

When working with parameters defined on the open interval (0, 1), we often use logistic transformation:

$$logit(u) = log(\frac{u}{1 - u})$$

whose inverse is:

$$logit^{-1}(v) = \frac{e^v}{1 + e^v}$$

Another common choice is the probit transformation $\phi^{-1}(U)$ where $\phi$ is the standard normal cumulative distribution function, to transform from $(0, 1)$ to $(-\infty, \infty)$.

# 1.9

#### Sampling using Inverse Cumulative Distribution Function

The Cumulative Distribution Function (cdf F) of a one-dimensional distribution $p(v)$ is defined by:

- $F(v_*) = Pr(v \le v_*) = $
    - $\sum_{v \le v_*} p(v)$ if $p$ is discrete
    - $\int_{-infty}^{v_*}p(v)dv$ if $p$ is continuous
    
The inverse cdf can be used to obtain random samples from the distribution $p$, as follows:
- First draw random value $U$ from the uniform distribution $[0, 1]$ using a table of numbers or - more likely - a random function on the computer
- Second let $v = F^{-1}(U)$. The function $F$ is not necessarily one-to-one, but the inverse in regard to $U$ is unique with a probability of 1.
- Value $v$ will be a random draw from $p$, and is easy to compute as lonf as $F^{-1}(U)$ is simple.

For a continuous example, suppose $v$ has an exponential distribution with parameter $\lambda$, then its' cdf is $F(v) = 1 - e^{\lambda v}$, and the value of $v$ for which $U = F(v)$ is:

$$v = \frac{log(1 - U)}{\lambda}$$

Recognizing that $1 - U$ also has the uniform distribution $[0, 1]$, we see we can obtain random draws from the exponential distribution as $-log U \frac{U}{\lambda}$.

# 1.10

#### Bayesian Inference in Applied Statistics

A pragmatic rationale for the use of Bayesian methods is the inherent flexibility introduced by their incorporation of multiple levels of randomness and the resultant ability to combine information from different sources, while incorporating all reasonable sources of uncertainty in inferential summaries.

Strength of the Bayesian approach lies in:
- Its ability to combine information from multiple sources (thereby in fact allowing greater ‘objectivity’ in final conclusions)
- Its more encompassing accounting of uncertainty about the unknowns in a statistical problem.