# Chapter 18 

# Using what you know; Bayesian models

## Introduction

So far, we have mostly worked with frequentist statistical methods. Frequentist models make inferences using only data and likelihood model assumptions. There is another class of statistical models with a long and successful history, Bayesian models. In contrast to the frequentist approach, Bayesian models use **prior information** as well as data and likelihood model to perform inference.   

Despite the long history, Bayesian models have not been used extensively until recently. This limited use is a result of several difficulties. The need to specify a **prior distribution** of known information has proved a formidable intellectual obstacle and is often cited as a reason for not using Bayesian methods. Further, modern Bayesian methods are often computationally intensive and have become practical only in the past few decades. Finally, the recent emergence of improved software and algorithms has resulted in wide and practical access to Bayesian methods.         

<img src="../images/Sun.png" alt="Drawing" style="width:350px; height:450px"/>
<center>A Bayesian would win this bet. Source nist.gov.</center>


## Brief History of Bayesian Models

A restricted version of Bayes Theorem was proposed by Rev.Thomas Bayes (1702-1761). Bayes Theorem, was published posthumously by his friend Richard Price. Bayes' interest was in probabilities of gambling games. He was also a supported Issac Newton's new theory of calculus, with his publication, *An Introduction to the Doctrine of Fluxions, and a Defence of the Mathematicians Against the Objections of the Author of The Analyst*.

![](../images/ThomasBayes.gif)

Pierre-Simon Laplace published a far more general version of Bayes Theorem, similar to its modern form, in his Essai philosophique sur les probabilités in 1814. Laplace applied Bayesian methods to problems in celestial mechanics. These problems had great practical importance in the late 18th and early 19th centuries for the safe navigation of ships; the cutting edge technology problem of the age.  
 
![](../images/Laplace.jpg)

### Early 20th Century History

The geophysicist and mathematician Harold Jefferys extensively used Bayes' methods. His 1939 book, *The Theory of Probability* was in deliberate opposition to Fisher's maximum likelihood methods. The publication of his book set off a feud between Jefferys and Fisher which lasted until the death of both men in the 1970s. The feud resulted in Bayesian methods rarely being taught or used in scientific publications for many decades. Thus, the development and use of Bayesian models was limited in the mid 20th Century.   

<img src="../images/JeffreysProbability.jpg" alt="Drawing" style="width:225px; height:300px"/>
<center>Jefferys' seminal 1939 book</center>

The battle between Fisher, Jefferys, and their protégés continued for most of the middle 20th century. This battle was bitter and often personal. The core of these argument were:

- Fisher argued that the selection of a Bayesian prior distribution was purely subjective, allowing one to achieve any answer desired.
- Jefferys argued that all knowledge is in fact subjective, and that expressing a prior documented the investigator's biases. Further, he argued that choosing a cut-off value or confidence interval was subjective in any event.

Despite the philosophical squabbles, Bayesian methods endured and showed an increasing number of success stories. Pragmatists continued to use both approaches. For example, there were some notable success during the Second World War, included:     
- Bayesian models were used to improve artillery accuracy in both world wars. In particular the eminent Soviet statistician Andre Kolmagorov used Bayesian methods to greatly improve artillery accuracy. 
- Bayesian models were used by Alan Turing to break German naval codes.
- Bernard Koopman, working for the British Royal Navy, improving the ability to locate U-boats using crude directional data from intercepted radio transmissions. 

### Late 20th Century History  

Starting in the second half of the 20th century the convergence of greater computing power and general acceptance lead to the following notable advances in computational Bayesian methods. The following publications are a few notable milestones in the advancement of Bayesian methods:   
- Statistical sampling using Monte Carlo methods; Stanislaw Ulam, John von Neuman; 1946, 1947.   
- MCMC, or Markov Chain Monte Carlo; Metropolis et al. (1953) Journal of Chemical Physics.   
- Monte Carlo sampling methods using Markov chains and their application, Hastings (1970).   
- Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, Geman and Geman (1984).   
- Hamiltonian MCMC, Duane, Kennedy, Pendleton, and Roweth (1987).    
- Sampling-based approaches to calculating marginal densities, Gelfand and Smith (1990).  

### 21st Century  

In the 21st Century Bayesian models are in daily routine use. These applications range across the scope of statistical model applications. A few examples include medical research, marketing analytics, natural language understanding, and web search. Bayesian models have found uses in [legal judgements](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4934658/) when faced with diverse and uncertain evidence.  

 One of the more interesting 21st Century applications of Bayesian modeling is in management of [search and rescue operations](https://kcts9.pbslearningmedia.org/resource/nvpn-sci-sarops/using-probability-in-search-and-rescue-operations-prediction-by-the-numbers/). For example, the United States Coast Guard routinely uses a Bayesian model to find the most likely search locations for vessels in distress, given incomplete and vague reports of location. These methods made global headlines with the successful location of the missing Air France Flight 447. This aircraft had disappeared in little traveled area of the South Atlantic Ocean. Conventional location methods had failed to locate the wreckage, since the potential search area was quite large and the aircraft's emergency locater, or pinger, could not be detected. Bayesian methods were able to rapidly narrow the prospective search area, and the wreckage was successfully located. The Bayesian modeling methods used are discussed in a [paper by Stone et.al., 2014](https://arxiv.org/pdf/1405.4720.pdf) and summarized [here](https://sinews.siam.org/Details-Page/bayesian-search-for-missing-aircraft-ships-and-people). A map, showing the posterior probability of the model, is displayed below.  

<img src="../images/AirFrance447_posterior_PDF.png" alt="Drawing" style="width:550px; height:450px"/>
<center>Posterior distribution of locations for Air France 447, assuming no pinger</center>    

A significant recent advance is the No-U-Turn (NUTS) sampler, published by Hoffman and Gelman in 2011, was an advance in Hamiltonian MCMC sampling. The NUTS sampler eliminates the need for tedious hand tuning. This method represents the state of the art in sampling. NUTS makes MCMC method routinely available for high dimensional problems.   

## Bayesian vs. Frequentist Views

With greater computational power and general acceptance, Bayes methods are now widely used. Among pragmatists, the common belief today is that some problems are better handled by frequentist methods and some with Bayesian methods. Models that fall between these extremes are also in common use. These methods include the so-called **empirical Bayes** methods.  

Let's summarize the differences between the Bayesian and frequentist views. 

- The goal of Bayesian methods computation of a **posterior distribution**. Bayesian methods use **prior distributions** combined with **evidence** to compute the posterior distribution. 
- Bayesian posterior distributions are updated when more evidence is collected. 
- Frequentists do not quantify anything about the parameters, starting with only a likelihood model and using p-values and confidence intervals to express the uncertainty of parameters given the data.
- Frequentist methods are based on specific random samples, and models must be recomputed when additional evidence is collected. 

Recalling that both views are useful, we can contrast these methods with a chart.

<img src="../images/FrequentistBayes.jpg" alt="Drawing" style="width:600px; height:400px"/>
<center>Comparison of Bayesian and frequentist models</center>


## Bayes Theorem and Conditional Probability

As you have likely guessed from the name, Bayesian models are built upon **Bayes' Theorem** or **Bayes' Rule**, a fundamental relationship for **conditional probability**.

<img src="../images/BayesDeNeon.jpg" alt="Drawing" style="width:400px; height:300px"/>
<center>Credit; Wikipedia commons<center>

### Conditional probability

Let's briefly review **conditional probability**, the probability that one event occurs given that another event has occurred. Consider the example of events in a space S with subspaces A and B, shown below. 

<img src="../images/Prob1.png" alt="Drawing" style="width:300px; height:200px"/>

We can write a conditional probability relationship between the subsets as follows. We write the conditional probability of A given B:

$$P(A|B)$$

Let's try to find the relationship between conditional probability and the intersection between the sets, $P(A \cap B)$. To find this probability notice that it is the product of two probabilities:  
1. $P(B)$ means the event is in B and must be in the intersection. 
2. $P(A|B)$ since A must also occur when B occurs.

We can now write:

$$P(A \cap B) = P(A|B) P(B)$$

Rearranging terms we get the following for our example: 

\begin{align}
P(A|B) &= \frac{P(A \cap B)}{P(B)} \\
& = \frac{\frac{2}{10}}{\frac{4}{10}} = \frac{2}{4} = \frac{1}{2}
\end{align}


### Bayes Theorem

As has already been stated, Bayes' Theorem is fundamental to Bayesian data analysis. To get started, let's go through a simple derivation of Bayes theorem. From the previous section we have:

$$P(A|B) P(B) = P(A \cap B)$$

Using the same approach, we can also write: 

$$P(B|A) P(A) = P(A \cap B)$$

Eliminating $P(A \cap B):$

$$ P(B)P(A|B) = P(A)P(B|A)$$

Or, 

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

Which is Bayes' Theorem!

### Interpreting Bayes Theorem

How can you interpret Bayes' Theorem in a way that is useful for data analysis? The following interpretation is a foundation of Bayesian analysis.  

$$Posterior\ Distribution = \frac{Likelihood \bullet Prior\ Distribution}{Evidence} $$

In general, we are interested in estimating the posterior distribution of model parameters. In this case, we can think of Bayes' theorem in terms of model parameters:   

\begin{align}
posterior\ distribution(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎)
&= \frac{Likelihood(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)\ 𝑃rior(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)}{P(data)}  \\
𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎) &= \frac{P(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)\ 𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)}{P(data)} 
\end{align}

So, what does the above actually mean? Let's decompose the expression term by term, focusing on the typical case of the probability distribution of model parameters:    
1. The conditional **posterior distribution** of the model parameters given the evidence or data, $P(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎)$. Computing this posterior distribution is the goal of Bayesian analysis. 
2. **Prior distribution**, $P(parameters)$, is chosen to express information available about the model parameters apriori. The specification of the prior distribution will be discussed further in this lesson.      
3. The **likelihood** is the conditional distribution of the data given the model parameters, $p(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)$.   
4. The **evidence** is the distribution of the data, $p(data)$. This distribution acts to normalize the product of the likelihood and prior distribution so that the result is  proper posterior distribution.     

## Marginal Distributions   

In many cases of Bayesian analysis we are interested in the marginal distribution. For example, it is often the case that only one or a few parameters will be of interest. The behavior of these parameters is characterized by the marginal distribution. Further, the denominator of Bayes theorem, $P(data)$, can sometimes be computed as a marginal distribution. For these reasons, computing marginal distributions is an important, yet difficult, aspect of Bayesian analysis. Key principles of marginal distributions have already been discussed in chapter 8. Here, we will just do a brief review.   

Consider a multivariate probability density function with $n$ variables, $p(\theta_1, \theta_2, \ldots, \theta_n)$. A **marginal distribution** is the distribution of one variable with the others integrated out. In other words, if we integrate over all other variables $\{ \theta_2, \ldots, \theta_n \}$ the result is the marginal distribution with respect to the remaining variable, $p(\theta_1)$. We can express this idea mathematically as follows:       

$$p(\theta_1) = \int_{\theta_2, \ldots, \theta_n} p(\theta_1, \theta_2, \ldots, \theta_n)\ d\theta2, \ldots, d \theta_n$$


> **Exercise 18-1:** A simple example will make the concept of marginal distributions less abstract. You will plot a bi-variate distribution along with the marginal distributions of the two variables.   
> In this case, you will work with the bivariate Normal distribution, $N( \mathbf{\mu}, \Sigma^2)$, where $\mu$ is the 2-vector of means and $\Sigma^2$ is the 2x2 covariance matrix. For the two vector, $\mu = [\mu_x,\mu_y]$, the marginal distributions are:   
> $$ p(X) = \int_Y N( [\mu_x,\mu_y], \Sigma^2)\ dY \\ p(Y) = \int_X N( [\mu_x,\mu_y], \Sigma^2)\ dX $$
> As a first step, execute the code in the cell below to import the packages you will need for the rest of this notebook.     

In [None]:
import pandas as pd
import numpy as np
import numpy.random as nr
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import itertools

%matplotlib inline

> The code in the cell below creates a contour plot of 5,000 samples drawn from a bi-variate normal distribution with:   
> $$ \mu =  \begin{bmatrix} 0.0 \\ 0.0 \\ \end{bmatrix} \\ \Sigma^2 = 
\begin{bmatrix} 1.5 & 0.5 \\ 0.5 & 1.0 \\ \end{bmatrix} $$
> Complete and execute the code in the cell below to display a bivariate contour plot of the joint distribution and examine the results. To display the plot use the [seaborn.kdeplot]() function for the two variables, $X$ and $Y$.

In [None]:
bi_variate_normal = nr.multivariate_normal([0.0, 0.0], np.array([[1.5, 0.5], [0.5, 1.0]]), size=5000)
bi_variate_normal = pd.DataFrame(bi_variate_normal, columns=['X','Y'])

### Complete the code below




> Now, you are ready to plot the marginal densities of the variables, X and Y. You will not need to actually compute any integrals. Instead, you can use nonparametric density estimates for each variable. Follow these steps:  
> 1. Use the set of Matplotlib axis to place two density plots, one over the other.   
> 2. Use ax[].set_xlim() to set the x-axis limits to the range $[-6.0,6.0]$, so that you can compare the two plots.    
> 3. Display the marginal density estimates of X and Y using the [seaborn.kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) function. Make sure you include a title, so you know which plot is which.   

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(8, 6)) 

## Complete the code below to display the marginal densities of the X and Y variables.




> Notice any differences in the density plots of these marginal distributions. Are these differences expected in terms of dispersion and shape, and why? 

> **Answer:** 


## Applying Bayes Theorem

We need a formulation of Bayes Theorem which is tractable for computational problems. Specifically, we don't want to be stuck summing all of the possibilities to compute the denominator, $P(B)$. In fact, in many cases, computing this denominator directly is intractable.  

We can start by examining some interesting facts about conditional probabilities: 

$$
𝑃(𝐵 \cap A) = 𝑃(𝐵|𝐴)𝑃(𝐴) \\
And \\
𝑃(𝐵)=𝑃(𝐵 \cap 𝐴)+𝑃(𝐵 \cap \bar{𝐴}) 
$$

Where, $\bar{A} = not\ A$. So, the marginal distribution, $P(B)$, can be written:   

$$
𝑃(𝐵)=𝑃(𝐵|𝐴)𝑃(𝐴)+𝑃(𝐵│ \bar{𝐴})𝑃(\bar{𝐴})
$$

Using the above relations we can rewrite Bayes Theorem as:

$$ P(A|B) = \frac{P(A)P(B|A)}{𝑃(𝐵│𝐴)𝑃(𝐴)+𝑃(𝐵│ \bar{𝐴})𝑃(\bar{𝐴})} \\ $$

In summary, to compute the denominator we need to sum all the cases in the subset $A$ and all the cases in the subset $not\  A$. This is a bit of a mess! But fortunately, we can often avoid computing this denominator by force. We can rewrite Bayes Theorem as:

$$𝑃(𝐴│𝐵)=𝑘∙𝑃(𝐵|𝐴)𝑃(𝐴)$$

Ignoring the normalization constant $k$, we get:

$$𝑃(𝐴│𝐵) \propto 𝑃(𝐵|𝐴)𝑃(𝐴)$$

### Simplified relationship for Bayes theorem

How to we interpret the relationship shown above? We can consider the following relationship:

$$Posterior\ Distribution \propto Likelihood \bullet Prior\ Distribution \\
Or\\
𝑃( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 │ 𝑑𝑎𝑡𝑎 ) \propto 𝑃( 𝑑𝑎𝑡𝑎 | 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 )𝑃( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ) $$

The above formulation means that we can find a function proportional to the posterior distribution without the denominator. Once we have these values, we can sum over them to find the marginal distribution $P(B)$. In practice this approach can transform an intractable computation into a simple summation. 

These relationships can apply to the parameters in a model, including partial slopes, intercept, error distributions, lasso constants, etc. 

## Creating Bayes models

The goal of a Bayesian model is to find the posterior distribution of the model parameters and perform inference on this distribution. The general steps are as follows:

1. Identify data relevant to the research question. 
2. Define a sampling plan for the data. In contrast to frequentist methods, the data need not be collected in a single batch. 
3. Define a descriptive model for the data. For example, a linear regression model might be used for a problem.
3. Specify a prior distribution of the model parameters. For example, you might believe that the parameters of the linear model should be Normally distributed as $N(\mu,\sigma^2)$.
4. Use the Bayesian inference formula (above) to compute posterior distribution of the model parameters. If there is no data as yet, the posterior distribution is the same as the prior distribution. This later possibility is in contrast to frequentist methods, where data must be collected first. 
5. Update the posterior if more data is observed. This is key! The posterior of a Bayesian model naturally updates as more data is added, a form of learning. In this case, the **previous posterior serves as a prior** for the model update. 
6. Inference on the posterior can be performed. 
7. Optionally, simulate data values from realizations of the posterior distribution. These values are predictions from the model. 

### How can you choose a prior?

The choice of the prior is a serious, and potentially vexing, problem when performing Bayesian analysis. In fact, the need to choose a prior has often been cited as a reason why Bayesian models are impractical. General guidance is that a prior must be convincing to a **skeptical audience**. But this guidance is vague in practice.   

There are a number of ways one can create a good prior distribution. Some possible approaches include:

- Using prior empirical information about the problem. This might include information from related work by others. Deriving a prior distribution in this manner is sometimes called **empirical Bayes** modeling. Typically, a less informative prior distribution is used than the actual empirical distribution so the model is not overly constrained. For example, if a Normally distributed prior is chosen, the variance might be increased by some factor (perhaps 2x). The empirical Bayes approach is often applied in practice, while some Bayes theoreticians do not consider this a Bayesian approach at all.   
- Apply domain knowledge to determine a reasonable distribution. For example, viable range of parameter values could be computed from physical principles.      
- If there is poor prior knowledge for the problem a non-informative prior can be used. One possibility is a Uniform distribution. But **be careful!**. A uniform prior is informative, since you must set the limits on range of parameter values! There can be other options for uninformative distributions, such as the [Jefferys' prior](https://en.wikipedia.org/wiki/Jeffreys_prior).      
 
 

### Conjugate Priors   

An analytically simple choice for a prior distribution family is a **conjugate prior**. When a likelihood function is multiplied by its conjugate distribution the posterior distribution will be in the same family as the prior. This idea is attractive, since for cases where the conjugate distribution exists, analytical results can be computed.     

Most commonly used named distributions have conjugates. A few commonly used examples are shown in the table below:

Likelihood | Conjugate
---|---
Binomial|Beta
Bernoulli|Beta
Poisson|Gamma
Categorical|Dirichlet
Normal - mean| Normal
Normal - variance | Inverse Gamma, Half Cauchy
Normal - inverse variance, $\tau$ | Gamma

However, there are many practical cases where a conjugate prior is not used. With modern computational methods, a conjugate prior distribution is not required. And, often in these cases, no analytical solution is likely to exist.  

## First Extended Example

With a bit of theory in mind, let's apply these concepts with an example. Let's say we are interested in analyzing the incidence of distracted drivers. We randomly sample the behavior of 10 drivers at an intersection and determine if they exhibit distracted driving or not. The data are binomially distributed, a driver is distracted or not. In the example we will:

1. Select the Binomial distribution for the likelihood.
2. Choose an uninformative prior distribution.  Specifically use the conjugate prior, the Beta distribution with parameters $\alpha=1$ and $\beta=1$.   
3. Using the data sample, compute the likelihood.
4. Compute the posterior distribution of distracted driving. 
5. Try another prior distribution and repeat step 4.
5. Add more evidence (data) and update the posterior distribution.

The Binomial likelihood has one parameter we need to estimate, $p$, the probability of success. The concept of success should be interpreted broadly as a positive case. In this case success is a distracted driver. We can write this relationship formally for the probability $\theta$ of $k$ successes in $n$ trials:

$$ P(k) = \binom{n}{k} \cdot \theta^k(1-\theta)^{n-k}$$

Where:     
- $n =$ number of trials    
- $k =$ number of successes    
- $z = n-k =$ number of failures  
- $\theta =$ probability of success   

The code in the cells below creates a synthetic data set of distracted and not-distracted drivers and computes some simple summary statistics. Execute this code. 

In [None]:
drivers = ['yes','no','yes','no','no','yes','no','no','no','yes']
distracted = [1 if x=='yes' else 0 for x in drivers]
distracted

In [None]:
N = len(distracted)  # sample size
n_distracted = sum(distracted)  # number of distracted drivers
n_not = N - n_distracted # number not distracted
print('Distracted drivers = %d \nAttentive drivers = %d'
    '\nEmpirical pobability of distracted driving = %.1f' 
      % (n_distracted, n_not, n_distracted / (n_distracted + n_not)))

*Note* that the empirical estimate of $p$ is the frequentist **maximum likelihood estimate**.   

### The Beta prior

Let's explore the properties of the conjugate prior, the Beta distribution. The Beta distribution has two parameters, $a$ and $b$. The Beta distribution is defined on the range $0 \le Beta(p |a, b) \le 1$. Formally, we can write the density function of the Beta distribution:

$$Beta(x |a, b) = \kappa x^{a-1}(1 - x)^{b-1} \\
where,\ \kappa = normalization\ constant$$

The normalization constant, $\kappa$, is computed using a [Beta function](https://en.wikipedia.org/wiki/Beta_function_) (not to be confused with the Beta distribution). The normalization ensures that the Beta distribution is proper; i.e. $\int_0^1 Beta(x |a, b)\ dx = 1.0$. We will not explore the detailed mathematical properties of the Beta function in this lesson. 

The two parameters of the Beta distribution, $a$ and $b$, determine its shape. These parameters have a valid range, $0 \le a,b $. To get a feel for the Beta distribution, execute the code in the cell below which computes 25 examples of the density function on a 5x5 grid. 

In [None]:
plt.figure(figsize=(12, 8))

alpha = [.5, 1, 2, 3, 4]
beta = alpha[:]
x = np.linspace(.001, .999, num=100)

for i, (a, b) in enumerate(itertools.product(alpha, beta)):
    plt.subplot(len(alpha), len(beta), i+1)
    plt.plot(x, scipy.stats.beta.pdf(x, a, b))
    plt.title('(a,b) = (%d,%d)' % (a,b))
plt.tight_layout()    

> **Exercise 18-2:** You can see from the plots above, that the Beta distribution can take on quite a range of shapes, depending on the parameters. Examine these plots and answer the following questions:   
> 1. What is the relationship between $a$ and $b$ when the density function of the Beta distribution is symmetric?   
> 2. As $a$ and $b$ increase how does the shape of the distribution change? 
> 3. How does the relationship between the values of $a$ and $b$ change the skew of the distribution? 

> **Answers:**      
> 1.                  
> 2.                   
> 3.            

### Analytical solution with conjugate prior  

It helps understanding to consider the distribution of the product of a Binomial likelihood and a Beta prior. Let's do a bit of algebra and find the result. Define the evidence as $n$ trials with $z$ successes. The prior is a Beta distribution with parameters $a$ and $b$. We denote the normalizing Beta function as $B(a,b)$. Using the notation, $\theta = (a,b)$, for the paramter vector, we start with Bayes Theorem to find the distribution of the posterior:    

\begin{align}
posterior(\theta | k, n) &= \frac{likelihood(k,n | \theta)\ prior(\theta)}{data\ distribution (k,n)} \\
p(\theta | k, n) &= \frac{p(k,n | \theta)\ p(\theta)}{p(k,n)} \\
&= \frac{Binomial(k,n | \theta)\ Beta(\theta)}{p(k,n)} \\
&= \frac{\theta^k(1-\theta)^{n-k}}{p(k,n)} \frac{\theta^{a}(1-\theta)^{b}}{B(a,b)} \\
&= \frac{\theta^{k + a}(1-\theta)^{n-k+b}}{B(k+a,\ n-k+b)} \\
&= Beta(k + a,\ n-k+b)
\end{align}

There are some useful insights you can gain by examining the last line of the derivation above.    
- As expected, the posterior distribution is in the Beta family. The parameters $a$ and $b$ are determined by the prior.  
- The parameters of the prior can be interpreted as **pseudo counts** of successes, $a = pseudo\ success + 1$ and failures, $b = pseudo\ failure + 1$. Be careful when creating a prior to **add 1** to the successes and failures. The larger the total pseudo counts, $a + b$, the **stronger the prior information**.  
- The evidence is also in the form (actual) counts of successes, $k$ and failure, $n-k$. The more the available evidence the greater the influence on the posterior distribution. A large amount of evidence will overwhelm the prior.    

### Uninformative prior distribution. 

Let's pick an initial prior distribution. Combined with the likelihood we can then compute a posterior distribution, $P(\theta)$, for the one model parameter $\theta$.    

In a case where we have no useful information in advance, an **uninformative prior distribution** can be an appropriate choice. For this example, a uniform distribution can be used. The Beta distribution with $\theta = (a,b) = (1,1)$ is the uniform distribution. This is the same as saying the pseudo counts of success and failures are both 0.   

The code in the cell below computes and plots the Beta prior distribution with pseudo counts $[0.0]$ and $[a,b]=[1,1]$. Execute this code and examine the results. 

In [None]:
def beta_prior(p, a, b):
    pp = scipy.stats.beta.pdf(p, a, b)
    return pp

N = 100
p = np.linspace(.001, .999, num=N)
p_prior = beta_prior(p, 1.0, 1.0 )

fig,ax = plt.subplots(figsize=(8,4))
ax.plot(p, p_prior, linewidth=2, color='blue')
ax.set_xlabel('p');
ax.set_ylabel('Probability density');
ax.set_title('Uniform prior distribution');

### The likelihood function

Next, we need to compute the likelihood. The likelihood is the conditional probability of the data given the parameter and the data, $P(X|\theta)$. The code in the cell below computes and plots the Binomial likelihood for the distracted driver data. The **probability mass function**, [scipy.stats.binom.pmf](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html), is used to compute the likelihood for each value of the probability parameter $\theta$, given the data. Notice that we are using a probability mass function since the Binomial distribution is discrete.   

Execute this code and examine the results.

In [None]:
def binomial_likelihood(p, data):
    k = sum(data)
    n = len(data)
    return scipy.stats.binom.pmf(k, n, p)

likelihood = binomial_likelihood(p, distracted)

fig,ax = plt.subplots(figsize=(8,4))
ax.plot(p, likelihood)
ax.set_title('Likelihood function');
ax.set_xlabel('Parameter');
ax.set_ylabel('Likelihood');

> **Exercise 18-3:** Examine this result and answer these questions:
> 1. Does the maximum of this likelihood curve appear to be at $p=0.4$, the estimated parameter value?
> 2. Does it make sense that the values of the likelihood are nearly 0 at the extremes of the parameter value range, and why?  

> **Answers:**    
> 1.              
> 2.          

### Grid sampling the posterior distribution   

A simple, even naive, method for computing the posterior distribution is to use **grid sampling**. The likelihood and prior are computed on a regular grid as is illustrated below. The grid spans the range of each parameter value of interest.   

<img src="../images/SamplingGrid.png" alt="Drawing" style="width:350px; height:350px"/>
<center>A 2-dimensional sampling grid for computing a posterior distribution</center>   

The algorithm to compute samples of the posterior distribution is shown below. The result is a normalized sampled estimate of posterior distribution.   

<img src="../images/GridSamplePseudoCOde.png" alt="Drawing" style="width:600px; height:800px"/>


> **Warning!** In this lesson the probability calculations are computed using a **grid approximation**. The grid approximation performs calculations on a regular grid of parameter values. While simple to implement and understand, grid approximations are generally too inefficient for real-world problems.

### Computing the posterior distribution

Now that you have a prior and a likelihood we are in a position to compute the posterior distribution of the parameter $\theta$, given the evidence, $X$, or $P(\theta|X)$. 

To perform this calculation, the product of the prior density by the likelihood is computed. The numerator of Bayes Theorem. Then, this result must then be normalized. The normalization is computing the marginal distribution of the data by summing the numerator over the parameter, $\theta$:  

$$p(X) = \sum_{\theta} p(X | \theta)\ p(\theta)$$

> **Exercise 18-4:** Use the likelihood and prior distributions samples just created to compute and plot the posterior distribution of the Binomial parameters , $\theta$. For this analysis, you will use **1-dimensional grid sampling** to compute an approximate solution. The functions in the cell below display plots the prior density, the likelihood and the posterior distribution. The maximum value of the parameter, $\theta$, is printed for the prior, likelihood and posterior. The later estimate is known as the **maximum apostiori** or **MAP**. You must complete the `posterior` function below, which requires two missing lines of code, and execute the code in the cell below:    
> 1. Compute the un-normalized posterior distribution from the arguments to the function.     
> 2. Return the normalized posterior distribution.
>         
> *Note*: You can perform these operations in two short lines of code. 

In [None]:
def posterior(prior, like):
    ## Complete this function so that it returns the normalized posterior distribution
    
    

def plot_distribution(dist, theta, ax, title):
    ax.plot(theta, dist)
    ax.set_title(title)
    ax.set_xlim(0, 1)

def print_maximums(pp, l, post, bias=0.00001):
    print('\n\nMaximum of the prior density = {:5.3f}'.format(float(np.argmax(pp) + 1)/100.0)) 
    print('Maximum likelihood {:5.3f}'.format(float(np.argmax(l) + 1)/100.0))
    print('MAP = {:5.3f}'.format(float(np.argmax(post) + 1)/100.0))

def plot_post(prior, like, post, theta):
    fig, ax = plt.subplots(3, 1, figsize=(10, 10))
    plot_distribution(prior, theta, ax[0], 'Density of prior distribution')
    plot_distribution(like, theta, ax[1], 'Likelihood function')
    plot_distribution(post, theta, ax[2], 'Posterior probability distribution')
    ax[2].set_xlabel('Theta')
    print_maximums(prior, like, post)
    
post = posterior(p_prior, likelihood)
plot_post(p_prior, likelihood, post, p)

> Answer the following questions:  
> 1. Given the prior, is the identical shapes of the likelihood and posterior expected and why?    
> 2. Frequentist statistics is largely based on likelihood functions and the maximum likelihood method. When the prior is a uniform distribution, as in the preceding example, is there any substantive difference between Bayesian and frequentist analysis? What does this tell you about the similarity and differences of these methods?   

> **Answers:**     
> 1.               
> 2.          

### Analysis with another prior

How will changing the prior influence the posterior? Imagine we acquired some prior information from a national study of distracted drivers. The finding of the study is that one in 10 drivers is distracted on average.  

> **Exercise 18-5:** You will now incorporate the prior information from the national study in the Bayesian analysis. Since the national study collected data on a great number of drivers, you will use a stronger prior with total pseudo counts $= 10$. For this analysis, you will use 1-dimensional grid sampling to compute an approximate solution. To complete this analysis, do the following:       
> 1. Compute the density function of the new prior distribution. Make sure you add 1 to the pseudo counts, $[a,b]$, 
> 2. Compute the posterior distribution using the previously computed likelihood and the new prior.   
> 3. Plot and print the results using the plot_post function.    

In [None]:
## Put your code below  




> Examine the results and answer these questions:    
> 1. Does the prior represent the results of the national study and why?    
> 2. Notice the difference between the likelihood and the prior. Pay particular interest to the model and right tail of the posterior distribution. Is this expected given the new prior and why?   

> **Answers:**    
> 1.             
> 2.           

### Analytical solution

As we discussed earlier in this lesson, the Beta distributed posterior can be computed analytically. Recall, the parameters of the posterior Beta distribution incorporate both counts and pseudo counts for success and failures. These parameters are computed as: 
\begin{align}
a &= success + pseudo\ success + 1 \\
b &= total\ counts - success + pseudo\ failures + 1
\end{align}

**Adjusting density:** With standard statistical software tools, it is easy to compute a probability density function (PDF). But, if you sample the PDF at multiple discrete points, there is a scaling problem. Recall that to find a probability of a continuous distribution, we need to integrate over some nonzero interval, say from $\theta_1$ to $\theta_2$:     
$$P \Big( \big{|}^{\theta_2}_{\theta_1} \Big) = \int^{\theta_2}_{\theta_1} {pdf}(\theta)\ d\theta$$    
To plot the posterior probability we need to scale PDF at the desired values of $\theta$. These values of $\theta$ are discrete, on the sampling grid, and are said to form a **comb** or **comb function**, $\vec{\theta}$. The sampled values must add to 1.0, so that the result represents a proper distribution. A simple transformation achieves the required normalization. Simply divide by the number, $n$, of sample points in the comb, $\vec{\theta}$ :   
$$p(\vec{\theta}) = \frac{1}{n} {pdf}(\theta)$$

> **Exercise 18-6:** Now that you have computed the posterior using grid sampling, you will compute and plot the **analytically derived posterior** Beta distribution for the distracted driving problem. An analytic solution uses values of $a$ and $b$ computed using the prior pseudo counts plus the evidence, from which the posterior Beta distribution can be directly computed. The pseudo counts are $[a,b]=[1,9]$, adding 1, as before. Perform the following steps:
> 1. From the list `distracted` and the pseudo counts of the prior distribution compute Beta the parameters, $a$ and $b$ of the posterior distribution using the prior and data.  
> 2. Compute the probabilities of the posterior for each value of the parameter, $\theta$. You can use the [scipy.stats.beta.pdf](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.beta.html) function for this calculation, and then normalize the result. Note that in the code we use the vector of $\theta$ values in the variable `p`. 
> 3. Plot the density of the posterior vs. the values of the Binomial parameter $\theta$.   
> 4. Print the MAP value of $\theta$.   
> 3. Make sure you name the variables returned by the`analytic_posterior` function `a`, `b`, `beta_post`.

In [None]:
def analytic_posterior(distracted, p, pseudo_sucess, pseudo_fail): 
    success = sum(distracted)
    n = len(distracted)
    
    ## Add code to find a and b
   
    
    print('a = ' + str(a) + '  b= ' + str(b))
    
    ## Add code to compute normalized posterior
    
    
    
    ## Plot density and print MAP
    fig, ax = plt.subplots(figsize=(10, 4))
    ax.plot(p, beta_post)
    ax.set_title('Analytical posterior distribution of distracted driving')
    ax.set_xlabel('Theta')
    print('MAP = %.3f' % float(float(np.argmax(beta_post))/100.0 + 0.01))
    return(a, b, beta_post)
    
a, b, beta_post = analytic_posterior(distracted, p, 2, 10)    

> Compare the results from the analytical solution to the grid-sampled solution. Are there any significant differences and why is this the expected behavior?   

> **Answer:**            

## Adding data to the Bayesian model

We will now examine one of the most interesting properties of Bayesian models. The ability to update a model using new evidence. This is in contrast to frequentist analysis, where the model computation must be done from scratch.     

Let's say that we observe some more drivers to gather more data on distracted driving. We can summarize the effects of adding additional evidence to a Bayesian model as:     
- Additional data will narrow the spread or dispersion of the posterior distribution. 
- As data are added to a Bayesian model, the posterior moves toward the likelihood. 
- The prior matters less as more data is added to a Bayesian model.
- The inferences from Bayesian and frequentist models tend to converge as data set size grows and the posterior approaches the likelihood.

> **Note:** By using a prior distribution, Bayesian methods can provide useful inferences with minimal data. Adding additional evidence moves the posterior distribution toward the likelihood. **But, be careful!** For complex models with large numbers of parameters, you may need enormous numbers of observations to see the convergence in behavior to the equivalent frequenist model. 

> **Exercise 18-7:** To get a feel for how adding evidence changes a Bayesian model we will update our running example. This process is often referred to as **updating belief** in Bayesian analysis. A Bayesian belief is represented by the posterior distribution. Additional evidence shifts changes belief leading to a new posterior distribution. In this case we consier the result of having observed 20 additional drivers and note if they are distracted or not.      
> For this analysis, you will use 1-dimensional **grid sampling** to compute an approximate solution. To find and analyze the new posterior distribution, perform the following steps:    
> 1. Compute the Binomial likelihood for the new observations.   
> 2. Compute an updated posterior density using the *posterior* function. The prior distribution is the posterior distribution from the preceding grid-samled analysis with the original 10 observations. In other words, the belief represented by the preceding posterior distribution is updated by computing a new posterior distribution.    
> 3. Plot the updated model using the *plot_post* function.   

In [None]:
new_drivers = ['no','no','no','yes','no',
               'yes','no','yes','no','no',
               'no', 'yes', 'no', 'no', 'no',
               'yes', 'yes', 'no', 'no', 'no']  # Some new data
new_distracted = [1 if x == 'yes' else 0 for x in new_drivers]

## Put your code below




> Compare the updated results with the original results and answer the following questions.   
> 1. How has the MAP point of the posterior changed compared to the first posterior? Is this change expected? 
> 2. Is the dispersion of the posterior distribution reduced compared to the first posterior, and why is this behavior expected? 
> 3. Are the MAP and the maximum of the likelihood function closer together and why?  

> **Answers:**  
> 1.      
> 2.       
> 3.            

> **Exercise 18-8:** Using the **analytic_posterior** function, you will now compute and plot the analytical Beta posterior density An analytic solution uses values of $a$ and $b$ computed using the prior pseudo counts and the evidence, from which the posterior Beta distribution can be directly computed. follow these steps:   
> 1. Compute $a$ and $b$ using the pseudo counts of the prior distribution (based on the national study of 1 of 10 distracted drivers) and the successes and failures from both sets of observations. 
> 2. Using the function you created for Exercise 18-6 find the Beta density for the values of the Binomial parameter, $\theta$, of the posterior density vs. the Binomial parameter, and compute and print the MAP.  
> 3. Make sure you name the objects returned by the `analytic_posterior` function `a_updated`, `b_updated`, `beta_post_updated`.

In [None]:
## Put your code below 



> Compare the analytical result to the previous result. Is the shape and MAP the same as expected and why?  

> **Answer:**             

## Credible Intervals

How can we specify the uncertainty for a Bayesian parameter estimate? For frequentist analysis we use confidence intervals, but these are not entirely appropriate. Confidence intervals are based on a **sampling distribution** of some statistic. Given this statistic, the upper and lower confidence intervals are the specified quantiles of the resulting parameter distribution. In Bayesian analysis there is no sampling distribution. Instead, we use a prior distribution along with a likelihood, computed from evidence, to compute a posterior distribution.    

What then can we do to specify a range of credible parameter values for Bayesian analysis? We use a concept known as the **credible interval** of the posterior distribution. A credible interval is an interval on the Bayesian posterior distribution with the highest $\alpha$ proportion of posterior probability. As an example, the $\alpha = 0.90$ credible interval encompasses the 90% of the posterior distribution with the highest density. Alternatively, the credible interval is often referred to by other equivalent names; **highest density interval (HDI)**, the **highest posterior density interval (HPDI)**, or the **highest probability density (HPD)** interval. These names make sense, since we seek the densest posterior interval containing $\alpha$ probability.        

For symmetric distributions the credible interval can be numerically the same as the confidence interval. However, in the general case, these two quantities can be quite different.  

An example may help your intuition for the concept of credible interval. The code in the cell below uses the analytical solution for the posterior distribution. Based on this distribution, the credible interval is computed, printed and plotted. For reference the confidence interval is printed as well. Execute this code and examine the results.   

> **Note** This code uses a simple second order approximation to find the credible interval of the unimodal Beta posterior distribution. The approximation relies on the fact that the computed value of the PDF must be the same at each of the boundary credible interval. Starting with an initial guess, the confidence interval, the credible interval boundaries are incrementally updated. This approach likely introduces a small amount of bias in the solution, since the actual interval is unconstrained.    

In [None]:
def correct_credible_interval(a, b, lower_ci, upper_ci, p, nits=20):
    n = float(len(p))
    for _ in range(nits): # Iterate to find improved solutions
        ## first find the pdf value for the candidate credible
        ## interval bounds. 
        lower_pdf = scipy.stats.beta.pdf(lower_ci, a,b)/n
        upper_pdf = scipy.stats.beta.pdf(upper_ci, a,b)/n
        ## Depending on how the relationships between the pdf
        ## values the update is applied based on the difference.  
        if(lower_pdf > upper_pdf): 
            diff_pdf = lower_pdf - upper_pdf
            lower_ci = lower_ci - diff_pdf
            upper_ci = upper_ci - diff_pdf
        else: 
            diff_pdf = upper_pdf - lower_pdf
            lower_ci = lower_ci + diff_pdf
            upper_ci = upper_ci + diff_pdf
    return(lower_ci, upper_ci)

def plot_credible_interval(alpha, a, b, posterior, theta):
    ## Compute the initial values, which are the confidence intervbales
    lower_ci, upper_ci = scipy.stats.beta.interval(alpha, a, b)
    print('Lower confidence interval = {:6.4f}   Upper confidence interval = {:6.4f}'.format(lower_ci, upper_ci))
    ## Apply the approximate correct to find the credible intervals
    lower_credible, upper_credible = correct_credible_interval(a, b, lower_ci, upper_ci, p)
    print('Lower credible interval = {:6.4f}   Upper credible interval = {:6.4f}'.format(lower_credible, upper_credible))
    ## Plot the posterior distributon with credible intervals
    fig, ax = plt.subplots(figsize=(10, 4))
    ax.plot(p, posterior)
    ax.axvline(lower_credible, color='red', linewidth=2, label='Credible Interval')
    ax.axvline(upper_credible, color='red', linewidth=2)
    ax.set_title('Analytical posterior distribution with {:4.2f} credible interval'.format(alpha))
    ax.set_xlabel('Theta')
    return lower_ci, upper_ci, lower_credible, upper_credible, ax
    
    
_=plot_credible_interval(0.95, a, b, beta_post, p)    

Notice the following:   
- The confidence interval is nearly, but not quite the same as the credible interval.  
- The density at the upper and lower bounds of the credible interval are identical. A difference with confidence intervals.  

### Credible Intervals are not Confidence Intervals   

You may well wonder how credible intervals are different from the more familiar confidence intervals. Confidence intervals and credible intervals are conceptually quite different.     

A confidence interval is a purely frequentest concept. The **confidence interval** defines an interval on the **sampling distribution** where repeated samples of a statistic are expected with probability $= 1-\alpha$. You **cannot interpret** a confidence interval as an interval on a probability distribution of the values of a statistic!      

Whereas, a credible interval is an interval on a posterior distribution of the statistic. Therefore, the credible interval is exactly what the misinterpretation of the confidence interval tries to be. The credible interval is the interval with highest $1-\alpha$ probability for the statistic being estimated. In other words, there is a $1-\alpha$ probability that the statistic lies in the credible interval.   

For a symmetric posterior distribution, the credible interval will be numerically the same as the confidence interval. However, this need not be the case in general. Further, as just explained, the interpretation is quite different in any case.  

For the case of the posterior distribution using 10 observations the difference between the confidence interval and credible interval is shown. This posterior distribution is asymmetric, so the values are different. Notice that while the credible intervals cross the density function at exactly the same density, this is not the case for the confidence intervals.    

In [None]:
lower_ci, upper_ci, lower_credible, upper_credible, ax = plot_credible_interval(0.95, a, b, beta_post, p) 
ax.axvline(lower_ci, linestyle='dotted', linewidth=2, color='black', label='Condifence Interval');
ax.axvline(upper_ci, linestyle='dotted', linewidth=2, color='black');
ax.legend();

> **Exercise 18-9:** Using the functions defined above, analyze and plot the $\alpha = 0.95$ credible interval for the updated model (30 observations), incorporating the additional observations. Use the **analytically computed** updated parameters values, a, b and the corresponding posterior density estimates to compute, plot and print the credible interval for the posterior distribution. Execute your code and examine the results.          

In [None]:
## Put your code below  





> Compare the credible intervals for the two posterior distributions above and answer these questions:   
> 1. How does the HDI change as more evidence is added and why?   
> 2. Do you expect the HDI to converge to the frequentist confidence interval as more evidence is added to this model and why?   

> **Answers:**       
> 1.          
> 2.          

## Simulating from the  posterior distribution: predictions

So far, we have computed the posterior distribution of the probability parameter $\theta$ for two priors and with added observations. What else can we do with a Bayesian model? For one thing, you can perform simulations and make predictions. Predictions are computed by simulating from the posterior distribution. The results of these simulations are useful for several purposes, including:    
1. Forecasting future values.   
2. Model checking by comparing simulation results for agreement (or not) with observations.   

> **Exercise 18-10:** The code in the cell below simulates to compute forecasts and plots the distribution of distracted drivers for the next 10 drivers. You will complete the code for the simulation using a resampling method with key steps:    
> 1. Probabilities are resampled **with replacement**. The sampling probabilities are drawn from the updated (using 30 observations) posterior density. You can compute the realizations using [numpy.random.choice](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html).    
> 2. The propabilities are scaled by the number of cars observed and then rounded to an integer.  

In [None]:
num_cars = 10
num_samples = 10000

## Add the missing line of code below



plt.hist(counts, bins=2*int(max(counts))+1, density=True)
plt.title('Expected number of distracted drivers in next %d' % num_cars)
plt.xlabel('Number of distracted drivers')
plt.xlim(0,10)
_=plt.ylabel('Expected ')

> Answer the following questions:  
> 1. What are the two highest probability number of distracted drivers in the next 10?   
> 2. Why is it expected that there is essentially 0 probability for high numbers of distracted drivers out of the 10?  

> **answers:**    
> 1.         
> 2.      


> **Exercise 18-11:** The foregoing example is just one possible series of realizations drawn from the posterior distribution. Keeping in mind that the Bayesian model is probabilistic, it is worth considering how uncertain these predictions are. To understand the uncertainty we can resample multiple times from the posterior distribution and analyze the dispersion of the samples. This is not a  bootstrap resampling, since the model is not updated by resampling the observations. Still, resampling will give us an idea of the uncertainty for the current model.        
> The code in the cell below does the following:    
> 1. Iterates over the number of trials to be sampled, with the missing code doing the following: 
>   - At each iteration the probabilities are sampled. 
>   - The counts are grouped by the number of distracted drivers and summed. 
>   - The aggregated counts are normalized by the total number of counts.   
>   - The counts are added to the `counts.index` rows of the ith column.    
> 2. The transpose of the data frame is taken to place the probabilities in columns by trial.    
> 3. A box plot is displayed. The [pandas.melt](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html) method transforms the data frame into two columns, one column with the number of distracted drivers, the values or counts.   
> 4. Prints a summary of the data frame using the [pandas.dataframe.describe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) function. 

In [None]:
num_cars = 10
num_samples = 10000
num_trials = 1000

distracted_count = pd.DataFrame(np.zeros((num_cars+1,num_trials)))
for i in range(num_trials):
    ## Put you code below 
    counts = pd.Series(((num_cars * np.random.choice(p, size=num_samples, replace=True, p=beta_post_updated)).round().astype(np.int64)))
    counts = counts.groupby(counts).agg('count')
    counts = np.divide(counts, counts.sum())
    distracted_count.iloc[counts.index,i] = counts
distracted_count = distracted_count.transpose() 
    

fig, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(x="variable", y="value", data=pd.melt(distracted_count), ax=ax)
ax.set_ylabel('Probability')
ax.set_xlabel('Number of distracted drivers')
_=ax.set_title('Probability of number of distracted drivers in next ' + str(num_cars))

## Print a summary of the data frame
distracted_count.describe()

> Examine these results and answer these questions:     
> 1. Are the uncertainties large enough to change the interpretation of which numbers of distracted drivers are most likely?       
> 2. Is there any chance of 0 distracted drivers, and for which cases?            

> **Answers:**   
> 1.                  
> 2.                 

## Summary

In this lesson, you have explored the following concepts:

1. Application of Bayes Theorem.
2. Computation of marginal distribtuions.
3. Selection and computation of prior distributions.
4. Selection and computation of likelihoods.
5. Computation of posterior distributions.
6. Computation and comparison of credible intervals. 
7. Simulation of data values from posterior distribution of model parameters.

## Further reading

There are numerous books and articles on Bayesian data analysis. Further, there are a growing number of powerful software packages that can be used for Bayesian data analysis.

### Some introductory texts

These two books provide a broad and readable introduction to Bayesian data analysis. Both books contain extensive examples using R and specialized Bayes packages.

<img src="../images/StatisticalRethinking.jpg" alt="Drawing" style="width:200px; height:275px"/>

<img src="../images/DoingBaysianDataAnalysis.jpg" alt="Drawing" style="width:200px; height:275px"/>

### Bayesian modeling with Python

An introduction to Bayesian modeling using Python packages, PyMC3 and [Tensorflow Probability](https://www.tensorflow.org/probability). This book contains a practical introduction to Bayesian modeling.     

<img src="../images/BayesianModelingPython.jpg" alt="Drawing" style="width:200px; height:275px"/>

### Modeling reference

This book contains a comprehensive treatment of applying Bayesian models. The level of treatments is intermediate. The examples are from the social sciences, but the methods can be applied more widely. The examples use R and specialized Bayes packages. 

<img src="../images/BayesRegression.jpg" alt="Drawing" style="width:200px; height:275px"/>

### Theory 

This book contains a comprehensive overview of the modern theory of Bayesian models. The book is at an advanced level. Primarily theory is addressed, with only very limited R code examples.  

<img src="../images/BaysianDataAnalysis.jpg" alt="Drawing" style="width:200px; height:275px"/>

#### Copyright 2017, 2018, 2019, 2020, 2021, 2022, 2023 Stephen F Elston. All rights reserved.