# Excercises

## Excercise 5. 1.
[Purpose: Iterative application of Bayes’ rule, and seeing how posterior probabilities change with inclusion of more data.] This exercise extends the ideas of Table 5.4, so at this time, please review Table 5.4 and its discussion in the text. Suppose that the same randomly selected person as in Table 5.4 gets re-tested after the first test result was positive, and on the re-test, the result is negative. When taking into account the results of both tests, what is the probability that the person has the disease? Hint: For the prior probability of the re-test, use the posterior computed from the Table 5.4. Retain as many decimal places as possible, as rounding can have a surprisingly big effect on the results. One way to avoid unnecessary rounding is to do the calculations in R.

In [49]:
p_disease = 0.001 # P(:()
p_clean = (1 - p_disease) # P(:))
p_true_pos = 0.99 # P(+|:()
p_false_neg = (1 - p_true_pos) # P(-|:()
p_false_pos = 0.05 # P(+|:))
p_true_neg = (1 - p_false_pos) # P(-|:))

In [51]:
# P(:(|+) = P(+|:()*P(:() / P(+)
liklihood = p_true_pos
prior = p_disease
marginal_likelihood = (p_true_pos * p_disease) + (p_false_pos * p_clean)
p_disease_when_pos = liklihood * prior / marginal_likelihood
print('P(:(|+):', p_disease_when_pos)

P(:(|+): 0.019434628975265017


In [12]:
# P(:(|+, -)
new_liklihood = p_false_neg
new_prior = p_disease_when_pos
new_marginal_likelihood = (p_true_neg * p_clean) + (p_false_neg * p_disease)
p_disease_when_pos_then_neg = new_liklihood * new_prior / new_marginal_likelihood
print('P(:(|+, -):', p_disease_when_pos_then_neg)

P(:(|+, -): 0.0002047776639544922


## Exercies 5.2.
[Purpose: Getting an intuition for the previous results by using “natural frequency” and “Markov” representations]

(A) Suppose that the population consists of 100,000 people. Compute how many people would be expected to fall into each cell of Table 5.4. To compute the expected frequency of people in a cell, just multiply the cell probability by the size of the population. To get you started, a few of the cells of the frequency table are filled in here:

In [17]:
import numpy as np
import pandas as pd

empty_array = [[0, 0], [0, 0]]
df = pd.DataFrame(empty_array)
df.columns = ['theta = :(', 'theta = :)']
df.index = ['D = +', 'D = -']
df

Unnamed: 0,theta = :(,theta = :)
D = +,0,0
D = -,0,0


In [24]:
N = 100000
df.loc['D = +','theta = :('] = p_true_pos * p_disease * N
df.loc['D = +','theta = :)'] = p_false_pos * p_clean * N
df.loc['D = -','theta = :('] = p_false_neg * p_disease * N
df.loc['D = -','theta = :)'] = p_true_neg * p_clean * N
df

Unnamed: 0,theta = :(,theta = :)
D = +,99.0,4995.0
D = -,1.0,94905.0


(B) Take a good look at the frequencies in the table you just computed for the previous part. These are the so-called “natural frequencies” of the events, as opposed to the somewhat unintuitive expression in terms of conditional probabilities (Gigerenzer & Hoffrage, 1995). From the cell frequencies alone, determine **the proportion of people who have the disease, given that their test result is positive**. Before computing the exact answer arithmetically, first give a rough intuitive answer merely by looking at the relative frequencies in the row D = +. Does your intuitive answer match the intuitive answer you provided when originally reading about Table 5.4? Probably not. Your intuitive answer here is probably much closer to the correct answer. Now compute the exact answer arithmetically. It should match the result from applying Bayes’ rule in Table 5.4.

In [29]:
# P(:(|+)
p_disease_when_pos = df.loc['D = +', 'theta = :(']/df.loc['D = +'].sum()
print('P(:(|+):',p_disease_when_pos)

P(:(|+): 0.019434628975265017


(C) Now we’ll consider a related representation of the probabilities in terms of natural frequencies, which is especially useful when we accumulate more data. This type of representation is called a “Markov” representation by Krauss, Martignon, and Hoffrage (1999). Suppose now we start with a population of N = 10, 000, 000 people. We expect 99.9% of them (i.e., 9,990,000) not to have the disease, and just 0.1% (i.e., 10,000) to have the disease. Now consider how many people we expect to test positive. Of the 10,000 people who have the disease, 99%, (i.e., 9,900) will be expected to test positive. Of the 9,990,000 people who do not have the disease, 5% (i.e., 499,500) will be expected to test positive. **Now consider re-testing everyone who has tested positive on the first test. How many of them are expected to show a negative result on the retest?** Use this diagram to compute your answer:

In [2]:
N = 10000000
p_disease = 0.001
p_clean = (1 - p_disease)
n_disease = N * p_disease 
n_clean = N * p_clean
p_true_pos = 0.99
p_false_neg = (1 - p_true_pos)
p_clean = (1 - p_disease)
p_false_pos = 0.05
p_true_neg = (1 - p_false_pos)

In [3]:
print('Number of people who have the disease and a postive and a negative test results in two tests.')
print('exp(+ | :() is', n_disease * p_true_pos)
print('exp(- | :() is', n_disease * p_false_neg)
n_diff_results_when_disease = n_disease * p_true_pos * p_false_neg
print('exp(- | +, :() is', n_diff_results_when_disease)
print()
print('Number of people who do not have the disease and a postive and a negative test results in two tests.')
print('exp(+ | :)) is', n_clean * p_false_pos)
print('exp(- | :)) is', n_clean * p_true_neg)
n_diff_results_when_clean = n_clean * p_false_pos * p_true_neg
print('exp(- | +, :)) is', n_diff_results_when_clean)

Number of people who have the disease and a postive and a negative test results in two tests.
exp(+ | :() is 9900.0
exp(- | :() is 100.00000000000009
exp(- | +, :() is 99.00000000000009

Number of people who do not have the disease and a postive and a negative test results in two tests.
exp(+ | :)) is 499500.0
exp(- | :)) is 9490500.0
exp(- | +, :)) is 474525.0


(D) Use the diagram in the previous part to answer this: **What proportion of people, who test positive at first and then negative on retest, actually have the disease?** In other words, of the total number of people at the bottom of the diagram in the previous part (those are the people who tested positive then negative), what proportion of them are in the left branch of the tree? How does the result compare with your answer to Exercise 5.1?

In [4]:
n_disease_when_diff_results = n_diff_results_when_disease / \
                             (n_diff_results_when_clean + n_diff_results_when_disease)
print(n_disease_when_diff_results)

0.00020858616504854387


### Exercise 5.3. 
[Purpose: To see a hands-on example of data-order invariance.]

Consider again the disease and diagnostic test of the previous two exercises.

(A) Suppose that a person selected at random from the population gets the test and
it comes back negative. Compute the probability that the person has the disease.

In [6]:
# P(:(|-)
likelihood = p_false_neg
prior = p_disease
marginal_likelihood = (p_false_neg * p_disease) + (p_true_neg * p_clean)
p_disease_when_neg = likelihood * prior / marginal_likelihood
print(p_disease_when_neg)

1.0536741618022054e-05


(B) The person then gets re-tested, and on the second test the result is positive. Compute the probability that the person has the disease. How does the result compare with
your answer to Exercise 5.1?


In [8]:
new_likelihood = p_true_pos
new_prior = p_disease_when_neg 
new_marginal_likelihood = (p_true_pos * p_disease) + (p_false_pos * p_clean)
p_disease_when_neg_and_pos = new_likelihood * new_prior / new_marginal_likelihood
print("P(:(|-,+):", p_disease_when_neg_and_pos)
print("Result is same as Exercies 5.1")

P(:(|-,+): 0.00020477766395449222
Result is same as Exercies 5.1


### Exercise 5.4. 
[Purpose: To gain intuition about Bayesian updating by using BernGrid.] 

Open the programBernGridExample.R. You will notice there are several examples of using the function BernGrid. Run the script. For each example, include the R code and the resulting graphic and explain what idea the example illustrates. 

Hints: Look back at Figures 5.2 and 5.3, and look ahead to Figure 6.5. Two of the examples involve a single flip, with the only difference between the examples being whether the prior is uniform or contains only two extreme options. The point of those two examples is to show that a single datum implies little when the prior is vague, but a single datum can have strong implications when the prior allows only two very different possibilities.