# Think Bayes

Second Edition

Copyright 2020 Allen B. Downey

License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

In [3]:
# If we're running on Colab, install empiricaldist
# https://pypi.org/project/empiricaldist/

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install empiricaldist

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from empiricaldist import Pmf
from utils import decorate, savefig

## Odds

The following function converts from probabilities to odds.

In [5]:
def odds(p):
    return p / (1-p)

And this function converts from odds to probabilities.

In [6]:
def prob(o):
    return o / (o+1)

If 20% of bettors think my horse will win, that corresponds to odds of 1:4, or 0.25.

In [7]:
p = 0.2
odds(p)

If the odds against my horse are 1:5, that corresponds to a probability of 1/6.

In [8]:
o = 1/5
prob(o)

We can use the odds form of Bayes's theorem to solve the cookie problem:

In [9]:
prior_odds = 1
likelihood_ratio = 0.75 / 0.5
post_odds = prior_odds * likelihood_ratio
post_odds

And then we can compute the posterior probability, if desired.

In [10]:
post_prob = prob(post_odds)
post_prob

If we draw another cookie and it's chocolate, we can do another update:

In [11]:
likelihood_ratio = 0.25 / 0.5
post_odds *= likelihood_ratio
post_odds

And convert back to probability.

In [12]:
post_prob = prob(post_odds)
post_prob

## Oliver's blood

The likelihood ratio is also useful for talking about the strength of evidence without getting bogged down talking about priors.

As an example, we'll solve this problem from MacKay's *Information Theory, Inference, and Learning Algorithms*:

> Two people have left traces of their own blood at the scene of a crime.  A suspect, Oliver, is tested and found to have type 'O' blood.  The blood groups of the two traces are found to be of type 'O' (a common type in the local population, having frequency 60) and of type 'AB' (a rare type, with frequency 1). Do these data [the traces found at the scene] give evidence in favor of the proposition that Oliver was one of the people [who left blood at the scene]?

If Oliver is
one of the people who left blood at the crime scene, then he
accounts for the 'O' sample, so the probability of the data
is just the probability that a random member of the population
has type 'AB' blood, which is 1%.

If Oliver did not leave blood at the scene, then we have two
samples to account for.  If we choose two random people from
the population, what is the chance of finding one with type 'O'
and one with type 'AB'?  Well, there are two ways it might happen:
the first person we choose might have type 'O' and the second
'AB', or the other way around.  So the total probability is
$2 (0.6) (0.01) = 1.2$%.

So the likelihood ratio is:

In [13]:
like1 = 0.01
like2 = 2 * 0.6 * 0.01

likelihood_ratio = like1 / like2
likelihood_ratio

Since the ratio is less than 1, it is evidence *against* the hypothesis that Oliver left blood at the scence.

But it is weak evidence.  For example, if the prior odds were 1 (that is, 50% probability), the posterior odds would be 0.83, which corresponds to a probability of:

In [14]:
post_odds = 1 * like1 / like2
prob(post_odds)

So this evidence doesn't "move the needle" very much.

**Exercise:** Suppose other evidence had made you 90% confident of Oliver's guilt.  How much would this exculpatory evidence change your beliefs?  What if you initially thought there was only a 10% chance of his guilt?

Notice that evidence with the same strength has a different effect on probability, depending on where you started.

In [15]:
# Solution goes here

In [16]:
# Solution goes here

## Addends

In [215]:
def make_die(sides):
    """Pmf that represents a die with the given number of sides.
    
    sides: int
    
    returns: Pmf
    """
    outcomes = np.arange(1, sides+1)
    die = Pmf(1/sides, outcomes)
    return die

In [216]:
die = make_die(6)
die

In [115]:
d6.bar(alpha=0.6)
decorate(xlabel='Outcome',
         ylabel='PMF')

In [116]:
for q, p in die.items():
    print(q, p)

In [19]:
def add_dist(pmf1, pmf2):
    """Compute the distribution of a sum.
    
    pmf1: Pmf
    pmf2: Pmf
    
    returns: Pmf of sums from pmf1 and pmf2
    """
    res = Pmf()
    for q1, p1 in pmf1.items():
        for q2, p2 in pmf2.items():
            q = q1 + q2
            p = p1 * p2
            res[q] = res(q) + p
    return res

In [133]:
def decorate_dice(title=''):
    decorate(xlabel='Outcome',
             ylabel='PMF',
             title=title)

In [134]:
twice = add_dist(die, die)
twice.bar(color='C1', alpha=0.6)
decorate_dice()

In [135]:
twice = die.add_dist(die)
twice.bar(color='C1', alpha=0.6)
decorate_dice()

In [136]:
twice = Pmf.add_dist(die, die)
twice.bar(color='C1', alpha=0.6)
decorate_dice()

In [137]:
def add_dist_seq(seq):
    """Distribution of sum of values from PMFs.
    
    seq: sequence of Pmf objects
    
    returns: Pmf
    """
    total = seq[0]
    for other in seq[1:]:
        total = total.add_dist(other)
    return total

In [138]:
dice = [die] * 3

In [139]:
thrice = add_dist_seq(dice)
d6.plot(label='once')
twice.plot(label='twice')
thrice.plot(label='thrice')

decorate_dice(title='Distributions of sums')
plt.xticks([0,3,6,9,12,15,18])
savefig('fig05-01')

## Gluten sensitivity

In [109]:
from scipy.stats import binom

def make_binomial(n, p):
    """Make a binomial distribution.
    
    n: number of trials
    p: probability of success
    
    returns: Pmf representing the distribution of k
    """
    ks = np.arange(n+1)
    ps = binom.pmf(ks, n, p)
    return Pmf(ps, ks)

In [110]:
n = 35
n_sensitive = 10
n_insensitive = n - n_sensitive

dist_sensitive = make_binomial(n_sensitive, 0.95)
dist_insensitive = make_binomial(n_insensitive, 0.4)

In [112]:
dist_total = Pmf.add_dist(dist_sensitive, dist_insensitive)

In [140]:
dist_sensitive.plot(label='sensitive')
dist_insensitive.plot(label='insensitive')
dist_total.plot(label='total')

decorate(xlabel='Number of correct identifications',
         ylabel='PMF',
         title='Gluten sensitivity')

savefig('fig05-02')

In [69]:
table = pd.DataFrame()
table[0] = make_binomial(n, 0.4)

for n_sensitive in range(1, n):
    n_insensitive = n - n_sensitive

    dist_sensitive = make_binomial(n_sensitive, 0.95)
    dist_insensitive = make_binomial(n_insensitive, 0.4)
    dist_total = Pmf.add_dist(dist_sensitive, dist_insensitive)    
    table[n_sensitive] = dist_total
    
table[n] = make_binomial(n, 0.95)

In [71]:
table.head()

In [72]:
table.tail()

In [142]:
for n_sensitive in [0, 10, 20, 30]:
    table[n_sensitive].plot(label=f'n_sensitive = {n_sensitive}')
    
decorate(xlabel='Number of correct identifications',
         ylabel='PMF',
         title='Gluten sensitivity')

savefig('fig05-03')

In [143]:
likelihood1 = table.loc[12]
likelihood2 = table.loc[20]

In [144]:
hypos = np.arange(n+1)
prior = Pmf(1, hypos)

In [145]:
posterior1 = prior * likelihood1
posterior1.normalize()

posterior2 = prior * likelihood2
posterior2.normalize()

In [147]:
posterior1.plot(label='posterior with 12 correct')
posterior2.plot(label='posterior with 20 correct')

decorate(xlabel='Number of sensitive subjects',
         ylabel='PMF',
         title='Posterior distributions')

savefig('fig05-04')

In [104]:
posterior1.max_prob()

In [105]:
posterior2.max_prob()

## Exercises

**Exercise:** Let's use Bayes's Rule to solve the Elvis problem from Chapter 2:

> Elvis Presley had a twin brother who died at birth. What is the probability that Elvis was an identical twin?

In 1935, about 2/3 of twins were fraternal and 1/3 were identical.

The question contains two pieces of information we can use to update this prior.

* First, Elvis's twin was also male, which is more likely if they were identical twins, with a likelihood ratio of 2.

* Also, Elvis's twin died at birth, which is more likely if they were identical twins, with a likelihood ratio of 1.25.

If you are curious about where those number come from, I wrote [a blog post about it](https://www.allendowney.com/blog/2020/01/28/the-elvis-problem-revisited).

In [149]:
# Solution goes here

In [150]:
# Solution goes here

In [152]:
# Solution goes here

**Exercise:** The following is an [interview question that appeared on glassdoor.com](https://www.glassdoor.com/Interview/You-re-about-to-get-on-a-plane-to-Seattle-You-want-to-know-if-you-should-bring-an-umbrella-You-call-3-random-friends-of-y-QTN_519262.htm), attributed to Facebook:

> You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining. What is the probability that it's actually raining in Seattle?

Use Bayes's Rule to solve this problem.  As a prior you can assume that it rains in Seattle about 10% of the time.

In [153]:
# Solution goes here

In [154]:
# Solution goes here

In [155]:
# Solution goes here

**Exercise:** [According to the CDC](https://www.cdc.gov/tobacco/data_statistics/fact_sheets/health_effects/effects_cig_smoking), people who smoke are about 25 times more likely to develop lung cancer than nonsmokers.

[Also according to the CDC](https://www.cdc.gov/tobacco/data_statistics/fact_sheets/adult_data/cig_smoking/index.htm), about 14\% of adults in the U.S. are smokers. 

If you learn that someone has lung cancer, what is the probability they are a smoker?

In [156]:
# Solution goes here

In [157]:
# Solution goes here

In [158]:
# Solution goes here

**Exercise:** Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
I choose one of the dice at random, roll it twice, multiply the outcomes, and report that the product is 12.
What is the probability that I chose the 8-sided die?

Hint: `Pmf` provides a function called `mul_dist` that takes two `Pmf` objects and returns a `Pmf` that represents the distribution of the product.

You might find the following function helpful:

In [171]:
def make_die(sides):
    """Pmf that represents a die with the given number of sides.
    
    sides: int
    
    returns: Pmf
    """
    outcomes = np.arange(1, sides+1)
    die = Pmf(1/sides, outcomes)
    return die

In [172]:
# Solution goes here

In [173]:
# Solution goes here

In [187]:
# Solution goes here

In [188]:
# Solution goes here

In [189]:
# Solution goes here

**Exercise:** There are 538 members of the United States Congress.  
Suppose we audit their investment portfolios and find that 312 of them out-perform the market.
Let's assume that an honest member of Congress has only a 50% chance of out-performing the market, but a dishonest member who trades on inside information has a 90% chance.  How many members of Congress are honest?

In [207]:
n = 538

table = pd.DataFrame()
table[0] = make_binomial(n, 0.9)

for n_honest in range(1, n):
    n_dishonest = n - n_honest

    dist_honest = make_binomial(n_honest, 0.5)
    dist_dishonest = make_binomial(n_dishonest, 0.9)
    dist_total = Pmf.add_dist(dist_honest, dist_dishonest)    
    table[n_honest] = dist_total
    
table[n] = make_binomial(n, 0.5)
table.shape

In [208]:
data = 312
likelihood = table.loc[312]
len(likelihood)

In [209]:
hypos = np.arange(n+1)
prior = Pmf(1, hypos)
len(prior)

In [212]:
posterior = prior * likelihood
posterior.normalize()
posterior.mean()

In [213]:
posterior.plot(label='posterior')
decorate(xlabel='Number of honest members of Congress',
         ylabel='PMF')

In [214]:
posterior.credible_interval(0.9)