# Bite Size Bayes

This notebook presents example code and exercise solutions for Think Bayes.

Copyright 2020 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT

In [1]:
import pandas as pd
import numpy as np

The dataset includes variables I selected from the General Social Survey, available from this project on the GSS site: https://gssdataexplorer.norc.org/projects/54786

I also store the data in the GitHub repository for this book; the following cell downloads it, if necessary.

In [2]:
# Load the data file

import os

if not os.path.exists('gss_bayes.tar.gz'):
    !wget https://github.com/AllenDowney/BiteSizeBayes/raw/master/gss_bayes.tar.gz
    !tar -xzf gss_bayes.tar.gz

`utils.py` provides `read_stata`, which reads the data from the Stata format.

In [3]:
from utils import read_stata

gss = read_stata('GSS.dct', 'GSS.dat')
gss.rename(columns={'id_': 'caseid'}, inplace=True)
# gss.index = gss['caseid']
gss.head()

Unnamed: 0,year,relig,srcbelt,region,adults,wtssall,ballot,cohort,feminist,polviews,partyid,race,sex,educ,age,indus10,occ10,caseid,realinc
0,1972,3,3,3,1,0.4446,0,1949,0,0,2,1,2,16,23,5170,520,1,18951.0
1,1972,2,3,3,2,0.8893,0,1902,0,0,1,1,1,10,70,6470,7700,2,24366.0
2,1972,1,3,3,2,0.8893,0,1924,0,0,3,1,2,12,48,7070,4920,3,24366.0
3,1972,5,3,3,2,0.8893,0,1945,0,0,1,1,2,17,27,5170,800,4,30458.0
4,1972,1,3,3,2,0.8893,0,1911,0,0,0,1,2,12,61,6680,5020,5,50763.0


In [4]:
def replace_invalid(series, bad_vals, replacement=np.nan):
    """Replace invalid values with NaN

    Modifies series in place.

    series: Pandas Series
    bad_vals: list of values to replace
    replacement: value to replace
    """
    series.replace(bad_vals, replacement, inplace=True)

The following cell replaces invalid responses for the variables we'll use.

In [5]:
replace_invalid(gss['feminist'], [0, 8, 9])
replace_invalid(gss['polviews'], [0, 8, 9])
replace_invalid(gss['partyid'], [8, 9])
replace_invalid(gss['indus10'], [0, 9997, 9999])
replace_invalid(gss['age'], [0, 98, 99])

In [6]:
def values(series):
    """Make a series of values and the number of times they appear.
    
    series: Pandas Series
    
    returns: Pandas Series
    """
    return series.value_counts(dropna=False).sort_index()

### feminist

https://gssdataexplorer.norc.org/variables/1698/vshow

This question was only asked during one year, so we're limited to a small number of responses.

In [7]:
values(gss['feminist'])

1.0      298
2.0     1083
NaN    61085
Name: feminist, dtype: int64

### polviews

https://gssdataexplorer.norc.org/variables/178/vshow


In [8]:
values(gss['polviews'])

1.0     1560
2.0     6236
3.0     6754
4.0    20515
5.0     8407
6.0     7876
7.0     1733
NaN     9385
Name: polviews, dtype: int64

### partyid

https://gssdataexplorer.norc.org/variables/141/vshow

In [9]:
values(gss['partyid'])

0.0     9999
1.0    12942
2.0     7485
3.0     9474
4.0     5462
5.0     9661
6.0     6063
7.0      995
NaN      385
Name: partyid, dtype: int64

### race

https://gssdataexplorer.norc.org/variables/82/vshow

In [10]:
values(gss['race'])

1    50340
2     8802
3     3324
Name: race, dtype: int64

### sex

https://gssdataexplorer.norc.org/variables/81/vshow

In [11]:
values(gss['sex'])

1    27562
2    34904
Name: sex, dtype: int64

### age



In [12]:
values(gss['age'])

18.0     219
19.0     835
20.0     870
21.0     987
22.0    1042
        ... 
86.0     172
87.0     143
88.0     113
89.0     335
NaN      221
Name: age, Length: 73, dtype: int64

### indus10

https://gssdataexplorer.norc.org/variables/17/vshow

In [13]:
values(gss['indus10'])

170.0      458
180.0      444
190.0       37
270.0       69
280.0       36
          ... 
9770.0      13
9780.0       8
9790.0      53
9870.0      22
NaN       4704
Name: indus10, Length: 271, dtype: int64

## Select subset

Here's the subset of the data with valid responses for the variables we'll use.

In [14]:
varnames = ['year', 'age', 'sex', 'polviews', 'partyid', 'indus10']

valid = gss.dropna(subset=varnames)
valid.shape

(49290, 19)

In [15]:
subset = valid[varnames]
subset.head()

Unnamed: 0,year,age,sex,polviews,partyid,indus10
3117,1974,21.0,1,4.0,2.0,4970.0
3118,1974,41.0,1,5.0,0.0,9160.0
3121,1974,58.0,2,6.0,1.0,2670.0
3122,1974,30.0,1,5.0,4.0,6870.0
3123,1974,48.0,1,5.0,4.0,7860.0


## Save the data

In [16]:
# if the file already exists, remove it

import os

if os.path.isfile('gss_bayes.hdf5'):
    !rm gss_bayes.hdf5

In [17]:
subset.to_hdf('gss_bayes.hdf5', 'gss', complevel=3)

In [18]:
!ls -l gss_bayes.hdf5

-rw-rw-r-- 1 downey downey 322056 Jan 20 11:41 gss_bayes.hdf5


In [19]:
# Load the data file

import os

datafile = 'gss_bayes.hdf5'
if not os.path.exists(datafile):
    !wget https://github.com/AllenDowney/PoliticalAlignmentCaseStudy/raw/master/gss_bayes.hdf5

## The conjunction fallacy

As part of a psychological experiment, Tversky and Kahneman posed the following question 

> Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.  Which is more probable?

* Linda is a bank teller.
* Linda is a bank teller and is active in the feminist movement.

Many people choose the second answer, presumably because it seems more consistent with the description.  It seems unlikely that Linda would be "just" a bank teller; if she is a bank teller, it seems likely that she would also be a feminist.

But the second answer cannot be correct.  Suppose we find 1000 people who fit Linda's description and 10 of them work as bank tellers.  How many of them are also feminists?  At most, all 10 of them are; in that case, the two options are equally likely.  Or some of them are; in that case the second option is less likely.  But there can't be more than 10 out of 10, so the second option cannot be more likely.

The error people make if they choose the second option is called the [conjunction fallacy](https://en.wikipedia.org/wiki/Conjunction_fallacy).   It's called a [fallacy](https://en.wikipedia.org/wiki/Fallacy) because it's a logical error and "conjunction" because "bank teller AND feminist" is a [logical conjunction](https://en.wikipedia.org/wiki/Logical_conjunction).

If this example makes you uncomfortable, you are in good company.  The biologist [Stephen J. Gould wrote](https://sci-hub.tw/https://doi.org/10.1080/09332480.1989.10554932) :

> I am particularly fond of this example because I know that the [second] statement is least probable, yet a little [homunculus](https://en.wikipedia.org/wiki/Homunculus_argument) in my head continues to jump up and down, shouting at me, "but she can't just be a bank teller; read the description."

In this notebook I'll use this example to demonstrate probability, conditional probability, and Bayes's theorem.

## Probability

The definition of [probability is more controversial than you might expect](https://en.wikipedia.org/wiki/Probability_interpretations).  To avoid getting bogged down before we get started, I will start with a simple definition: a **probability** is a **fraction** of a dataset.

For example, if we survey 1000 people, and 20 of them are bank tellers, the fraction that work as bank tellers is 0.02 or 2\%.  

If we choose a person from this population at random, the probability that they are a bank teller is 2\%.  (By "at random" I mean that every person in the dataset has the same chance of being chosen, and by "they" I mean the [singular, gender-neutral pronoun](https://en.wikipedia.org/wiki/Singular_they), which is a correct and useful feature of English.)

With this definition of probability we can always compute probabilities by counting.  

To demonstrate  I'll use a data set from the [General Social Survey](http://gss.norc.org/) or GSS.

In [20]:
gss = pd.read_hdf('gss_bayes.hdf5', 'gss')

The results is a Pandas DataFrame with one row for each person surveyed and one column for each variable I selected.

Here are the number of rows and columns:

In [21]:
gss.shape

(49290, 6)

And here are the first few rows:

In [22]:
gss.head()

Unnamed: 0,year,age,sex,polviews,partyid,indus10
3117,1974,21.0,1,4.0,2.0,4970.0
3118,1974,41.0,1,5.0,0.0,9160.0
3121,1974,58.0,2,6.0,1.0,2670.0
3122,1974,30.0,1,5.0,4.0,6870.0
3123,1974,48.0,1,5.0,4.0,7860.0


The columns are

* `id_`: Respondent id.

* `year`: Year when the respondent was surveyed.

* `age`: Respondent's age when surveyed.

* `sex`: Male or female.

* `polviews`: Political views on a range from liberal to conservative.

* `partyid`: Political party affiliation, Democrat, Independent, or Republican.

* `indus10`: [Code](https://www.census.gov/cgi-bin/sssd/naics/naicsrch?chart=2007) for the industry the respondent works is.

We'll look at these variables in more detail, starting with `indus10`.

## Banking

The code for "Banking and related activities" is 6870, so we can select bankers like this:

In [23]:
banker = (gss['indus10'] == 6870)

The result is a Boolean series, which is a Pandas Series that contains the values `True` and `False`.  Here are the first few entries:

In [24]:
banker.head()

3117    False
3118    False
3121    False
3122     True
3123    False
Name: indus10, dtype: bool

We can use `values` to see how many times each value appears.

In [25]:
values(banker)

False    48562
True       728
Name: indus10, dtype: int64

In this dataset, there are 728 bankers.

If we use the `sum` function on this Series, it treats `True` as 1 and `False` as 0, so the total is the number of bankers.

In [26]:
banker.sum()

728

To compute the fraction of bankers, we can divide by the number of people in the dataset:

In [27]:
banker.sum() / banker.size

0.014769730168391155

But we can also use the `mean` function, which computes the fraction of `True` values in the Series:

In [28]:
banker.mean()

0.014769730168391155

About 1.5% of the respondents work in banking.

That means if we choose a random person from the dataset, the probability they are a banker is about 1.5%.

**Exercise**: The values of the column `sex` are encoded like this:

```
1	Male
2   Female
```

The following cell creates a series that is `True` for female respondents and `False` otherwise.

In [39]:
female = (gss['sex'] == 2)

* Use `values` to display the number of `True` and `False` values in `female`.

* Use `sum` to count the number of female respondents.

* Use `mean` to compute the fraction of female respondents.

In [40]:
# Solution

values(gss['sex'])

1    22779
2    26511
Name: sex, dtype: int64

In [41]:
# Solution

female.sum()

26511

In [42]:
# Solution

female.mean()

0.5378575776019476

The fraction of women in this dataset is higher than in the adult U.S. population because [the GSS does not include people living in institutions](https://gss.norc.org/faq), including prisons and military housing, and those populations are more likely to be male.

**Exercise:** The designers of the General Social Survey chose to represent sex as a binary variable.  What alternatives might they have considered?  What are the advantages and disadvantages of their choice?

For more on this topic, you might be interested in this article: Westbrook and Saperstein, [New categories are not enough: rethinking the measurement of sex and gender in social surveys](https://sci-hub.tw/10.1177/0891243215584758)

## Political views

The values of `polviews` are on a seven-point scale:

```
1	Extremely liberal
2	Liberal
3	Slightly liberal
4	Moderate
5	Slightly conservative
6	Conservative
7	Extremely conservative
```

Here are the number of people who gave each response:

In [30]:
values(gss['polviews'])

1.0     1442
2.0     5808
3.0     6243
4.0    18943
5.0     7940
6.0     7319
7.0     1595
Name: polviews, dtype: int64

I'll define `liberal` to be `True` for anyone whose response is "Extremely liberal", "Liberal", or "Slightly liberal".

In [43]:
liberal = (gss['polviews'] < 4)

Here are the number of `True` and `False` values:

In [44]:
values(liberal)

False    35797
True     13493
Name: polviews, dtype: int64

And the fraction of respondents who are "liberal".

In [45]:
liberal.mean()

0.27374721038750255

If we choose a random person in this dataset, the probability they are liberal is about 27%.

To make this interpretation of the data explicit, I'll define a function that takes a Boolean series and returns a probability:

In [52]:
def prob(A):
    """Computes the probability of a proposition, A.
    
    A: Boolean series
    
    returns: probability
    """
    assert isinstance(A, pd.Series)
    assert A.dtype == 'bool'
    
    return A.mean()

The parameter `A` is a Boolean series that represents a logical proposition, that is, a claim that is either true or false.

The `assert` statement check whether `A` is a Series with data type `bool`.  If not, they would print an error message.

Using this function to compute probabilities makes the code more readable.  Here are the probabilities for the propositions we have computed so far.

In [54]:
prob(banker)

0.014769730168391155

In [55]:
prob(female)

0.5378575776019476

In [56]:
prob(liberal)

0.27374721038750255

**Exercise**: The values of `partyid` are encoded like this:

```
0	Strong democrat
1	Not str democrat
2	Ind,near dem
3	Independent
4	Ind,near rep
5	Not str republican
6	Strong republican
7	Other party
```

I'll define `democrat` to include respondents who chose "Strong democrat" or "Not str democrat":

In [58]:
democrat = (gss['partyid'] <= 1)

* Use `mean` to compute the fraction of Democrats in this dataset.

* Use `prob` to compute the same fraction, which we will think of as a probability.

In [59]:
# Solution

democrat.mean()

0.3662609048488537

In [60]:
# Solution

prob(democrat)

0.3662609048488537

## Conjunction

"Conjunction" is another name for the logical `and` operation.  If you have two propositions, `A` and `B`, the conjunction `A and B` is `True` if both `A` and `B` are `True`, and `False` otherwise.

I'll demonstrate using Boolean series:

In [68]:
A = pd.Series((True, True, False, False))
A

0     True
1     True
2    False
3    False
dtype: bool

In [69]:
B = pd.Series((True, False, True, False))
B

0     True
1    False
2     True
3    False
dtype: bool

With Boolean series, the `&` is the logical `and` operator, so we can compute the conjunction of `A` and `B` like this:

In [70]:
A & B

0     True
1    False
2    False
3    False
dtype: bool

To show this operation more clearly, I'll put the operands and the result in a DataFrame:

In [72]:
table = pd.DataFrame()
table['A'] = A
table['B'] = B
table['A & B'] = A & B
table

Unnamed: 0,A,B,A & B
0,True,True,True
1,True,False,False
2,False,True,False
3,False,False,False


In a previous section, we computed the probability that a random respondent is a banker:

In [73]:
prob(banker)

0.014769730168391155

Now compute probabilities for propositions that involve conjunction.

For example, here is the probability that a random respondent is a banker and a Democrat:

In [79]:
prob(banker & democrat)

0.004686548995739501

As we should expect, `prob(banker & democrat)` is less than `prob(banker)`, because not all bankers are Democrats.

**Exercise:** Use `prob` and the `&` operator to compute the following probabilities.

* What is the probability that a random respondent is a banker and liberal?

* What is the probability that a random respondent is female, a banker, and liberal?

* What is the probability that a random respondent is female, a banker, and a liberal Democrat?

Notice that as we add more conjunctions, the probabilities get smaller.

In [83]:
# Solution

prob(banker & liberal)

0.003306958815175492

In [84]:
# Solution

prob(female & banker & liberal)

0.002556299452221546

In [None]:
# Solution

prob(female & banker & liberal & democrat)

**Exercise:** We expect conjunction to be commutative; that is, `A & B` should be the same as `B & A`.

To check, compute these two probabilies:

* What is the probability that a random respondent is a banker and liberal?
* What is the probability that a random respondent is liberal and a banker?

In [101]:
prob(banker & liberal)

0.003306958815175492

In [102]:
prob(liberal & banker)

0.003306958815175492

If they are not the same, something has gone very wrong!

## Conditional probability

Conditional probability is a probability that depends on a condition, but that might not be the most helpful definition.  Here are some examples:

* What is the probability that a respondent is a Democrat, given that they are liberal?

* What is the probability that a respondent is female, given that they are a banker?

* What is the probability that a respondent is liberal, given that they are female?

Let's start with the first one, which we can interpret like this: of all the respondents who are liberal, what fraction are Democrats?

We can compute this probability in two steps:

1. Select all respondents who are liberal.

2. Compute the fraction of the selected respondents who are Democrats.

To select liberal respondents, we can use the bracket operator, `[]`, like this:

In [87]:
selected = democrat[liberal]

The result is a Boolean series that contains a subset of the values in `democrat`.  Specifically, it contains only the values where `liberal` is `True`.

To see that, let's check the length of the result:

In [88]:
len(selected)

13493

If things have gone according to plan, that should be the same as the number of `True` values in `liberal`:

In [89]:
liberal.sum()

13493

Good.  

`selected` contains the value of `democrat` for liberal respondents, so the mean of `selected` is the fraction of liberals who are democrats:

In [91]:
selected.mean()

0.5206403320240125

Slightly more than half of liberals are Democrats.  If the result is lower than you expected, keep in mind:

1. We used a somewhat strict definition of "Democrat", excluding Independents who "lean" democratic.

2. The dataset includes respondents as far back as 1974; in the early part of this interval, there was less alignment between political views and party affiliation, compared to the present.

Let's try the second example, "What is the probability that a respondent is female, given that they are a banker?"

We can interpret that to mean, "Of all respondents who are bankers, what fraction are female?"

Again, we'll use the bracket operator to select only the bankers:

In [93]:
selected = female[banker]
len(selected)

728

As we've seen, there are 728 bankers in the dataset.

Now we can use `mean` to compute the conditional probability that a respondent is famel, given that they are a banker:

In [94]:
selected.mean()

0.7706043956043956

About 77% of the bankers in this dataset are female.

We can get the same result using `prob`:

In [96]:
prob(selected)

0.7706043956043956

Remember that we defined `prob` to make the code more explicit.  We can do the same thing with conditional probability.  I'll define `conditional` to take two Boolean series, `A` and `B`, and compute the conditional probability of `A` given `B`:

In [97]:
def conditional(A, B):
    """Conditional probability of A given B.
    
    A: Boolean series
    B: Boolean series
    
    returns: probability
    """
    return prob(A[B])

Now we can use it to compute the probability that a liberal is a Democrat:

In [98]:
conditional(democrat, liberal)

0.5206403320240125

And the probability that a banker is female:

In [100]:
conditional(female, banker)

0.7706043956043956

**Exercise:** Compute the third conditional probability from above: "What is the probability that a respondent is liberal, given that they are female?"

Hint: The answer should be less than 30%.  If your answer is about 54%, you have made a mistake (see the next exercise).


In [105]:
# Solution

conditional(liberal, female)

0.27581004111500884

In [106]:
conditional(female, liberal)

0.5419106203216483

**Exercise:**  In a previous exercise, we saw that conjunction is commutative; that is, `prob(A & B)` is always equal to `prob(B & A)`.

But conditional probability is NOT commutative; that is, `conditional(A, B)` is not the same as `conditional(B, A)`.

That should be clear if we look at an example.  Previously, we computed the probability a respondent is female, given that they are banker.

In [110]:
conditional(female, banker)

0.7706043956043956

The result shows that the majority of bankers are female.  That is not the same as the probability that a respondent is a banker, given that they are female:

In [112]:
conditional(banker, female)

0.02116102749801969

Only about 2% of female respondents are bankers.

**Exercise:** Use `conditional` to compute the following probabilities:

* What is the probability that a respondent is liberal, given that they are a Democrat?

* What is the probability that a respondent is a Democrat, given that they are liberal?

Think carefully about the order of the series you pass to `conditional`.

In [113]:
conditional(liberal, democrat)

0.3891320002215698

In [114]:
conditional(democrat, liberal)

0.5206403320240125

**Exercise:** Of all 

As you will see in the Jupyter notebook for this chapter, I downloaded a dataset with a population of 50,287 people who responded to the survey.  For each respondent I extract the following variables:

\begin{description}

\item[female]: \py{True} if the respondent is female, \py{False} otherwise.

\item[liberal]: \py{True} if the respondent self-identifies as liberal.

\item[democrat]: \py{True} if the respondent self-identifies as a Democrat, that is, a member of the Democratic Party in the United States.

\item[banker]: \py{True} if the respondent works in banking.

\end{description}

Each of these variables is a Series object, which is defined by Pandas.  If you are not familiar with Pandas, I will explain what you need to know as we go along.

To compute the number of respondents who are bankers, we could use a for loop:

\begin{code}
total = 0
for x in banker:
    if x is True:
        total += 1
\end{code}

But it is easier to use methods provided by Series.  For example, \py{sum} computes the total of the elements in the Series, treating \py{True} as 1 and \py{False} as 0.  So we can replace the for loop with:

\begin{code}
total = banker.sum()
\end{code}

To get a probability, we could divide the total by the length of the Series:

\begin{code}
prob = total / len(banker)
\end{code}

Or we could use the Series method \py{mean}, which computes the mean of the elements, again treating \py{True} as 1 and \py{False} as 0:

\begin{code}
prob = banker.mean()
\end{code}

To make the code more readable, I'll define a function that computes the fraction of \py{True} values in a Series:

\begin{code}
def prob(A):
    return A.mean()
\end{code}

Then we can compute a probability for each variable:

\begin{code}
prob(female)
0.5385487302881461

prob(liberal)
0.14741384453238413

prob(democrat)
0.36639688189790603

prob(banker)
0.014536560144769025
\end{code}

In this dataset, about 54\% of respondents are female, 15\% say they are liberal, 37\% say they are Democrats, and 1.5\% work in banking.

The mathematical notation for probability is $\p{A}$, where $A$ represents a fact which might be true or false, or a prediction that might come true or not.  In the example, $\p{banker}$ is the probability that a random respondent is a banker.


\section{Conditional probability}

I'll use this data to solve a simplified version of the Linda problem:

\begin{quote}
Linda is female and liberal:

\begin{itemize}
\item What is the probability that she is a banker?
\item What is the probability that she is a banker and a Democrat?
\end{itemize}

\end{quote}

The answers to these question are conditional probabilities.
A {\bf conditional probability} is a probability based on a specified subset of a population.  For example, given that Linda is female, what is the probability she is a banker?

\index{conditional probability}
\index{probability!conditional}

One way to answer this question is to select the subset of the population that is female and compute the fraction that are bankers.  We can do that with a for loop:

To select a subset, we can use one Series as an index into another:

\begin{code}
prob(banker[female])
0.020788715752160108
\end{code}

The Series \py{banker[female]} contains the elements of \py{banker} for female respondents only.  The result indicates that about 2.1\% of female respondents are bankers.

The mathematical notation for conditional probability is \p{A|B}, which is the probability of $A$ given that $B$ is true.  In this example, $A$ is the event that Linda is a banker and $B$ is the condition that she is female, so we could write $\p{banker|female}$.

I define the following function to compute conditional probabilities.

\begin{code}
def conditional(A, B):
    return prob(A[B])
\end{code}

For this example, we would call it like this:

\begin{code}
conditional(banker, female)
0.020788715752160108
\end{code}



\section{Conjoint probability}
\label{conjoint}

{\bf Conjoint probability} is a fancy way to say the probability that
two things are true.  For example, we might want to know the probability that a random person is a female banker.

\index{conjoint probability}
\index{probability!conjoint}

We can compute this probability using the \py{&} operator, which computes the logical AND of the elements in a series; for example, \py{female&banker} is \py{True} only where \py{female} and \py{banker} are \py{True}:

\begin{code}
prob(female & banker)
0.011195736472647006
\end{code}

About 1.1\% of the respondents are female bankers.

The mathematical notation for conjoint probability is \p{A \AND B}, so we could write this example as $\p{female \AND banker}$.

As this example shows, the conjoint probability, \p{A \AND B}, is not generally the same as the conditional probability, \p{A|B}.  However, they are related by the following equation:

\[ \p{A | B} = \p{A \AND B} / \p{B} \]

For example:

\[ \p{banker | female} = \p{banker \AND female} / \p{female} \]

In other words, we can compute the conditional probability that Linda is a banker, given that she is female, like this:

\begin{code}
prob(banker & female) / prob(female)
0.0207887
\end{code}

And we can confirm that we get the same result if we select the subset of female respondents and compute the fraction of bankers:

\begin{code}
prob(banker[female])
0.0207887
\end{code}



\section{Bayes's theorem}

We can also write the relationship between conditional and conjoint probability the other way around:
%
\[ \p{A \AND B} = \p{B} \p{A | B} \]
%
In other words, the probability that $A$ and $B$ are true is the product of two probabilities: the probability of $B$ and the conditional probability of
$A$ given $B$.

The AND operator is commutative, so:
%
\[ \p{A \AND B} = \p{B \AND A} \]
%
That implies that we can compute a conjoint probability either way:
%
\[ \p{B} \p{A | B} = \p{A} \p{B | A}  \]
%
If you think about what that means, it is not surprising: you can check $B$ first, and then $A$ given $B$, or $A$ first, and then $B$ given $A$.  You get the same thing either way.

Now, if we divide through by $\p{B}$, we get Bayes's theorem:
%
\[ \p{A | B} = \frac{\p{A} ~ \p{B | A}}{\p{B}} \]
%
We can confirm that Bayes's theorem works with the example data.  If $A$ is $banker$ and $B$ is $female$, we can compute the conditional probability $\p{banker|female}$ directly:

\begin{code}
conditional(banker|female)
0.02078871575216
\end{code}

Or we can compute it using Bayes's theorem:

\begin{code}
prob(banker) * conditional(female, banker) / prob(female)
0.02078871575216
\end{code}

The results are the same.

In one sense, there is nothing special about Bayes's theorem.  It's just an equation that relates conditional probabilities.

And in this example, it is not particularly useful, because it is easier to compute the probability we want directly.  So let's see an example where it is more useful.

In [None]:
female = sex == 2
values(female)

In [None]:
liberal = polviews <= 2
values(liberal)

In [None]:
democrat = partyid <= 1
values(democrat)

In [None]:
banker = indus10 == 6870
values(banker)

In [None]:
total = 0
for x in banker:
    if x is True:
        total += 1
        
total

In [None]:
total / len(banker)

In [None]:
def prob(A):
    """Probability of A"""
    return A.mean()

In [None]:
def count(A):
    """Number of instances of A"""
    return A.sum()

In [None]:
prob(female)

In [None]:
prob(liberal)

In [None]:
prob(democrat)

In [None]:
prob(banker)

In [None]:
prob(democrat & liberal)

In [None]:
count(banker[female])

In [None]:
prob(banker[female])

In [None]:
prob(female & banker)

In [None]:
prob(banker & female) / prob(female)

In [None]:
def conditional(A, B):
    """Conditional probability of A given B"""
    return prob(A[B])

In [None]:
conditional(banker, female)

In [None]:
conditional(liberal, democrat)

In [None]:
conditional(democrat, liberal)

In [None]:
conditional(democrat, female)

In [None]:
def conjunction(A, B):
    """Probability of both A and B"""
    return prob(A) * conditional(B, A)

In [None]:
prob(liberal & democrat)

In [None]:
conjunction(liberal, democrat)

In [None]:
prob(liberal) * prob(democrat)

In [None]:
conjunction(democrat, liberal)

In [None]:
prob(banker) * conditional(female, banker) / prob(female)

In [None]:
def bayes_theorem(A, B):
    """Conditional probability of A given B, using Bayes's theorem"""
    return prob(A) * conditional(B, A) / prob(B)

In [None]:
bayes_theorem(democrat, liberal)

In [None]:
conditional(banker, female)

In [None]:
conditional(banker, female & liberal)

In [None]:
conditional(banker & democrat, female & liberal)