# Bayes Rule Exercises

## Problem 1:

Joan says she is itchy. There is a test for Allergy to Cats, but this test is not always
right:

- For people that really do have the allergy, the test says "Yes" 80% of the time
- For people that do not have the allergy, the test says "Yes" 10% of the time

If 1% of the population has the allergy, and Joan’s test says "Yes", what are the chances that Joan really has the allergy?


#### A way to think about the problem

Assuming, 10,000 people:
- have allergy: 1% => 100 ppl
  - have positive test: 80% => 80 ppl
- do not have allergy: 99% => 9,900 ppl
  - have positive test: 10% => 990 ppl
- have positive test: 80+990 = 1070 => 1070/10,000 = 10.70%
- B = has "yes" test
- A = has allergy
- P(B|A) = 0.8
- P(A) = 0.01
- P(B) = 0.1070
- ( 0.8 * 0.01 ) / 0.107 = 7.47%


- how many tested positive: 1070
- how many test postitive and also have allergy: 80
- 80/1070 = 7.47%





Creating synthetic data.

In [1]:
import pandas as pd
import numpy as np

Create a blank data frame.

In [2]:
df = pd.DataFrame()
df

Add a column with 10,000 default values for having an allergy.

In [3]:
df['has_allergy']  = [False] * 10_000
df


Unnamed: 0,has_allergy
0,False
1,False
2,False
3,False
4,False
...,...
9995,False
9996,False
9997,False
9998,False


Verify the values in the data frame.

In [4]:
df.value_counts()

Unnamed: 0_level_0,count
has_allergy,Unnamed: 1_level_1
False,10000


Sample 1% of the total data set and capture their index position.

In [5]:
df1 = df.sample( frac = 0.01, random_state = 11 ).index
len(df1), df1

(100,
 Index([3104, 6353, 8689, 5857, 6011, 9821, 5129,  415, 3538, 1703, 3527, 7899,
        3026,  746, 2817, 9847, 6984, 3327, 7953, 1769, 7753, 7805, 2291, 8222,
        7864, 2532, 7025, 3484,  100, 6612, 8564, 8838, 9935, 4387, 4101, 5796,
        5257, 5838, 9405, 6827, 1932, 5265,  719, 7044, 1778, 9117, 1957,  353,
        2734, 7763, 3743, 4793, 4119, 2327, 8042, 5854, 7359, 5553, 6949, 6828,
        1578, 4169, 1383, 3139, 8943, 4940, 4782, 3615, 9356, 9340,  167, 8765,
        4659, 4749, 9155, 6502, 8292, 6580, 9838, 1448, 1702, 2382, 4402,  403,
        3010, 7893,  541, 1198, 1315, 1059, 2147,   47, 8516, 7021, 3608, 8935,
        7756, 5220,  649, 5388],
       dtype='int64'))

Use the index positions to modify those records, changing the values to True.

In [6]:
df.loc[df1, "has_allergy"] = True
df.value_counts()

Unnamed: 0_level_0,count
has_allergy,Unnamed: 1_level_1
False,9900
True,100


Create new column for positive tests and fill with a default value of False.

In [7]:
df['test_pos'] = False
df.value_counts()


Unnamed: 0_level_0,Unnamed: 1_level_0,count
has_allergy,test_pos,Unnamed: 2_level_1
False,False,9900
True,False,100


Find those rows where has_allergy is True, i.e. observations that have allergies.

One way is to use `np.where`.

In [8]:
np.where( df['has_allergy'] == True )


(array([  47,  100,  167,  353,  403,  415,  541,  649,  719,  746, 1059,
        1198, 1315, 1383, 1448, 1578, 1702, 1703, 1769, 1778, 1932, 1957,
        2147, 2291, 2327, 2382, 2532, 2734, 2817, 3010, 3026, 3104, 3139,
        3327, 3484, 3527, 3538, 3608, 3615, 3743, 4101, 4119, 4169, 4387,
        4402, 4659, 4749, 4782, 4793, 4940, 5129, 5220, 5257, 5265, 5388,
        5553, 5796, 5838, 5854, 5857, 6011, 6353, 6502, 6580, 6612, 6827,
        6828, 6949, 6984, 7021, 7025, 7044, 7359, 7753, 7756, 7763, 7805,
        7864, 7893, 7899, 7953, 8042, 8222, 8292, 8516, 8564, 8689, 8765,
        8838, 8935, 8943, 9117, 9155, 9340, 9356, 9405, 9821, 9838, 9847,
        9935]),)

Another way is to create a filter and then apply it to the data frame and record the position of the rows, i.e. the row indices.

In [9]:
filter = ( df['has_allergy'] == True )
has_allergy_index = df[ filter ].index
has_allergy_index


Index([  47,  100,  167,  353,  403,  415,  541,  649,  719,  746, 1059, 1198,
       1315, 1383, 1448, 1578, 1702, 1703, 1769, 1778, 1932, 1957, 2147, 2291,
       2327, 2382, 2532, 2734, 2817, 3010, 3026, 3104, 3139, 3327, 3484, 3527,
       3538, 3608, 3615, 3743, 4101, 4119, 4169, 4387, 4402, 4659, 4749, 4782,
       4793, 4940, 5129, 5220, 5257, 5265, 5388, 5553, 5796, 5838, 5854, 5857,
       6011, 6353, 6502, 6580, 6612, 6827, 6828, 6949, 6984, 7021, 7025, 7044,
       7359, 7753, 7756, 7763, 7805, 7864, 7893, 7899, 7953, 8042, 8222, 8292,
       8516, 8564, 8689, 8765, 8838, 8935, 8943, 9117, 9155, 9340, 9356, 9405,
       9821, 9838, 9847, 9935],
      dtype='int64')

Yet another way is to create another data frame with the rows that are True.  Although this is a new data frame, notice that the NAMED row index values are preserved.

In [10]:
filter = ( df['has_allergy'] == True )
df_has_allergy = df[ filter ]
df_has_allergy


Unnamed: 0,has_allergy,test_pos
47,True,False
100,True,False
167,True,False
353,True,False
403,True,False
...,...,...
9405,True,False
9821,True,False
9838,True,False
9847,True,False


In [11]:
df_has_allergy.value_counts()


Unnamed: 0_level_0,Unnamed: 1_level_0,count
has_allergy,test_pos,Unnamed: 2_level_1
True,False,100


Of those who have the allergy, 80% have a positive test.  So, lets sample 80% of those who have the allergy and capture their index positions.

In [12]:
has_allergy_index = df_has_allergy.sample( frac = 0.8, random_state = 42 ).index
has_allergy_index

Index([8292, 5265, 7025, 4659, 4402, 3743, 2147, 7953, 1059,   47, 1769, 3026,
       7753, 3327, 8943,  403, 7805, 7864, 1315, 3104, 5553, 8838, 2532, 4169,
       7021, 1578, 4101, 9821,  746, 7359, 1198, 4782, 8564, 2817, 9340,  415,
       6828, 6827, 3527, 1702, 4940, 3484,  649, 9405, 2734, 1778, 8042, 2382,
       6502, 1383, 2327,  353, 1703, 3615,  719, 7893,  541, 6612, 3538, 8935,
       5796, 9935, 5388, 4387, 5129, 6949, 4749, 6984, 6353, 9838, 7899, 4119,
       5854, 4793, 9847, 5838, 7763, 3139, 9356, 5857],
      dtype='int64')

In [13]:
df.loc[ has_allergy_index, 'test_pos'] = True
df.value_counts()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
has_allergy,test_pos,Unnamed: 2_level_1
False,False,9900
True,True,80
True,False,20


Repeat the process for those who do not have the allergy but do get a positive test.

In [14]:
filter = df['has_allergy'] == False
idx = df[ filter ].sample( frac = 0.1, random_state = 42 ).index
idx


Index([8517, 5736, 4816, 9311,  628, 3463, 9446, 6021, 8268, 9941,
       ...
       7276, 7821, 8465, 4092, 5432, 8727, 9248,  601, 5619, 3468],
      dtype='int64', length=990)

In [15]:
df.loc[ idx, 'test_pos'] = True
df.value_counts()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
has_allergy,test_pos,Unnamed: 2_level_1
False,False,8910
False,True,990
True,True,80
True,False,20


Calculate the conditional probabilities.

In [18]:
categorical_cols = df

# Define the columns
A = ['has_allergy']
B = ['test_pos'] # Example of multiple columns in B

# Count values for A and B
counts_A_B = categorical_cols[A + B].value_counts(dropna=False).reset_index()
counts_B = categorical_cols[B].value_counts(dropna=False).reset_index()

# Merge the dataframes on multiple columns in B
merged_df = pd.merge(
    counts_A_B,
    counts_B,
    on=B,
    suffixes=('_A_B', '_B')
)

# Calculate marginal probabilities
merged_df['marg_prob'] = merged_df['count_A_B'] / merged_df['count_B'] * 100

# Display the resulting dataframe
merged_df


Unnamed: 0,has_allergy,test_pos,count_A_B,count_B,marg_prob
0,False,False,8910,8930,99.776036
1,False,True,990,1070,92.523364
2,True,True,80,1070,7.476636
3,True,False,20,8930,0.223964


## Problem 2:

Beth is planning a picnic today, but the morning is cloudy.

- 50% of all rainy days start off cloudy
- Cloudy mornings are common (40% of all days start cloudy)
- This is usually a dry month (10% of days tend to be rainy)

What is the chance of rain during the day when it starts of cloudy?

## Problem 3:

An artist competition has entries from three painters: Pam, Paul and Peter.

- Pam put in 15 paintings, 4% of her works have won First Prize.
- Paul put in 5 paintings, 5% of his works have won First Prize.
- Peter put in 10 paintings, 3% of his works have won First Prize.

What is the probability that Pam will win First Prize?

## Problem 4: (Challenge)

A man speaks the truth 2 out of 3 times. He throws one die and says that number
obtained is a three. Find the probability that the number obtained is actually a three.