<span style="color:#1871d6; font-size:24px; font-weight:700"> Bayes Theorem </span>

The cigar example in the lab illustrates the application of Bayes' theorem with its calculation using the formula. 
Unfortunately, that calculation is complicated and can cause confusion and/or incorrect substitution of the involved
probability values. 
Fortunately, here is another approach that is much more intuitive and easier:

Assume some convenient value for the total of all items involved, 
then construct a table of rows and columns with the individual cell frequencies based on the known probabilities.

For example, let's assume that the adult population in Boone County, Missouri is 100,000. 
Now we can use the given information to create a table.

*Number of males who smoke cigars:* 
51% of adults are males; so there are 51,000 males. 
If 9.5% of them smoke, that makes 0.095 x 51,000 = 4845. 
Then, males who do not smoke are 51,000 - 4845 = 46,155.
See the table where these values go.


*Number of females who smoke cigars:* 49% of the adults are females, that makes 49,000. 
1.7% of them are smokers, so 0.017 x 49,000 = 833. 
The number of females who do not smoke is 49,000 - 833 = 48,167. 
Again look at the table below. 

In [1]:
cigar <- matrix(c(4845, 833, 46155, 48167), ncol = 2)
colnames(cigar) <- c('smoker', 'nonsmoker')
rownames(cigar) <- c('male', 'female')
cigar.table <- as.table(cigar)

addmargins(cigar.table)

Unnamed: 0,smoker,nonsmoker,Sum
male,4845,46155,51000
female,833,48167,49000
Sum,5678,94322,100000


The above table involves simple arithmetic. 
Simply partition the assumed population into the different cell categories by finding suitable percentages.

Now we can easily address the key question as follows: 
To find the probability of getting a male subject, given that the subject smokes cigars, 
simply use the same conditional probability described before. 

To find the probability of getting a male given that the subject smokes, 
restrict the table to the column of cigar smokers, 
then find the probability of getting a male in that column.
Among the 5678 cigar smokers, there are 4845 males, so the probability we seek is 4845/5678 = 0.85329341. 
That is, $P(M | C)$ = 4845/5678 = 0.85329341 = 0.853 (rounded).

**Activity 1:** 
Now, your turn: 
The actual population of Boone County, Missouri is 170,733 (as of 2013).
Create the above table with actual population values for the given percentages and find the actual $P(M | C)$.

In [2]:
# Add your code here
# --------------------

pop = 170733
m = round(0.51*pop)
f = round(0.49*pop)
ms = round(m * 0.095)
mns = m - ms
fs = round(f * 0.017)
fns = f - fs
cigar2 <- matrix(c(ms, fs, mns, fns), ncol=2)
colnames(cigar2) <- c('smoker','nonsmoker')
rownames(cigar2) <- c('male','female')
cigar2.table <- as.table(cigar2)

addmargins(cigar2.table)

Unnamed: 0,smoker,nonsmoker,Sum
male,8272,78802,87074
female,1422,82237,83659
Sum,9694,161039,170733


In [3]:
print(paste("P(M|C) = P(M and C) / P(C) = ", 8272/9694))

[1] "P(M|C) = P(M and C) / P(C) =  0.853311326593769"


a) Now, using the same table, randomly select an individual, what is the prior probability that the selected person is a female?

b) You later learn that the randomly selected person was smoking a cigar. 
Use this additional information to find the posterior probability that the selected person is a female.

In [4]:
addmargins(prop.table(cigar2))

Unnamed: 0,smoker,nonsmoker,Sum
male,0.048449919,0.4615511,0.510001
female,0.008328794,0.4816702,0.489999
Sum,0.056778713,0.9432213,1.0


In [5]:
# Add your code here
# --------------------
# a) prior probability of a person being female is simply 0.49. That is the percentage of the females in the population.
# If we don't know any extra information, we use that.
print(paste("P(F) = ", 83659/170733))


# b) posterior probability is computed after we learn some extra information; here it is the fact that the person is a smoker.
# We compute P(F|C); the probability that the person is female given that the person is a cigar smoker. 
print(paste("P(F|C) = P(F and C) / P(C) = ", 1422/9694))


[1] "P(F) =  0.489999004293253"
[1] "P(F|C) = P(F and C) / P(C) =  0.146688673406231"


Load the framingham data from the directory '/datasets/framingham'.

In [6]:
framingham_data <- read.csv("/dsa/data/all_datasets/framingham/framingham.csv")
head(framingham_data)

Unnamed: 0_level_0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>
1,1,39,4,0,0,0,0,0,0,195,106.0,70,26.97,80,77,0
2,0,46,2,0,0,0,0,0,0,250,121.0,81,28.73,95,76,0
3,1,48,1,1,20,0,0,0,0,245,127.5,80,25.34,75,70,0
4,0,61,3,1,30,0,0,1,0,225,150.0,95,28.58,65,103,1
5,0,46,3,1,23,0,0,0,0,285,130.0,84,23.1,85,85,0
6,0,43,2,0,0,0,0,1,0,228,180.0,110,30.3,77,99,0


**Activity 2:** Create a two-way table from this data set with diabetes condition in the columns and gender in the rows. Use addmargins to add totals.


In [7]:
#dia <- with(framingham_data,table(diabetes,male)) # <-- this was wrong.
dia <- with(framingham_data,table(male,diabetes))
colnames(dia) <- c('nondiabetes','diabetes')
rownames(dia) <- c('female','male')
addmargins(dia)

Unnamed: 0,nondiabetes,diabetes,Sum
female,2363,57,2420
male,1768,52,1820
Sum,4131,109,4240


**Activity 3:** What is the probability that an individual has diabetes, given that the individual is female?  Let <b>d</b> be an event of diabetes and <b>d'</b> be event of nondiabetes.
Similarly let $f$ be the event of female and $f'$ be event of male. 
Find $P(d | f)$ using Bayes formula.

            
                         p(d) * p(f|d)
     p(d|f) =  -------------------------------------
               [p(d) * p(f|d)] + [ p(d') * p(f|d')]

**The denominator is simply p(f)** partitioned by d (remember the picture in the lab notebook); sometimes writing it in this form is more useful if we have multiple pieces of information that cover the space of events (like in the painting competition example). 

Let's compute it using both the multiplication rule and the Bayes rule. 

In [8]:
# Add your code here
# --------------------
# Multiplication rule: 
print(paste("p(d|f) = p(d and f)/p(f) = ", 57/2420))

[1] "p(d|f) = p(d and f)/p(f) =  0.0235537190082645"


In [9]:
# Bayes rule: 
print(paste("p(d|f) = p(d and f)/p(f) = p(d) * p(f|d) / p(f) = ", (109/4240)*(57/109) / (2420/4240)))

[1] "p(d|f) = p(d and f)/p(f) = p(d) * p(f|d) / p(f) =  0.0235537190082645"


In [10]:
# easier with prop table 
addmargins(prop.table(dia))


Unnamed: 0,nondiabetes,diabetes,Sum
female,0.5573113,0.0134434,0.5707547
male,0.4169811,0.01226415,0.4292453
Sum,0.9742925,0.02570755,1.0


In [11]:
print(paste("p(d|f) = p(d) * p(f|d) / p(f) = ", 0.02570755 * (0.01344340/0.02570755) /0.5707547  ))

# You can also see that p(f) and [p(d) * p(f|d)] + [ p(d') * p(f|d')] are the same probabilities. 


[1] "p(d|f) = p(d) * p(f|d) / p(f) =  0.0235537263206067"


**Activity 4:** Dangerous fires are rare (1%), but smoke is fairly common (10%) due to barbecues, and 90% of dangerous fires make smoke. What is the probability of dangerous Fire when there is Smoke?

Let's set up the probabilities: P(Fire), P(Smoke), P(Smoke|Fire), P(Fire|Smoke).

Which one are we looking for? Write the formula and compute the probability of dangerous Fire given there is Smoke.

In [12]:
# Add your code here 
# --------------------

P_Fire = 0.01
P_Smoke = 0.1
P_Smoke_Fire = 0.9

P_Fire_Smoke = P_Smoke_Fire * P_Fire / P_Smoke

print(paste("P(Fire|Smoke) = ", P_Fire_Smoke))


[1] "P(Fire|Smoke) =  0.09"


**Activity 5:** Imagine 100 people at a party, and you tally how many wear pink or not, and if a man or not, and get these numbers:

In [13]:
party <- matrix(c(5,20,35,40), ncol = 2)
colnames(party) <- c('Pink', 'notPink')
rownames(party) <- c('Man', 'Woman')
party.table <- as.table(party)

party.table
addmargins(party.table)

      Pink notPink
Man      5      35
Woman   20      40

Unnamed: 0,Pink,notPink,Sum
Man,5,35,40
Woman,20,40,60
Sum,25,75,100


And then you calculate some probabilities 

In [14]:
P_Man = 40/100
P_Pink = 25/100
P_Pink_givenMan = 5/40

Then you lose your data, but only retain the probabilities. You see a pink-wearing guest leaving the party. What is the probability that the guest was a man ? 

In [15]:
# Add your code here 
# --------------------

print(paste("P(Man|Pink) = P(Pink|Man)*P(Man)/P(Pink) = ", P_Pink_givenMan*P_Man/P_Pink))


[1] "P(Man|Pink) = P(Pink|Man)*P(Man)/P(Pink) =  0.2"


**Activity 6:** You are planning a picnic today, but the morning is cloudy. 50% of all rainy days start off cloudy.
But cloudy mornings are common (about 40% of days start cloudy) and this is usually a dry month (only 3 of 30 days tend to be rainy, or 10%).

What is the chance of rain during the day?

In [16]:
P_rain = 0.1
P_cloud = 0.4
P_cloud_givenrain = 0.5

# today started cloudy

P_rain_givencloud = P_rain * P_cloud_givenrain / P_cloud

P_rain_givencloud

# Save your notebook!