### Module 4 - Probability Exercises 

Most of the concepts we discussed in this module were theoretical. Examples were used to explain how to apply conditional probability concepts on dataset columns. The exercises in this notebook will be similar to what you have seen in lab and practice notebooks. 

Refer to your labs: [Bayes](../labs/Bayes.ipynb), [Conditional_Probability](../labs/Conditional_Probability.ipynb) and [Distributions](../labs/Distributions.ipynb).  

Practice notebooks: [Conditional_Probability](../practices/Conditional_Probability.ipynb) and [Bayes](../labs/Bayes.ipynb). 

Let's recall what is an independent and dependent event through below activity. Consider an event of rolling two dice. The sample space $S=\{\{1,1\}, \{1,2\}, \{1,3\},....,\{6,6\}\}$ contains 36 possible combinations on two dice.

<span style="color:#1871d6; font-weight:400">Activity 1.a: </span> What is the probability of getting a 2 on either dice?

In [69]:
# Your code for activity 1.a goes here...


library(prob)
S <- rolldie(2, makespace = TRUE)

# Define the event A such that it has all possible combinations including a 2 {{1,2}, {2,2}, {2,3},...} 

# A = {{2,1},{2,2},{2,3},{2,4},{2,5},{2,6},{1,2},{3,2},{4,2},{5,2},{6,2}}
# P(2onEither) = A / S

A = subset(S, X1==2 | X2==2)
#A
t = 11/36
#t
paste("P(A) =",  round(Prob(A), 4))

<span style="color:#1871d6; font-weight:400">Activity 1.b: </span> What is the probability of getting a 2 on either dice given that the sum of the two outcomes is less than 6? 

In [76]:
# Your code for activity 1.b goes here...

# Define the event A such that it has all possible combinations including 2 {{1,2}, {2,2},...}. 
# A = {{2,1},{2,2},{2,3},{2,4},{2,5},{2,6},{1,2},{3,2},{4,2},{5,2},{6,2}}

# Define event B such that sum of outcomes is less than 6. 
# B = {{1,1},{1,2},{1,3},{1,4},{2,1},{2,2},{2,3}{3,1},{3,2},{4,1}}
B = subset(S, X1 + X2 < 6)
#print(paste("P(B) =",  round(Prob(B, given = S), 4)))

# Event A is conditional on Event B 

paste("P(A|B) =", round(Prob(A, given = B), 4))

<span style="color:#1871d6; font-weight:400">Activity 2: </span> Toss a coin twice. The sample space is given by $S = \{HH, HT, TH, TT\}$. Let $A$ = {head occurs} and $B$ = {a head and tail occur}. What are the probabilities $P(A|B)$ and $P(B|A)$?

````
        
     Double click here. Enter your answer for activity 2 here.

````
𝑃(𝐴|𝐵) = 1 because given an H & T occurred, an H always occurs.
𝑃(𝐵|𝐴) = 2/3 because given H occurring, two will be an H & T and one will be H & H.

In [None]:
# Space <- tosscoin(2, makespace = TRUE)
# Space
# A <- subset(Space, isrep(Space, vals = "H", nrep = 1) | isrep(Space, vals = "H", nrep = 2))
# A
# B <- subset(Space, isrep(Space, vals = "H", nrep = 1) & isrep(Space, vals = "T", nrep = 1))
# B
# Prob(A, given = B)
# Prob(B, given = A)

<span style="color:#1871d6; font-weight:400">Activity 3: </span> We have data about the smoking status versus the gender of people working in a company.

<table>
    <tr>
        <th></th>
        <th colspan="2">gender</th>
        <th>sum</th>
    </tr>
    <tr>
        <td rowspan="2">smoke</td>
        <td>80</td>
        <td>54</td>
        <td>134</td>
    </tr>
    <tr>
        <td>15</td>
        <td>19</td>
        <td>34</td>
    </tr>
    <tr>
        <td>sum</td>
        <td>95</td>
        <td>73</td>
        <td>168</td>
    </tr>
</table>
    
    
3.a) If one person were selected at random from the data set, what is the probability that selected person is female?

3.b) What is the probability that selected person is a smoker?

In [64]:
# Code for Activity 3 goes here -----
# Generate the matrix required to find probabilities

smokers <- matrix(c(80, 15, 54, 19), ncol = 2)
colnames(smokers) <- c('female', 'male')
rownames(smokers) <- c('nonsmoker', 'smoker')
smokers.table <- as.table(smokers)

addmargins(smokers.table)


Unnamed: 0,female,male,Sum
nonsmoker,80,54,134
smoker,15,19,34
Sum,95,73,168


In [67]:
# Answer for Activity 3.a goes here -----
# P(Female) =
paste("P(Female) =", round(95 / 168, 4))

In [68]:
# Answer for Activity 3.b goes here -----
# P(smoker | female) =
paste("P(Smoker|Female) =", round(15 / 95, 4))

<span style="color:#1871d6; font-weight:400">Activity 4: </span>  Load the framingham data from the directory '/dsa/data/all_datasets/framingham/'. Find out the probability of a randomly subject to have the risk of coronary heart disease given that the subject is a male.

In [72]:
# Code for Activity 4 goes here -----
framingham_data <- read.csv("/dsa/data/all_datasets/framingham/framingham.csv")

chd <- with(framingham_data,table(male,TenYearCHD))
colnames(chd) <- c('no CHD risk','CHD risk')
rownames(chd) <- c('female','male')

addmargins(chd)

Unnamed: 0,no CHD risk,CHD risk,Sum
female,2119,301,2420
male,1477,343,1820
Sum,3596,644,4240


In [75]:
# Enter your answer for Activity 4 -----
print(paste("P(Male) =", round(1820 / 4240, 4)))
print(paste("P(CHD|Male) =", round(((644/4240) * (343 / 644)) / (1820 / 4240), 4)))

[1] "P(Male) = 0.4292"
[1] "P(CHD|Male) = 0.1885"


<span style="color:#1871d6; font-weight:400">Activity 5: </span> Find out the probability of randomly selected subject to have the risk of coronary heart disease given subject is less than or equal to 40 years of age.

In [102]:
# P(CHD | age <= 40) = P(CHD and age <= 40) / P(age <= 40)
# Code for activity 5 goes here...


+++++++++TRY GROUP BY AGE++++++++++++++++



age_CHD <- with(framingham_data,table(age,TenYearCHD))
colnames(age_CHD) <- c('no CHD risk','CHD risk')
addmargins(age_CHD)
less41_CHD <- subset(age_CHD, age<=40)
less41_CHD

#S = framingham_data
#A = subset(S, TenYearCHD==1)
#A
#B = subset(S, age<=40)
#B
#t = isin(A, B)
#t
#paste("P(CHD|Age) =",  round(Prob(A, given = B), 4))

Unnamed: 0,no CHD risk,CHD risk,Sum
32,1,0,1
33,5,0,5
34,18,0,18
35,40,2,42
36,81,3,84
37,88,4,92
38,136,8,144
39,164,6,170
40,177,15,192
41,163,11,174


<span style="color:#1871d6; font-weight:400">Activity 6: </span> Find out the probability of randomly selected subject to have the risk of coronary heart disease given that the subject smokes less than 10 cigarettes.

**Hint:** Use a 3-way table to include the variables cigsPerDay, TenYearCHD and currentSmoker. Use currentSmoker as the 3rd dimension/input.

In [None]:
# Code for activity 6 goes here....



In [None]:
# P( risk of CHD | subject smokes <10 cigs ) = P( risk of CHD and subject smokes <10 cigs)/ P(subject smokes <10 cigs)
# Enter your answer for Activity 6 -----




<span style="color:#1871d6; font-weight:400">Activity 7.a: </span> Find out the probability of randomly selected subject to have the risk of coronary heart disease given that the subject has totChol > 300 and has BMI >30. 

<span style="color:#1871d6; font-weight:400">Activity 7.b: </span> Find out the probability of randomly selected subject to have the risk of coronary heart disease given that the subject has totChol > 300 and has BMI < 30. 

**Hint:** Use a 3-way table. Use BMI as the 3rd dimension/input.

In [None]:
# Enter your answer for Activity 7.a (BMI > 30) -----  
# P( risk of CHD | subject has totChol > 300 ) = P( risk of CHD and subject has totChol > 300 )/ P(subject has totChol > 300 )




In [None]:
# Enter your answer for Activity 7.b (BMI < 30) -----  
# P( risk of CHD | subject has totChol > 300 ) = P( risk of CHD and subject has totChol > 300 )/ P(subject has totChol > 300 )


