<span style="color:#04c921; font-size:24px; font-weight:700"> Conditional Probability</span>

Conditional probability, as you have seen in the labs and the video, 
is the probability of an event X occurring given that another event Y has occurred. 
Mathematically, it is represented as $P(X | Y)$ which is read as “probability of X given Y”.

We will continue working with the motor vehicle thefts dataset to apply conditional probability concepts. 
You will be introduced to a new function `tally()` that works just like prop.table() for working easily with conditional probability.

#### Read the data

Load the motor thefts dataset into a variable called vehicle_thefts. The dataset exists in directory '/dsa/data/all_datasets/motor vehicle thefts/'.

In [1]:
vehicle_thefts <- read.csv("/dsa/data/all_datasets/motor_vehicle_thefts/mvt.csv", header = TRUE)

head(vehicle_thefts)

Unnamed: 0_level_0,ID,Date,LocationDescription,Arrest,Domestic,Beat,District,CommunityArea,Year,Latitude,Longitude
Unnamed: 0_level_1,<int>,<chr>,<chr>,<lgl>,<lgl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
1,8951354,12/31/2012 23:15,STREET,False,False,623,6,69,2012,41.75628,-87.62164
2,8951141,12/31/2012 22:00,STREET,False,False,1213,12,24,2012,41.89879,-87.6613
3,8952745,12/31/2012 22:00,RESIDENTIAL YARD (FRONT/BACK),False,False,1622,16,11,2012,41.96919,-87.76767
4,8952223,12/31/2012 22:00,STREET,False,False,724,7,67,2012,41.76933,-87.65773
5,8951608,12/31/2012 21:30,STREET,False,False,211,2,35,2012,41.83757,-87.62176
6,8950793,12/31/2012 20:30,STREET,True,False,2521,25,19,2012,41.92856,-87.754


In [2]:
# Extract the month and the day of the week and add these variables to the data frame vehicle_thefts
DateConvert <- strptime(vehicle_thefts$Date, "%m/%d/%Y %H:%M")


library(lubridate)
library(dplyr)

# ymd_hms() transforms dates stored as character or numeric vectors to POSIXct objects
# Remember from the labs that there are two internal implementations of date/time: POSIXct, 
# which stores seconds since UNIX epoch (+ some other data), and POSIXlt, which stores a 
# list of day, month, year, hour, minute, second, etc.
expand_date <- ymd_hms(DateConvert) #Converting input date "12/31/2012 20:30" to "2012-12-31 23:15:00 UTC" format

# Create new columns: Month, Weekday, Hour, and Minutes
vehicle_thefts$Month <- months(DateConvert)  #Extract month from formatted date. 
vehicle_thefts$Weekday <- weekdays(DateConvert)   #Extract weekday from formatted date. 
vehicle_thefts$Hour <- as.numeric(format(expand_date, "%H")) #Extract hour from formatted date. 
vehicle_thefts$Minutes <- as.numeric(format(expand_date, "%M"))  #Extract minutes from formatted date. 

head(vehicle_thefts)


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Unnamed: 0_level_0,ID,Date,LocationDescription,Arrest,Domestic,Beat,District,CommunityArea,Year,Latitude,Longitude,Month,Weekday,Hour,Minutes
Unnamed: 0_level_1,<int>,<chr>,<chr>,<lgl>,<lgl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>
1,8951354,12/31/2012 23:15,STREET,False,False,623,6,69,2012,41.75628,-87.62164,December,Monday,23,15
2,8951141,12/31/2012 22:00,STREET,False,False,1213,12,24,2012,41.89879,-87.6613,December,Monday,22,0
3,8952745,12/31/2012 22:00,RESIDENTIAL YARD (FRONT/BACK),False,False,1622,16,11,2012,41.96919,-87.76767,December,Monday,22,0
4,8952223,12/31/2012 22:00,STREET,False,False,724,7,67,2012,41.76933,-87.65773,December,Monday,22,0
5,8951608,12/31/2012 21:30,STREET,False,False,211,2,35,2012,41.83757,-87.62176,December,Monday,21,30
6,8950793,12/31/2012 20:30,STREET,True,False,2521,25,19,2012,41.92856,-87.754,December,Monday,20,30


**Reference:** [format()](https://stat.ethz.ch/R-manual/R-devel/library/base/html/format.html)

**Activity 1:** What is the probability of an arrest being made for the month with largest number of motor vehicle thefts? First let's find out the month with maximum number of thefts.


In [3]:
# Find the name of the month with largest value among the table() output results
# The first returned month will have the largest value
#sort(table(vehicle_thefts$Month), decreasing = TRUE)
which.max(table(vehicle_thefts$Month))

# A table is usually the best summary for categorical data. Once we have a table, we should be able to 
# look at it and say something sensible.  Let's take a look at the relationship between the two categorical 
# variables Month and Arrest.
addmargins(table(vehicle_thefts$Month, vehicle_thefts$Arrest))

Unnamed: 0,FALSE,TRUE,Sum
April,14028,1252,15280
August,15243,1329,16572
December,15029,1397,16426
February,12273,1238,13511
January,14612,1435,16047
July,15477,1324,16801
June,14772,1230,16002
March,14460,1298,15758
May,14848,1187,16035
November,14807,1256,16063


In [25]:
# What is the probability that an arrest is made, given that the theft occurs in October?  
# P(Arrest|October)

# table() generates a frequency table
# Logical subsetting of the Month column extracts rows where Arrest = TRUE and Month = "October"
# Logical subsetting of the Month column extracts rows where Month = "October"
# round() rounds the answer to 4 decimal places


#table(vehicle_thefts$Month[<what goes in here> & vehicle_thefts$Month=="October"])/<what goes in here>(table(vehicle_thefts$Month[vehicle_thefts$Month=="October"]))
prob = table(vehicle_thefts$Month[vehicle_thefts$Arrest==TRUE & vehicle_thefts$Month == "October"]) / 
    table(vehicle_thefts$Month[vehicle_thefts$Month == "October"])

print(paste("P(Arrest|October) =", round(prob, 4)))

[1] "P(Arrest|October) = 0.0785"


**Activity 2:** Which month has the largest number of motor vehicle thefts for which an arrest was made?

In [26]:
# which.max() determines the index of the (first) minimum or maximum of a numeric or logical vector.

which.max(table(vehicle_thefts$Month[vehicle_thefts$Arrest==TRUE]))

**Activity 3:** 

a) Read the smoke.csv dataset from the directory '/dsa/data/all_datasets/smoke/smoke.csv' (header = TRUE) into a variable called smoke_data. 

b) Create a two-way table called smoker_outcome for variables 'smoker' and 'outcome'. 
Add marginal distributions to the table by using addmargins() function.

In [27]:
smoke_data <- read.csv("/dsa/data/all_datasets/smoke/smoke.csv", header = TRUE)

smoker_outcome = table(smoke_data$smoker, smoke_data$outcome)
addmargins(smoker_outcome)

Unnamed: 0,Alive,Dead,Sum
No,502,230,732
Yes,443,139,582
Sum,945,369,1314


There is no point in getting this table unless we can interpret it. Most important thing we might be interested in is whether smoking is a factor in smokers' death vs. nonsmokers' death. 443 out of 945 alive are smokers while 139 out of 369 dead are smokers. Those are hard to compare unless we change to a common denominator, or express them as proportions or percentages. We see that 443 out of 945, or about 47% of the alive smoke; and 139 out of 369, or about 38% of the dead smoked. So smoking does not seem to be a factor in deaths for this group of people. 

In [30]:
# Example: Above dataset recorded smoking status and whether or not the subject was alive at the end 
# of 20 years. Use the prop.table function to find the conditional probability of survival for smokers 
# and nonsmokers. prop.table() is similar to table() command where former one gives probabilities 
# while table() returns actual frequency count. 

# tally() works exactly like prop.table(). tal <- tally(~smoker + outcome)

#tal <- tally(.~smoker + outcome)  ## Can't get it to work "Error in UseMethod("tally"): no applicable method for 'tally' applied to an object of class "formula""

smoking <- prop.table(smoker_outcome)
addmargins(smoking)

Unnamed: 0,Alive,Dead,Sum
No,0.3820396,0.1750381,0.5570776
Yes,0.3371385,0.1057839,0.4429224
Sum,0.7191781,0.2808219,1.0


**Reference:** [tally()](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/tally)

**Activity 4:** 

a) What is the probability that a person is dead, given that the person was a smoker?

b) What is the probability that a person is dead, given that the person was a non-smoker?   

Here, smoker status is the condition.

In [41]:
#P(dead|smoker) = P(dead & smoker)/P(smoker)
# --- Add code below ---------------------------------

prob_dead_smoke <- table(smoke_data$outcome[smoke_data$outcome == "Dead" & smoke_data$smoker == "Yes"]) / 
    sum(table(smoke_data$outcome[smoke_data$smoker == "Yes"]))

print(paste("P(Dead|Smoker) =", round(prob_dead_smoke, 4)))

[1] "P(Dead|Smoker) = 0.2388"


In [42]:
#P(dead|nonsmoker) = P(dead & nonsmoker)/P(nonsmoker)
# --- Add code below ---------------------------------
prob_dead_nosmoke <- table(smoke_data$outcome[smoke_data$outcome == "Dead" & smoke_data$smoker == "No"]) / 
    sum(table(smoke_data$outcome[smoke_data$smoker == "No"]))

print(paste("P(Dead|Nonsmoker) =", round(prob_dead_nosmoke, 4)))

[1] "P(Dead|Nonsmoker) = 0.3142"


In [None]:
# Use prop.table() if you dont want to do arithmetic of finding percentages from table() results, 
# The "2" tells R to compute the marginal distributions across the columns (smoker status (yes or no) adds 
# up to 1 columnwise). To compute rowwise percentages, use "1" (outcome (dead or alive) adds up to 1.)
# --- Add code below ---------------------------------


In [43]:
# Compute columnwise proportions for the outcome variable (i.e., dead or alive), based on smoking status
# (i.e., smoker or nonsmoker)

addmargins(prop.table(smoker_outcome, 2))

Unnamed: 0,Alive,Dead,Sum
No,0.5312169,0.6233062,1.1545232
Yes,0.4687831,0.3766938,0.8454768
Sum,1.0,1.0,2.0


In [44]:
# Compute rowwise proportions for the smoker variable (i.e., smoker or nonsmoker), based on outcome (i.e., 
# dead or alive)

addmargins(prop.table(smoker_outcome, 1))

Unnamed: 0,Alive,Dead,Sum
No,0.6857923,0.3142077,1
Yes,0.7611684,0.2388316,1
Sum,1.4469607,0.5530393,2


The meaning of conditional probabilities is much clearer in these tables than it is in language or mathematical notation.
The idea of a conditional probability is that you are looking at a subset of the data. 
For example, in an election poll we might be interested in the subset of voters who prefer Candidate A, 
and also be interested in knowing the proportions of those voters  with respect to gender, race, ethnicity, etc. 

For the smoke data, we saw that about 40% of the 1314 people smoked. 
However, for the subset of alive, 443 out of 945, or about 47% are smokers. 
Often we want to compare one subset to another. 
Here, 139 of the 369 dead, or about 38% were smokers. 
We noted this earlier and found those numbers in the table. 
The notation for these conditional probabilities might look something like 
P(smoke | alive) and P(smoke | dead) respectively.
These can be found by using "2" in prop.table() because the subsets (conditions) are dead or alive.

In [45]:
# Comparing proportions of smokers and non-smokers for subsets of alive and dead.

addmargins(prop.table(smoker_outcome, 2))


Unnamed: 0,Alive,Dead,Sum
No,0.5312169,0.6233062,1.1545232
Yes,0.4687831,0.3766938,0.8454768
Sum,1.0,1.0,2.0


Similarly, we can answer activity 3 by looking at the subsets (conditions) of smoking status.

In [46]:
# Comparing proportions of alive and dead for subsets of nonsmokers and smokers.

addmargins(prop.table(smoker_outcome, 1))


Unnamed: 0,Alive,Dead,Sum
No,0.6857923,0.3142077,1
Yes,0.7611684,0.2388316,1
Sum,1.4469607,0.5530393,2


# SAVE YOUR NOTEBOOK