# Conditional Probability

Conditional probability is a measure of the probability of an event occurring given that another event has occurred.

`P(A|B)` is the probability of A given that B already occured. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; 
sns.set_style("whitegrid")

In [2]:
data = pd.read_csv('../data/student-mat.csv',sep=';')
print(data.shape)
data.head()

(395, 33)


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


The database has 33 variables. We are going to work with a subset of them:

- **school**: student's school (***`binary`***: 'GP' or 'MS')
- **sex**: student's sex (***`binary`***: 'F' - female or 'M' - male)
- **address**: student's home address type (***`binary`***: 'U'-urban, or 'R'-rural)
- **studytime**: weekly study time (***`numeric`***: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
- **schoolsup**: extra educational support (***`binary`***: yes or no)
- **internet**: Internet access at home (***`binary`***: yes or no)
- **G1**: first period grade (***`numeric`***: from 0 to 20)
- **G2**: second period grade (***`numeric`***: from 0 to 20)
- **G3**: final grade (***`numeric`***: from 0 to 20, output target)

In [3]:
data = data[['school','sex','address','studytime','schoolsup','internet','G1','G2','G3']]
print(data.shape)
data.head()

(395, 9)


Unnamed: 0,school,sex,address,studytime,schoolsup,internet,G1,G2,G3
0,GP,F,U,2,yes,no,5,6,6
1,GP,F,U,2,no,yes,5,5,6
2,GP,F,U,2,yes,yes,7,8,10
3,GP,F,U,3,no,yes,15,14,15
4,GP,F,U,2,no,no,6,10,10


## Exercise 1. Conditional Probability  
 
Determine the probability a student gets a final grade G3 greater than or equal to 60%, given we know he studies more than 5 hours a week.

Let's create the boolean variable G3pass. If G3 >= 60%, G3pass will be 1, and 0 otherwise. Original G3 values are on a 0–20 scale so we multiply them by 5.

In [4]:
data['G3pass'] = np.where(data.G3 * 5 >= 60, 1, 0)
data.head()

Unnamed: 0,school,sex,address,studytime,schoolsup,internet,G1,G2,G3,G3pass
0,GP,F,U,2,yes,no,5,6,6,0
1,GP,F,U,2,no,yes,5,5,6,0
2,GP,F,U,2,yes,yes,7,8,10,0
3,GP,F,U,3,no,yes,15,14,15,1
4,GP,F,U,2,no,no,6,10,10,0


Let's make another boolean variable: StudyHard. If studytime >= 3 (>5 hours), StudyHard will be 1, and 0 otherwise.

In [5]:
data['StudyHard'] = np.where(data.studytime >= 3, 1, 0)
data.head()

Unnamed: 0,school,sex,address,studytime,schoolsup,internet,G1,G2,G3,G3pass,StudyHard
0,GP,F,U,2,yes,no,5,6,6,0,0
1,GP,F,U,2,no,yes,5,5,6,0,0
2,GP,F,U,2,yes,yes,7,8,10,0,0
3,GP,F,U,3,no,yes,15,14,15,1,1
4,GP,F,U,2,no,no,6,10,10,0,0


Let's calculate a crosstab:

In [6]:
pd.crosstab(data.G3pass, data.StudyHard, margins=True)

StudyHard,0,1,All
G3pass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,188,45,233
1,115,47,162
All,303,92,395


Remember the formula of conditional probability:

$P(G3pass|StudyHard)=\frac{P(G3pass \bigcap StudyHard)}{P(StudyHard)}$

$P(G3pass \bigcap StudyHard) = \frac{47}{395}$

$P(StudyHard) = \frac{92}{395}$

In [7]:
47/92

0.5108695652173914

The probability that a student gets a final grade G3>=60%, given we know he studies more than 5 hours a week, is 51.1%

## Exercise 2. Conditional Probability

Determine the probability a student gets a final grade G3>=60%, given we know he has extra educational support.

In [8]:
pd.crosstab(data.G3pass, data.schoolsup, margins=True)

schoolsup,no,yes,All
G3pass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,191,42,233
1,153,9,162
All,344,51,395


The formula of conditional probability:

$P(G3pass|schoolsup)=\frac{P(G3pass \bigcap schoolsup)}{P(schoolsup)}$

$P(G3pass \bigcap schoolsup) = \frac{9}{395}$

$P(schoolsup) = \frac{51}{395}$

In [9]:
9/51

0.17647058823529413

The probability that a student gets a final grade G3>=60%, given we know he has extra educational support, is 17.6%.

## Exercise 3. Total Probability Law

The variable `studytime` represents the weekly study time with the values:
- 1: <2 hours, 
- 2: 2 to 5 hours, 
- 3: 5 to 10 hours, 
- 4: >10 hours)

In [10]:
data.studytime.value_counts(sort=False)

2    198
3     65
1    105
4     27
Name: studytime, dtype: int64

We can get the same information as probabilities (estimated by frequencies):

In [11]:
data.studytime.value_counts(sort=False, normalize=True)

2    0.501266
3    0.164557
1    0.265823
4    0.068354
Name: studytime, dtype: float64

Let's use `st` as `studytime`. We can write the probabilities as:

$P(st=1)=0.27$

$P(st=2)=0.50$

$P(st=3)=0.16$

$P(st=4)=0.07$

We have the variable `G3pass`. We can determine an estimated of the probability of passing given the `studytime` (st) category:

In [12]:
pd.crosstab(data.G3pass, data.studytime, margins=True)

studytime,1,2,3,4,All
G3pass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,61,127,32,13,233
1,44,71,33,14,162
All,105,198,65,27,395


$P(G3pass|st=1)=\frac{44}{105}=0.42$

$P(G3pass|st=2)=\frac{71}{198}=0.36$

$P(G3pass|st=3)=\frac{33}{65}=0.51$

$P(G3pass|st=4)=\frac{14}{27}=0.52$

**Problem**: Using all information provide, what is the probability that a ramdomly selected student gets G3>=60% (G3pass==1)?

We should use the law of total probability.

$P(G3pass)=P(G3pass|st=1)P(st=1)+P(G3pass|st=2)P(st=2)+P(G3pass|st=3)P(st=3)+P(G3pass|st=4)P(st=4)$

In [13]:
(0.42)*(0.27)+(0.36)*(0.5)+(0.51)*(0.16)+(0.52)*(0.07)

0.4114

$P(G3pass)=(0.42)(0.27)+(0.36)(0.5)+(0.51)(0.16)+(0.52)(0.07)=0.411$

The probability that a ramdomly selected student gets G3>=60% (G3pass==1) is 41.1%

## Exercise 4. Bayes' Rule

We choose an student, and he passed, calculate the proability he studied 2 to 5 hours a week.

$P(st=2|G3pass)=\frac{P(G3pass|st=2)P(st=2)}{P(G3pass|st=1)P(st=1)+P(G3pass|st=2)P(st=2)+P(G3pass|st=3)P(st=3)+P(G3pass|st=4)P(st=4)}$

$P(st=2|G3pass)=\frac{(0.36)(0.5)}{(0.42)(0.27)+(0.36)(0.5)+(0.51)(0.16)+(0.52)(0.07)}$

In [14]:
(0.36)*(0.5)/(0.411)

0.43795620437956206

$P(st=2|G3pass)=0.438$

The probability that the student studied 2 to 5 hours a week is 43.8%