# Homework 1 Problem 6

Datasets are given to classify patients of heart disease.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# read data
df_train = pd.read_csv('hw2_trainPatient.csv', names=['age', 'chestPain', 'target'])
df_test = pd.read_csv('hw2_testPatient.csv', names=['age', 'chestPain', 'target'])

display(df_train.head(5))

Unnamed: 0,age,chestPain,target
0,45,1,0
1,57,0,0
2,46,0,1
3,71,0,0
4,65,0,0


To define a probabilistic model of this data, we let $Y_{i} = 1$ if patient $i$ has heart disease,
and $Y_{i} = 0$ if patient $i$ does not.  To construct a simple Bayesian classifier, we will compute the posterior probability $P(Y_{i}|X_{i})$ of the class label given some feature $X_i$. If $P(Y_{i} = 1|X_{i}) > P(Y_{i} = 0|X_{i})$, we classify patient $i$ as probably having heart disease. Otherwise, we classify them as probably not having heart disease.
By Bayes' rule, the posterior probability $P(Y_{i}|X_{i})$ is calculated as:

$$
P(Y_{i}|X_{i}) = \dfrac{P(X_{i}|Y_{i})P(Y_{i})}{P(X_{i})}.
$$

Consider the feature $A_{i}$, where $A_{i} = 1$ if patient $i$ has $age > 55$, and $A_{i} = 0$ if patient $i$ has $age \le 55$.

In [3]:
# Change column names
df_train['A'] = np.where(df_train['age']>55, 1, 0)
df_test['A'] = np.where(df_test['age']>55, 1, 0)

df_train = df_train.rename(columns={'chestPain': 'E', 'target': 'Y'})[['A', 'E', 'Y']]
df_test = df_test.rename(columns={'chestPain': 'E', 'target': 'Y'})[['A', 'E', 'Y']]

display(df_train.head(5))

Unnamed: 0,A,E,Y
0,0,1,0
1,1,0,0
2,0,0,1
3,1,0,0
4,1,0,0


![a](hw2_6a.png)

In [9]:
# (a): Estimate these probabilities from the data in trainPatient, and report their values

n = df_train.shape[0]
n0 = df_train[df_train['Y']==0].shape[0]
n1 = df_train[df_train['Y']==1].shape[0]
n0_A = df_train[(df_train['Y']==0)&(df_train['A']==1)].shape[0]
n1_A = df_train[(df_train['Y']==1)&(df_train['A']==1)].shape[0]

p_y_equal_1 = n1/n
p_y_equal_0 = n0/n
p_a_given_y1 = n1_A/n1
p_a_given_y0 = n0_A/n0

print('The estimated values are:')
print('P(Yi=1) =', p_y_equal_1)
print('P(Yi=0) =', p_y_equal_0)
print('P(Ai=1|Yi=1) =', p_a_given_y1)
print('P(Ai=1|Yi=0) =', p_a_given_y0)

The estimated values are:
P(Yi=1) = 0.46
P(Yi=0) = 0.54
P(Ai=1|Yi=1) = 0.4782608695652174
P(Ai=1|Yi=0) = 0.3333333333333333


(b) Test the Bayesian classifier based on feature Ai using the data in testPatient. What is the accuracy (percentage of correctly classified patients) of this classifier on that data?

**Answer:** In this question
$$
P(Y_{i}|A_{i}) = \dfrac{P(A_{i}|Y_{i})P(Y_{i})}{P(A_{i})}.
$$
If $P(Y_{i} = 1|A_{i}) > P(Y_{i} = 0|A_{i})$, we classify patient $i$ as probably having heart disease. That is, if $P(A_{i}|Y_{i}=1)P(Y_{i}=1) > P(A_{i}|Y_{i}=0)P(Y_{i}=0)$, we classify patient $i$ as probably having heart disease.

In [11]:
print('If Ai = 1, whether to classify patient i as probably having heart disease:')
print(p_a_given_y1 * p_y_equal_1 > p_a_given_y0 * p_y_equal_0)
print('If Ai = 0, whether to classify patient i as probably having heart disease:')
print((1-p_a_given_y1) * p_y_equal_1 > (1-p_a_given_y0) * p_y_equal_0)

If Ai = 1, whether to classify patient i as probably having heart disease:
True
If Ai = 0, whether to classify patient i as probably having heart disease:
False


Therefore, we will classify patient $i$ as probably having heart disease iff their $A_i=1$.

In [12]:
from sklearn.metrics import accuracy_score

df_test['predY_A'] = np.where(df_test['A']==1, 1, 0)
print('The test accuracy is:', accuracy_score(df_test['Y'], df_test['predY_A']))

display(df_test.head(5))

The test accuracy is: 0.6756756756756757


Unnamed: 0,A,E,Y,predY_A
0,1,0,0,1
1,0,0,0,0
2,0,1,0,0
3,1,0,1,1
4,1,1,1,1


(c) Consider now the feature $E_{i}$, where $E_{i} = 1$ if exercise causes patient $i$ to experience chest pain, and $E_{i} = 0$ if it does not. Estimate and report conditional probabilities for this new feature as in part (a). What is the accuracy of a Bayesian classifier based on feature $E_{i}$ on the data in testPatient?

In [13]:
# n = df_train.shape[0]
# n0 = df_train[df_train['Y']==0].shape[0]
# n1 = df_train[df_train['Y']==1].shape[0]
n0_E = df_train[(df_train['Y']==0)&(df_train['E']==1)].shape[0]
n1_E = df_train[(df_train['Y']==1)&(df_train['E']==1)].shape[0]

# p_y_equal_1 = n1/n
# p_y_equal_0 = n0/n
p_e_given_y1 = n1_E/n1
p_e_given_y0 = n0_E/n0

print('The estimated values are:')
print('P(Yi=1) =', p_y_equal_1)
print('P(Yi=0) =', p_y_equal_0)
print('P(Ei=1|Yi=1) =', p_e_given_y1)
print('P(Ei=1|Yi=0) =', p_e_given_y0)

The estimated values are:
P(Yi=1) = 0.46
P(Yi=0) = 0.54
P(Ei=1|Yi=1) = 0.43478260869565216
P(Ei=1|Yi=0) = 0.18518518518518517


In [14]:
print('If Ei = 1, whether to classify patient i as probably having heart disease:')
print(p_e_given_y1 * p_y_equal_1 > p_e_given_y0 * p_y_equal_0)
print('If Ei = 0, whether to classify patient i as probably having heart disease:')
print((1-p_e_given_y1) * p_y_equal_1 > (1-p_e_given_y0) * p_y_equal_0)

If Ei = 1, whether to classify patient i as probably having heart disease:
True
If Ei = 0, whether to classify patient i as probably having heart disease:
False


Therefore, we will classify patient $i$ as probably having heart disease iff their $E_i=1$.

In [15]:
df_test['predY_E'] = np.where(df_test['E']==1, 1, 0)
print('The new test accuracy is:', accuracy_score(df_test['Y'], df_test['predY_E']))

display(df_test.head(5))

The new test accuracy is: 0.7567567567567568


Unnamed: 0,A,E,Y,predY_A,predY_E
0,1,0,0,1,0
1,0,0,0,0,0
2,0,1,0,0,1
3,1,0,1,1,0
4,1,1,1,1,1


(d) Consider the pair of features $\{A_{i}, E_{i}\}$. This pair of features can take on 4 values:
$\{0, 0\}, \{0, 1\}, \{1, 0\}$ or $\{1, 1\}$ (we do $not$ assume that $A_{i}$ and $E_{i}$ are independent). By counting as in part (a), compute and report the probabilities of these four events given $Y_{i} = 1$ and given $Y_{i} = 0$. What is the accuracy of a Bayesian classifier based on feature $\{A_{i}, E_{i}\}$ on the data in testPatient? Compare the accuracy of this classifier to those from parts (b) and (c), and give an intuitive explanation for what you observe.

In [18]:
# n = df_train.shape[0]
# n0 = df_train[df_train['Y']==0].shape[0]
# n1 = df_train[df_train['Y']==1].shape[0]
n0_A0E1 = df_train[(df_train['Y']==0)&(df_train['A']==0)&(df_train['E']==1)].shape[0]
n0_A1E1 = df_train[(df_train['Y']==0)&(df_train['A']==1)&(df_train['E']==1)].shape[0]
n0_A0E0 = df_train[(df_train['Y']==0)&(df_train['A']==0)&(df_train['E']==0)].shape[0]
n0_A1E0 = df_train[(df_train['Y']==0)&(df_train['A']==1)&(df_train['E']==0)].shape[0]

n1_A0E1 = df_train[(df_train['Y']==1)&(df_train['A']==0)&(df_train['E']==1)].shape[0]
n1_A1E1 = df_train[(df_train['Y']==1)&(df_train['A']==1)&(df_train['E']==1)].shape[0]
n1_A0E0 = df_train[(df_train['Y']==1)&(df_train['A']==0)&(df_train['E']==0)].shape[0]
n1_A1E0 = df_train[(df_train['Y']==1)&(df_train['A']==1)&(df_train['E']==0)].shape[0]

# p_y_equal_1 = n1/n
# p_y_equal_0 = n0/n
p_A0E1_given_y1 = n1_A0E1/n1
p_A1E1_given_y1 = n1_A1E1/n1
p_A0E0_given_y1 = n1_A0E0/n1
p_A1E0_given_y1 = n1_A1E0/n1

p_e_given_y0 = n0_E/n0
p_A0E1_given_y0 = n0_A0E1/n0
p_A1E1_given_y0 = n0_A1E1/n0
p_A0E0_given_y0 = n0_A0E0/n0
p_A1E0_given_y0 = n0_A1E0/n0

print('The estimated values are:')
print('P(Yi=1) =', p_y_equal_1)
print('P(Yi=0) =', p_y_equal_0, '\n')

print('P(Ai=0, Ei=1|Yi=1) =', p_A0E1_given_y1)
print('P(Ai=1, Ei=1|Yi=1) =', p_A1E1_given_y1)
print('P(Ai=0, Ei=0|Yi=1) =', p_A0E0_given_y1)
print('P(Ai=1, Ei=0|Yi=1) =', p_A1E0_given_y1, '\n')

print('P(Ai=0, Ei=1|Yi=0) =', p_A0E1_given_y0)
print('P(Ai=1, Ei=1|Yi=0) =', p_A1E1_given_y0)
print('P(Ai=0, Ei=0|Yi=0) =', p_A0E0_given_y0)
print('P(Ai=1, Ei=0|Yi=0) =', p_A1E0_given_y0)

The estimated values are:
P(Yi=1) = 0.46
P(Yi=0) = 0.54 

P(Ai=0, Ei=1|Yi=1) = 0.21739130434782608
P(Ai=1, Ei=1|Yi=1) = 0.21739130434782608
P(Ai=0, Ei=0|Yi=1) = 0.30434782608695654
P(Ai=1, Ei=0|Yi=1) = 0.2608695652173913 

P(Ai=0, Ei=1|Yi=0) = 0.14814814814814814
P(Ai=1, Ei=1|Yi=0) = 0.037037037037037035
P(Ai=0, Ei=0|Yi=0) = 0.5185185185185185
P(Ai=1, Ei=0|Yi=0) = 0.2962962962962963


In [19]:
print('If Ai=0, Ei = 1, whether to classify patient i as probably having heart disease:')
print(p_A0E1_given_y1 * p_y_equal_1 > p_A0E1_given_y0 * p_y_equal_0)
print('If Ai=1, Ei = 1, whether to classify patient i as probably having heart disease:')
print(p_A1E1_given_y1 * p_y_equal_1 > p_A1E1_given_y0 * p_y_equal_0)
print('If Ai=0, Ei = 0, whether to classify patient i as probably having heart disease:')
print(p_A0E0_given_y1 * p_y_equal_1 > p_A0E0_given_y0 * p_y_equal_0)
print('If Ai=1, Ei = 0, whether to classify patient i as probably having heart disease:')
print(p_A1E0_given_y1 * p_y_equal_1 > p_A1E0_given_y0 * p_y_equal_0)

If Ai=0, Ei = 1, whether to classify patient i as probably having heart disease:
True
If Ai=1, Ei = 1, whether to classify patient i as probably having heart disease:
True
If Ai=0, Ei = 0, whether to classify patient i as probably having heart disease:
False
If Ai=1, Ei = 0, whether to classify patient i as probably having heart disease:
False


In [20]:
# We can see from the above that the value of A won't affect the prediction
df_test['predY_AE'] = np.where((df_test['E']==1), 1, 0)

print('The new test accuracy is:', accuracy_score(df_test['Y'], df_test['predY_AE']))

display(df_test.head(5))

The new test accuracy is: 0.7567567567567568


Unnamed: 0,A,E,Y,predY_A,predY_E,predY_AE
0,1,0,0,1,0,0
1,0,0,0,0,0,0
2,0,1,0,0,1,1
3,1,0,1,1,0,0
4,1,1,1,1,1,1


**Finding:** The accuracy in (d) is the same as that in (c), and is larger than that in (b).

**Intuitive Explanation:** The prediction using two features should be no less than the best prediction using a single feature theoretically if the data is well-fitted. Therefore, since the accuracy using $A$ is lower than that using $E$, the collaborative result is the same as $E$, and feature $A$ might not be that useful in our setting.