In [1]:
# import libraries
import numpy as np
import pandas as pd
from scipy.stats import ttest_1samp, binom_test

In [2]:
# load data
heart = pd.read_csv('Heart Diseases UCL - Sheet1.csv')
yes_hd = heart[heart.heart_disease=='presence']
no_hd = heart[heart.heart_disease == 'absence']

# investigating cholestrol levels in patients with heart diease
chol_hd = yes_hd.chol
print('The mean cholesterol level for patients with heart disease is ' + str((chol_hd.mean())))
# 240 mg/dl is considered high (and therefore unhealthy)

chol_no_hd = no_hd.chol
print('The mean cholesterol level for patients without heart disease is ' + str(chol_no_hd.mean()))

The mean cholesterol level for patients with heart disease is 251.4748201438849
The mean cholesterol level for patients without heart disease is 242.640243902439


#### 1. Cholesterol Analysis


Question 1: Do people with heart disease have higher cholesterol levels (greater than or equal to 240 mg/dl) on average?

To investigate, we're going to conduct some hypothesis testing!
- Null: People with heart diseases have an average cholesterol level equal to 240 mg/dl
- Alternative: People with heart disease have an average cholesterol level that is **greater** than 240 mg/dl

Running the code in the cell below will reveal a p-value of 0.0035 (0.35%). An interpretation of this is that there is a (100%-.35% = 99.65%) chance that the average cholesterol level is significantly higher than 240 mg/dl

In [3]:
# the p-value here includes the greater parameter in the updated version of scipy
tstat,pval = ttest_1samp(chol_hd, 240, alternative='greater')
print('The p-value for patients with heart disease is ' +str(pval))


The p-value for patients with heart disease is 0.0035411033905155707


#### Question 2: A repetition of 'Question 1' but for patients **NOT** diagnosed with a heart disease.

Do patients without heart disease have average cholesterol levels significantly above 240 mg/dl?
- Null: People with heart diseases have an average cholesterol level equal to 240 mg/dl
- Alternative: People with heart disease have an average cholesterol level that is **greater** than 240 mg/dl

The p-value for this test is 0.264, which is more than 0.05 for a 5% significant threshold, suggesting that patients not diagnosed with heart disease have an average cholesterol level **NOT** significantly higher than 240 mg/dl. This checks with the average mean of cholesterol level in chol_no_hd

In [4]:
tstat2,pval2 = ttest_1samp(chol_no_hd, 240, alternative='greater')
print('The p-value for patients without HD is ' +str(pval2))

The p-value for patients without HD is 0.26397120232220506


## 2. Fasting Blood Sugar Analysis

The fbs column indicates whether or not a patient's blood sugar level is greater than 120 mg/dl, 1 means fb is greater than 120 mg/dl; 0 means it was less than or equal to 120 mg/dl

We can find the total number of patients by simply summing up the values in the fbs column

In [5]:
## Fasting Blood Sugar Analysis

num_patients = len(np.array(heart))
print('The total number of patients is ' + str(num_patients))

highfbs_patients = heart[heart.fbs == 1]
num_highfbs_patients = len(np.array(highfbs_patients))
print('The number of patients with high fasting blood sugar is ' + str(num_highfbs_patients))

The total number of patients is 303
The number of patients with high fasting blood sugar is 45


- Sometimes, part of an analysis will involve comparing a sample to known population values to see if the sample appears to be representative of the general population.

By some estimates, about 8% of the U.S. population had diabetes (diagnosed or undiagnosed) in 1988 when this data was collected. While there are multiple tests that contribute to a diabetes diagnosis, fasting blood sugar level greater than 120 mg/dl can be indicative of diabetes (or at least, pre-diabetes). If this sample were representative of the population, approximately how many people would we expect to have diabetes. 

we'll calculate and print this number.

Is this value similar to the number of patients with a resting blood sugar above 120 mg/dl - or different?

In [6]:
expected_num_fbs = 0.08 * num_patients
print('The expected number of patients with diabetes, if the sample is representative of the population is ' + str(round(expected_num_fbs)))

# The expected number is 24, which is about half the number of people with high fbs (45).

The expected number of patients with diabetes, if the sample is representative of the population is 24


Does this sample come from a population in which the rate of fbs > 120 mg/dl is equal to 8%?

We are going to test this using the following hypotheses:
- Null: This sample was drawn from a population where 8% of people have a fbs > 120 mg/dl
- Alternative: This sample was drawn from a population where more than 8% of the people have fbs > 120 mg/dl


The p-value from the test is 4.69e-05. This can be interpreted as; there is a (100-p_value= 99.995%) chance that this sample was drawn from a population where more than 8% of those people have high fast blood sugar when there at a 5% significance threshold.

In [7]:
p_value = binom_test(45, n=303, p=0.08,alternative='greater')
print('the p_value is '+ str(p_value))

the p_value is 4.689471951448875e-05


  p_value = binom_test(45, n=303, p=0.08,alternative='greater')
