In this project, you’ll investigate some data from a sample patients who were evaluated for heart disease at the **Cleveland Clinic Foundation**. The data was downloaded from the UCI Machine Learning Repository and then cleaned for analysis. The principal investigators responsible for data collection were:

1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

* **Cholesterol Analysis**

The full dataset has been loaded as heart, then split into two subsets:

* yes_hd, which contains data for patients **with** heart disease
* no_hd, which contains data for patients **without** heart disease

For this project, we’ll investigate the following variables:

* chol: serum cholestorol in mg/dl
* fbs: An indicator for whether fasting blood sugar is greater than 120 mg/dl (1 = true; 0 = false)

To start, we’ll investigate cholesterol levels for patients with heart disease. Use the dataset yes_hd to save cholesterol levels for patients with heart disease as a variable named chol_hd.

In [3]:
# import libraries
import pandas as pd
import numpy as np
from scipy.stats import ttest_1samp
from scipy.stats import binom_test

In [4]:
# load data
heart = pd.read_csv('/content/heart_disease.csv')
yes_hd = heart[heart.heart_disease == 'presence']
no_hd = heart[heart.heart_disease == 'absence']

In [5]:
heart

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
0,63.0,male,145.0,233.0,typical angina,0.0,1.0,150.0,absence
1,67.0,male,160.0,286.0,asymptomatic,1.0,0.0,108.0,presence
2,67.0,male,120.0,229.0,asymptomatic,1.0,0.0,129.0,presence
3,37.0,male,130.0,250.0,non-anginal pain,0.0,0.0,187.0,absence
4,41.0,female,130.0,204.0,atypical angina,0.0,0.0,172.0,absence
...,...,...,...,...,...,...,...,...,...
298,45.0,male,110.0,264.0,typical angina,0.0,0.0,132.0,presence
299,68.0,male,144.0,193.0,asymptomatic,0.0,1.0,141.0,presence
300,57.0,male,130.0,131.0,asymptomatic,1.0,0.0,115.0,presence
301,57.0,female,130.0,236.0,atypical angina,0.0,0.0,174.0,presence


In [6]:
yes_hd

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
1,67.0,male,160.0,286.0,asymptomatic,1.0,0.0,108.0,presence
2,67.0,male,120.0,229.0,asymptomatic,1.0,0.0,129.0,presence
6,62.0,female,140.0,268.0,asymptomatic,0.0,0.0,160.0,presence
8,63.0,male,130.0,254.0,asymptomatic,0.0,0.0,147.0,presence
9,53.0,male,140.0,203.0,asymptomatic,1.0,1.0,155.0,presence
...,...,...,...,...,...,...,...,...,...
297,57.0,female,140.0,241.0,asymptomatic,1.0,0.0,123.0,presence
298,45.0,male,110.0,264.0,typical angina,0.0,0.0,132.0,presence
299,68.0,male,144.0,193.0,asymptomatic,0.0,1.0,141.0,presence
300,57.0,male,130.0,131.0,asymptomatic,1.0,0.0,115.0,presence


In [7]:
no_hd

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
0,63.0,male,145.0,233.0,typical angina,0.0,1.0,150.0,absence
3,37.0,male,130.0,250.0,non-anginal pain,0.0,0.0,187.0,absence
4,41.0,female,130.0,204.0,atypical angina,0.0,0.0,172.0,absence
5,56.0,male,120.0,236.0,atypical angina,0.0,0.0,178.0,absence
7,57.0,female,120.0,354.0,asymptomatic,1.0,0.0,163.0,absence
...,...,...,...,...,...,...,...,...,...
288,56.0,male,130.0,221.0,atypical angina,0.0,0.0,163.0,absence
289,56.0,male,120.0,240.0,atypical angina,0.0,0.0,169.0,absence
291,55.0,female,132.0,342.0,atypical angina,0.0,0.0,166.0,absence
295,41.0,male,120.0,157.0,atypical angina,0.0,0.0,182.0,absence


In [8]:
# cholesterol levels for patients with heart disease
chol_hd = yes_hd.chol
chol_hd

1      286.0
2      229.0
6      268.0
8      254.0
9      203.0
       ...  
297    241.0
298    264.0
299    193.0
300    131.0
301    236.0
Name: chol, Length: 139, dtype: float64

In general, total cholesterol over 240 mg/dl is considered “high” (and therefore unhealthy). Calculate the mean cholesterol level for patients who were diagnosed with heart disease

In [9]:
# mean cholesterol levels
mean_hd = np.mean(chol_hd)
print(f'The mean cholesterol level for patients who were diagnosed with heart disease is {mean_hd}')

The mean cholesterol level for patients who were diagnosed with heart disease is 251.4748201438849


This value (251 mg/dI) is higher than 240 mg/dI, which is already 'high'.

Do people with heart disease have high cholesterol levels (greater than or equal to 240 mg/dl) on average? We will test the following null and alternative hypotheses:

* Null: People with heart disease have an average cholesterol level equal to 240 mg/dl
* Alternative: People with heart disease have an average cholesterol level that is greater than 240 mg/dl



We’ll have to run a two-sided test. However, since we calculated earlier that the average cholesterol level for heart disease patients is greater than 240 mg/dl, we can calculate the p-value for the one-sided test indicated above simply by dividing the two-sided p-value in half.

In [10]:
# p-value for one-sided test - patients with heart disease
tstat, pval = ttest_1samp(chol_hd, 240) # chol_hd: cholesterol levels for patients with heart disease, 240: null value

In [11]:
# significance threshold 0.05
print(pval/2)

0.0035411033905155707


This is less than 0.05, suggesting that heart disease patients have an average cholesterol level significantly higher than 240 mg/dl.

We will run the same hypothesis test, but for patients in the sample who were **not** diagnosed with heart disease.

In [12]:
# cholesterol levels for patients without heart disease
chol_no_hd = no_hd.chol
# mean cholesterol levels
mean_no_hd = np.mean(chol_no_hd)
print(f'The mean cholesterol level for patients who were not diagnosed with heart disease is {mean_no_hd}')

# p-value for one-sided test - patients without heart disease
tstat, pval = ttest_1samp(chol_no_hd, 240)

# significance threshold 0.05
print(pval/2)

The mean cholesterol level for patients who were not diagnosed with heart disease is 242.640243902439
0.26397120232220506


The p-value here (0.264) is greater than 0.05. So patients without heart disease don't have average cholesterol levels significantly above 240 mg/dl.

In [13]:
# the percentage of patients who diagnosed with heart disease and have cholesterol level above 240 mg/dI
per_chol_hd = np.sum(chol_hd >= 240)/len(chol_hd)
print(per_chol_hd*100,'%')

# the percentage of patients who were not diagnosed with heart disease and have cholesterol level above 240 mg/dI
per_chol_no_hd = np.sum(chol_no_hd >= 240)/len(chol_no_hd)
print(per_chol_no_hd*100,'%')


57.55395683453237 %
46.34146341463415 %


Cholesterol, when present in the blood in excess (hypercholesterolemia), is one of the main risk factors for cardiovascular disease, which can lead to heart attack or stroke, two of the leading causes of death in Europe.

Because it is not clear how cholesterol is deposited under the arterial epithelium, there is disagreement in the scientific community as to the relationship between cholesterol and the development of cardiovascular disease. **Malcolm Kendrick** argues that cholesterol is not a risk factor, but only an indicator. According to this hypothesis, when arterial injuries are caused, cholesterol acts as a healer, demonstrating the poor condition of the arteries.

For example, we can notice to the dataset thet there is a 57-year-old female paient that has 354 cholesterol level, she's asymptomatic and she was not diagnosed with heart disease. Maybe this woman, has extremely high-level of cholesterol due to the poor condition of the arteries.

* **Fasting Blood Sugar Analysis**

In [14]:
# number of total patients
num_patients = len(heart)
print(f'The total number of patients is {num_patients}')

The total number of patients is 303


The fbs column of this dataset indicates whether or not a patient’s fasting blood sugar was greater than 120 mg/dl (1 means that their fasting blood sugar was greater than 120 mg/dl; 0 means it was less than or equal to 120 mg/dl).

Calculate the number of patients with fasting blood sugar greater than 120.

In [15]:
# number of patients with fasting blood sugar greater than 120 mg/dI
num_highfbs_patients = np.sum(heart.fbs == 1)
print(f'The number of patients with fasting blood sugar greater than 120 mg/dI is {num_highfbs_patients}')

The number of patients with fasting blood sugar greater than 120 mg/dI is 45


Sometimes, part of an analysis will involve comparing a sample to known population values to see if the sample appears to be representative of the general population.

By some estimates, about 8% of the U.S. population had diabetes (diagnosed or undiagnosed) in 1988 when this data was collected. While there are multiple tests that contribute to a diabetes diagnosis, fasting blood sugar levels greater than 120 mg/dl can be indicative of diabetes (or at least, pre-diabetes). If this sample were representative of the population, approximately how many people would you expect to have diabetes?

In [16]:
# expected number of patients that have diabetes
num_diabetes = int(num_patients*.08)
print(f'We expect that if 303 patients were representative of the US population, approximately {num_diabetes} people have diabetes')

We expect that if 303 patients were representative of the US population, approximately 24 people have diabetes


This value is almost half the number with fbs > 120 in the sample (45).

Does this sample come from a population in which the rate of fbs > 120 mg/dl is equal to 8%? Test the following null and alternative hypotheses:

* Null: This sample was drawn from a population where 8% of people have fasting blood sugar > 120 mg/dl
* Alternative: This sample was drawn from a population where more than 8% of people have fasting blood sugar > 120 mg/dl

In [17]:
# test if the population with fasting blood sugar > 120 mg/dI is 8%
# significance threshold 0.05
p_val = binom_test(num_highfbs_patients,num_patients,.08, alternative = 'greater')
print(p_val)

4.689471951448875e-05


  p_val = binom_test(num_highfbs_patients,num_patients,.08, alternative = 'greater')


So we can conclude that this sample was drawn from a population where the rate of fasting blood sugar > 120 mg/dl is greater than 8%, because p_val = 0.0000469 << 0.05 (significance threshold)