# Heart Disease Research Part I

In this project, you’ll investigate some data from a sample patients who were evaluated for heart disease at the Cleveland Clinic Foundation

In [26]:
from scipy.stats import ttest_1samp, binom_test
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

### Cholesterol Analysis 

In [27]:
heart = pd.read_csv('heart_disease.csv')
heart.head()

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
0,63.0,male,145.0,233.0,typical angina,0.0,1.0,150.0,absence
1,67.0,male,160.0,286.0,asymptomatic,1.0,0.0,108.0,presence
2,67.0,male,120.0,229.0,asymptomatic,1.0,0.0,129.0,presence
3,37.0,male,130.0,250.0,non-anginal pain,0.0,0.0,187.0,absence
4,41.0,female,130.0,204.0,atypical angina,0.0,0.0,172.0,absence


- chol: serum cholestorol in mg/dl
- fbs: An indicator for whether fasting blood sugar is greater than 120 mg/dl (1 = true; 0 = false)

In [28]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            303 non-null    float64
 1   sex            303 non-null    object 
 2   trestbps       303 non-null    float64
 3   chol           303 non-null    float64
 4   cp             303 non-null    object 
 5   exang          303 non-null    float64
 6   fbs            303 non-null    float64
 7   thalach        303 non-null    float64
 8   heart_disease  303 non-null    object 
dtypes: float64(6), object(3)
memory usage: 21.4+ KB


In [29]:
heart['heart_disease'].unique()

array(['absence', 'presence'], dtype=object)

In [37]:
yes_hd = heart[heart['heart_disease'] == 'presence']
no_hd = heart[heart['heart_disease'] == 'absence']

- To start, we’ll investigate cholesterol levels for patients with heart disease.

In [38]:
chol_hd = yes_hd['chol']
chol_hd.head()

1    286.0
2    229.0
6    268.0
8    254.0
9    203.0
Name: chol, dtype: float64

In [39]:
chol_hd_mean = np.mean(chol_hd)
chol_hd_mean

251.4748201438849

It is higher than 240 mg/dl (total cholesterol over 240 mg/dl is considered “high” (and therefore unhealthy))

#### run a two-sided test

Do people with heart disease have high cholesterol levels (greater than or equal to 240 mg/dl) on average? Import the function from scipy.stats that you can use to test the following null and alternative hypotheses:

- Null: People with heart disease have an average cholesterol level equal to 240 mg/dl
- Alternative: People with heart disease have an average cholesterol level that is greater than 240 mg/dl

In [40]:
tsats, pval = ttest_1samp(chol_hd, 240)
pval/2

0.0035411033905155707

p-value (0.0035) less than significant threshold (0.05), suggesting that heart disease patients have an average cholestorel level significantly higher than 240 mg/l

- run the same hypothesis test, but for patients in the sample who were not diagnosed with heart disease

In [41]:
chol_no_hd = no_hd['chol']
chol_no_hd.head()

0    233.0
3    250.0
4    204.0
5    236.0
7    354.0
Name: chol, dtype: float64

In [42]:
chol_no_hd_mean = np.mean(chol_no_hd)
chol_no_hd_mean

242.640243902439

In [43]:
tsats, pval = ttest_1samp(chol_no_hd, 240)
pval/2

0.26397120232220506

- Cholesterol level is higher than 240 mg/dl (total cholesterol over 240 mg/dl is considered “high” (and therefore unhealthy))
- p-value (0.2639) higher than significant threshold (0.05), suggesting that heart disease patients with no cholesterol have an average cholestorel level not significantly higher than 240 mg/l
- Type II error (False Positive)

### Fasting Blood Sugar Analysis 

In [45]:
num_patients = len(heart)
num_patients

303

Calculate the number of patients with fasting blood sugar greater than 120

In [46]:
num_highfbs_patients = len(heart[heart['fbs'] == 1])
num_highfbs_patients

45

about 8% of the U.S. population had diabetes (diagnosed or undiagnosed) in 1988 when this data was collected. While there are multiple tests that contribute to a diabetes diagnosis, fasting blood sugar levels greater than 120 mg/dl can be indicative of diabetes (or at least, pre-diabetes). If this sample were representative of the population, approximately how many people would you expect to have diabetes? Calculate and print out this number.

In [48]:
0.08 * num_patients

24.240000000000002

this comes out to approximately 24 patients, which is almost half the number with fbs > 120 in the sample (45)

#### run binomial test 

Does this sample come from a population in which the rate of fbs > 120 mg/dl is equal to 8%? Import the function from scipy.stats that you can use to test the following null and alternative hypotheses:

- Null: This sample was drawn from a population where 8% of people have fasting blood sugar > 120 mg/dl
- Alternative: This sample was drawn from a population where more than 8% of people have fasting blood sugar > 120 mg/dl

In [50]:
p_value = binom_test(num_highfbs_patients, num_patients, 0.08, alternative = 'greater')
p_value

4.689471951449078e-05

p-value (0.0000468) less than significant threshold (0.05), indicating that this sample likely comes from a population where more than 8% of people have fbs > 120 mg/dl.