## Heart Disease Analysis

This project analyses a csv that was taken from codecademy, it is reported to be: 

"data from a sample patients who were evaluated for heart disease at the Cleveland Clinic Foundation. The data was downloaded from the UCI Machine Learning Repository and then cleaned for analysis. The principal investigators responsible for data collection were:

Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D."

My goal is to work with the csv file to work on my one sample t-test, binomial testing, hypothesis work and signifcance thresholds.

In [25]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_1samp
from scipy.stats import binomtest

In [2]:
heart = pd.read_csv('heart_disease_practice.csv')
yes_hd = heart[heart.heart_disease == 'presence']
no_hd = heart[heart.heart_disease == 'absence']

Within this data set I will be speciffically looking at *chol* (serum cholestorol in mg/dl), *fbs* ( An indicator for whether fasting blood sugar is greater than 120 mg/dl (1 = true; 0 = false) and *heart_disease*. We have split the data frame *heart* into two subsets of  itself, being *yes_hd* and *no_hd*, *yes_hd* is a dataframe of all the individuals whom have *heart_disease* present and *no_hd* is those whom have it absent. 

In [18]:
heart

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
0,63.0,male,145.0,233.0,typical angina,0.0,1.0,150.0,absence
1,67.0,male,160.0,286.0,asymptomatic,1.0,0.0,108.0,presence
2,67.0,male,120.0,229.0,asymptomatic,1.0,0.0,129.0,presence
3,37.0,male,130.0,250.0,non-anginal pain,0.0,0.0,187.0,absence
4,41.0,female,130.0,204.0,atypical angina,0.0,0.0,172.0,absence
...,...,...,...,...,...,...,...,...,...
298,45.0,male,110.0,264.0,typical angina,0.0,0.0,132.0,presence
299,68.0,male,144.0,193.0,asymptomatic,0.0,1.0,141.0,presence
300,57.0,male,130.0,131.0,asymptomatic,1.0,0.0,115.0,presence
301,57.0,female,130.0,236.0,atypical angina,0.0,0.0,174.0,presence


In [19]:
num_patients = len(heart)
num_patients

303

##### Taking a dive into the Cholestorol column, chol

First I want to issolate *chol* from both dataframes (creating two series, chol_hd and chol_no_hd), to look at how those with Heart Disease cholestoral levels differ to those who dont have *heart_disease*

In general, total cholesterol over 240 mg/dl is considered “high” (and therefore unhealthy). If we calculate the mean cholesterol level for patients who were diagnosed with heart disease and print it out. Is it higher than 240 mg/dl?

In [4]:
chol_hd = yes_hd['chol']
print(chol_hd)
chol_hd_mean = chol_hd.mean()
chol_hd_mean

1      286.0
2      229.0
6      268.0
8      254.0
9      203.0
       ...  
297    241.0
298    264.0
299    193.0
300    131.0
301    236.0
Name: chol, Length: 139, dtype: float64


251.4748201438849

Indeed the cholestorol is higher than expected but we should run a one-sided-ttest and check it against a signifacne threshold to see if this higher mean matters or not.

First we must ask a question

Do people with heart disease have high cholesterol levels (greater than or equal to 240 mg/dl) on average?

**Null**: People with heart disease have an average cholesterol level equal to 240 mg/dl

**Alternative**: People with heart disease have an average cholesterol level that is greater than 240 mg/dl

In [13]:
tstat, pval = ttest_1samp(chol_hd, 240)
one_sided_pval = pval / 2
one_sided_pval

0.0035411033905155707

one_sided_pval is around 0.0035 which is less than 0.05(significance threshold), meaning heart disease patitents have an average cholesterol level significantly greater than 240

We will also want to check how chol for those without heart_disease is represented in our sample.

In [14]:
chol_no_hd = no_hd['chol']
print(chol_no_hd)
chol_no_hd_mean = chol_no_hd.mean()
print(chol_no_hd_mean)

tstat, no_hd_pval = ttest_1samp(chol_no_hd, 240)
one_sided_no_hd_pval = no_hd_pval / 2
one_sided_no_hd_pval

0      233.0
3      250.0
4      204.0
5      236.0
7      354.0
       ...  
288    221.0
289    240.0
291    342.0
295    157.0
302    175.0
Name: chol, Length: 164, dtype: float64
242.640243902439


0.26397120232220506

The one_sided_no_hd_pval is 0.26 which is higher than our significane level of 0.05 meaning we can reject the alternative hypothesis for those who do not have a heart disease 


#### Analysing fbs (Fasting Blood Sugar)

the fbs column of this dataset indicates whether or not a patient’s fasting blood sugar was greater than 120 mg/dl (1 means that their fasting blood sugar was greater than 120 mg/dl; 0 means it was less than or equal to 120 mg/dl).

by taking the sum of the column we can see how many individuals have higher than 120 mg/dl

In [22]:
num_highfbs_patients = np.sum(heart.fbs == 1)
num_highfbs_patients

45

By some estimates, about 8% of the U.S. population had diabetes (diagnosed or undiagnosed) in 1988 when this data was collected. While there are multiple tests that contribute to a diabetes diagnosis, fasting blood sugar levels greater than 120 mg/dl can be indicative of diabetes (or at least, pre-diabetes).

In [23]:
expected_diabetics_in_sample = 0.08 * len(heart)
expected_diabetics_in_sample

24.240000000000002

We should epect 24 people to have diabetes, this is worrying as we have almost double that amount. We should check to see if our sample was drawn from a population where 8% of people have fasting blood sugar > 120 mg/dl

As our answers to fbs is either yes or no (1 or 0) we should se a binomial test to determine if we are getting expected results

First we ak the question:

Does this sample come from a population in which the rate of fbs > 120 mg/dl is equal to 8%?

Then our hypothesis

**Null**: This sample was drawn from a population where 8% of people have fasting blood sugar > 120 mg/dl

**Alternative**: This sample was drawn from a population where more than 8% of people have fasting blood sugar > 120 mg/dl

In [28]:
result = binomtest(45, n= 303, p= 0.08, alternative= 'greater')
result.pvalue

4.689471951448875e-05

Our answer is 4.6895~ * e ** -5. or 0.00004695. Which is significantly less than 0.05 This means that our sample is very likely to have come from a population where more than 8% of people have an fbs > 120 mg/dl