# Heart Diesease Research Part I

In this project, you’ll investigate some data from a sample patients who were evaluated for heart disease at the Cleveland Clinic Foundation. The data was downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/45/heart+disease) and then cleaned for analysis. The principal investigators responsible for data collection were:
1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

In [1]:
import pandas as pd
import numpy as np

# load data
heart = pd.read_csv('heart_disease.csv')
yes_hd = heart[heart.heart_disease == 'presence']
no_hd = heart[heart.heart_disease == 'absence']

#### 1.

The full dataset has been loaded for you as `heart`, then split into two subsets:
- `yes_hd`, which contains data for patients **with** heart disease
- `no_hd`, which contains data for patients **without** heart disease


For this project, we’ll investigate the following variables:
- `chol`: serum cholestorol in mg/dl
- `fbs`: An indicator for whether fasting blood sugar is greater than 120 mg/dl (`1` = true; `0` = false)

To start, we’ll investigate cholesterol levels for patients with heart disease. Use the dataset `yes_hd` to save cholesterol levels for patients with heart disease as a variable named `chol_hd`.

In [3]:
chol_hd = yes_hd.chol

#### 2.

- In general, total cholesterol over 240 mg/dl is considered “high” (and therefore unhealthy). 
- Calculate the mean cholesterol level for patients who were diagnosed with heart disease and print it out. 
- Is it higher than 240 mg/dl?

In [7]:
chol_hd.mean()

np.float64(251.4748201438849)

#### 3.

- Do people with heart disease have high cholesterol levels (greater than or equal to 240 mg/dl) on average? 
- Import the function from `scipy.stats` that you can use to test the following null and alternative hypotheses:
    - Null: People with heart disease have an average cholesterol level equal to 240 mg/dl
    - Alternative: People with heart disease have an average cholesterol level that is greater than 240 mg/dl

In [None]:
from scipy.stats import ttest_1samp

# tstat, pval = ttest_1samp(Sample Distribution, Expected Population Mean, alternative='greater')
tstat, pval = ttest_1samp(chol_hd, 240, alternative='greater')

#### 4.

- Run the hypothesis test indicated in task 3 and print out the p-value. 
- Can you conclude that heart disease patients have an average cholesterol level significantly greater than 240 mg/dl? 
- Use a significance threshold of 0.05.

In [19]:
pval

np.float64(0.0035411033905155707)

- `ttest_1samp` has two inputs: 
    - the sample of values (in this case, the cholesterol levels for patients with heart disease) and 
    - the null value (in this case, 240). 
- It has two outputs, the t-statstic and a p-value.
- When you divide the p-value by two (in order to run the one-sided test), you should get a p-value of `0.0035`. 
- This is less than 0.05, suggesting that heart disease patients have an average cholesterol level significantly higher than 240 mg/dl.

#### 5.

- Repeat steps 1-4 in order to run the same hypothesis test, but for patients in the sample who were **not** diagnosed with heart disease. 
- Do patients without heart disease have average cholesterol levels significantly above 240 mg/dl?

In [12]:
chol_no_hd = no_hd.chol
chol_no_hd.mean()

np.float64(242.640243902439)

In [20]:
# tstat, pval = ttest_1samp(Sample Distribution, Expected Population Mean, alternative='greater')
tstat, pval = ttest_1samp(chol_no_hd, 240, alternative='greater')

pval

np.float64(0.26397120232220506)

- We got a p-value of `0.263`, which is greater than 0.05. Therefore, we cannot conclude that patients without heart disease have average cholesterol levels significantly above 240 mg/dl.

#### 6.

- Let’s now return to the full dataset (saved as `heart`). 
- How many patients are there in this dataset? 
- Save the number of patients as `num_patients` and print it out.

In [16]:
num_patients = len(heart)
num_patients

303

#### 7.

- Remember that the `fbs` column of this dataset indicates whether or not a patient’s fasting blood sugar was greater than 120 mg/dl (`1` means that their fasting blood sugar was greater than 120 mg/dl; `0` means it was less than or equal to 120 mg/dl).
- Calculate the number of patients with fasting blood sugar greater than 120. 
- Save this number as `num_highfbs_patients` and print it out.

In [22]:
heart.head()

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
0,63.0,male,145.0,233.0,typical angina,0.0,1.0,150.0,absence
1,67.0,male,160.0,286.0,asymptomatic,1.0,0.0,108.0,presence
2,67.0,male,120.0,229.0,asymptomatic,1.0,0.0,129.0,presence
3,37.0,male,130.0,250.0,non-anginal pain,0.0,0.0,187.0,absence
4,41.0,female,130.0,204.0,atypical angina,0.0,0.0,172.0,absence


In [23]:
num_highfbs_patients = len(heart[heart.fbs == 1])
num_highfbs_patients

45

#### 8.

- Sometimes, part of an analysis will involve comparing a sample to known population values to see if the sample appears to be representative of the general population.
- By some estimates, about 8% of the U.S. population had diabetes (diagnosed or undiagnosed) in 1988 when this data was collected. 
- While there are multiple tests that contribute to a diabetes diagnosis, fasting blood sugar levels greater than 120 mg/dl can be indicative of diabetes (or at least, pre-diabetes). 
- If this sample were representative of the population, approximately how many people would you expect to have diabetes? 
- Calculate and print out this number.
- Is this value similar to the number of patients with a resting blood sugar above 120 mg/dl — or different?

In [24]:
# 8% of the US population has diabetes in 1988
# fbs > 120 mg/dl can be indicative of diabetes (or pre-diabetes)

len(heart) * 0.08

24.240000000000002

#### 9.

- Does this sample come from a population in which the rate of fbs > 120 mg/dl is equal to 8%? 
- Import the function from `scipy.stats` that you can use to test the following null and alternative hypotheses:
    - **Null**: This sample was drawn from a population where 8% of people have fasting blood sugar > 120 mg/dl
    - **Alternative**: This sample was drawn from a population where more than 8% of people have fasting blood sugar > 120 mg/dl

In [34]:
from scipy.stats import binomtest

# Parameters:
# k: number of observed successes
# n: number of total trials
# p: expected probability of success. Default is 0.5
# alternative: 'two-sided', 'greater', or 'less'. Default is 'two-sided'

p_value = binomtest(num_highfbs_patients, len(heart), 0.08, alternative='greater')
p_value

BinomTestResult(k=45, n=303, alternative='greater', statistic=0.1485148514851485, pvalue=4.689471951448875e-05)

#### 10.

- Run the hypothesis test indicated in task 9 and print out the p-value. 
- Using a significance threshold of 0.05, can you conclude that this sample was drawn from a population where the rate of fasting blood sugar > 120 mg/dl is significantly greater than 8%?

In [32]:
p_value.pvalue

np.float64(4.689471951448875e-05)

- The `binomtest()` function takes four parameters (in order):
    - The observed number of “successes” (in this case, the number of people in the sample who had fasting blood sugar greater than 120 mg/dl)
    - The number of “trials” (in this case, the number of patients)
    - The null probability of “success” (in this case, 0.08)
    - The `alternative` parameter, which indicates the alternative hypothesis for the test (eg.,`'two-sided'` `'greater'` or `'less'`)
- The output is the BinomTestResult, which contains more information:
    - k: used number of successes
    - n: used number of trials
    - alternative: used alternative hypothesis
    - pvalue: the p-value of the test
- If you run the test correctly, you should get a p-value of `4.689471951449078e-05` which is equivalent to `0.0000469` (the e-5 at the end indicates scientific notation). This is less than 0.05, indicating that this sample likely comes from a population where more than 8% of people have fbs > 120 mg/dl.