
### Objective
### HYPOTHESIS TESTING: SIGNIFICANCE THRESHOLDS
### Heart Disease Research Part I
In this project, you’ll investigate some data from a sample patients who were evaluated for heart disease at the Cleveland Clinic Foundation. The data was downloaded from the UCI Machine Learning Repository and then cleaned for analysis. The principal investigators responsible for data collection were:
1.	Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2.	University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3.	University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4.	V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

Note that a solution.py file is loaded for you in the workspace, which contains solution code for this project. We highly recommend that you complete the project on your own without checking the solution, but feel free to take a look if you get stuck or want to check your answers when you’re done!

Tasks: 10/10 complete. Mark the tasks as complete by checking them off

### Cholesterol Analysis
<br>1. The full dataset has been loaded for you as `heart`, then split into two subsets:  
o	`yes_hd`, which contains data for patients with heart disease  
o	`no_hd`, which contains data for patients without heart disease  <br>
<br>For this project, we’ll investigate the following variables:  
o	`chol`: serum cholestorol in mg/dl  
o	`fbs`: An indicator for whether fasting blood sugar is greater than 120 mg/dl (1 = true; 0 = false)  

To start, we’ll investigate cholesterol levels for patients with heart disease. Use the dataset yes_hd to save cholesterol levels for patients with heart disease as a variable named chol_hd.

<br>2. In general, total cholesterol over 240 mg/dl is considered “high” (and therefore unhealthy). Calculate the mean cholesterol level for patients who were diagnosed with heart disease and print it out. Is it higher than 240 mg/dl?

<br>3. Do people with heart disease have high cholesterol levels (greater than or equal to 240 mg/dl) on average? Import the function from `scipy.stats` that you can use to test the following null and alternative hypotheses:

o	Null: People with heart disease have an average cholesterol level equal to 240 mg/dl  
o	Alternative: People with heart disease have an average cholesterol level that is greater than 240 mg/dl

Note: Unfortunately, the `scipy.stats` function we’ve been using does not (at the time of writing) have an `alternative` parameter to change the alternative hypothesis for this test. Therefore, you’ll have to run a two-sided test. However, since you calculated earlier that the average cholesterol level for heart disease patients is greater than 240 mg/dl, you can calculate the p-value for the one-sided test indicated above simply by dividing the two-sided p-value in half.

<br>4. Run the hypothesis test indicated in task 3 and print out the p-value. Can you conclude that heart disease patients have an average cholesterol level significantly greater than 240 mg/dl? Use a significance threshold of 0.05.

<br>5. Repeat steps 1-4 in order to run the same hypothesis test, but for patients in the sample who were not diagnosed with heart disease. Do patients without heart disease have average cholesterol levels significantly above 240 mg/dl?

### Fasting Blood Sugar Analysis
<br>6. Let’s now return to the full dataset (saved as heart). How many patients are there in this dataset? Save the number of patients as num_patients and print it out.

<br>7. Remember that the `fbs` column of this dataset indicates whether or not a patient’s fasting blood sugar was greater than 120 mg/dl (1 means that their fasting blood sugar was greater than 120 mg/dl; 0 means it was less than or equal to 120 mg/dl).  
Calculate the number of patients with fasting blood sugar greater than 120. Save this number as `num_highfbs_patients` and print it out.

<br>8. Sometimes, part of an analysis will involve comparing a sample to known population values to see if the sample appears to be representative of the general population.  
By some estimates, about 8% of the U.S. population had diabetes (diagnosed or undiagnosed) in 1988 when this data was collected. While there are multiple tests that contribute to a diabetes diagnosis, fasting blood sugar levels greater than 120 mg/dl can be indicative of diabetes (or at least, pre-diabetes). If this sample were representative of the population, approximately how many people would you expect to have diabetes? Calculate and print out this number.  
Is this value similar to the number of patients with a resting blood sugar above 120 mg/dl — or different?

<br>9. Does this sample come from a population in which the rate of fbs > 120 mg/dl is equal to 8%? Import the function from `scipy.stats` that you can use to test the following null and alternative hypotheses:  
o	Null: This sample was drawn from a population where 8% of people have fasting blood sugar > 120 mg/dl  
o	Alternative: This sample was drawn from a population where more than 8% of people have fasting blood sugar > 120 mg/dl


<br>10. Run the hypothesis test indicated in task 9 and print out the p-value. Using a significance threshold of 0.05, can you conclude that this sample was drawn from a population where the rate of fasting blood sugar > 120 mg/dl is significantly greater than 8%?




In [3]:
import pandas as pd
import numpy as np

In [4]:
# Load data
heart = pd.read_csv('heart_disease.csv')
# Creating two separate dataframes
yes_hd = heart[heart.heart_disease == 'presence']
no_hd = heart[heart.heart_disease == 'absence']

In [5]:
# Check the df "heart"
print(heart.head(3))

    age   sex  trestbps   chol              cp  exang  fbs  thalach   
0  63.0  male     145.0  233.0  typical angina    0.0  1.0    150.0  \
1  67.0  male     160.0  286.0    asymptomatic    1.0  0.0    108.0   
2  67.0  male     120.0  229.0    asymptomatic    1.0  0.0    129.0   

  heart_disease  
0       absence  
1      presence  
2      presence  


In [6]:
# Removing the maximum number of columns displayed with the follwing lines does not work. 
# In my editor "None" needs to be replaced by "0".
# pd.get_option("display.max_columns")
# pd.set_option("display.max_columns", None)

In [7]:
# This code temporarily removes the max number of columns displayed.
with pd.option_context('display.max_columns', 0):
    print(yes_hd.head(3))

    age     sex  trestbps   chol  ... exang  fbs  thalach  heart_disease
1  67.0    male     160.0  286.0  ...   1.0  0.0    108.0       presence
2  67.0    male     120.0  229.0  ...   1.0  0.0    129.0       presence
6  62.0  female     140.0  268.0  ...   0.0  0.0    160.0       presence

[3 rows x 9 columns]


In [8]:
with pd.option_context('display.max_columns', 0):
    print(no_hd.head(3))

    age     sex  trestbps   chol  ... exang  fbs  thalach  heart_disease
0  63.0    male     145.0  233.0  ...   0.0  1.0    150.0        absence
3  37.0    male     130.0  250.0  ...   0.0  0.0    187.0        absence
4  41.0  female     130.0  204.0  ...   0.0  0.0    172.0        absence

[3 rows x 9 columns]


In [22]:
# Task 1
with pd.option_context('display.max_columns', 0): print(yes_hd.head())
with pd.option_context('display.width', 1000): print(yes_hd.columns)

    age     sex  trestbps   chol  ... exang  fbs  thalach  heart_disease
1  67.0    male     160.0  286.0  ...   1.0  0.0    108.0       presence
2  67.0    male     120.0  229.0  ...   1.0  0.0    129.0       presence
6  62.0  female     140.0  268.0  ...   0.0  0.0    160.0       presence
8  63.0    male     130.0  254.0  ...   0.0  0.0    147.0       presence
9  53.0    male     140.0  203.0  ...   1.0  1.0    155.0       presence

[5 rows x 9 columns]
Index(['age', 'sex', 'trestbps', 'chol', 'cp', 'exang', 'fbs', 'thalach', 'heart_disease'], dtype='object')


In [29]:
# Task 2
chol_hd = yes_hd.chol
print(round(chol_hd.mean(),2))
# The "comma" in print statement adds an unwanted space. 
# print("The average cholestrol level for the chol_hd column is ",round(chol_hd.mean(),2),".")
# Output: The average cholestrol level for the chol_hd column is  251.47 .
# Use string concatenation instead.
print("The average cholestrol level for the chol_hd column is " + str(round(chol_hd.mean(),2)) + ".")

251.47
The average cholestrol level for the chol_hd column is 251.47.


In [26]:
# Task 3
# Before going into detail: This question is already answered in Task 2. The mean of chol_hd of 251.47 "is greater than" 240 mg/dl, 
# which means the answer is yes, people with heart disease "do have" high colestrol levels (greater than or equal to 240 mg/dl).


# Codecademy's Comment: 
# Note: Unfortunately, the `scipy.stats` function we’ve been using does not (at the time of writing) have 
# an `alternative` parameter to change the alternative hypothesis for this test. Therefore, you’ll have to run a two-sided test. 
# However, since you calculated earlier that the average cholesterol level for heart disease patients is greater than 240 mg/dl, 
# you can calculate the p-value for the one-sided test indicated above simply by dividing the two-sided p-value in half.

# Mark:
# I disagree with Codecademy's comment. 
# The 2-sided t-test ttest_1samp checks whether the sample mean is equal to the expected mean 
# (assumed population mean) and returns a two-tailed p-value ("p-val/2" below or above and "1 - p-val/2" below or above 
# the expected mean, depending on how the alternative hypothesis is worded).
# Null hypothesis: sample mean is equal to expected mean. (Indicated by p-val>0.05)
# Alternative hypothesis: sample mean is greater than expected mean. 
# If p-val<=0.05, reject null hypothesis, i.e., the sample mean is not equal to the expected value. However, we still 
# would not know whether it is greater or smaller. 

# tstat, pval = ttest_1samp(sample_distribution, expected_mean)
from scipy.stats import ttest_1samp
tstat, pval = ttest_1samp(chol_hd, 240)
print("p-value = " + str(pval))    # Output: p-value = 0.007082206781031141
# The likelyhood that sample mean is equal to 240 is 0.7%; less than 5%; so, reject null hypothesis. 

# How do we know whether sample mean is greater or smaller than 240?
# Look at test statistics (t-values); more precisely, the distribution on t-values. If chol_hd == 240 then tstat=0; 
# if chol_hd > 240 then tstat>0; if chol_hd < 240 then tstat<0 . 
print("tstat = " + str(tstat))    # Output: tstat = 2.7337803003099808
# By itself, a t-value of 2.73 does not tell us anything. T-values are not in the units of the original data.
# Since tstat is positive, our alternative hypothesis is true. The likelyhood of sample values being greater than 240 is 
# "1 - p-val/2": 1 - 0.00708/2 = 0.99646.

print("People with heart disease do have high cholesterol levels (greater than or equal to 240 mg/dl) on average.")

p-value = 0.007082206781031141
tstat = 2.7337803003099808
People with heart disease do have high cholesterol levels (greater than or equal to 240 mg/dl) on average.


In [36]:
# Task 4

# Significance threshold for two-sided test is 0.05. 
# Two-tailed p-value was calculated as 0.0070822. 
# Dividing it by two means running a one-sided test (people higher than 240 mg/dl).
# p-val_one-sided = pval/2 = 0.0035. The likelyhood of people having cholestrol levels less than 240 mg/dl is 0.35%.
print("Heart disease patients have average cholestrol levels significantly higher than 240 mg/dl.")

Heart disease patients have average cholestrol levels significantly higher than 240 mg/dl.


In [47]:
# Task 5
# People who were not diagnosed with heart disease.

chol_no_hd = no_hd.chol
#print(round(chol_no_hdno.mean(),2))
print("The average cholestrol level for the chol_no_hd column is " + str(round(chol_no_hd.mean(),2)) + ".")

# tstat, pval = ttest_1samp(sample_distribution, expected_mean)
from scipy.stats import ttest_1samp
tstat, pval = ttest_1samp(chol_no_hd, 240)
print("p-value = " + str(pval))    # Output: p-value = 0.527942
# The likelyhood that sample mean is equal to 240 is 5.28%; more than 5%; so, DO NOT reject null hypothesis. 
# Divide p-val by two to obtain one-sided test result.
# p-val_one-sided = pval/2 = 0.26. The likelyhood of people having cholestrol levels more than 240 mg/dl is 2.6%.
print("""No heart disease patients (no_hd = heart[heart.heart_disease == 'absence'])
      have average cholestrol levels \"significantly\" lower than 240 mg/dl.""")

The average cholestrol level for the chol_no_hd column is 242.64.
p-value = 0.5279424046444101
No heart disease patients (no_hd = heart[heart.heart_disease == 'absence'])
      have average cholestrol levels "significantly" lower than 240 mg/dl.


In [50]:
# Task 6
num_patients = len(heart)
print("There are " + str(num_patients) + " patients in this  dataset.")

There are 303 patients in this  dataset.


In [54]:
# Task 7
# Ordinarily, one would count the values in that column. Let's try.
# num_highfbs_patients = heart.fbs.value_counts()
# print(num_highfbs_patients)
# 45 patients.
# This gives out a matrix with counts for 0 and 1. 
# What we need is the count for 1. Since the column consists of 
# 0s and 1s, use sum() to get the count of 1s.
num_highfbs_patients = np.sum(heart.fbs)
print("The number of patients with high fasting blood sugar is " + str(int(num_highfbs_patients)) + ".")

The number of patients with high fasting blood sugar is 45.


In [55]:
# Task 8
# Sample has 303 patients. 8% means 24 patients should have diabetes.
# Task 7 revealed 45 patients with fasting blood sugar levels > 120.
# (45/303)=14.85% of the sample.

In [62]:
# Task 9 and Task 10
# If this sample (303) came from a population with 8% diabetes,
# 24 patients should show fasting blood sugar levels of >120.
# THAT IS, if fbs levels >120 can be understood as having diabetes.
# This sample has 45 people or 14.85% that are >120.

# from scipy.stats import binom_test # fsb column has 0,1.
# pval = binom_test(num_highfbs_patients, num_patients, .08, alternative='greater')

# DeprecationWarning: 'binom_test'. Rewriting with 'binomtest'.
from scipy.stats import binomtest # fsb column has 0,1.
pval = binomtest(int(num_highfbs_patients), num_patients, .08, alternative='greater')

print(num_highfbs_patients) # Just checking numbers
print(num_patients) # Just checking numbers
print(pval) # 4.689 e-05
# Probability that this sample of 303 patients (num_patients) 
# DOES NOT COME from a sample with 8% or more people having diabetes 
# is close to zero. 
# 4.689 e-5 < 0.05 --> (significantly) lower.
# Significantly greater than 8%.
print("This sample of 303 patients was drawn from a population that has fasting blood sugar levels significantly higher than 8%.")

45.0
303
BinomTestResult(k=45, n=303, alternative='greater', statistic=0.1485148514851485, pvalue=4.689471951448875e-05)
This sample of 303 patients was drawn from a population that has fasting blood sugar levels significantly higher than 8%.
