## Heart Disease Research
In this project, I reviewed and analyzed data from a sample of patients evaluated for heart disease at the Cleveland Clinic Foundation. The data was downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Heart+Disease). 

We will study a number of **quantitative, binary, and categorical variables**, produce **boxplots**, and utilize **hypothesis testing** to better understand the relationships between the variables. This project was completed in the *Master Statistics with Python* skill path on Codecademy.

#### Focus Areas for Analysis
- Determine whether there is an association between people with heart disease and another healther indicator such as high cholesterol, fasting blood sugar, and maximum heart rate. 
- Utilize hypothesis testing - namely one-sample t-test, binomial test, two-sample t-test, ANOVA test, Tukey's range test, and chi-square test - to draw inferences about a population from the sample of data.

#### Variables in Dataset
- age: age in years
- sex: sex assigned at birth; 'male' or 'female'
- trestbps: resting blood pressure in mm Hg
- chol: serum cholesterol in mg/dl
- cp: chest pain type ('typical angina', 'atypical angina', 'non-anginal pain', or 'asymptomatic')
- exang: whether the patient experiences exercise-induced angina (1: yes; 0: no)
- fbs: whether the patient’s fasting blood sugar is >120 mg/dl (1: yes; 0: no)
- thalach: maximum heart rate achieved in exercise test
- heart_disease: whether the patient is found to have heart disease ('presence': diagnosed with heart disease; 'absence': no heart disease)

In [4]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind
from scipy.stats import f_oneway
from scipy.stats import binom_test
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy.stats import chi2_contingency

In [5]:
# Load data
heart = pd.read_csv(r"C:\Users\shimtek\Projetos\Codecademy\Data Science\heart_disease.csv")
heart.head()

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
0,63.0,male,145.0,233.0,typical angina,0.0,1.0,150.0,absence
1,67.0,male,160.0,286.0,asymptomatic,1.0,0.0,108.0,presence
2,67.0,male,120.0,229.0,asymptomatic,1.0,0.0,129.0,presence
3,37.0,male,130.0,250.0,non-anginal pain,0.0,0.0,187.0,absence
4,41.0,female,130.0,204.0,atypical angina,0.0,0.0,172.0,absence


In [3]:
heart.shape

(303, 9)

In [6]:
# Split dataset into subsets, patients with/without heart disease (HD)
yes_hd = heart[heart.heart_disease == 'presence']
no_hd = heart[heart.heart_disease == 'absence']

In [5]:
# Separate cholesterol variable for patients with HD, calculate average
chol_hd = yes_hd.chol
avg_chol = np.mean(chol_hd)
print("Average cholesterol for patients with heart disease:", round(avg_chol, 2))

Average cholesterol for patients with heart disease: 251.47


Typically, a person with a total cholesterol reading of over 240 mg/dl is deemed to have high cholesterol. We've found that the average cholesterol level for patients with heart disease in this study is 251.47 mg/dl. We'll now use hypothesis testing to determine if people with heart disease in the general public have high cholesterol levels. 

In [6]:
# One-Sample-T-Test 
tstat, pval = ttest_1samp(chol_hd, 240)
print("P-value for people with heart disease who have high cholesterol:", pval/2)

P-value for people with heart disease who have high cholesterol: 0.0035411033905155707


Given that the p-value is less than the signifcant threshold of 0.05, we will reject the null hypothesis and therefore determine that people with heart disease have an average cholesterol level that is greater than 240 mg/dl.

In [7]:
# Calculate number of total patients and number of patients with FBS > 120
num_patients = len(heart)
print(num_patients)
num_highfbs_patients = len(heart[heart.fbs == 1]) # Since patients have a value of 1 in the fbs column if their fasting blood sugar is greater than 120 mg/dl, and 0 otherwise, you can simply add up all the numbers in the fbs column of heart using np.sum().
print(num_highfbs_patients)

303
45


In [9]:
# Calculate 8% of sample size
sample_highfbs = 0.08 * num_patients
print("Expected number of patients with high blood sugar in sample:", round(sample_highfbs, 2))

Expected number of patients with high blood sugar in sample: 24.24


In [10]:
# Run binomial test
p_value = binom_test(45, n=303, p=0.08, alternative='greater')
print(p_value)

4.689471951448875e-05
