# Hypothesis Testing on Heart Disease Data Project

In [64]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_1samp, binom

# Load the dataset:

In [65]:
df = pd.read_csv('heart_disease.csv')

# first few rows of the dataset
df.head()

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
0,63,male,145,233,typical angina,0,1,150,absence
1,67,male,160,286,asymptomatic,1,0,108,presence
2,67,male,120,229,asymptomatic,1,0,129,presence
3,37,male,130,250,non-anginal pain,0,0,187,absence
4,41,female,130,204,atypical angina,0,0,172,absence


# Data Assessment:

### a) Types of Data:

In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   age            303 non-null    int64 
 1   sex            303 non-null    object
 2   trestbps       303 non-null    int64 
 3   chol           303 non-null    int64 
 4   cp             303 non-null    object
 5   exang          303 non-null    int64 
 6   fbs            303 non-null    int64 
 7   thalach        303 non-null    int64 
 8   heart_disease  303 non-null    object
dtypes: int64(6), object(3)
memory usage: 21.4+ KB


### b) Column Names:

In [49]:
list(df)

['age',
 'sex',
 'trestbps',
 'chol',
 'cp',
 'exang',
 'fbs',
 'thalach',
 'heart_disease']

### c) Total Number of Rows:

In [50]:
len(df)

303

### d) Checking for total number of duplicate rows:

In [51]:
df.isnull().sum()

age              0
sex              0
trestbps         0
chol             0
cp               0
exang            0
fbs              0
thalach          0
heart_disease    0
dtype: int64

### e) Checking for total number of duplicate rows:

In [56]:
df.duplicated().sum()

0

# Analysis

### Splitting the dataset based on heart disease `presence` or `absence`:

### i) Presence of heart disease:

In [57]:
yes_hd = df[df.heart_disease == 'presence']
yes_hd.head()

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
1,67,male,160,286,asymptomatic,1,0,108,presence
2,67,male,120,229,asymptomatic,1,0,129,presence
6,62,female,140,268,asymptomatic,0,0,160,presence
8,63,male,130,254,asymptomatic,0,0,147,presence
9,53,male,140,203,asymptomatic,1,1,155,presence


### ii) Absence of heart disease:

In [58]:
no_hd = df[df.heart_disease == 'absence']
no_hd.head()

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
0,63,male,145,233,typical angina,0,1,150,absence
3,37,male,130,250,non-anginal pain,0,0,187,absence
4,41,female,130,204,atypical angina,0,0,172,absence
5,56,male,120,236,atypical angina,0,0,178,absence
7,57,female,120,354,asymptomatic,1,0,163,absence


# Cholestrol level Analysis

## Question 1: Calculate the mean cholesterol level for patients `with heart disease`? (Cholestrol level above 240 mg/dL is considered high)

In [68]:
cholestrol_level_yes = yes_hd.chol

# Calculate the mean:
print(np.mean(cholestrol_level_yes))

251.4748201438849


### Therefore: total cholesterol over 240 mg/dl is considered “high” (and therefore unhealthy) with high heart disease.

# Question 2: Hypothesis test - Is the average cholesterol level > 240 mg/dl?
#### Null: Average cholesterol level = 240 mg/dl
#### Alternative: Average cholesterol level > 240 mg/dl

In [69]:
tstat, pval = ttest_1samp(cholestrol_level_yes, 240)
print(pval/2)

0.0035411033905155707


#### Conclusion: Since the p-value (0.0035) is much lower than the common significance level of 0.05 (5%), would `reject the null hypothesis`. This means that there is statistically significant evidence to conclude that people with heart disease have an average cholesterol level greater than 240 mg/dL.

# Question 3: Calculate mean cholesterol level for patients `without heart disease` (Cholestrol level above 240 mg/dL is considered high).

In [70]:
cholestrol_level_no = no_hd.chol
print(np.mean(cholestrol_level_no))

242.640243902439


### Therefore: total cholesterol over 240 mg/dl is considered “high” (and therefore unhealthy) without heart disease.

# Question 4: Hypothesis test - Is the average cholesterol level > 240 mg/dl (for no heart disease)?
#### Null: Average cholesterol level = 240 mg/dl
#### Alternative: Average cholesterol level > 240 mg/dl

In [62]:
tstat, pval = ttest_1samp(cholestrol_level_no, 240)
print(pval/2)

0.26397120232220506


#### Conclusion: Since the p-value (0.263) is greater than the common significance level of 0.05 (or 5%), it indicates that there is `no significant evidence to reject the null hypothesis`. In other words, based on the data, there isn't enough evidence to conclude that people with heart disease have a significantly higher average cholesterol level than 240 mg/dL.

# Fasting Blood Sugar (fbs) Analysis

# Question 5: Number of patients in the dataset

In [73]:
num_patients = len(df)
print(f"Total number of patients: {num_patients}")

Total number of patients: 303


# Question 6: Number of patients with fasting blood sugar > 120mg/dl.
#### fbs (1 = fbs > 120 mg/dl, 0 = fbs <= 120 mg/dl)

In [74]:
num_highfbs_patients = np.sum(df.fbs)
print(f"Number of patients with high fasting blood sugar > 120mg/dl: {num_highfbs_patients}")

Number of patients with high fasting blood sugar > 120mg/dl: 45


# Question 7: Expected number of diabetic patients based on 8% prevalence
#### Given that approximately 8% of the U.S. population had diabetes in 1988, calculate the expected number of patients with diabetes in your sample based on this percentage.

In [76]:
num_patient = 303 # Represent No. of people in the dataset(df)
prevalence_rate = 0.08  # 8%

expected_diabetic_patients = prevalence_rate * num_highfbs_patients
print(f"Expected number of diabetic patients in the sample: {expected_diabetic_patients}")

Expected number of diabetic patients in the sample: 3.6


#### Conclusion: This means, with a sample size of 303 patients, approximately 3 or 4 patients to have diabetes based on an 8% prevalence rate.

# Question 8: Hypothesis test - Is the sample rate of fbs > 120 mg/dl different from 8%?
#### Null: Sample rate = 8%
#### Alternative: Sample rate > 8%

In [77]:
# using binomial test
num_patient = 303
num_highfbs_patients = 45


pval = 1 - binom.cdf(num_highfbs_patients - 1, num_patient, .08)
print(f"P-value: {pval:.7f}")

P-value: 0.0000469


#### In summary, a p-value of 0.0000469 suggests that there is strong statistical evidence to conclude that the proportion of patients with high fasting blood sugar is significantly different from the expected 8%. This finding could have important implications for understanding health risks or guiding treatment strategies in the population being studied.

#### This result may have practical implications in a clinical or public health context. It suggests that the prevalence of high fasting blood sugar in the sample is significantly higher than expected, which could warrant further investigation or intervention.


# Summary of Findings:

#### Cholesterol Levels:
#### The mean cholesterol level for people with heart disease is 251.47 mg/dL, significantly higher than 240 mg/dL (p-value: 0.0035). This suggests that people with heart disease tend to have high cholesterol levels.
#### For people without heart disease, the mean cholesterol level is 242.64 mg/dL, but there is no significant evidence that it's higher than 240 mg/dL (p-value: 0.263).

#### Fasting Blood Sugar (fbs):
#### Out of 303 patients, 45 have fasting blood sugar levels above 120 mg/dL.
#### Based on an 8% diabetes prevalence, approximately 3 or 4 patients are expected to have diabetes.
#### The binomial test shows strong evidence that the actual prevalence of high fasting blood sugar in this sample is significantly higher than 8% (p-value: 0.0000469).