<a href="https://clarusway.com/contact-us/"><img align="center" src="https://i.ibb.co/B43qn24/officially-licensed-logo.png" alt="Open in Clarusway LMS" width="200
" height="200" title="This notebook is licensed by Clarusway IT training school. Please contact the authorized persons about the conditions under which you can use or share."></a>

# Cardiovascular Disease Dataset

We will study with a dataset on Cardiovascular Disease.

We'll try to understand the concepts like

- true means,
- confidence intervals,
- one sample t test,
- independent samples t test,
- homogenity of variance check (Levene's test),
- One-way ANOVA,
- Chi-square test.

Dataset from: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset

# Data Preparation

‚≠ê Import pandas, scipy.stats, seaborn, and matplotlib.pyplot libraries

In [29]:
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import ttest_1samp
import seaborn as sns
import matplotlib.pyplot as plt

‚≠êRun the following code to read in the "cardio.csv" file.

In [30]:
df = pd.read_csv("cardio.csv", sep=";")

In [31]:
df=df.sample(500, random_state=42)

In [32]:
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
46730,66728,21770,1,156,64.0,140,80,2,1,0,0,1,1
48393,69098,21876,1,170,85.0,160,90,1,1,0,0,1,1
41416,59185,23270,1,151,90.0,130,80,1,1,0,0,1,1
34506,49288,19741,1,159,97.0,120,80,1,1,0,0,1,1
43725,62481,18395,1,164,68.0,120,80,1,1,0,0,1,0


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, 46730 to 42173
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           500 non-null    int64  
 1   age          500 non-null    int64  
 2   gender       500 non-null    int64  
 3   height       500 non-null    int64  
 4   weight       500 non-null    float64
 5   ap_hi        500 non-null    int64  
 6   ap_lo        500 non-null    int64  
 7   cholesterol  500 non-null    int64  
 8   gluc         500 non-null    int64  
 9   smoke        500 non-null    int64  
 10  alco         500 non-null    int64  
 11  active       500 non-null    int64  
 12  cardio       500 non-null    int64  
dtypes: float64(1), int64(12)
memory usage: 54.7 KB


In [34]:
df.shape

(500, 13)

In [35]:
df.describe()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,49656.324,19460.94,1.368,164.746,74.0934,127.912,98.908,1.338,1.192,0.078,0.046,0.762,0.492
std,27694.652229,2444.264657,0.482744,8.017609,14.340822,40.82349,130.985839,0.651617,0.540111,0.26844,0.209695,0.426286,0.500437
min,172.0,14319.0,1.0,144.0,43.0,-120.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,26990.5,17804.0,1.0,159.0,65.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
50%,49225.5,19669.0,1.0,165.0,72.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
75%,72126.5,21326.25,2.0,170.0,80.0,140.0,90.0,1.0,1.0,0.0,0.0,1.0,1.0
max,99934.0,23670.0,2.0,198.0,160.0,907.0,1200.0,3.0,3.0,1.0,1.0,1.0,1.0


‚≠êLet's get rid of the outliers, moreover blood pressure could not be negative value!

In [36]:
df.loc[df['ap_hi'] < 0, 'ap_hi'] = df['ap_hi'].abs()
df.loc[df['ap_lo'] < 0, 'ap_lo'] = df['ap_lo'].abs()

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

outliers = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)

df[outliers] = np.nan
df.fillna(df.median(), inplace=True) 

## Task-1. Is the Systolic blood pressure population mean 122mmhg?

ap_hi => It's the Systolic blood pressure i.e. Pressure exerted when Blood is ejected in arteries. Normal value : 122 mm Hg for all adults aged 18 and over

‚≠êWhat is the mean for Systolic blood pressure?

In [37]:
ap_hi_mean = df['ap_hi'].mean()
ap_hi_mean

122.59

‚≠êWhat is the standard deviation for Systolic blood pressure?

In [38]:
df['ap_hi'].std()

9.59301578773893

‚≠êWhat is the standard error of the mean for Systolic blood pressure?

In [39]:
sem = df.ap_hi.sem()

‚≠êWhat are the descriptive statistics of the mean for Systolic blood pressure?

In [40]:
df.ap_hi.describe()

count    500.000000
mean     122.590000
std        9.593016
min       90.000000
25%      120.000000
50%      120.000000
75%      120.000000
max      170.000000
Name: ap_hi, dtype: float64

## Confidence Interval using the t Distribution

Key Notes about Confidence Intervals

üí°A point estimate is a single number.

üí°A confidence interval, naturally, is an interval.

üí°Confidence intervals are the typical way to present estimates as an interval range.

üí°The point estimate is located exactly in the middle of the confidence interval.

üí°However, confidence intervals provide much more information and are preferred when making inferences.

üí°The more data you have, the less variable a sample estimate will be.

üí°The lower the level of confidence you can tolerate, the narrower the confidence interval will be.

‚≠êInvestigate the given task by calculating the confidence interval. (Use 90%, 95% and 99% CIs)

In [41]:
moe = 1.645 * sem
upper_limit = ap_hi_mean + moe
lower_limit = ap_hi_mean - moe
print('90% Confidence Interval:',(lower_limit, upper_limit))

90% Confidence Interval: (121.88427409499084, 123.29572590500916)


In [42]:
moe = 1.96 * sem
upper_limit = ap_hi_mean + moe
lower_limit = ap_hi_mean - moe
print('95% Confidence Interval:',(lower_limit, upper_limit))

95% Confidence Interval: (121.74913509190398, 123.43086490809603)


In [43]:
moe = 2.58 * sem
upper_limit = ap_hi_mean + moe
lower_limit = ap_hi_mean - moe
print('99% Confidence Interval:',(lower_limit, upper_limit))

99% Confidence Interval: (121.48314721281238, 123.69685278718762)


## One Sample t Test

‚≠êInvestigate the given task by using One Sample t Test.

Key Notes about Hypothesis Testing (Significance Testing)

üí°Assumptions

üí°Null and Alternative Hypothesis

üí°Test Statistic

üí°P-value

üí°Conclusion

Conduct the significance test. Use scipy.stats.ttest_1samp

In [45]:
t_statistic, p_value = ttest_1samp(df['ap_hi'], ap_hi_mean)

In [47]:

print("One-sample t-test results:")
print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")

# Interpret the results
alpha = 0.05  # significance level
if p_value < alpha:
    print("Null hypothesis rejected: Population mean is not 122 mmHg.")
else:
    print("Null hypothesis cannot be rejected: Population mean is likely 122 mmHg.")
    

One-sample t-test results:
t-statistic: 0.0
p-value: 1.0
Null hypothesis cannot be rejected: Population mean is likely 122 mmHg.


## Task-2. Is There a Significant Difference Between Males and Females in Systolic Blood Pressure?

H0: ¬µ1 = ¬µ2 ("the two population means are equal")

H1: ¬µ1 ‚â† ¬µ2 ("the two population means are not equal")

‚≠êShow descriptives for 2 groups

In [50]:
df.groupby('gender')['gender'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1.0,434.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
2.0,66.0,2.0,0.0,2.0,2.0,2.0,2.0,2.0


___üöÄTest the assumption of homogeneity of variance Hint: Levene‚Äôs Test

The hypotheses for Levene‚Äôs test are:

H0: "the population variances of group 1 and 2 are equal"

H1: "the population variances of group 1 and 2 are not equal"

___üöÄConduct the significance test. Use scipy.stats.ttest_ind

H0: ¬µ1 = ¬µ2 ("the two population means are equal")

H1: ¬µ1 ‚â† ¬µ2 ("the two population means are not equal")

## Task-3. Is There a Relationship Between Glucose and Systolic Blood Pressure?

‚≠êDraw a boxplot to see the relationship.

‚≠êShow the descriptive statistics of 3 groups.

‚≠êConduct the relavant statistical test to see if there is a significant difference between the mean of the groups.

## Task-4. Is There a Relationship Between Physical activity vs. Presence or absence of cardiovascular disease?

### Physical activity vs. Presence or absence of cardiovascular disease

‚≠êCreate a crosstab using Pandas.

‚≠êConduct chi-square test to see if there is a relationship between 2 categorical variables.