# Can Data Predict Cardiovascular Disease?


Cardiovascular disease (CVD) is the number one cause of death globally. This has raised many concerns and research to find ways to prevent CVD among people. While it is said that up to 90% of CVD may be preventable, it is difficult to predict and prevent this disease because it involves many risk factors such as sex, family history, smoking and many more. Data analysis and machine learning methods seem to be a reliable way to explore the patients’ data, identify risk factors, and predict if a person is likely to have CVD or not. I chose this subject due to its importance, and the integral role of data analysis methods to solve it. I am going to explore the following questions:

1.	How each of the variables (risk factors) affect having CVD? For instance, can we say cholesterol level contributes to CVD? What about smoking?
2.	How we can compare the distribution of different risk factors among people with and without CVD? What kind of distributions they are?
3.	What is the correlation between variables? Can we expect any relationship? Can we expect a value for the person’s blood pressure when we know his/her age?
4.	Based on a person’s health profile, can we predict if a person is likely to have CVD?



Table of Contents

In [None]:
# Filter all warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas.util.testing as tm
from scipy import stats
import sklearn.naive_bayes as sknb
%matplotlib inline

In [None]:
def multiplePlots( series ):
    
    fig, axs = plt.subplots(2,2)
    plt.tight_layout(pad=0.4, w_pad=4, h_pad=1.0)
    #plt.annotate(name, xy =(0,0))
    # Histogram
    sns.distplot(series, ax=axs[0,0])
    
    
    # Lag plot
    lag = series.copy()
    lag = np.array(lag[:-1])
    current = series[1:]
    ax = sns.regplot(current,lag,fit_reg=False, ax=axs[0,1])
    ax.set_ylabel("y_i-1")
    ax.set_xlabel("y_i")
    
    # QQ plot
    qntls, xr = stats.probplot(series, fit=False)
    sns.regplot(xr,qntls, ax=axs[1,0])
    
    # Run sequence
    ax = sns.regplot(np.arange(len(series)),series, ax=axs[1,1])
    ax.set_ylabel("val")
    ax.set_xlabel("i")

# Load Data

In [None]:
data = pd.read_csv("../input/cardiovascular-disease-dataset/cardio_train.csv",sep=";")
data.head()

In [None]:
df = pd.read_csv("../input/cardiovascular-disease-dataset/cardio_train.csv",sep=";")
df.head()

In [None]:
print(f"The shape of the dataset is: {df.shape}")

In [None]:
df.info()

👩🏻‍💻 
 - There is no Null value in dataset.
 - There are 70000 observation.
 - there are 12 columns.

# Data Preparation

My plan for data preparation: 
 - drop duplicated rows if any.
 - drop any unusual observation.
 - convert age from days to years.
 - create a column for ranges of age.
 - create a column for body mass index (bmi).
 - create a column for state of bload pressure.
 - convert variable types.


#### drop duplicates

In [None]:
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True)

#### drop errors and outliers

In [None]:
df.describe()

❗️ min, and max of weight and height deosn't make sense.

- minimum weight is 10 kg, which is an error.
- minimum height is 55 cm, which is an error.

In [None]:
# remove weight outliers
weight_min_outlier_mask = df['weight'] > df['weight'].quantile(0.005)
weight_max_outlier_mask = df['weight'] < df['weight'].quantile(0.999)
df = df[(weight_min_outlier_mask) & (weight_max_outlier_mask)]

In [None]:
height_min_outlier_mask = df['height'] > df['height'].quantile(0.005)
height_max_outlier_mask = df['height'] < df['height'].quantile(0.999)
df = df[(height_min_outlier_mask) & (height_max_outlier_mask)]

📚 from [webmd.com](https://www.webmd.com/hypertension-high-blood-pressure/guide/diastolic-and-systolic-blood-pressure-know-your-numbers)

Blood flows through our body because of a difference in pressure.

Our blood pressure is highest at the start of its journey from our heart – when it enters the aorta – and it is lowest at the end of its journey along progressively smaller branches of arteries. 

The systolic pressure (ap_hi) is the higher figure caused by the heart’s contraction, while the diastolic number (ap_lo) is the lower pressure in the arteries.

❗️ Therefore, ap_hi should be higher thatn ap_low, and they cannot be negative.

In [None]:
print(f"In {df[df['ap_hi'] < df['ap_lo']].shape[0]} obeservation ap_hi is lower than ap_low, which is incorrect.")
print('_'*80)
print()
print("Let's remove them:")

df = df[df['ap_hi'] > df['ap_lo']].reset_index(drop=True)
df.head()

#### create bload pressure stage variable

In [None]:
def bload_pressure_stage(data):
    if (data['ap_hi'] <= 120) and (data['ap_lo'] <= 80):
        return 'Normal'
    if (data['ap_hi'] >= 120 and data['ap_hi'] < 129) and (data['ap_lo'] < 80):
        return 'Elevated'
    if (data['ap_hi'] >= 129 and data['ap_hi'] < 139) | (data['ap_lo'] >= 80 and data['ap_lo'] < 89):
        return 'High_Bload_Pressur_1'
    if (data['ap_hi'] >= 140) | (data['ap_lo'] >= 89):
        return 'High_Bload_Pressur_2'
    if (data['ap_hi'] >= 180) | (data['ap_lo'] >= 120):
        return 'Crisis'

df['bload_pressure_stage'] = df.apply(bload_pressure_stage, axis=1)

In [None]:
df.head(2)

#### convert age from days to years and create age range variable:

In [None]:
df['age'] = round(df['age']/365).apply(lambda x: int(x))
df.head(2)

In [None]:
print('Min Age: ', df['age'].min())
print('Max Age: ', df['age'].max())
print('Mean Age: ', df['age'].mean())

df['age_range'] = df['age'].apply(lambda x: 'young' if (x >=30 and x <= 40) else ('middle_age' if (x>40 and x<=45) else 'Elderly'))
df.head(2)

# calculating Body Mass Index

Height alone does not seem a realted variable, but calculating bmi will help to get a better insight:/

In [None]:
def BMI(data):
    return data['weight'] / (data['height']/100)**2
 
df['bmi'] = df.apply(BMI, axis=1)

In [None]:
df.head(2)

📚 from: [cdc.gov](https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html)

## How is BMI interpreted for adults?
For adults 20 years old and older, BMI is interpreted using standard weight status categories. These categories are the same for men and women of all body types and ages.

The standard weight status categories associated with BMI ranges for adults are shown in the following table.

BMI | Weight Status
---|---
Below18.5	| Underweight
18.5 – 24.9 |	Normal or Healthy Weight
25.0 – 29.9	| Overweight
30.0 and Above | Obese

In [None]:
df['weight_status'] = df['bmi'].apply(lambda x :'Underweight' if x <= 18.5 else ('Normal' if (x > 18.5 and x <= 24.9) else ( 'Overweight' if (x > 24.9 and x <= 29.9) else 'obese')) )

In [None]:
df.head()

In [None]:
df.describe()

some columns such as gender can be store as categories so I will change their types to category:

In [None]:
# df['gender'] = df['gender'].astype('category')
# df['cholesterol'] = df['cholesterol'].astype('category')
# df['gluc'] = df['gluc'].astype('category')
# df['smoke'] = df['smoke'].astype('category')
# df['alco'] = df['alco'].astype('category')
# df['active'] = df['active'].astype('category')
# df['cardio'] = df['cardio'].astype('category')

In [None]:
df.head(3)

In [None]:
df.info()

# Q1: What are the risk factors of cardio vascular disease? 

In [None]:
pd.crosstab(df.gender, df.cardio, normalize=False).plot(kind="bar", figsize=(10, 6))
plt.title('Cardiovascular Disease Frequency for genders')
plt.xlabel('Gender (1: female, 2: male)')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
plt.savefig('1.png')
plt.show()

In [None]:
g = pd.crosstab(df.gender, df.cardio, margins=True)
g

In [None]:
sns.heatmap(pd.crosstab(df.gender, df.cardio, normalize=True), annot=True,cmap=sns.cubehelix_palette())

In [None]:
from scipy.stats import chi2_contingency
chi2, p, dof, ex = chi2_contingency(g)
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)


👩🏻‍💻 

From the bar chart we can see chance of having CVD is slightly larger, when the gender is male. 
heatmap:
chi2: p-value is larger than 0.05. Therefor, we do not have enough evidence to reject the null hypothesis, which was there was no effect of gender type on having CVD. 


In [None]:
s = pd.crosstab(df.smoke, df.cardio)
s.plot(kind="bar", figsize=(10, 6))
plt.title('Cardiovascular Disease Frequency for smoke status')
plt.xlabel('Smoke')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
plt.savefig('2.png')
plt.show()

In [None]:
sns.heatmap(s, annot=True,cmap=sns.cubehelix_palette())

In [None]:
chi2, p, dof, ex = chi2_contingency(pd.crosstab(df.smoke, df.cardio, margins=True))
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)



👩🏻‍💻 

From the bar chart It seems smoking doesnot have much impact on CVD. However, after performing chi2 test, p-value is less than 0.05, and we are in a good shape to reject the null hypothesis, which was there was no effect of smoke on having CVD.

**It might be because we don't know how often they smoke. If people in the study were asked to choose a range that how oftern they smoke we could see a better results.**

In [None]:
p = pd.crosstab(df.active, df.cardio)
p. plot(kind="bar", figsize=(10, 6))
plt.title('Cardiovascular Disease Frequency for physically active status')
plt.xlabel('Physically Active')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
plt.savefig('3.png')
plt.show()

In [None]:
chi2, p, dof, ex = chi2_contingency( pd.crosstab(df.active, df.cardio, margins=True))
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)

👩🏻‍💻 

Based on the bar chart Of people who are physically active less people have CVD, however, of people who are not active more people are suffering from CVD. Moreover, chi-squared test prove this interpretation, bacause p-value is way less than 0.05 and we are in a good shape to reject the null hypothesis, which was there was no effect of activity status on having CVD. It shows physical activity can decrease chances of Cardiovascular disease.

In [None]:
a= pd.crosstab(df.alco, df.cardio)
a.plot(kind="bar", figsize=(10, 6))
plt.title('Cardiovascular Disease Frequency for alcohol status')
plt.xlabel('Alcohol Status')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
plt.savefig('4.png')
plt.show()

In [None]:
 chi2, p, dof, ex = chi2_contingency( pd.crosstab(df.alco, df.cardio, margins = True))
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)

👩🏻‍💻 

Based on the bar chart, the same as smoke status, since we do not know a person with aclcohol status 1, how often drinks, we can't see any helpful information here. However, p-value of chi-squared test is not less than 0.05, and we do not have enough evidence to reject the null hypothesis, which was there was no effect of alcohol status on having CVD.

In [None]:
ch = pd.crosstab(df.cholesterol, df.cardio)
ch.plot(kind="bar", figsize=(10, 6))
plt.title('Cardiovascular Disease Frequency for cholestrol status')
plt.xlabel('cholestrol ')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
plt.savefig('5.png')
plt.show()

In [None]:
 chi2, p, dof, ex = chi2_contingency( pd.crosstab(df.cholesterol, df.cardio, margins=True))
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)

👩🏻‍💻 

The above chart obviously shows higer cholestrol can lead to CVD. Moreover, based on the chi-squared result, p-value is less than 0.05 and we are in a good shape to reject the null hypothesis, which was there was no effect of cholesteorl level on having CVD.

In [None]:
gl = pd.crosstab(df.gluc, df.cardio)
gl.plot(kind="bar", figsize=(10, 6))
plt.title('Cardiovascular Disease Frequency for glucose status')
plt.xlabel('Glucose Status')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
plt.savefig('6.png')
plt.show()

In [None]:
 chi2, p, dof, ex = chi2_contingency( pd.crosstab(df.gluc, df.cardio, margins=True))
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)

👩🏻‍💻 

The above chart obviously shows higer glucose can lead to CVD. Moreover, chi-squared test proves this interpretation.

In [None]:
bp = pd.crosstab(df.bload_pressure_stage, df.cardio)
bp.plot(kind="bar", figsize=(10, 6))
plt.title('Cardiovascular Disease Frequency for bload perssure status')
plt.xlabel('Bload pressure Status')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
plt.savefig('7.png')
plt.show()

In [None]:
 chi2, p, dof, ex = chi2_contingency( pd.crosstab(df.bload_pressure_stage, df.cardio, margins=True))
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)

👩🏻‍💻 

- The above chart obviously shows impact of high bload pressure on having CVD. Chi-squared test proves the interpretaion.

In [None]:
age = pd.crosstab(df.age_range, df.cardio)
age.plot(kind="bar", figsize=(10, 6))
plt.title('Cardiovascular Disease Frequency for age status')
plt.xlabel('Age Status')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
plt.savefig('8.png')
plt.show()

In [None]:
 chi2, p, dof, ex = chi2_contingency( pd.crosstab(df.age_range, df.cardio, margins=True))
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)

👩🏻‍💻 

- The above chart obviously shows impact of age range on having CVD. Chi-squared also proves the results.

In [None]:
bmi = pd.crosstab(df.weight_status, df.cardio)
bmi.plot(kind="bar", figsize=(10, 6))
plt.title('Cardiovascular Disease Frequency for weight status')
plt.xlabel('Weight Status')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
plt.savefig('9.png')
plt.show()

In [None]:
pd.crosstab(df.weight_status, df.cardio, margins=True)

In [None]:
chi2, p, dof, ex = chi2_contingency( pd.crosstab(df.weight_status, df.cardio, margins=True))
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)

👩🏻‍💻 

- The above chart obviously shows impact of weight status on having CVD. Chi-squared test also proves the results.

# Findings

I found that weight status (normal, overweight, underweight, obese), age range (young, middle age, elderly), blood pressure status (normal, elevated, high1, and high2), glucose level, cholesterol level, and physical activity status have effect on having CVD. Their frequency bar charts show this finding. I also performed chi-squared test for these variables and the p-value was less than 0.05. This means that we are in a good shape to reject the null hypothesis, which was there was no effect of each of these variables on having CVD. 

For gender status, and alcohol intake, there was not much difference in their visualization. So, from the visualization I could not see any effect of these variables on having CVD. However, the p-value of chi-squared test was not statistically significant. So, we are not in a good shape to reject the null hypothesis which was there was no effect of each of these variables on having CVD. Since we do not know a person with alcohol status 1, how often drinks, we can't see any helpful information here.

For smoke status, there was not much difference in their visualization. So, from the visualization I could not see any effect of smoking on having CVD. The p-value from chi-square test was statistically significant, so we are in a good shape to reject the null hypothesis, which was there was no effect of smoking on having CVD. It might be because we don't know how often they smoke. If people in the study were asked to choose a range that how often they smoke, we could see a better result. Moreover, it is possible that some variables such as smoking does not have direct affect on CVD, but they cause one of other variable to have effect on it (correlation).


# Q2. How we can compare distribution of different risk factors among people with and without CVD? What kind of distribution they are?

In [None]:
df.head(2)

In [None]:
fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
sns.distplot(df[df['cardio'] == 1]['weight'], hist = True, kde=True, ax=ax1).set(title = 'CVD')
sns.distplot(df[df['cardio'] == 0]['weight'], hist = True, kde=True, ax=ax2).set(title = 'healthy')

In [None]:
multiplePlots(df[df['cardio'] == 0]['weight'])

In [None]:
df[df['cardio'] == 1]['weight'].describe()

In [None]:
df[df['cardio'] == 0]['weight'].describe()

In [None]:
import scipy.stats as stats

In [None]:
stats.ttest_ind(df[df['cardio'] == 0]['weight'],df[df['cardio'] == 1]['weight'],equal_var=False)

wieght has a normal distribution among both groups. Distributions are skewed to the right. Average weight, and median among people with CVD is 5 kg more. They seem to have same distribution but p-value from t-test is statistically significant, so mean value for groups with and without cvd is different.

In [None]:
fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
sns.distplot(df[df['cardio'] == 1]['ap_lo'], hist = True, kde=False, ax=ax1).set(title = 'CVD')
sns.distplot(df[df['cardio'] == 0]['ap_lo'], hist = True, kde=False, ax=ax2).set(title = 'healthy')

In [None]:
multiplePlots(df[df['cardio'] == 0]['ap_lo'])

In [None]:
print('2.5th percentail: ', np.percentile(df['ap_lo'],2.5))
print('97.5th percentile: ', np.percentile(df['ap_lo'],97.5))
print('95% confidence interval: ', np.percentile(df['ap_lo'],97.5) - np.percentile(df['ap_lo'],2.5))
print('variance: ', np.var(df['ap_lo']))
print('std ', np.std(df['ap_lo']))

In [None]:
40/9.7

In [None]:
stats.ttest_ind(df[df['cardio'] == 0]['ap_lo'],df[df['cardio'] == 1]['ap_lo'],equal_var=False)

low pressure has a normal distribution among both groups. Distributions are skewed to the right and left. P-value from t-test is less than 0.05, so mean value are significanty different.


In [None]:
fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
sns.distplot(df[df['cardio'] == 1]['ap_hi'], hist = True, kde=False, ax=ax1).set(title = 'CVD')
sns.distplot(df[df['cardio'] == 0]['ap_hi'], hist = True, kde=False, ax=ax2).set(title = 'healthy')

In [None]:
fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
sns.distplot(np.log(df[df['cardio'] == 1]['ap_hi']), hist = True, kde=True, ax=ax1).set(title = 'CVD')
sns.distplot(np.log(df[df['cardio'] == 0]['ap_hi']), hist = True, kde=True, ax=ax2).set(title = 'healthy')

In [None]:
multiplePlots( np.log(df[df['cardio'] == 0]['ap_hi']))

In [None]:
stats.ttest_ind(df[df['cardio'] == 0]['ap_hi'],df[df['cardio'] == 1]['ap_hi'],equal_var=False)

In [None]:
sns.distplot(df[df['cardio'] == 1]['ap_lo'], hist = True, kde=True).set(title = 'CVD')

high pressure has a long tail, and based on visual test we can conclude that the distribution of it is Pareto/power distribution. By applying log scale for x and y axis, we can see the distribution for a reasonable range. In fact, the log scale shows the range that the majority of data ocurrse. Based on t-test, their distribution is significantly different.

In [None]:
fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
sns.distplot(df[df['cardio'] == 1]['height'], hist = True, kde=False, ax=ax1).set(title = 'CVD')
sns.distplot(df[df['cardio'] == 0]['height'], hist = True, kde=False, ax=ax2).set(title = 'healthy')

In [None]:
multiplePlots( np.log(df[df['cardio'] == 0]['height']))

In [None]:
stats.ttest_ind(df[df['cardio'] == 0]['height'],df[df['cardio'] == 1]['height'],equal_var=False)

It is interesting. Their distribution is significantly different.

In [None]:
fig, (ax1,ax2, ax3) = plt.subplots(nrows=1, ncols=3, sharey=True)
sns.distplot(df[df['cardio'] == 1]['bmi'], hist = True, kde=False, ax=ax1).set(title = 'CVD')
sns.distplot(df[df['cardio'] == 0]['bmi'], hist = True, kde=False, ax=ax2).set(title = 'healthy')
sns.distplot(df['bmi'], hist = True, kde=False, ax=ax3).set(title = 'whole data')

In [None]:
multiplePlots(df[df['cardio'] == 0]['bmi'])

In [None]:
df[df['cardio'] == 1]['bmi'].describe()

In [None]:
df[df['cardio'] == 0]['bmi'].describe()

In [None]:
stats.ttest_ind(df[df['cardio'] == 0]['bmi'],df[df['cardio'] == 1]['bmi'],equal_var=False)

bmi has a normal distribution among both groups. Distributions are skewed to the right, with mean around 27, and median around 26. Their distribution is significantly different.

In [None]:
print('2.5th percentail: ', np.percentile(df['bmi'],2.5))
print('97.5th percentile: ', np.percentile(df['bmi'],97.5))
print('95% confidence interval: ', np.percentile(df['bmi'],97.5) - np.percentile(df['bmi'],2.5))
print('variance: ', np.var(df['bmi']))
print('std ', np.std(df['bmi']))

In [None]:
19.59 / 5.06

95% of data of is about 4 standard deviations (plus and minus two standard deviation from the mean). We can conclude that we can consider the distribution as normal because empirical rule apply to our data.

# Findings

I found that weight has a normal distribution among both groups. Distributions are skewed to the right. Average weight, and median among people with CVD is 5 kg more. Low blood pressure also has a normal distribution. high pressure has a long tail and based on visual test we can conclude that its distribution is Pareto/power distribution. By applying log scale for x and y axis, we can see the distribution for a reasonable range. In fact, the log scale shows the range that the majority of data occurs. Body mass index (bmi) has a normal distribution skewed to the right, and roughly the same for both groups.

It was interesting that, all distributions for both groups were roughly similar visually. However, by performing t-test, p-value was statistically significant, and we can conclude that the mean value of weight, height, bmi, low and high blood pressure between people who have CVD and who do not have it is significantly different.


# Q2. What is the corrolation between variables? Can we expect linear relationship?

In [None]:
df.corr()
f, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, linewidths=0.5, square=True, vmax=0.3, center=0, cmap=sns.cubehelix_palette())
plt.savefig('heat.png')

👩🏻‍💻

From heatmap above, I can see:
 - age, low bload pressure (lo_pi), cholestrol, and bmi are slightly corrolated with cardio (target class). The reason their corrolation value are low is that each of these risk factors contributes to the CVD and their effect dependes on other variabls as well, not alone.
 
 
pairs with most corrolation:
    
point1|point2|corrolation
---|---|---
cardio| lo_pi | 0.33
cardio| cholestrol |0.22
cardio| bmi | 0.19
cardio|age|0.24
cholestrol | gluc | 0.45
Weight | lo_pi | 0.24




In [None]:
df.columns

In [None]:
vars = ['age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol',
       'gluc', 'smoke', 'alco', 'active','bmi', 'cardio']

# OLS and ANOVA

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
model1 = smf.ols('bmi ~ C(cardio)', data=df).fit()
model1.summary()

In [None]:
aov_table = sm.stats.anova_lm(model1, typ=2)
print(aov_table)

prob(F-statistic) is 0. 
So, do we have evidence that mean of mean of bmi is differnet acrros two groups (cardio 0, 1)? Yes, we have evidence that they are different.

In [None]:
model1 = smf.ols('age ~ C(cardio)', data=df).fit()
model1.summary()

In [None]:
aov_table = sm.stats.anova_lm(model1, typ=2)
print(aov_table)

In [None]:
model1 = smf.ols('cholesterol ~ C(cardio)', data=df).fit()
model1.summary()

In [None]:
aov_table = sm.stats.anova_lm(model1, typ=2)
print(aov_table)

In [None]:
model3 = smf.ols('ap_lo ~ weight', data=df).fit()
model3.summary()

In above regression models, my dependent variable is low blood pressure, and would like to see how does it change with weight.
p_value is less than 0.05, so the relationship is statistically significant.
We can expet the linear model:

y(ap_lo) = 68.6471 + 0.1705*X(weight)

In [None]:
68.6471 + 0.1705 * 50

# Findings

After creating heatmap and pair plot, I found that variables are not significantly correlated visually. However, I choose the pairs with most correlation in comparison to others and performed OLS or ANOVA to see the results. The analysis shows that we can expect some correlation between age, weight and high/low blood pressure. For instance, we can have the following mode:

Y (high blood pressure) = 103.96 + 0.47*X(age)

we have also evidence that the mean of bmi, age, and cholesterol is different among people who have CVD and those who have not. 


# Classification

**Objective**: predict if a person is likely to have cardiovascular diseas.


**Possible classes**: "cardio: 0", "cardio: 1"


**Features**:age, ap_hi, ap_lo, bmi, bload_pressure_stage_High_Bload_Pressur_1, bload_pressure_stage_High_Bload_Pressur_2, bload_pressure_stage_Normal, age_range_middle_age, age_range_young, weight_status_Overweight, weight_status_Underweight, weight_status_obese, gender_2, cholesterol_2, cholesterol_3, gluc_2, gluc_3, smoke_1, alco_1, active_1, cardio_1

In [None]:
import pandas as pd
import numpy as np
import sklearn as sk 
from sklearn.model_selection import train_test_split
import sklearn.ensemble as skens

In [None]:
data.duplicated().sum()

In [None]:
data.drop_duplicates(inplace=True)

In [None]:
# remove weight outliers
weight_min_outlier_mask = data['weight'] > data['weight'].quantile(0.005)
weight_max_outlier_mask = data['weight'] < data['weight'].quantile(0.999)
data = data[(weight_min_outlier_mask) & (weight_max_outlier_mask)]

In [None]:
height_min_outlier_mask = data['height'] > data['height'].quantile(0.005)
height_max_outlier_mask = data['height'] < data['height'].quantile(0.999)
data = data[(height_min_outlier_mask) & (height_max_outlier_mask)]

In [None]:
print(f"In {data[data['ap_hi'] < data['ap_lo']].shape[0]} obeservation ap_hi is lower than ap_low, which is incorrect.")
print('_'*80)
print()
print("Let's remove them:")

data = data[data['ap_hi'] > data['ap_lo']].reset_index(drop=True)
data.head()

In [None]:
data['age'] = round(data['age']/365).apply(lambda x: int(x))
data.head(2)

In [None]:
def BMI(data):
    return data['weight'] / (data['height']/100)**2
 
data['bmi'] = data.apply(BMI, axis=1)

In [None]:
cvd=data.copy()

In [None]:
cvd['gender']= cvd['gender'].apply(lambda x: 0 if x==2 else 1)

In [None]:
cvd['cardio'].value_counts()

In [None]:
cvd.shape

In [None]:
sns.countplot(x='cardio',data=cvd)

In [None]:

cvd = pd.get_dummies(cvd,drop_first=True)
cvd.head(2)

In [None]:
cvd.columns

In [None]:
cvd = cvd[['age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol',
       'gluc', 'smoke', 'alco', 'active', 'bmi', 'cardio']]

In [None]:
X = cvd.loc[:, cvd.columns != 'cardio']
y = cvd.loc[:, cvd.columns == 'cardio']

In [None]:
# split data into test and train set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
columns = X_train.columns

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

## Decision Tree

In [None]:
import sklearn.tree as sktree

## 1.1 Train a decision tree classifier

We will train a decision tree classifier to classify the target class (CVD). Here are the specifications:
- __Objective__: predict who has cardiovascular diseas.
- __Possible classes__: "cardio: 0", "cardio: 1"
- __Features__:

## Random Forest

In [None]:
# build 10 random trees
rf_model = skens.RandomForestClassifier(n_estimators = 10, oob_score=True, criterion = 'entropy')
rf_model.fit(X_train, y_train)

In [None]:
#predict
predicted_labels = rf_model.predict(X_test)
y_test['predicted_rf'] = predicted_labels

In [None]:
# accuracy
accuracy = {}
from sklearn.metrics import accuracy_score
ac = accuracy_score(y_test['cardio'], y_test['predicted_rf'])
accuracy['Random Forest 10 estimator'] = ac
print("Accuracy with Random Forrest: {0:.2%}".format(ac))

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test['cardio'], y_test['predicted_rf']))



In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test['cardio'], y_test['predicted_rf'])
cm

In [None]:
feat_importance = rf_model.feature_importances_
feat_importance

# Important Features

In [None]:
import matplotlib
matplotlib.rc('xtick', labelsize = 12)

pd.DataFrame({'Feature Importance':feat_importance},
             index=X_train.columns).sort_values(by='Feature Importance',ascending=True).plot(kind='barh', figsize = (8,8))

plt.savefig('features.png')

## K-fold cross validation

In [None]:
param_grid = {
                 'n_estimators': [5, 10, 15, 20, 25],
                 'max_depth': [2, 5, 7, 9],
             }

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid_clf = GridSearchCV(rf_model, param_grid, cv=10)
grid_clf.fit(X_train, y_train)

In [None]:
grid_clf.best_estimator_

In [None]:
grid_clf.best_params_

In [None]:
y_pred = grid_clf.predict(X_test)


In [None]:
y_pred

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test['cardio'], y_pred))

In [None]:
rf_model = skens.RandomForestClassifier(n_estimators=25, max_depth=9, oob_score=True, criterion='entropy')
rf_model.fit(X_train, y_train)

In [None]:
#predict
predicted_labels = rf_model.predict(X_test)
y_test['predicted_rf_25_estimator'] = predicted_labels

In [None]:
from sklearn.metrics import accuracy_score
ac = accuracy_score(y_test['cardio'], y_test['predicted_rf_25_estimator'])
accuracy['Random Forest 25 estimator'] = ac
print("Accuracy with Random Forrest: {0:.2%}".format(ac))

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test['cardio'], y_test['predicted_rf_25_estimator']))



In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test['cardio'], y_test['predicted_rf_25_estimator'])
cm

## Tune and changeing hyperparameters

In [None]:
param_grid = {
                 'n_estimators': [2, 4, 8, 12, 16],
                 'max_depth': [2, 4, 6, 8],
             }

In [None]:
grid_clf = GridSearchCV(rf_model, param_grid, cv=5)
grid_clf.fit(X_train, y_train)

In [None]:
grid_clf.best_params_

In [None]:
rf_model = skens.RandomForestClassifier(n_estimators=16, max_depth=8, oob_score=True, criterion='entropy')
rf_model.fit(X_train, y_train)

In [None]:
#predict
predicted_labels = rf_model.predict(X_test)
y_test['predicted_rf_16_estimator'] = predicted_labels

In [None]:
from sklearn.metrics import accuracy_score
ac = accuracy_score(y_test['cardio'], y_test['predicted_rf_16_estimator'])
accuracy['Random Forest 16 estimator'] = ac
print("Accuracy with Random Forrest: {0:.2%}".format(ac))

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test['cardio'], y_test['predicted_rf_16_estimator']))


In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test['cardio'], y_test['predicted_rf_16_estimator'])
cm

# Naive Baysian

In [None]:
gnb_model = sknb.GaussianNB()
gnb_model.fit(X_train, y_train)

In [None]:
y_pred = gnb_model.predict(X_test)
y_test['predict'] = y_pred

In [None]:
from sklearn.metrics import accuracy_score
ac = accuracy_score(y_test['cardio'], y_test['predict'])
accuracy['Naive Baysian without prior'] = ac
print("Accuracy: {0:.2%}".format(ac))

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test['cardio'], y_test['predict']))

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test['cardio'], y_test['predict'])
cm

## Changing Prior

In [None]:
gnb_model = sknb.GaussianNB([0.3,0.7])
gnb_model.fit(X_train, y_train)

In [None]:
y_pred = gnb_model.predict(X_test)
y_test['predict'] = y_pred

In [None]:
from sklearn.metrics import accuracy_score
ac = accuracy_score(y_test['cardio'], y_test['predict'])
accuracy['Naive Baysian prior 30-70'] = ac
print("Accuracy: {0:.2%}".format(ac))

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(x=list(accuracy.keys()), y=list(accuracy.values()))
plt.ylabel("Accuracy %")
plt.xlabel("Algorithms")
plt.xticks(rotation=45)
plt.savefig('mls.png')

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test['cardio'], y_test['predict']))



In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test['cardio'], y_test['predict'])
cm

# Findings

I used Random Forest and Naïve Bayesian to predict if a person is likely to have cardiovascular disease. The dataset had approximately 70000 observations, and about half of them had CVD. I found that the most important feature in predicting cardio is body mass index. I changed hyperparameters and used cross-validation to tune the models.
In selecting the model, since we are dealing with disease, it is important to minimize the errors in prediction, but most importantly minimize false-negative predictions. The reason is that, if we predict a person not having CVD, while the person has, we are jeopardizing the person’s health. Since the dataset is balanced, I can consider f-1 score as well; however, due to the impotence of minimizing false negative prediction, I prefer recall score. Among all models, tuned random forest with recall score 67% is the preferred model.

