# Predicting Cardiovascular Disease
---


The purpose of this notebook is to find the model that best predicts if a patient has cardiovascular disease.

This notebook follows the data science process OSEMN (Obtain, Scrub, Explore, Model, iNterpret)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-libraries-that-will-be-used:" data-toc-modified-id="Import-libraries-that-will-be-used:-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import libraries that will be used:</a></span></li><li><span><a href="#Obtain-data" data-toc-modified-id="Obtain-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Obtain data</a></span><ul class="toc-item"><li><span><a href="#Data-contents:" data-toc-modified-id="Data-contents:-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Data contents:</a></span></li><li><span><a href="#Data-feature-descriptions:" data-toc-modified-id="Data-feature-descriptions:-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Data feature descriptions:</a></span></li></ul></li><li><span><a href="#Scrub" data-toc-modified-id="Scrub-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Scrub</a></span></li><li><span><a href="#Explore" data-toc-modified-id="Explore-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Explore</a></span><ul class="toc-item"><li><span><a href="#BMI-analysis" data-toc-modified-id="BMI-analysis-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>BMI analysis</a></span></li><li><span><a href="#Blood-Pressure-Category-analysis" data-toc-modified-id="Blood-Pressure-Category-analysis-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Blood Pressure Category analysis</a></span></li><li><span><a href="#Age-analysis" data-toc-modified-id="Age-analysis-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Age analysis</a></span></li><li><span><a href="#Gender-analysis" data-toc-modified-id="Gender-analysis-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Gender analysis</a></span></li><li><span><a href="#Other-quick-checks" data-toc-modified-id="Other-quick-checks-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Other quick checks</a></span><ul class="toc-item"><li><span><a href="#Cholesterol" data-toc-modified-id="Cholesterol-4.5.1"><span class="toc-item-num">4.5.1&nbsp;&nbsp;</span>Cholesterol</a></span></li><li><span><a href="#Activity" data-toc-modified-id="Activity-4.5.2"><span class="toc-item-num">4.5.2&nbsp;&nbsp;</span>Activity</a></span></li><li><span><a href="#Glucose" data-toc-modified-id="Glucose-4.5.3"><span class="toc-item-num">4.5.3&nbsp;&nbsp;</span>Glucose</a></span></li><li><span><a href="#Drink-alcohol-and-smoke" data-toc-modified-id="Drink-alcohol-and-smoke-4.5.4"><span class="toc-item-num">4.5.4&nbsp;&nbsp;</span>Drink alcohol and smoke</a></span></li></ul></li><li><span><a href="#Heatmap" data-toc-modified-id="Heatmap-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>Heatmap</a></span></li></ul></li><li><span><a href="#Models" data-toc-modified-id="Models-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Models</a></span><ul class="toc-item"><li><span><a href="#One-hot-encode,-Split,-and-Standardize" data-toc-modified-id="One-hot-encode,-Split,-and-Standardize-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>One hot encode, Split, and Standardize</a></span></li><li><span><a href="#Building-Models" data-toc-modified-id="Building-Models-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Building Models</a></span></li><li><span><a href="#Hyperparameter-Tuning" data-toc-modified-id="Hyperparameter-Tuning-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Hyperparameter Tuning</a></span><ul class="toc-item"><li><span><a href="#XGBoost" data-toc-modified-id="XGBoost-5.3.1"><span class="toc-item-num">5.3.1&nbsp;&nbsp;</span>XGBoost</a></span></li><li><span><a href="#Random-Forest" data-toc-modified-id="Random-Forest-5.3.2"><span class="toc-item-num">5.3.2&nbsp;&nbsp;</span>Random Forest</a></span></li><li><span><a href="#K-Nearest-Neighbor" data-toc-modified-id="K-Nearest-Neighbor-5.3.3"><span class="toc-item-num">5.3.3&nbsp;&nbsp;</span>K-Nearest Neighbor</a></span></li><li><span><a href="#Support-Vector-Machines" data-toc-modified-id="Support-Vector-Machines-5.3.4"><span class="toc-item-num">5.3.4&nbsp;&nbsp;</span>Support Vector Machines</a></span></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-5.3.5"><span class="toc-item-num">5.3.5&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Decision-Tree" data-toc-modified-id="Decision-Tree-5.3.6"><span class="toc-item-num">5.3.6&nbsp;&nbsp;</span>Decision Tree</a></span></li></ul></li></ul></li><li><span><a href="#Interpret" data-toc-modified-id="Interpret-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Interpret</a></span><ul class="toc-item"><li><span><a href="#Results" data-toc-modified-id="Results-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Results</a></span></li><li><span><a href="#Test-Accuracy" data-toc-modified-id="Test-Accuracy-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Test Accuracy</a></span></li><li><span><a href="#ROC-Curve" data-toc-modified-id="ROC-Curve-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>ROC Curve</a></span></li><li><span><a href="#Feature-Importance" data-toc-modified-id="Feature-Importance-6.4"><span class="toc-item-num">6.4&nbsp;&nbsp;</span>Feature Importance</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-6.5"><span class="toc-item-num">6.5&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></li></ul></div>

### Import libraries that will be used:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, GridSearchCV, \
RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from xgboost import XGBClassifier
from sklearn.metrics import f1_score, accuracy_score, classification_report,\
confusion_matrix, roc_auc_score, plot_confusion_matrix, plot_roc_curve

<a></a>
## Obtain data

---
The data was obtained from [Kaggle](https://www.kaggle.com/sulianova/cardiovascular-disease-dataset).

In [None]:
df = pd.read_csv("../input/cardiovascular-disease-dataset/cardio_train.csv", sep= ';')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe().T

In [None]:
df.cardio.value_counts(normalize= True)

### Data contents:

* 70,000 data points
* Almost equal counts of patients with and without cardiovascular disease
* Six continuous features, six categorical features

### Data feature descriptions:
1. Age | Objective Feature | age | int (days)
2. Height | Objective Feature | height | int (cm) |
3. Weight | Objective Feature | weight | float (kg) |
4. Gender | Objective Feature | gender | categorical code |
5. Systolic blood pressure | Examination Feature | ap_hi | int |
6. Diastolic blood pressure | Examination Feature | ap_lo | int |
7. Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
8. Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
9. Smoking | Subjective Feature | smoke | binary |
10. Alcohol intake | Subjective Feature | alco | binary |
11. Physical activity | Subjective Feature | active | binary |
12. Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

In [None]:
# scatter matrix to visualize data
pd.plotting.scatter_matrix(df, figsize = [15,15]);

## Scrub
 
Data cleaning

In [None]:
# Check for missing values
df.isna().sum()

In [None]:
# check for duplicate values
df.duplicated().sum()

In [None]:
# drop id column
df.drop('id', axis= 1, inplace= True)

# change gender from 1 or 2 to 0 or 1
df['gender'].replace(2, 0, inplace= True)

# change age from days to years
df['age'] = round(df['age'] / 365, 1)

In [None]:
print(df.age.min())
df.age.max()

Our data contains adults from ages 29 - 65 years old.

Some of the heights seemed suspicious so we will check them out.

In [None]:
# check heights 
df[df['height'] < 125]

In [None]:
# 125cm (4') seems short for the weights that are associated with them
# so we'll drop them
df = df[df['height'] >= 125]

In [None]:
df[df['height'] > 200]

In [None]:
# a height of 250cm (8'2") and a weight of 86kg (190lbs) seems suspicious
df.drop(index=6486, inplace= True)

Now we will check out the odd numbers on ap_hi and ap_lo

A quick note about ap_hi and ap_lo:
Systolic and diastolic are the readings on blood pressure. 

In [None]:
# find where ap_lo is higher than ap_hi
df = df[df['ap_hi'] > df['ap_lo']]

In [None]:
# remove ap_hi and ap_lo with negative and extremely low numbers
# anything with systolic < 80 and diastolic < 50 is considered abnormally low
df = df[df['ap_hi'] > 80]
df = df[df['ap_lo'] > 50]

In [None]:
# remove any ap_hi an ap_lo readings that are abnormally high
df = df[df['ap_hi'] < 250]

In [None]:
# while these diastolic readings are very high, 
# they are still lower than the systolic and match other features
df[df['ap_lo'] > 150]

Time to remove the incredibly low values for "weight"

In [None]:
#40kg = 88lbs, this would indicate a underweight person, or a typo in the data
df = df[df['weight'] >= 40]

In [None]:
# reset index
df.reset_index(inplace= True, drop= True)

In [None]:
# change feature names
new_names = {'ap_hi' : 'systolic', 
             'ap_lo' : 'diastolic', 
             'gluc' : 'glucose', 
             'alco': 'alcohol', 
             'cardio': 'disease'
            }

In [None]:
df = df.rename(columns= new_names)

In [None]:
df.shape

In [None]:
df.disease.value_counts(normalize= True)

Data cleaning removed 1,488 data points from our original 70,000 and we still have an even number of patients with and without cardiovascular disease (CVD).

## Explore

### BMI analysis

Body mass index (BMI) is a measure of body fat based on height and weight that applies to adult men and women. BMI is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m², resulting from mass in kilograms and height in meters.

In [None]:
df_eda = df.copy()

In [None]:
sns.scatterplot(x= 'height', y= 'weight', hue= 'disease', data= df_eda)

Looking at height and weight in this manner isn't very helpful.

In [None]:
# create BMI feature
def BMI (data):
    return round(data['weight'] / (data['height'] / 100) **2, 2)

df_eda['BMI'] = df_eda.apply(BMI, axis= 1)

In [None]:
df_eda.describe().T

Guidelines for BMI Categories

![image.png](attachment:image.png)

In [None]:
pal = ['#008ae6' , '#ec1313']

plt.figure(figsize= (10,15))
sns.boxplot(x = 'disease', y= 'BMI', data= df_eda, palette= pal)
plt.title('Body Mass Index and Cardiovascular Disease', fontsize= 20)
plt.xlabel('Disease Presence')
plt.ylabel('BMI')

A quick boxplot shows that individuals with CVD have, on average, higher body mass indexes than those that do not have CVD.

### Blood Pressure Category analysis

Blood pressure is the pressure of circulating blood against the walls of blood vessels. Most of this pressure results from the heart pumping blood through the circulatory system.

Here are the most recent guidelines established by the American Heart Association (as of Feb.2021)

![image.png](attachment:image.png)

In [None]:
#function to categorize blood pressure
def bp_categories(systolic, diastolic):
    if systolic > 180 or diastolic > 120:
        return 'Crisis'
    elif (140 <= systolic < 180) or (90 <= diastolic < 120):
        return 'HBP_stage2'
    elif (130 <= systolic < 140) or (80 <= diastolic < 90) :
        return 'HBP_stage1'
    elif (120 <= systolic < 130) and diastolic < 80:
        return 'Elevated'
    else:
        return 'Normal'

In [None]:
# HTN is abbreviation for hypertension
df_eda['HTN_stage'] = df_eda[['systolic', 'diastolic']].apply\
(lambda x: bp_categories(*x), axis= 1)

In [None]:
df_eda.HTN_stage.value_counts()

In [None]:
pal = ['#008ae6' , '#ec1313']

plt.figure(figsize= (15,15))
sns.countplot(x= 'HTN_stage', hue= 'disease', data= df_eda, palette= pal)
plt.title('Blood Pressure Categories and Presence of Cardiovascular Disease', 
          fontweight= 'bold', fontsize= 20)
plt.xlabel('Blood Pressure Category')
plt.legend( ['No disease', 'Disease present'])
plt.ylabel('# of Patients')

Groups that have a higher prevalence of cardiovascular disease have blood pressure that could be classified as Hypertension Stage 2 or Hypertensive Crisis.

The other categories, Normal, Elevated, and Hypertension Stage 1 have a higher prevalence of patients without cardiovascular disease, though it is still present.

### Age analysis

In [None]:
print(df_eda.age.min())
print(df_eda.age.max())
df_eda[df_eda['age'] < 30]

In [None]:
# bin ages into categories
df_eda.loc[(df_eda['age'] < 40), 'age_range'] = 30
df_eda.loc[(df_eda['age'] >= 40) & (df_eda['age'] < 50), 'age_range'] = 40
df_eda.loc[(df_eda['age'] >= 50) & (df_eda['age'] < 60), 'age_range'] = 50
df_eda.loc[(df_eda['age'] >= 60) & (df_eda['age'] < 70), 'age_range'] = 60


In [None]:
df_eda.age_range.value_counts(normalize= True)

In [None]:
sns.countplot(x= 'age_range', hue= 'disease', data= df_eda)
plt.title('Age Ranges and Cardiovascular Disease')
plt.xlabel('Age Range')
plt.legend( ['No disease', 'Disease present'])
plt.ylabel('Patients')

In [None]:
plt.figure(figsize= (10,10))
pal = ['#1ac6ff', '#e65c00']

sns.scatterplot(x= 'age_range', y= 'weight', hue= 'disease', data= df_eda, palette= pal)
plt.title('Age Ranges, Weights, and Presence of Cardiovascular Disease', 
          fontweight= 'bold', fontsize= 15)
plt.xlabel('Age Range')
plt.ylabel('Weight')
plt.legend()

Here we can see that as a person gets older, their chances of being diagnosed with cardiovascular disease increase. We can also see that weight doesn't really have an impact as much as age does. 

### Gender analysis

In [None]:
# determine which is male/female
df_eda.groupby('gender')['height'].mean()

In [None]:
df_eda.groupby('gender')['weight'].mean()

In [None]:
df_eda.gender.value_counts(normalize= True)

"0" in both analyses have the higher number. Historically, men are taller than women. We can assume that the "0" refers to male and "1" refers to females. However, it is improtant to note that the data is imbalanced (almost $1/3$ to $2/3$).

In [None]:
legend_labels = ['male', 'female']

plt.figure(figsize= (10, 8))
plt.title('Counts of Males and Females With & Without CV Disease', fontsize= 20)
sns.countplot(x= 'gender', hue= 'disease', data= df_eda, palette= 'cubehelix')
plt.xlabel('Absence/Presence of Cardiovascular Disease')
plt.legend(legend_labels)
plt.ylabel('Patients')

This graph shows that even though there is almost double the amount of women in the data, there are even amounts of disease present within each gender.

In [None]:
df_eda.groupby('gender')['disease'].mean()

### Other quick checks

#### Cholesterol

In [None]:
df_eda.cholesterol.value_counts(normalize= True)

In [None]:
plt.figure(figsize= (8,6))
sns.countplot(x= 'cholesterol', hue= 'disease', data= df_eda)
plt.legend( ['No disease', 'Disease present'])
plt.title('Cholesterol and Disease')
plt.xlabel('Cholesterol Rank')
plt.ylabel('# of Patients')

In [None]:
pal = ['#1a75ff', '#cc6699', '#ff9900']
sns.catplot(x= 'cholesterol', y= 'disease', data= df_eda, kind= 'bar', 
            palette= pal)
plt.title('Average Risk of Having Disease vs Rank of Cholesterol')
plt.xlabel('Cholesterol Rank')
plt.ylabel('Has Disease')

Having cholesterol levels "above normal" and "well above normal" increase an individuals chances of being diagnosed with cardiovascular disease. However, having "normal" cholesterol levels does not decrease your chances. There are many individuals with normal cholesterol levels and CVD.

76% of patients with cholesterol "well above normal" also have cardiovascular disease.

In [None]:
df_eda.groupby('cholesterol')['disease'].mean()

#### Activity

In [None]:
df_eda.active.value_counts(normalize= True)

In [None]:
plt.figure(figsize= (12, 10))
sns.catplot(x='active', y='BMI', col='disease', data=df_eda, kind='boxen', 
            palette='Set1')

This visualization shows that patients without cardiovascular disease have similar body mass indexes, regardless of whether they classify themselves as active or not.

There is a slightly higher BMI for individuals that do have cardiovascular disease, but again, does not really differ if the patient is / is not active.

In [None]:
df_eda.groupby(['disease', 'active'])['BMI'].mean()

#### Glucose

In [None]:
pal = ['#008ae6' , '#ec1313']
sns.countplot(x= 'glucose', hue= 'disease', data= df_eda, palette= pal)
plt.legend( ['No disease', 'Disease present'])

Glucose levels that are "normal" do not seem to have any relationship to having the disease or not. In fact, a patient has almost a 50% chance of having cardiovascular disease even with a normal glucose measurement. The risk of being diagnosed with cardiovascular disease increases with increasing levels of glucose.

In [None]:
df_eda.groupby('glucose')['disease'].mean()

#### Drink alcohol and smoke

In [None]:
df_eda.groupby(['alcohol', 'smoke'])['disease'].mean()

In [None]:
df_eda.groupby(['alcohol', 'smoke'])['disease'].count()

In [None]:
fig= plt.figure(figsize= (6,6))
al_smo = df_eda.groupby(['alcohol', 'smoke'])['disease'].mean().plot()

This graph shows that an individual that claims to drink alcohol and smoke has the lowest risk of being diagnosed with cardiovascular disease. However, because these features are subjective and vague (drink how much, smoke what?), it is unwise to definitively say that drinking and smoking is better for your heart.

It is also interesting that risk of having cardiovascular disease decreases when looking at drinking OR smoking. Again, this should lead one to consider subjective information to be potentially less accurate.

### Heatmap

In [None]:
corr = df.corr()

plt.figure(figsize= (10,8))
mask = np.triu(np.ones_like(corr, dtype=np.bool))
sns.heatmap(df.corr(), cmap= 'coolwarm', mask= mask, linewidths= 1, annot= True)
plt.title('Correlation between Features', fontsize= 15)
plt.show()

The highest correlation (positive or negative) is between systolic and diastolic. The second highest is between height and gender.

The features with the highest correlation on disease are systolic, diastolic, age, and cholesterol.

## Models

### One hot encode, Split, and Standardize

In [None]:
df['cholesterol'] = df['cholesterol'].astype('category')
df['glucose'] = df['glucose'].astype('category')

In [None]:
df = pd.get_dummies(df, prefix=['chol', 'gluc'], drop_first=True)

In [None]:
y = df['disease']
X = df.drop('disease', axis= 1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state= 42)

In [None]:
y_test.shape

In [None]:
scaler = StandardScaler()

In [None]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Building Models

In [None]:
classifiers = {
    'Logistic Regression' : LogisticRegression(), 
    'Random Forest' : RandomForestClassifier(), 
    'Support Vector Machine' : SVC(), 
    'K-Nearest Neighbors' : KNeighborsClassifier(), 
    'Decision Tree' : DecisionTreeClassifier(), 
    'XGBoost' : XGBClassifier()
}

In [None]:
# takes approx 2 mins to run
results = pd.DataFrame(columns= ['Train_accuracy', 'Test_accuracy', 'F1_score', 
                                'False_Negative', 'True_Positive'])

for key, value in classifiers.items():
    #fit models
    value.fit(X_train, y_train)
    train_pred = value.predict(X_train)
    y_pred = value.predict(X_test)
    
    # get accuracy, f1 score
    train_acc = accuracy_score(y_train, train_pred) * 100
    test_acc = accuracy_score(y_test, y_pred) * 100
    f1 = f1_score(y_test, y_pred)
    
    #get false neg and true positive
    cm = confusion_matrix(y_test, y_pred)
    FN = cm[1][0]
    TP = cm[1][1]
    
    # add measurements to datafram
    results.loc[key] = [round(train_acc, 2), round(test_acc, 2), 
                        round(f1, 2), round(FN, 0), round(TP, 0)]

In [None]:
results.sort_values(by= ['F1_score', 'False_Negative'], ascending= False)

We got decent results without any hyperparameter tuning. We can do that next to see if we can improve the accuracy and decrease the overfitting.

### Hyperparameter Tuning

In [None]:
# function to get results after each model

def get_results(model, model_name):
    train_pred = model.predict(X_train)
    y_pred = model.predict(X_test)
    
    # get classification report
    print('{} Classification Report'.format(model_name))
    print(classification_report(y_test, y_pred))
    
    # get confusion matrix
    plot_confusion_matrix(model, X_test, y_test, cmap= "Blues", values_format= '.5g')
    plt.grid(False)
    plt.show()
    
    # get accuracy and F1 scores
    train_acc = accuracy_score(y_train, train_pred) * 100
    test_acc = accuracy_score(y_test, y_pred) * 100
    f1 = f1_score(y_test, y_pred)
    
    #get false neg and true positive
    cm = confusion_matrix(y_test, y_pred)
    FN = cm[1][0]
    TP = cm[1][1]
    
    # save measurements into results df
    results.loc[model_name] = [round(train_acc, 2), round(test_acc, 2), 
                               round(f1, 2), round(FN, 0), round(TP, 0)]

#### XGBoost 

In [None]:
# Instantiate classifier
xgb = XGBClassifier()

In [None]:
# create hyperparameter grid
param_grid_xgb = {
    'learning_rate': [0.08],
    'max_depth': [4],
    'min_child_weight': [2, 3],
    'n_estimaters' : [125, 150],
    'scale_pos_weight' : [1.5, 1.7]
}

# Instantiate Randomized Search
# n_jobs : Number of jobs to run in parallel. -1 means using all processors.
RS_xgb = RandomizedSearchCV(xgb, param_grid_xgb, n_jobs= 3, scoring= 'recall', 
                            random_state=42)

In [None]:
# fit model
# approx 1 min to run
RS_xgb.fit(X_train, y_train)

In [None]:
RS_xgb.best_params_

In [None]:
# see model results and add to results df
get_results(RS_xgb, 'RS XGBoost')

#### Random Forest

In [None]:
# Instantiate classifier
RF = RandomForestClassifier()

In [None]:
param_grid_RF = {
    'n_estimators' : [200],
    'max_depth' : [100], 
    'min_samples_split' : [5, 8], 
    'min_samples_leaf' : [3],
    'class_weight' : [{1 : 1.5}, {1 : 1.7}]
}

# Instantiate Randomized Search
RS_RF = RandomizedSearchCV(RF, param_grid_RF, n_iter= 20, scoring= 'recall', 
                           random_state= 42)

In [None]:
# fit model
# approx 2 mins to run
RS_RF.fit(X_train, y_train)

In [None]:
RS_RF.best_params_

In [None]:
# see model results and add to results df
get_results(RS_RF, 'RS Random Forest')

#### K-Nearest Neighbor

In [None]:
# Instantiate classifier
KNN = KNeighborsClassifier()

In [None]:
# set parameter guidelines
param_grid_KNN = {
    'n_neighbors' : [23, 25, 27], 
    'weights' : ['uniform', 'distance']
}

# Instantiate Grid Search
GS_KNN = GridSearchCV(KNN, param_grid_KNN, n_jobs= 3, scoring= 'recall')

In [None]:
# fit model
# approx 1 min to run
GS_KNN.fit(X_train, y_train)

In [None]:
GS_KNN.best_params_

In [None]:
# see model results and add to results df
get_results(GS_KNN, 'GS KNN')

#### Support Vector Machines

In [None]:
# Instantiate classifier
svc = SVC(class_weight = {1: 1.5}, random_state= 42)

In [None]:
# fit model
# approx 2 mins to run
svc.fit(X_train, y_train)

In [None]:
# see model results and add to results df
get_results(svc, 'Tuned SVM')

#### Logistic Regression

In [None]:
# Instantiate classifier
LG = LogisticRegression()

In [None]:
# set parameter criteria
param_grid_LG = {
    'penalty': ['l1', 'l2'],
    'C':[0.05, 0.1, 1], 
    'class_weight' : [None, {1 : 1.5}], 
    'random_state' : [42]
}

# Instantiate Grid Search
GS_LG = GridSearchCV(LG, param_grid_LG, scoring= 'recall')

In [None]:
# fit model
# approx 1 sec to run
GS_LG.fit(X_train, y_train)

In [None]:
GS_LG.best_params_

In [None]:
# see model results and add to results df
get_results(GS_LG, 'GS Logistic Regression')

#### Decision Tree

In [None]:
# Instantiate classifier
DT = DecisionTreeClassifier()

In [None]:
# set parameter criteria
param_grid_DT = {
    'max_depth' : [3, 4, 5], 
    'min_samples_split' : [0.01, 0.05, 0.1], 
    'min_samples_leaf' : [5, 7], 
    'class_weight' : [{1 : 1.4}, {1 : 1.6}]
}

# Instantiate Randomized Search
RS_DT = RandomizedSearchCV(DT, param_grid_DT, n_jobs= 3, random_state=42)

In [None]:
# approx 1 sec to run
RS_DT.fit(X_train, y_train)

In [None]:
RS_DT.best_params_

In [None]:
plt.figure(figsize= (30,15))
_ = plot_tree(RS_DT.best_estimator_ , feature_names = X.columns, filled= True) 

In [None]:
get_results(RS_DT, 'RS Decision Tree')

## Interpret

### Results
Now, let's take the time to look in depth at some of the best results we achieved.

In [None]:
# view all results
results = results.sort_values(by= ['F1_score', 'False_Negative'], 
                              ascending= False)
results

Since we are looking at medical data, missing a diagnosis of cardiovascular disease could be deadly. But we had to weigh that against misdiagnosing too many people that don't have the disease and telling them to get a bunch of expensive tests. We also didn't want to hurt the accuracy of the prediction, so we walked a fine line. 

That is why looking at F1 score (conveys the balance between the precision and the recall) and false negatives are so important to this prediction.

We can see that the tuned XGBoost was able to predict the most True Positive and fewest false negatives. It has the highest F1 score and testing and training accuracy are very similar suggesting no overfitting.

The tuned decision tree had the most improvement and train/test accuracies are almost equal, but it is one of the worst tuned models with regards to false negatives (misclassified 700 more than tuned XGBoost)

### Test Accuracy

In [None]:
# see results in bar graph
fig, ax = plt.subplots(figsize=(10,10))
sns.barplot(x= results['Test_accuracy'], y= results.index, palette = 'twilight')
plt.vlines(x = 73.22, ymin = -.5, ymax = 11.5, linestyle= 'dashed', 
           color = 'r', label= 'Test Accuracy 73.22')
plt.vlines(x = 71.32, ymin = -.5, ymax = 11.5, linestyle= 'dashed', 
           color = 'black', label= 'Test Accuracy 71.32')
plt.title('Test Accuracy of all Models', fontsize= 15)
plt.ylabel('Model')
plt.xlabel('Accuracy in %')
plt.xlim(60, 76)
ax.legend(loc = 'lower right')

The highest accuracy score is at 73.22% by the untuned support vector machine model. The tuned XGBoost model's test accuracy is 71.32% - which happens to be the same as the tuned logistic regression.

### ROC Curve 

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. 

In [None]:
#ROC curve of best 6 models
fig = plot_roc_curve(RS_xgb, X_test, y_test, name= 'RS XGBoost')
plot_roc_curve(RS_RF,X_test, y_test, ax = fig.ax_, name= 'RS Random For')
plot_roc_curve(svc,X_test, y_test, ax = fig.ax_, name= 'Tuned SVM')
plot_roc_curve(GS_LG,X_test, y_test, ax = fig.ax_, name= 'GS Logistic Reg')
plot_roc_curve(GS_KNN,X_test, y_test, ax = fig.ax_, name= 'GS K-NearestN')
plot_roc_curve(RS_DT,X_test, y_test, ax = fig.ax_, name= 'RS Decision Tree')

# fig.figure.suptitle('ROC Curve Comparison')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

While most of the tuned models perform similarly, the tuned XGBoost again slightly outperforms the other models.

### Feature Importance

Let's take a look at the tuned XGBoost and see which features were the most important to our model.

In [None]:
# create DataFrame with feature importance to create nice looking graph
#  using tuned XGBoost model
FeatImp={'feature_names' : X.columns.values, 
         'feature_importance' : RS_xgb.best_estimator_.feature_importances_}
FI_df = pd.DataFrame(FeatImp)
FI_df.sort_values(by= ['feature_importance'], ascending= False, inplace= True)

In [None]:
# plot new DataFrame
plt.figure(figsize= (10,6))
sns.barplot(x= FI_df['feature_importance'], y= FI_df['feature_names'], 
            palette= 'Wistia_r')
plt.title('Feature Importance for Tuned XGBoost', fontsize= '15')
plt.xlabel('Feature Importance')
plt.ylabel('Feature')

In [None]:
FI_df

The systolic reading is by far the most important feature in predicting cardiovascular disease. Having a cholesterol well above normal, and a person's age are the second and third most important features, respectively.

Activity is the highest subjective feature, followed closely by smoking as the next subjective feature.

### Conclusion

The model that performed the best overall was XGBoost with tuned hyperparameters. 

Due to it's highest false negative and true positive rates as well as the highest F1 score. It may not have been the most accurate model that was made, but was fairly close (by less than 2%). This model also had the highest AUC (Area Under the Curve) score showing us it was the most accurate.

From the model, we can conclude that having a low blood pressure, low cholesterol, being young, and being active are the best ways to avoid cardiovascular disease.