## Modeling Heart Disease

Import libraries

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn_pandas import DataFrameMapper
from sklearn import (preprocessing, metrics)
from sklearn.model_selection import (train_test_split, GridSearchCV)
from sklearn.neighbors import KNeighborsClassifier
# from xgboost import XGBClassifier
# from sklearn.decomposition import PCA
from jupyterthemes import jtplot
jtplot.style('grade3')

Read in the 'clean' data from EDA
(One duplicate, and one possible outlier dropped)

In [None]:
data = pd.read_csv('heart_clean.csv')

Normalize continuous data (z-score) so that feature values are on similar scales. (Categorical data values are already on similar scales, so do not require transformation at this time.) 
Split data into training/test sets so test data does not influence z-score normalization.

In [None]:
random_seed = 20
sns.set_style('whitegrid') # Style option for later graphs

In [None]:
x_train, x_test, y_train, y_test = train_test_split(data.drop(columns='target'), data.target,
                                                    test_size=0.3, stratify=data.target, random_state=random_seed)

In [None]:
#categorical = ['cp', 'restecg', 'slope', 'ca', 'thal']
#binary_cat = ['sex', 'fbs', 'exang'] ## 'target' is omitted
numerical = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

In [None]:
#OH = preprocessing.OneHotEncoder(categories = 'auto', sparse=False)
mapper = DataFrameMapper([([n], preprocessing.StandardScaler()) for n in numerical], default=None, df_out=True)

In [None]:
X_train = mapper.fit_transform(x_train)
X_test = mapper.transform(x_test)

## Model 1: K-Nearest Neighbors
Predicts disease state based on votes from a defined number of nearest neighbors. <br> 
Parameters to be optimized by grid search: 1. k_neighbors - how many neighbors to count, 2. weights - whether or not to weight votes by distance, 3. metric - distance measure <br>


In [None]:
knn = KNeighborsClassifier()

In [None]:
params = {'n_neighbors': [i for i in range(1, 20, 2)], # skip even numbers to avoid ties
          'weights': ['uniform', 'distance'],
          'metric': ['euclidean', 'manhattan']}

In [None]:
# f1 score is selected to measure a balance of precision and recall
# cv and iid are adjusted to match default behaviour of future version sklearn
model = GridSearchCV(knn, params, scoring='f1', return_train_score=True, cv=5, iid=False, verbose = 1)

In [None]:
model.fit(X_train, y_train)

In [None]:
print('Best parameters:', model.best_params_)
print('Best score:', '{:.3f}'.format(model.best_score_))

### Evaluate K-neighbors  parameter optimization and model performance

Plot the train/test results from grid search to evaluate how each of the parameters affected training, and check that the 'best parameters' are reasonable.

In [None]:
# Save selected results in molten form for graphing
res = pd.DataFrame.from_dict(model.cv_results_).melt(
    id_vars = ['param_n_neighbors', 'param_metric', 'param_weights'],
    value_vars = ['mean_test_score', 'mean_train_score'])

In [None]:
sns.lineplot(x = res.param_n_neighbors, y = res.value, size = res.param_metric, hue = res.variable,
             style = res.param_weights)
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left") # move legend outide of grid
plt.xticks(range(1,20,2))
plt.ylabel('f1 score')
plt.xlabel('n_neighbors');

1. k_neighbors: The 'elbow' where the train and test scores converge appears around k=5-7 <br> 
2. weights: Test scores were hardly affected, but weighting by distance completely overestimated the training score, making 'uniform' a better choice. <br>
3. metric: Manhattan scores are consistently (slightly) higher than euclidean. However the euclidean curves are smooth, while manhattan are somewhat erratic, indicating they could be overfit. <br>
Overall, the parameters selected during training (k=5, weights=uniform, and metric=manhattan) appear reasonable.

The model was refit with the best parameters found during grid search, so can be used predict labels for test data and evaluate performance.

In [None]:
y_pred = model.predict(X_test)

In [None]:
print(metrics.classification_report(y_test, y_pred))

In [None]:
sns.heatmap(metrics.confusion_matrix(y_test, y_pred), cmap=sns.color_palette('Paired', 2),
            annot=True, annot_kws={'size':14}, cbar=False, square=True)
plt.xlabel('Predicted label\n(0=Healthy, 1=Disease)')
plt.ylabel('Actual label')
plt.title('K-Neighbors Confusion Matrix');

Precision and recall are fairly balanced. The model tends to over-predict presence of heart disease (false positives).

## Model 2:
