Heart disease diagnosis
---

## Exercise - Evaluate "most-frequent" baseline

> **Exercise**: Load and split the `heart-disease.csv` data into 70-30 train/test sets - make sure to keep the same proportion of classes by setting `stratify`. Evaluate the accuracy of the "most-frequent" baseline.

In [35]:
import pandas as pd
import numpy as np

# Load data
data_df = pd.read_csv('heart-disease.csv')

# First five rows
data_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,disease
0,63,male,typical angina,145,233,yes,ventricular hypertrophy,150,no,2.3,downsloping,0,fixed defect,absence
1,67,male,asymptomatic,160,286,no,ventricular hypertrophy,108,yes,1.5,flat,3,normal,likely
2,67,male,asymptomatic,120,229,no,ventricular hypertrophy,129,yes,2.6,flat,2,reversable defect,likely
3,37,male,non-anginal pain,130,250,no,normal,187,no,3.5,downsloping,0,normal,absence
4,41,female,atypical angina,130,204,no,ventricular hypertrophy,172,no,1.4,upsloping,0,normal,absence


In [36]:
# Create X/y arrays
X = data_df.drop('disease', axis=1)
y = data_df['disease']

In [37]:
# Split into train/test sets
from sklearn.model_selection import train_test_split

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=0)


print('Train set:', X_tr.shape, y_tr.shape)
# Prints: (212, 6) (212,)

print('Test set:', X_te.shape, y_te.shape)

Train set: (212, 13) (212,)
Test set: (91, 13) (91,)


In [38]:
np.unique(y_tr)

array(['absence', 'likely', 'very likely'], dtype=object)

In [39]:
# Count the number of entries labeled with each output and calculate probability
# Probability of 'absence'
p_absence = np.sum(y_tr == 'absence') / len(y_tr)
p_likely = np.sum(y_tr == 'likely') / len(y_tr)
p_vlikely = np.sum(y_tr == 'very likely') / len(y_tr)

print('Probability absence:', p_absence)
print('Probability likely:', p_likely)
print('Probability very likely:', p_vlikely)
print('Check total:', p_absence+p_likely+p_vlikely)

Probability absence: 0.5424528301886793
Probability likely: 0.3018867924528302
Probability very likely: 0.15566037735849056
Check total: 1.0


In [40]:
# Or using Pandas
pd.Series(y_tr).value_counts(normalize=True)

absence        0.542453
likely         0.301887
very likely    0.155660
Name: disease, dtype: float64

In [41]:
from sklearn.dummy import DummyClassifier
# Create the dummy classifier
dummy = DummyClassifier(strategy='most_frequent')

# Fit it
dummy.fit(None, y_tr)

# Compute test accuracy
accuracy = dummy.score(None, y_te)
print('Accuracy: {:.2f}'.format(accuracy))

Accuracy: 0.54


The "most frequent" baseline is "absence" with an accuracy of 54%.

Exercise - Evaluate k-NN baseline
---

> **Exercise**: Tune a k-NN classifier using grid search with **stratified 10-fold** cross-validation
> * Number of neighbors k
> * Distance metric - $L_{1}$ or $L_{2}$
> * Weighting strategy - uniform or by distance
>
> Refit the best estimator on the whole train set and report the test accuracy.

Data set documentation: http://archive.ics.uci.edu/ml/datasets/heart+Disease

In [42]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# One-hot encoding
onehot_columns = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']

# Numerical features
other_columns = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'ca']

# Preprocessor
preprocessor = ColumnTransformer([
    ('onehot', OneHotEncoder(handle_unknown='ignore'), onehot_columns),
    ('other', 'passthrough', other_columns)
])

In [48]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


# k-NN estimator
knn_estimator = Pipeline([
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler()), # Standardize features before k-NN
    ('knn', KNeighborsClassifier())
])

# Grid search with cross-validation
grid = {
    'knn__n_neighbors': [1, 5, 10, 15, 20],
    'knn__weights': ['uniform', 'distance'],
    'knn__p': [1, 2]
}



In [49]:
# Fit estimator before Grid Search
knn_estimator.fit(X_tr, y_tr)

# Evaluate on test set
accuracy = knn_estimator.score(X_te, y_te)
print('Accuracy: {:.3f}'.format(accuracy))


Accuracy: 0.626


In [50]:
knn_gscv = GridSearchCV(knn_estimator, grid, cv=10, refit=True, return_train_score=True)

In [51]:
# Fit/evaluate estimator
knn_gscv.fit(X_tr, y_tr)

# Collect results in a DataFrame
knn_results = pd.DataFrame({
    'k': knn_gscv.cv_results_['param_knn__n_neighbors'],
    'p': knn_gscv.cv_results_['param_knn__p'],
    'weights': knn_gscv.cv_results_['param_knn__weights'],
    'mean_tr': knn_gscv.cv_results_['mean_train_score'],
    'mean_te': knn_gscv.cv_results_['mean_test_score'],
    'std_te': knn_gscv.cv_results_['std_test_score']
})

# Ten best combinations according to the mean "test" score
# i.e. the mean score on the 10 validation folds
knn_results.sort_values(by='mean_te', ascending=False).head(10)



Unnamed: 0,k,p,weights,mean_tr,mean_te,std_te
16,20,1,uniform,0.689223,0.660377,0.083232
8,10,1,uniform,0.70339,0.660377,0.075691
15,15,2,distance,1.0,0.65566,0.079192
14,15,2,uniform,0.696586,0.65566,0.088311
11,10,2,distance,1.0,0.650943,0.074498
10,10,2,uniform,0.705967,0.636792,0.068658
18,20,2,uniform,0.689237,0.632075,0.090665
17,20,1,distance,1.0,0.632075,0.078631
9,10,1,distance,1.0,0.632075,0.077286
12,15,1,uniform,0.686076,0.632075,0.088146


In [52]:
# Report test score
print('Test accuracy: {:.2f}%'.format(100*knn_gscv.score(X_te, y_te)))

Test accuracy: 64.84%


Exercise - Logistic regression
---

> **Exercise**: Same with a logistic regression
> * Try both OvR and softmax
> * tune C
>
> Which estimator would you use in practice? k-NN or logistic regression?

In [53]:
from sklearn.linear_model import LogisticRegression
import numpy as np

# Logistic regression estimator
logreg_estimator = Pipeline([
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler()), # due to standardization and solvers sensitive to rescaling
    ('logreg', LogisticRegression())
])

# Grid search with cross-validation
Cs = np.logspace(-4, 4, num=20)
grids = [{
    'logreg__multi_class': ['ovr'],
    'logreg__solver': ['liblinear'],
    'logreg__C': Cs
}, {
    'logreg__multi_class': ['multinomial'],
    'logreg__solver': ['saga'],
    'logreg__C': Cs
}]

In [54]:
logreg_gscv = GridSearchCV(logreg_estimator, grids, cv=10, refit=True, return_train_score=True)

In [55]:
import warnings
from sklearn.exceptions import ConvergenceWarning

# Filter convergence warnings
warnings.simplefilter('ignore', ConvergenceWarning)

# Fit/evaluate estimator
logreg_gscv.fit(X_tr, y_tr)

# Collect results in a DataFrame
logreg_results = pd.DataFrame({
    'strategy': logreg_gscv.cv_results_['param_logreg__multi_class'],
    'C': logreg_gscv.cv_results_['param_logreg__C'],
    'mean_tr': logreg_gscv.cv_results_['mean_train_score'],
    'mean_te': logreg_gscv.cv_results_['mean_test_score'],
    'std_te': logreg_gscv.cv_results_['std_test_score']
})

# Ten best combinations according to the mean test score
logreg_results.sort_values(by='mean_te', ascending=False).head(10)



Unnamed: 0,strategy,C,mean_tr,mean_te,std_te
24,multinomial,0.00483293,0.688198,0.669811,0.071303
39,multinomial,10000.0,0.760943,0.665094,0.085097
38,multinomial,3792.69,0.760943,0.665094,0.085097
37,multinomial,1438.45,0.760943,0.665094,0.085097
34,multinomial,78.476,0.760943,0.665094,0.085097
33,multinomial,29.7635,0.760943,0.665094,0.085097
32,multinomial,11.2884,0.760422,0.665094,0.085097
31,multinomial,4.28133,0.760946,0.665094,0.085097
30,multinomial,1.62378,0.759901,0.665094,0.085097
29,multinomial,0.615848,0.758363,0.665094,0.085097


In [56]:
# Report test score
print('Test accuracy: {:.2f}%'.format(100*logreg_gscv.score(X_te, y_te)))

Test accuracy: 69.23%


he k-NN and logistic estimators are both better than the "most-frequent" baseline. However, after trying with different random_state seeds for the train_test_split() function, it's difficult to say that one is better than the other.

It would be a good idea to track other metrics such as the precision, recall and F1 measures. For a reference of the different metrics implemented in Scikit-learn, see Model evaluation guide

https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics