Heart disease diagnosis
---

## Exercise - Evaluate "most-frequent" baseline

> **Exercise**: Load and split the `heart-disease.csv` data into 70-30 train/test sets - make sure to keep the same proportion of classes by setting `stratify`. Evaluate the accuracy of the "most-frequent" baseline.

In [1]:
import pandas as pd
import os

# Load data
data_df = pd.read_csv("c4_heart-disease.csv")

# First five rows
data_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,disease
0,63,male,typical angina,145,233,yes,ventricular hypertrophy,150,no,2.3,downsloping,0,fixed defect,absence
1,67,male,asymptomatic,160,286,no,ventricular hypertrophy,108,yes,1.5,flat,3,normal,likely
2,67,male,asymptomatic,120,229,no,ventricular hypertrophy,129,yes,2.6,flat,2,reversable defect,likely
3,37,male,non-anginal pain,130,250,no,normal,187,no,3.5,downsloping,0,normal,absence
4,41,female,atypical angina,130,204,no,ventricular hypertrophy,172,no,1.4,upsloping,0,normal,absence


In [2]:
print("Classes:", data_df.disease.unique())

Classes: ['absence' 'likely' 'very likely']


In [3]:
from sklearn.model_selection import train_test_split

# Create X/y variables
X = data_df.drop("disease", axis=1)
y = data_df.disease

# Split into train/test sets
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, stratify=y, test_size=0.3, random_state=0
)
print("Train:", X_tr.shape, y_tr.shape)
print("Test:", X_te.shape, y_te.shape)

Train: (212, 13) (212,)
Test: (91, 13) (91,)


In [4]:
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(None, y_tr)
print("Test accuracy: {:.2f}%".format(100 * dummy.score(None, y_te)))

Test accuracy: 53.85%


The "most-frequent" baseline accuracy is around 54%

Exercise - Evaluate k-NN baseline
---

> **Exercise**: Tune a k-NN classifier using grid search with **stratified 10-fold** cross-validation
> * Number of neighbors k
> * Distance metric - $L_{1}$ or $L_{2}$
> * Weighting strategy - uniform or by distance
>
> Refit the best estimator on the whole train set and report the test accuracy.

Dataset documentation: http://archive.ics.uci.edu/ml/datasets/heart+Disease

In [5]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# One-hot encoding
onehot_columns = ["sex", "cp", "fbs", "restecg", "exang", "slope", "thal"]

# Numerical features
other_columns = ["age", "trestbps", "chol", "thalach", "oldpeak", "ca"]

# Preprocessor
preprocessor = ColumnTransformer(
    [
        ("onehot", OneHotEncoder(handle_unknown="ignore"), onehot_columns),
        ("other", "passthrough", other_columns),
    ]
)

In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# k-NN estimator
knn_estimator = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("scaler", StandardScaler()),  # Standardize features before k-NN
        ("knn", KNeighborsClassifier()),
    ]
)

# Grid search with cross-validation
grid = {
    "knn__n_neighbors": [1, 5, 10, 15, 20],
    "knn__weights": ["uniform", "distance"],
    "knn__p": [1, 2],
}
knn_gscv = GridSearchCV(knn_estimator, grid, cv=10, refit=True, return_train_score=True)

In [7]:
# Fit/evaluate estimator
knn_gscv.fit(X_tr, y_tr)

# Collect results in a DataFrame
knn_results = pd.DataFrame(
    {
        "k": knn_gscv.cv_results_["param_knn__n_neighbors"],
        "p": knn_gscv.cv_results_["param_knn__p"],
        "weights": knn_gscv.cv_results_["param_knn__weights"],
        "mean_tr": knn_gscv.cv_results_["mean_train_score"],
        "mean_te": knn_gscv.cv_results_["mean_test_score"],
        "std_te": knn_gscv.cv_results_["std_test_score"],
    }
)

# Ten best combinations according to the mean "test" score
# i.e. the mean score on the 10 validation folds
knn_results.sort_values(by="mean_te", ascending=False).head(10)

Unnamed: 0,k,p,weights,mean_tr,mean_te,std_te
16,20,1,uniform,0.683952,0.669264,0.058728
17,20,1,distance,1.0,0.650649,0.065454
8,10,1,uniform,0.70388,0.650649,0.054071
14,15,2,uniform,0.694436,0.650649,0.058114
18,20,2,uniform,0.692874,0.646104,0.05354
11,10,2,distance,1.0,0.641558,0.067094
13,15,1,distance,1.0,0.641342,0.067884
19,20,2,distance,1.0,0.641126,0.062357
15,15,2,distance,1.0,0.640909,0.066998
6,5,2,uniform,0.767801,0.637013,0.049264


The k-NN baseline accuracy is around 66% ±8% (std) according to the validation accuracy

In [8]:
# Report test score
print("Test accuracy: {:.2f}%".format(100 * knn_gscv.score(X_te, y_te)))

Test accuracy: 65.93%


Exercise - Logistic regression
---

> **Exercise**: Same with a logistic regression
> * Try both OvR and softmax
> * tune C
>
> Which estimator would you use in practice? k-NN or logistic regression?

In [9]:
from sklearn.linear_model import LogisticRegression
import numpy as np

# Logistic regression estimator
logreg_estimator = Pipeline(
    [
        ("preprocessor", preprocessor),
        (
            "scaler",
            StandardScaler(),
        ),  # due to standardization and solvers sensitive to rescaling
        ("logreg", LogisticRegression()),
    ]
)

# Grid search with cross-validation
Cs = np.logspace(-4, 4, num=20)
grids = [
    {"logreg__multi_class": ["ovr"], "logreg__solver": ["liblinear"], "logreg__C": Cs},
    {
        "logreg__multi_class": ["multinomial"],
        "logreg__solver": ["saga"],
        "logreg__C": Cs,
    },
]
logreg_gscv = GridSearchCV(
    logreg_estimator, grids, cv=10, refit=True, return_train_score=True
)

In [10]:
import warnings
from sklearn.exceptions import ConvergenceWarning

# Filter convergence warnings
warnings.simplefilter("ignore", ConvergenceWarning)

# Fit/evaluate estimator
logreg_gscv.fit(X_tr, y_tr)

# Collect results in a DataFrame
logreg_results = pd.DataFrame(
    {
        "strategy": logreg_gscv.cv_results_["param_logreg__multi_class"],
        "C": logreg_gscv.cv_results_["param_logreg__C"],
        "mean_tr": logreg_gscv.cv_results_["mean_train_score"],
        "mean_te": logreg_gscv.cv_results_["mean_test_score"],
        "std_te": logreg_gscv.cv_results_["std_test_score"],
    }
)

# Ten best combinations according to the mean test score
logreg_results.sort_values(by="mean_te", ascending=False).head(10)

Unnamed: 0,strategy,C,mean_tr,mean_te,std_te
24,multinomial,0.004833,0.687104,0.669264,0.058384
39,multinomial,10000.0,0.75784,0.664935,0.078263
38,multinomial,3792.690191,0.75784,0.664935,0.078263
37,multinomial,1438.449888,0.75784,0.664935,0.078263
36,multinomial,545.559478,0.75784,0.664935,0.078263
33,multinomial,29.763514,0.75784,0.66039,0.08106
32,multinomial,11.288379,0.75784,0.66039,0.08106
34,multinomial,78.475997,0.75784,0.66039,0.08106
35,multinomial,206.913808,0.75784,0.66039,0.08106
5,ovr,0.012743,0.732684,0.660173,0.030179


The logistic regression accuracy is around 67% ± 7% (std) according to the validation accuracy

In [11]:
# Report test score
print("Test accuracy: {:.2f}%".format(100 * logreg_gscv.score(X_te, y_te)))

Test accuracy: 69.23%


The k-NN and logistic estimators are both better than the "most-frequent" baseline. However, after trying with different `random_state` seeds for the `train_test_split()` function, it's difficult to say that one is better than the other.

It would be a good idea to track other metrics such as the precision, recall and F1 measures. For a reference of the different metrics implemented in Scikit-learn, see [Model evaluation guide](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics)