Heart disease diagnosis
---

## Exercise - Evaluate "most-frequent" baseline

> **Exercise**: Load and split the `heart-disease.csv` data into 70-30 train/test sets - make sure to keep the same proportion of classes by setting `stratify`. Evaluate the accuracy of the "most-frequent" baseline.

In [None]:
import pandas as pd
import os

# Load data
data_df = pd.read_csv("c4_heart-disease.csv")

# First five rows
data_df.head()

In [None]:
print("Classes:", data_df.disease.unique())

In [None]:
from sklearn.model_selection import train_test_split

# Create X/y variables
X = data_df.drop("disease", axis=1)
y = data_df.disease

# Split into train/test sets
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, stratify=y, test_size=0.3, random_state=0
)
print("Train:", X_tr.shape, y_tr.shape)
print("Test:", X_te.shape, y_te.shape)

In [None]:
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(None, y_tr)
print("Test accuracy: {:.2f}%".format(100 * dummy.score(None, y_te)))

The "most-frequent" baseline accuracy is around 54%

Exercise - Evaluate k-NN baseline
---

> **Exercise**: Tune a k-NN classifier using grid search with **stratified 10-fold** cross-validation
> * Number of neighbors k
> * Distance metric - $L_{1}$ or $L_{2}$
> * Weighting strategy - uniform or by distance
>
> Refit the best estimator on the whole train set and report the test accuracy.

Dataset documentation: http://archive.ics.uci.edu/ml/datasets/heart+Disease

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# One-hot encoding
onehot_columns = ["sex", "cp", "fbs", "restecg", "exang", "slope", "thal"]

# Numerical features
other_columns = ["age", "trestbps", "chol", "thalach", "oldpeak", "ca"]

# Preprocessor
preprocessor = ColumnTransformer(
    [
        ("onehot", OneHotEncoder(handle_unknown="ignore"), onehot_columns),
        ("other", "passthrough", other_columns),
    ]
)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# k-NN estimator
knn_estimator = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("scaler", StandardScaler()),  # Standardize features before k-NN
        ("knn", KNeighborsClassifier()),
    ]
)

# Grid search with cross-validation
grid = {
    "knn__n_neighbors": [1, 5, 10, 15, 20],
    "knn__weights": ["uniform", "distance"],
    "knn__p": [1, 2],
}
knn_gscv = GridSearchCV(knn_estimator, grid, cv=10, refit=True, return_train_score=True)

In [None]:
# Fit/evaluate estimator
knn_gscv.fit(X_tr, y_tr)

# Collect results in a DataFrame
knn_results = pd.DataFrame(
    {
        "k": knn_gscv.cv_results_["param_knn__n_neighbors"],
        "p": knn_gscv.cv_results_["param_knn__p"],
        "weights": knn_gscv.cv_results_["param_knn__weights"],
        "mean_tr": knn_gscv.cv_results_["mean_train_score"],
        "mean_te": knn_gscv.cv_results_["mean_test_score"],
        "std_te": knn_gscv.cv_results_["std_test_score"],
    }
)

# Ten best combinations according to the mean "test" score
# i.e. the mean score on the 10 validation folds
knn_results.sort_values(by="mean_te", ascending=False).head(10)

The k-NN baseline accuracy is around 66% ±8% (std) according to the validation accuracy

In [None]:
# Report test score
print("Test accuracy: {:.2f}%".format(100 * knn_gscv.score(X_te, y_te)))

Exercise - Logistic regression
---

> **Exercise**: Same with a logistic regression
> * Try both OvR and softmax
> * tune C
>
> Which estimator would you use in practice? k-NN or logistic regression?

In [None]:
from sklearn.linear_model import LogisticRegression
import numpy as np

# Logistic regression estimator
logreg_estimator = Pipeline(
    [
        ("preprocessor", preprocessor),
        (
            "scaler",
            StandardScaler(),
        ),  # due to standardization and solvers sensitive to rescaling
        ("logreg", LogisticRegression()),
    ]
)

# Grid search with cross-validation
Cs = np.logspace(-4, 4, num=20)
grids = [
    {"logreg__multi_class": ["ovr"], "logreg__solver": ["liblinear"], "logreg__C": Cs},
    {
        "logreg__multi_class": ["multinomial"],
        "logreg__solver": ["saga"],
        "logreg__C": Cs,
    },
]
logreg_gscv = GridSearchCV(
    logreg_estimator, grids, cv=10, refit=True, return_train_score=True
)

In [None]:
import warnings
from sklearn.exceptions import ConvergenceWarning

# Filter convergence warnings
warnings.simplefilter("ignore", ConvergenceWarning)

# Fit/evaluate estimator
logreg_gscv.fit(X_tr, y_tr)

# Collect results in a DataFrame
logreg_results = pd.DataFrame(
    {
        "strategy": logreg_gscv.cv_results_["param_logreg__multi_class"],
        "C": logreg_gscv.cv_results_["param_logreg__C"],
        "mean_tr": logreg_gscv.cv_results_["mean_train_score"],
        "mean_te": logreg_gscv.cv_results_["mean_test_score"],
        "std_te": logreg_gscv.cv_results_["std_test_score"],
    }
)

# Ten best combinations according to the mean test score
logreg_results.sort_values(by="mean_te", ascending=False).head(10)

The logistic regression accuracy is around 67% ± 7% (std) according to the validation accuracy

In [None]:
# Report test score
print("Test accuracy: {:.2f}%".format(100 * logreg_gscv.score(X_te, y_te)))

The k-NN and logistic estimators are both better than the "most-frequent" baseline. However, after trying with different `random_state` seeds for the `train_test_split()` function, it's difficult to say that one is better than the other.

It would be a good idea to track other metrics such as the precision, recall and F1 measures. For a reference of the different metrics implemented in Scikit-learn, see [Model evaluation guide](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics)