<a href="https://colab.research.google.com/github/dkapitan/jads-nhs-proms/blob/master/notebooks/4.0-modeling-clustering-classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background to osteoarthritis case study

This is day 4 from the [5-day JADS NHS PROMs data science case study](https://github.com/dkapitan/jads-nhs-proms/blob/master/README.md). To recap the previous lectures:

- In **lecture 1** we focused on data understanding and explored various possible outcome parameters.
- In **lecture 2** we constructed a combined outcome parameter using cut-off points for pain and physical functioning.
- In **lecture 3** we performed regression and linear modeling on the outcome paramter `t1_eq_vas`.

Today we are going to focus on classifcation.

# Modeling: clustering & classfication

## Learning objectives
### Modeling: classification

Recall we defined different outcome Y suitable for classification in lecture 2:

- `y_mcid`: outcome is good (True/1) if `t1_oks_score` is above the threshold or `delta_oks_score` is larger than MCID
- `y_t1_pain_good` and `y_t1_functioning_good`: combination of two binary outcomes, yielding a total of 4 classes.

We will use the second outcome since it is more challenging and versatile (see [this presentation (in Dutch)](https://kapitan.net/wp-content/uploads/2018/11/181108_data_driven_healthcare_congres.pdf) for more details):
- perform a binary classification, defining good outcome where both pain and functioning is good at `t1`
- perform multiclass classification.

We will compare the performance of single estimators (SVM, Decision tree) with ensemble learning (Random Forest, Gradient Boosting). Grid search and cross validation is done for model assessment. Some examples are given to visualize scoring functions of models to aid in model selection.


### Python: Hands-on Machine Learning (2nd edition)

- [Classification (chapter 3](https://github.com/ageron/handson-ml2/blob/master/03_classification.ipynb)
- [Support-vector machines (chapter 5](https://github.com/ageron/handson-ml2/blob/master/05_support_vector_machines.ipynb)
- [Decision trees (chapter 6)](https://github.com/ageron/handson-ml2/blob/master/06_decision_trees.ipynb)
- [Ensemble learning and random forests (chapter 7](https://github.com/ageron/handson-ml2/blob/master/07_ensemble_learning_and_random_forests.ipynb)

### scikit-learn
- [Demonstration of multi-metric evaluation on cross_val_score and GridSearchCV](https://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html#sphx-glr-auto-examples-model-selection-plot-multi-metric-evaluation-py)

So let's start by re-running the relevant code from previous lectures.

In [1]:
import warnings
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import StratifiedShuffleSplit

#supressing warnings for readability
warnings.filterwarnings("ignore")

# To plot pretty figures directly within Jupyter
%matplotlib inline

# choose your own style: https://matplotlib.org/3.1.0/gallery/style_sheets/style_sheets_reference.html
plt.style.use('seaborn-whitegrid')

# Go to town with https://matplotlib.org/tutorials/introductory/customizing.html
# plt.rcParams.keys()
mpl.rc('axes', labelsize=14, titlesize=14)
mpl.rc('figure', titlesize=20)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# contants for figsize
S = (8,8)
M = (12,12)
L = (14,14)

# pandas options
pd.set_option("display.max.columns", None)
pd.set_option("display.max.rows", None)
pd.set_option("display.precision", 2)

# read data
df = pd.read_parquet('https://github.com/dkapitan/jads-nhs-proms/blob/master/data/interim/knee-provider.parquet?raw=true')

In [2]:
# handy function to select oks columns
def oks_questions(t='t0'):
  return [
    col for col in df.columns if col.startswith(f"oks_{t}") and not col.endswith("_score")
]

# replace sentinel values in oks columns
# note we are doing imputation on original dataframe (rather than in pipeline later on)
# so we can perform it prior to StratefiedShuffleSplit
oks_no9 = oks_questions('t0') + oks_questions('t1')
impute_oks = SimpleImputer(missing_values=9, strategy="most_frequent")
df.loc[:, oks_no9] = impute_oks.fit_transform(df[oks_no9])

# group columns t0
age_band = ["age_band"]
gender = ["gender"]
age_band_categories = sorted([x for x in df.age_band.unique() if isinstance(x, str)])
comorb = [
    "heart_disease",
    "high_bp",
    "stroke",
    "circulation",
    "lung_disease",
    "diabetes",
    "kidney_disease",
    "nervous_system",
    "liver_disease",
    "cancer",
    "depression",
    "arthritis",
]
boolean = ["t0_assisted", "t0_previous_surgery", "t0_disability"]
eq5d = ["t0_mobility", "t0_self_care", "t0_activity", "t0_discomfort", "t0_anxiety"]
eq_vas = ["t0_eq_vas"]
categorical = ["t0_symptom_period", "t0_previous_surgery", "t0_living_arrangements"]
oks_score = ["oks_t0_score"]

# add number of comorbidities as extra feature
impute_comorb = SimpleImputer(missing_values=9, strategy="constant", fill_value=0)
df.loc[:, comorb] = impute_comorb.fit_transform(df[comorb])
df["n_comorb"] = df.loc[:, comorb].sum()


# define outcome Y
CUT_OFF_PAIN = 4
CUT_OFF_FUNCTIONING = 26

for t in ("t0", "t1"):
    df[f"oks_{t}_pain_total"] = df[f"oks_{t}_pain"] + df[f"oks_{t}_night_pain"]
    df[f"oks_{t}_functioning_total"] = (
        df.loc[:, [col for col in oks_questions(t) if "pain" not in col]]
        .sum(axis=1)
    )
    df[f"y_{t}_pain_good"] = df[f"oks_{t}_pain_total"].apply(
        lambda s: True if s >= CUT_OFF_PAIN else False
    )
    df[f"y_{t}_functioning_good"] = df[f"oks_{t}_functioning_total"].apply(
        lambda s: True if s >= CUT_OFF_FUNCTIONING else False
    )

# define binary outcome parameter
df["y_binary"] = np.logical_and(df.y_t1_pain_good, df.y_t1_functioning_good)

# Only using 1 split for stratefied sampling, more folds are used later on in cross-validation
split = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)
for train_index, test_index in split.split(df, df["y_binary"]):
    df_train = df.loc[train_index]
    df_test = df.loc[test_index]
    
y_train_pain_good = df_train.y_t1_pain_good
y_train_pain_good = df_train.y_t1_functioning_good
y_train_binary = df_train.y_binary

In [3]:
pd.crosstab(df_train.y_t0_pain_good, df_train.y_t0_functioning_good, normalize=True)

y_t0_functioning_good,False,True
y_t0_pain_good,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.81,0.07
True,0.08,0.05


In [4]:
pd.crosstab(df_train.y_t1_pain_good, df_train.y_t1_functioning_good, normalize=True)

y_t1_functioning_good,False,True
y_t1_pain_good,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.15,0.06
True,0.08,0.71


In [5]:
# same pipeline as lecture 3
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer


# preprocessing pipelines for specific columns
age_band_pipe = Pipeline(
    steps=[
        ("impute", SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
        ("ordinal", OrdinalEncoder(categories=[age_band_categories])),
    ]
)
gender_pipe = Pipeline(
    steps=[
        ("impute", SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
        ("onehot", OneHotEncoder()),
    ]
)

# ColumnTransformer on all included columns.
# Note columns that are not specified are dropped by default
transformers = {
    "age": ("age", age_band_pipe, age_band),
    "gender": ("gender", gender_pipe, gender),
    "comorb": (
        "comorb",
        'passthrough',
        comorb,
    ),
    "categorical": (
        "categorical",
        SimpleImputer(missing_values=9, strategy="most_frequent"),
        boolean + eq5d + categorical,
    ),
    "oks": (
        "oks",
        'passthrough',
        oks_questions('t0'),
    ),
    "eq_vas": ("eqvas", SimpleImputer(missing_values=999, strategy="median"), eq_vas),
}
prep = ColumnTransformer(
    transformers=[v for _, v in transformers.items()])

# X_train = prep.fit_transform(df_train)
# X_test = prep.fit_transform(df_test)

# list of columns for convenience
# https://stackoverflow.com/questions/54646709/sklearn-pipeline-get-feature-name-after-onehotencode-in-columntransformer
# X_columns = pd.Series(
#     age_band
#     + prep.named_transformers_["gender"]["onehot"].get_feature_names().tolist()
#     + comorb
#     + boolean
#     + eq5d
#     + categorical
#     + oks_questions('t0')
#     + oks_score
#     + eq_vas
# )

## Classification

### Single estimators
#### Stochastic Gradient Descent

In [7]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV


sgd = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
sgd_pipe = Pipeline(steps=[('prep', prep),
('sgd', sgd)])

# https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
sgd_parameters = {
'sgd__loss': ['hinge'],
'sgd__penalty': ['l2'],
'sgd__max_iter': [2, 5, 10]
}

sgd_search = GridSearchCV(sgd_pipe, sgd_parameters, cv=5)
sgd_search.fit(df_train, y_train_binary)



GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('prep',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('age',
                                                                         Pipeline(memory=None,
                                                                                  steps=[('impute',
                                                                                          SimpleImputer(add_indicator=False,
                                                                                                        copy=True,
                                             

In [10]:
from sklearn.metrics import confusion_matrix, roc_auc_score
confusion_matrix(y_train_binary, sgd_search.best_estimator_.predict(df_train), normalize='all')

array([[0.11814498, 0.17451393],
       [0.10172883, 0.60561227]])

#### Decision tree

In [12]:
from sklearn.tree import DecisionTreeClassifier

cart = DecisionTreeClassifier(random_state=42)
cart_pipe = Pipeline(
    steps=[
        ("prep", prep),
        ("cart", cart),
    ]
)

cart_parameters = {
'cart__max_depth': [10, 30],
'cart__min_samples_leaf': [0.02, 0.05, 0.1, 0.2],
}

cart_search = GridSearchCV(cart_pipe, cart_parameters, cv=5)
cart_search.fit(df_train, y_train_binary)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('prep',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('age',
                                                                         Pipeline(memory=None,
                                                                                  steps=[('impute',
                                                                                          SimpleImputer(add_indicator=False,
                                                                                                        copy=True,
                                             

In [13]:
confusion_matrix(y_train_binary, cart_search.best_estimator_.predict(df_train), normalize='all')

array([[0.05869799, 0.23396091],
       [0.03967578, 0.66766532]])

#### LinearSVC  

In [None]:
from sklearn.svm import LinearSVC

svm = Pipeline(steps=[("prep", prep), ("clf", LinearSVC(random_state=42))])

## Ensemble methods: Random Forest & GradientBoosted Trees


In [None]:
type(df)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier


rf_clf = Pipeline(steps=[('prep', prep),
                          ('rf', RandomForestClassifier(n_estimators=100,
                                                        random_state=42))])
gb_clf = Pipeline(steps=[('prep', prep),
                          ('gb', GradientBoostingClassifier(n_estimators=100, learning_rate=1.0))])

In [None]:
rf_clf.fit(X_train, y_train)
print(f'Random Forest model score: {rf_clf.score(X_test, y_test):.2f}')

In [None]:
svm_clf.fit(X_train, y_train)
print(f'SVM model score: {svm_clf.score(X_test, y_test):.2f}')

# Conclusion and reflection

## Discussion of results

* ...
* ...

## Checklist for results from data preparation process
* ...
* ...