<a href="https://colab.research.google.com/github/dkapitan/jads-nhs-proms/blob/master/notebooks/4.0-modeling-clustering-classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background to osteoarthritis case study

_taken from [narrative seminar Osteoarthritis by Hunter & Bierma-Zeinstra (2019) in the Lancet](https://github.com/dkapitan/jads-nhs-proms/blob/master/references/hunter2019osteaoarthritis.pdf)._

Outcomes from total joint replacement can be optimised if patient selection identifies marked joint space narrowing. Most improvement will be made in patients with complete joint space loss and evident bone attrition. Up to 25% of patients presenting for total joint replacement continue to complain of pain and disability 1 year after well performed surgery. Careful preoperative patient selection (including consideration of the poor outcomes that are more common in people who are depressed, have minimal radiographic disease, have minimal pain, and who are morbidly obese), shared decision making about surgery, and informing patients about realistic outcomes of surgery are needed to minimise the likelihood of dissatisfaction.

# Modeling: clustering & classfication

This is day 4 from the [5-day JADS NHS PROMs data science case study](https://github.com/dkapitan/jads-nhs-proms/blob/master/references/outline.md).



## Learning objectives: modeling
- ...


## Learning objectives Python: Hands-on Machine Learning (2nd edition)

- [End-to-end Machine Learning project (chapter 2)](https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb)
- [Unsupervised learning (chapter 9)](https://github.com/ageron/handson-ml2/blob/master/09_unsupervised_learning.ipynb)
- [Classification (chapter 3](https://github.com/ageron/handson-ml2/blob/master/03_classification.ipynb)
- [Support-vector machines (chapter 5](https://github.com/ageron/handson-ml2/blob/master/05_support_vector_machines.ipynb)
- [Decision trees (chapter 6)](https://github.com/ageron/handson-ml2/blob/master/06_decision_trees.ipynb)
- [Ensemble learning and random forests (chapter 7](https://github.com/ageron/handson-ml2/blob/master/07_ensemble_learning_and_random_forests.ipynb)

## Recap from previous lecture
- Good outcome for knee replacement Y is measured using difference in Oxford Knee Score (OKS)
- Research has shown that an improvement in OKS score of approx. 30% is relevant ([van der Wees 2017](https://github.com/dkapitan/jads-nhs-proms/blob/master/references/vanderwees2017patient-reported.pdf)). Hence an increase of +14 points is considered a 'good' outcome.
- to account for the ceiling effect, a high final `t1_oks_score` is also considered as a good outcome (even if `delta_oks_score` is smaller than 14).

    

In [0]:
import warnings
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.feature_selection import chi2, VarianceThreshold
import sklearn.linear_model

#supressing warnings for readability
warnings.filterwarnings("ignore")

# To plot pretty figures directly within Jupyter
%matplotlib inline

# choose your own style: https://matplotlib.org/3.1.0/gallery/style_sheets/style_sheets_reference.html
plt.style.use('ggplot')

# Go to town with https://matplotlib.org/tutorials/introductory/customizing.html
# plt.rcParams.keys()
mpl.rc('axes', labelsize=14, titlesize=14)
mpl.rc('figure', titlesize=20)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# contants for figsize
S = (8,8)
M = (12,12)
L = (14,14)

# pandas options
pd.set_option("display.max.columns", None)
pd.set_option("display.max.rows", None)
pd.set_option("display.precision", 2)

# import data
df = pd.read_parquet('https://github.com/dkapitan/jads-nhs-proms/blob/master/data/interim/knee-provider.parquet?raw=true')

# Data preparation in a scikit-learn Pipeline
Previously we have already discussed the various steps in data preparation using [pandas](https://pandas.pydata.org/). As explained in the [documentation of scikit-learn](https://scikit-learn.org/stable/modules/compose.html#column-transformer), this may be problematic for one of the following reasons:

* Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as data leakage), for example in the case of scalers or imputing missing values.

* You may want to include the parameters of the preprocessors in a [parameter search](https://scikit-learn.org/stable/modules/grid_search.html#grid-search).

To this purpose, the [`ColumnTransformer` class](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=columntransformer#sklearn.compose.ColumnTransformer) has been recently added to scikit-learn. The documentation gives an example how to use this for [pre-processing mixed types](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py). Historically, `sklearn` transformers are designed to work with numpy arrays, not with pandas dataframes. You can use [`sklearn-pandas`](https://github.com/scikit-learn-contrib/sklearn-pandas) to bridge this gap or use `ColumnTransformer` directly on pandas DataFrames. We will use the latter.



## Using ColumnsTransformers and Pipelines

Recalling from the second lecture, we want to perform the following preprocessing per (group of) columns:
  * Gender: replace `np.nan` and transform to boolean with `OneHotEncoding`
  * Age_band: ordinal values
  * Comorbidities: replace sentinel 9 with 0 and transform to boolean
  * boolean, EQ5D, categorical: replace sentinel 9 with most frequent
  * eq_vas: replace sentinel 999 with median

In case feature requires more than one preprocessing step, the use of `Pipeline` is recommended.

### Passing 1D or 2D arrays in your `Pipeline`
It is important to remember that `scikit-learn` can be quite fussy about the difference between passing 1D arrays/series and 2D arrays/dataframes.

For example, the following code will result in an error because `categories` needs to be a list of lists:
```
enc = OrdinalEncoder(categories=age_band_categories)
enc.fit(df[age_band])
```

The correct code is (brackets!):
```
enc = OrdinalEncoder(categories=[age_band_categories])
enc.fit(df[age_band])


### Beware: difference between `OrdinalEncoder` and `OneHotEncoding`
Using `OrdinalEncoder` to generate an integer representation of a categorical variable can not be used directly with all scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired.

Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.

In [108]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer


# group columns
age_band = ['age_band']
gender = ['gender']
age_band_categories = sorted([x for x in df.age_band.unique() if x])
comorb = ['heart_disease', 'high_bp', 'stroke', 'circulation', 'lung_disease',
          'diabetes', 'kidney_disease', 'nervous_system', 'liver_disease',
          'cancer', 'depression', 'arthritis']
boolean = ['t0_assisted', 't0_previous_surgery', 't0_disability']
eq5d = ['t0_mobility', 't0_self_care', 't0_activity', 't0_discomfort',
        't0_anxiety']
eq_vas = ['t0_eq_vas']
categorical = ['t0_symptom_period', 't0_previous_surgery',
               't0_living_arrangements']
oks = [col for col in df.columns if col.startswith('oks_t0')]

# preprocessing pipelines for specific columns
age_band_pipe = Pipeline(
    steps=[('impute',
            SimpleImputer(missing_values=None,
            strategy='most_frequent')),
           ('ordinal',
            OrdinalEncoder(categories=[age_band_categories])),
           ])

gender_pipe = Pipeline(
    steps=[('impute',
           SimpleImputer(missing_values=np.nan,
                         strategy='most_frequent')),
           ('onehot', OneHotEncoder()),
           ])

oks_pipe = Pipeline(
    steps=[('impute9',
            SimpleImputer(missing_values=9,
                          strategy='most_frequent')),
           ('imputeNA',
            SimpleImputer(missing_values=np.nan,
                          strategy='most_frequent')),
           ])

# ColumnTransformer on all included columns.
# Note columns that are not specified are dropped by default
prep = ColumnTransformer(
    transformers=[('age',
                   age_band_pipe,
                   ['age_band']),
                  ('gender',
                   gender_pipe,
                   gender),
                  ('constant',
                   SimpleImputer(missing_values=9,
                                 strategy='constant',
                                 fill_value=0),
                   comorb),
                  ('most_frequent',
                   SimpleImputer(missing_values=9,
                                 strategy='most_frequent'),
                   boolean + eq5d + categorical),
                  ('eqvas',
                   SimpleImputer(missing_values=999,
                                 strategy='median'),
                   eq_vas),
                  ('oks',
                   oks_pipe,
                   oks),
                  ])

prep.fit(df)

ValueError: ignored

In [107]:
df[oks[-1]].value_counts(dropna=False)

17.0    6564
16.0    6563
20.0    6562
19.0    6518
15.0    6387
21.0    6362
18.0    6301
22.0    6160
14.0    5994
23.0    5799
13.0    5745
12.0    5311
24.0    5296
25.0    4885
11.0    4883
26.0    4475
10.0    4349
27.0    4041
9.0     3848
28.0    3428
8.0     3169
29.0    3077
7.0     2654
30.0    2527
31.0    2246
6.0     2046
32.0    1892
NaN     1669
5.0     1568
33.0    1543
34.0    1260
4.0     1184
35.0     966
36.0     750
3.0      647
37.0     619
38.0     430
39.0     352
2.0      312
40.0     245
41.0     164
42.0     108
1.0      104
43.0      69
0.0       56
44.0      52
45.0      31
46.0      11
47.0       8
48.0       6
Name: oks_t0_score, dtype: int64

## Writing custom transformers (advanced, see Géron chapter 2)
Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom cleanup operations or combining specific attributes. You will want your transformer to work seamlessly with Scikit-Learn functionalities (such as pipelines), and since Scikit-Learn relies on duck typing (not inheritance), all you need to do is create a class and implement three methods: fit() (returning self), transform(), and fit_transform().

When writing transformers for data preparation, you only need to define `transform()`. Basically, `ColumnTransformer` passes only the subset of columns from the original dataframe to the transformer. So when writing your own transformer you don't need to do any subsetting, but you can assume that the `transform()` method should be applied to the whole dataframe.



In [0]:
# just as an example, not used in Pipeline
class ReplaceSentinels(BaseEstimator, TransformerMixin):
    """Replace sentinel values in dataframe.
    
    Attributes:
        sentinel: sentinel value, default 9
        replace_with: value to replace sentinel with, default np.nan
    """
    def __init__(self, sentinel = 9, replace_with=np.nan):
        self.sentinel = sentinel
        self.replace_with = replace_with
    def fit(self, X, y=None):
        return self
    def transform(self, X, ):
        return X.replace(9, self.replace_with)


## Preproces Y and train-test split
After preprocessing X, we now need to Y and perform train-test split. 


In [103]:
df[oks[0]].value_counts(dropna=False)

0    70493
1    60701
2     6171
3     1340
4      333
9      198
Name: oks_t0_pain, dtype: int64

In [0]:
from sklearn.model_selection import train_test_split

def good_outcome(oks_t1, delta_oks, abs_threshold=43, mcid=13):
  if oks_t1 > abs_threshold or delta_oks > mcid:
    return True
  else:
    return False

t1 = [col for col in df.columns if col.startswith(('oks_t1', 't1'))]
Y = df[T1].copy()
df['t1_delta_oks_score'] = df.oks_t1_score - df.oks_t0_score



In [98]:
t1 = [col for col in df.columns if col.startswith(('oks_t1', 't1'))]
t1

['t1_assisted',
 't1_assisted_by',
 't1_living_arrangements',
 't1_disability',
 't1_mobility',
 't1_self_care',
 't1_activity',
 't1_discomfort',
 't1_anxiety',
 't1_satisfaction',
 't1_sucess',
 't1_allergy',
 't1_bleeding',
 't1_wound',
 't1_urine',
 't1_further_surgery',
 't1_readmitted',
 't1_eq5d_index_profile',
 't1_eq5d_index',
 't1_eq_vas',
 'oks_t1_pain',
 'oks_t1_night_pain',
 'oks_t1_washing',
 'oks_t1_transport',
 'oks_t1_walking',
 'oks_t1_standing',
 'oks_t1_limping',
 'oks_t1_kneeling',
 'oks_t1_work',
 'oks_t1_confidence',
 'oks_t1_shopping',
 'oks_t1_stairs',
 'oks_t1_score']

## Discussion

### **Question:** ...
- ...


## Selecting input features X

In [0]:
# input features
# TO DO: decide what to do with years?!

# feature engineering
dfc['oks_t0_pain_total'] = dfc['oks_t0_pain'] + dfc['oks_t0_night_pain']
dfc['n_comorb'] = dfc.loc[:,comorb].sum()

X_features = ['provider_code', 'female', 'age_band'] + comorb + boolean + eq5d_questions('t0') + oks_questions('t0') + ['t0_eq_vas', 'oks_t0_pain_total', 'n_comorb']
X_features

## Selecting outcome Y

In [0]:
# add delta_oks_score and Y
def good_outcome(oks_t1, delta_oks, abs_threshold=43, mcid=13):
  if oks_t1 > abs_threshold or delta_oks > mcid:
    return True
  else:
    return False

dfc['delta_oks_score'] = dfc.oks_t1_score - dfc.oks_t0_score
dfc['Y'] = dfc.apply(lambda row: good_outcome(row['oks_t1_score'], row['delta_oks_score']), axis=1)
dfc.Y.value_counts(normalize=True)

# Getting into scikit-learn: regression

## Simple linear regression
To illustrate linear regression, we use `t1_eq_vas` as numeric outcome. First we will assess whether there is a correlation between `t0` and `t1` values of `eq_vas`

In [0]:
dfc.plot(kind='scatter', x='t0_eq_vas', y='t1_eq_vas', figsize=M, alpha=0.2);

In [0]:
from sklearn.linear_model import LinearRegression

# x needs to be a column vector or an matrix
x = dfc.t0_eq_vas.values.reshape(-1, 1)

# y is a row vector
y = dfc.t1_eq_vas.values
print(f'x: {x[:5]}\n y: {y[:5]}')

In [0]:
# linear regression with t0_eq_vas
lr = LinearRegression().fit(x, y)
r2 = lr.score(x, y)
print(f'r2: {r2:.2f}')

## Regression with decision trees
Next, let's try more parameters to perform regression on `t1_eq_vas`.

In [0]:
from sklearn.tree import DecisionTreeRegressor

# regression can only take numeric input features
X = dfc.loc[:, X_features].select_dtypes(include='float64').drop(columns=['n_comorb'])
y = dfc.t1_eq_vas.values
dtr = DecisionTreeRegressor().fit(X,y)
dt_r2 = dtr.score(X, y)
print(f'DecisionTreeRegressor r2: {dt_r2:.2f}')

In [0]:
# this doesn't make sense, probably overfitting
dtr = DecisionTreeRegressor(max_depth=5).fit(X,y)
dt_r2 = dtr.score(X, y)
print(f'DecisionTreeRegressor r2: {dt_r2:.2f}')

# Conclusion and reflection

## Discussion of results

* ...
* ...

## Checklist for results from data preparation process
* Input regarding the moment of prediction
* Input for data cleaning (handling missing data; removing variables not known at time of prediction, near-zero variance variables, etc)
* Input for feature engineering (adjusting variables based on tree-analyses, based on correlations, based on domain-analysis)
* Input for defining the outcome variable Y
* Input for defining the business objective in terms of generalizability (in case of missing Y values)
* Input for choosing the business objective in case there are still multiple options at the table
* Input for defining the scope of the business objective (e.g. limiting to a subgroup to get a better balanced outcome variable)
* A potential revision of the goal of your business objective
* Input for which variables and combination of variables seem particularly relevant within the to-be-developed algorithms 