<a href="https://colab.research.google.com/github/dkapitan/jads-nhs-proms/blob/master/notebooks/3.0-data-preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background to osteoarthritis case study

_taken from [narrative seminar Osteoarthritis by Hunter & Bierma-Zeinstra (2019) in the Lancet](https://github.com/dkapitan/jads-nhs-proms/blob/master/references/hunter2019osteaoarthritis.pdf)._

Outcomes from total joint replacement can be optimised if patient selection identifies marked joint space narrowing. Most improvement will be made in patients with complete joint space loss and evident bone attrition. Up to 25% of patients presenting for total joint replacement continue to complain of pain and disability 1 year after well performed surgery. Careful preoperative patient selection (including consideration of the poor outcomes that are more common in people who are depressed, have minimal radiographic disease, have minimal pain, and who are morbidly obese), shared decision making about surgery, and informing patients about realistic outcomes of surgery are needed to minimise the likelihood of dissatisfaction.

# Data Preparation

This is day 2 from the [5-day JADS NHS PROMs data science case study](https://github.com/dkapitan/jads-nhs-proms/blob/master/references/outline.md).



## Learning objectives: select and clean data
- Impute missing values
- Select main input variables X (feature engineering)
- Define target Y (clustered classes, categories)
- Decide how to handle correlated input features

## Learning objectives: Python

- [Pythonic data cleaning](https://realpython.com/python-data-cleaning-numpy-pandas/)
- [When to use list comprehensions](https://realpython.com/list-comprehension-python/)
- [Pandas GroupBy](https://realpython.com/pandas-groupby/)
- [Correlations in with numpy, scipy and pandas](https://realpython.com/numpy-scipy-pandas-correlation-python/)

## Recap from previous lecture
- Good outcome for knee replacement Y is measured using difference in Oxford Knee Score (OKS)
- Research has shown that an improvement in OKS score of approx. 30% is relevant ([van der Wees 2017](https://github.com/dkapitan/jads-nhs-proms/blob/master/references/vanderwees2017patient-reported.pdf)). Hence an increase of +14 points is considered a 'good' outcome.
- to account for the ceiling effect, a high final `t1_oks_score` is also considered as a good outcome (even if `delta_oks_score` is smaller than 14).

    

# Select and clean data


In [0]:
import warnings
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.feature_selection import chi2, VarianceThreshold
import sklearn.linear_model

#supressing warnings for readability
warnings.filterwarnings("ignore")

# To plot pretty figures directly within Jupyter
%matplotlib inline

# choose your own style: https://matplotlib.org/3.1.0/gallery/style_sheets/style_sheets_reference.html
plt.style.use('ggplot')

# Go to town with https://matplotlib.org/tutorials/introductory/customizing.html
# plt.rcParams.keys()
mpl.rc('axes', labelsize=14, titlesize=14)
mpl.rc('figure', titlesize=20)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# contants for figsize
S = (8,8)
M = (12,12)
L = (14,14)

# pandas options
pd.set_option("display.max.columns", None)
pd.set_option("display.max.rows", None)
pd.set_option("display.precision", 2)

# import data
df = pd.read_parquet('https://github.com/dkapitan/jads-nhs-proms/blob/master/data/interim/knee-provider.parquet?raw=true')

In [0]:
# re-run relevant code from previous lecture
# replace sentinel values with NA
dfc = df.copy()
dfc.loc[:,['t0_eq_vas', 't1_eq_vas']] = dfc.loc[:,['t0_eq_vas', 't1_eq_vas']].replace(999, np.nan).astype('Int64')
_no9 = [col for col in df.columns if col.startswith('oks_t') and not col.endswith('score')]
dfc.loc[:,_no9] = df.loc[:,_no9].replace(9, np.nan).astype('Int64')

# add delta_oks_score and Y
def good_outcome(oks_t1, delta_oks, abs_threshold=43, mcid=13):
  if oks_t1 > abs_threshold or delta_oks > mcid:
    return True
  else:
    return False


dfc['delta_oks_score'] = dfc.oks_t1_score - dfc.oks_t0_score
dfc['Y'] = dfc.apply(lambda row: good_outcome(row['oks_t1_score'], row['delta_oks_score']), axis=1)

## Clean X

TO DO: `9` used as a sentinel value is almost all questions!

In [0]:
# helper function to count all unique values for categorical data

def count_values(df):
  """ Count unique values in dataframe"""
  
  return pd.concat([df[col].value_counts().sort_values(ascending=False).head(10) for col in df.columns], axis=1).transpose()

### Encode boolean features

In [12]:
# encode casemix to boolean
casemix = ['heart_disease', 'high_bp', 'stroke', 'circulation', 'lung_disease', 'diabetes',
           'kidney_disease', 'nervous_system', 'liver_disease', 'cancer', 'depression', 'arthritis']
dfc.loc[:, casemix] = dfc.loc[:, casemix].replace({9: False, 1: True})
dfc.loc[:, casemix].describe()

Unnamed: 0,heart_disease,high_bp,stroke,circulation,lung_disease,diabetes,kidney_disease,nervous_system,liver_disease,cancer,depression,arthritis
count,139236,139236,139236,139236,139236,139236,139236,139236,139236,139236,139236,139236
unique,2,2,2,2,2,2,2,2,2,2,2,2
top,False,False,False,False,False,False,False,False,False,False,False,True
freq,126185,77666,136933,131354,126457,121857,136380,137814,138424,131693,126205,107274


In [26]:
# encode gender to boolean, males: False, females: True
dfc.loc[:,'gender'] = dfc.loc[:,'gender'].replace({1: False, 2: True})
dfc.loc[:,'gender'].value_counts()

False    129834
Name: gender, dtype: int64

In [49]:
# booleans, with 9 missing
boolean = ['t0_assisted', 't0_previous_surgery']
count_values(df[['gender'] + casemix + boolean])

Unnamed: 0,1.0,2.0,9.0
gender,55749.0,74085.0,
heart_disease,13051.0,,126185.0
high_bp,61570.0,,77666.0
stroke,2303.0,,136933.0
circulation,7882.0,,131354.0
lung_disease,12779.0,,126457.0
diabetes,17379.0,,121857.0
kidney_disease,2856.0,,136380.0
nervous_system,1422.0,,137814.0
liver_disease,812.0,,138424.0


In [0]:
# useless variables
useless = ['t0_assisted_by']


In [35]:

dfc.t0_previous_surgery.value_counts()

2    127410
1     10784
9      1042
Name: t0_previous_surgery, dtype: int64

In [27]:
dfc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 139236 entries, 0 to 47603
Data columns (total 83 columns):
provider_code                   139236 non-null object
procedure                       139236 non-null object
revision_flag                   139236 non-null uint8
year                            139236 non-null object
age_band                        129834 non-null object
gender                          129834 non-null object
t0_assisted                     139236 non-null uint8
t0_assisted_by                  139236 non-null uint8
t0_symptom_period               139236 non-null uint8
t0_previous_surgery             139236 non-null uint8
t0_living_arrangements          139236 non-null uint8
t0_disability                   139236 non-null uint8
heart_disease                   139236 non-null bool
high_bp                         139236 non-null bool
stroke                          139236 non-null bool
circulation                     139236 non-null bool
lung_disease             

In [0]:
# helper function for counting boolean attribues
def count_all(series):
    '''
    Returns absolute and normalized value counts of pd.series as a dataframe with 
    index = series.name
    columns with absolute and normalized counts of each value
    '''
    try:
        count = series.value_counts().to_frame().transpose()
        norm = series.value_counts(normalize=True).to_frame().transpose()
        return count.join(norm, lsuffix='_count', rsuffix='_normalized') 
    except:
        print('Error: expecting a pandas.Series object as input. \n' + count_all.__doc__)
        return None
        

### Python skill: using list comprehensions

There are three ways to work with lists in Python:
1. with `for` loops
2. using `map` (map-reduce pattern)
3. using list comprehensions

The third option finds its roots in functional programming and is considered the most Pythonic. It is particularly useful for data science.

In [18]:
tbl_casemix = pd.concat([count_all(dfc.loc[:, feature]) for feature in casemix], sort=True)
tbl_casemix.round(2).sort_values('True_normalized', ascending=False)

Unnamed: 0,False_count,False_normalized,True_count,True_normalized
arthritis,31962,0.23,107274,0.77
high_bp,77666,0.56,61570,0.44
diabetes,121857,0.88,17379,0.12
heart_disease,126185,0.91,13051,0.09
lung_disease,126457,0.91,12779,0.09
depression,126205,0.91,13031,0.09
circulation,131354,0.94,7882,0.06
cancer,131693,0.95,7543,0.05
stroke,136933,0.98,2303,0.02
kidney_disease,136380,0.98,2856,0.02


### Encode categorical features

In [30]:
# 9 as missing: symptom_period
categorical = ['provider_code', 'age_band', 't0_symptom_period', 't0_previous_surgery', 't0_living_arrangements']
dfc[categorical].head()

Unnamed: 0,provider_code,age_band,t0_assisted,t0_symptom_period,t0_previous_surgery,t0_living_arrangements
0,ADP02,,2,2,2,2
1,ADP02,,2,2,2,2
2,ADP02,,2,2,2,1
3,ADP02,,2,2,2,2
4,ADP02,,1,3,2,2


## Impute missing values

In [0]:
from sklearn.impute import SimpleImputer



imputer = SimpleImputer(strategy="median")

## Discussion

### **Question:** what are relevant considerations to handle NAs?
- imputation with mean/median?
- just drop all?

### **Question:** what would you choose as the primary outcome Y?

# Conclusion and reflection

## Discussion of results

* ...
* ...


## Checklist for results from data preparation process
* ...
* ...
