<a href="https://colab.research.google.com/github/DavideScassola/xai-labs/blob/main/./SHAP/exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NHANES I Survival Model

This is a cox proportional hazards model on data from <a href="https://wwwn.cdc.gov/nchs/nhanes/nhanes1">NHANES I</a> with followup mortality data from the <a href="https://wwwn.cdc.gov/nchs/nhanes/nhefs">NHANES I Epidemiologic Followup Study</a>. It is designed to illustrate how SHAP values enable the interpretion of XGBoost models with a clarity traditionally only provided by linear models. We see interesting and non-linear patterns in the data, which suggest the potential of this approach. Keep in mind the data has not yet been checked by us for calibrations to current lab tests and so you should not consider the results as actionable medical insights, but rather a proof of concept. 

In [None]:
import matplotlib.pylab as pl
import numpy as np
import xgboost
from sklearn.model_selection import train_test_split
import pandas as pd
import shap

X, y = shap.datasets.nhanesi()

## Load the data

This uses a pre-processed subset of NHANES I data available in the SHAP datasets module.

In [None]:
import shap
X, y = shap.datasets.nhanesi()
X

In [None]:
pl.hist(y, bins=40)
pl.xlabel('y')

In [None]:
pl.scatter(X['age'], y, alpha=0.1)
pl.xlabel("age")
pl.ylabel("y")





## Train XGBoost model

In [None]:
# train final model on the full data set
xgb_full = xgboost.DMatrix(X, label=y)
params = {"eta": 0.002, "max_depth": 3, "objective": "survival:cox", "subsample": 0.5}
model = xgboost.train(
    params, xgb_full, 7000, evals=[(xgb_full, "test")], verbose_eval=1000
)

#### 1. What are the most important features?

#### 2. What are the most important couples of features?

#### 3. Which features have an highly non-linear contribution? Are there features that are not much correlated with the output, but that have generally an high shap value?

#### 4. Compare the shap values obtained in the following different ways:
####	- Tree with TreeExplainer vs tree with KernelExplainer
####	- Tree with normalized features vs tree with unnormalized features

#### 6. Let's assume SHAP values have causal meaning, find the person who can benefit the most from: 
####    - doing more physical activity
####    - acting on weight
####    - losing weight
####    - both acting on physical activity and weight
####    - acting on all characteristics that can be changed reasonably