## Gradient Boosting Experiments

In this notebook we experiment with gradient boosting models from the scikit-survival library. We use an 'out of the box' Gradient-boosted model which uses Cox proportional hazard loss, with regression trees as a base learner. First, we fit the model using one-hot encoding on the categorical variables and KNN imputation. Then we repeat the experiment using target encoding on categorical variables. Neither option seems to give better performance than standard Cox regression.

### Load Data, One-Hot Encode Categorical Variables

In [1]:
# import libraries and modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import lifelines as lflns
sns.set_style('whitegrid')

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.model_selection import GridSearchCV

from sksurv.ensemble import GradientBoostingSurvivalAnalysis

In [2]:
# Load training data
hct_df = pd.read_csv("../data/train_set.csv")

# Replace text value that corresponding so missing data with NaN
hct_df = hct_df.replace(to_replace=["Missing Disease Status", "Missing disease status"], value=np.nan)

# drop year column, which isn't appropriate for prediction
hct_df = hct_df.drop(columns=['year_hct'])

In [3]:
# Use one-hot encoding to encode categorical columns.
# The min_frequency option will bin very rare values
# of each categorical variable into a new
# 'infrequent' category.
# The minimum frequency of 0.001 corresponds to a 
# minimum of roughly 20 training samples.

cat_cols = list(hct_df.select_dtypes(include='O').columns)

encoder = ColumnTransformer(
    [
        ('one_hot', 
         OneHotEncoder(drop='first', 
                       min_frequency=0.001, 
                       handle_unknown='ignore',
                       ), 
        cat_cols
        ),
    ],
    sparse_threshold=0,
    remainder='passthrough',
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
    )

# Drop the ID column, which should not be used
# in prediction.
# We keep it in hct_df above, because it may 
# be needed for the custom score function.
df_enc = pd.DataFrame(encoder.fit_transform(hct_df), columns=encoder.get_feature_names_out()).drop("ID", axis=1)

### Gradient Boosting with One-Hot Encoded Features

In [4]:
from sksurv.util import Surv

# Encode targets in a way that is compatible with scikit-survival
outcomes = Surv.from_arrays(event = df_enc['efs'].astype('bool'), time = df_enc['efs_time'])

# Create dataframe of features
feat_names = df_enc.columns[:-2]
features = df_enc[feat_names]

In [5]:
# Split training data further into test and train,
# to get a first pass at seeing how well the model generalizes
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, shuffle=True)

In [6]:
pipe = Pipeline(
        [
            (
                "scale",
                ColumnTransformer(
                    [
                        ('scale', StandardScaler(), ['donor_age', 'age_at_hct', 'karnofsky_score'])
                    ],
                    sparse_threshold=0,
                    remainder='passthrough',
                    verbose_feature_names_out=False,
                    force_int_remainder_cols=False
                )
            ),
            (
                "impute",
                KNNImputer()
            ),
            ("gb",
             GradientBoostingSurvivalAnalysis()
            )
        ],
    verbose=True
    )

In [7]:
pipe.fit(X_train, y_train)

[Pipeline] ............. (step 1 of 3) Processing scale, total=   0.0s
[Pipeline] ............ (step 2 of 3) Processing impute, total=  28.6s
[Pipeline] ................ (step 3 of 3) Processing gb, total= 5.8min


In [8]:
pipe.score(X_test, y_test)

np.float64(0.6540996550184679)

### Gradient Boosting with Target-Encoded Features

In [9]:
from sklearn.preprocessing import TargetEncoder

cat_cols = list(hct_df.select_dtypes(include='O').columns)

encoder = ColumnTransformer(
    [
        (
            'target', 
            TargetEncoder(), 
            cat_cols
        ),
    ],
    sparse_threshold=0,
    remainder='passthrough',
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)


In [10]:
train, test = train_test_split(hct_df)

In [11]:
features_train_raw = train.iloc[:, :-2].drop("ID", axis=1)
features_test_raw = test.iloc[:, :-2].drop("ID", axis=1)
train_target = train['efs_time']
train_outcomes = Surv.from_dataframe(data=train, event='efs', time='efs_time')
test_outcomes = Surv.from_dataframe(data=test, event='efs', time='efs_time')

In [12]:
features_train_processed = pd.DataFrame(encoder.fit_transform(features_train_raw, train_target), columns=encoder.get_feature_names_out())
features_test_processed = pd.DataFrame(encoder.transform(features_test_raw), columns=encoder.get_feature_names_out())

In [13]:
pipe = Pipeline(
        [
            (
                "scale",
                StandardScaler()
            ),
            (
                "impute",
                KNNImputer()
            ),
            ("gb",
             GradientBoostingSurvivalAnalysis()
            )
        ],
    verbose=True
)

In [14]:
pipe.fit(features_train_processed, train_outcomes)

[Pipeline] ............. (step 1 of 3) Processing scale, total=   0.0s
[Pipeline] ............ (step 2 of 3) Processing impute, total=  18.1s
[Pipeline] ................ (step 3 of 3) Processing gb, total= 5.7min


In [15]:
pipe.score(features_test_processed, test_outcomes)

np.float64(0.6687057133349336)

In [None]:
preds = pipe.predict(features_test_processed)

In [23]:
test_idx = features_test_raw.index

In [26]:
# Use custom scoring functions to evaluate model predictions
%run -i ../examples/concordance_index.ipynb

solution = hct_df.iloc[test_idx]
prediction = pd.DataFrame({"ID":hct_df.iloc[test_idx]["ID"], "prediction":preds})
score(solution.copy(deep=True), prediction.copy(deep=True), "ID")

0.6554016554678938