In [None]:
!pip install sklearn
!pip install sksurv

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

from sksurv.datasets import load_flchain
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sksurv.preprocessing import OneHotEncoder
from sksurv.util import Surv
from sksurv.metrics import (concordance_index_censored,
                            concordance_index_ipcw,
                            cumulative_dynamic_auc)

plt.rcParams['figure.figsize'] = [7.2, 4.8]

we are going to use data from a study that investigated to which extent the serum immunoglobulin free light chain (FLC) assay can be used predict overall survival. The dataset has 7874 subjects and 9 features; the endpoint is death, which occurred for 2169 subjects (27.5%). This is a stratified random sample containing 1/2 of the subjects from a study of the relationship between serum free light chain (FLC) and mortality. The original sample contains samples on approximately 2/3 of the residents of Olmsted County aged 50 or greater.

A data frame with 7874 persons containing the following variables.

age= (age in years)
sex= (F=female, M=male)
sample.yr= (the calendar year in which a blood sample was obtained)
kappa= (serum free light chain, kappa portion)
lambda= (serum free light chain, lambda portion)
flc.grp= (the FLC group for the subject, as used in the original analysis)
creatinine= (serum creatinine)
mgus= (1 if the subject had been diagnosed with monoclonal gammapothy (MGUS))
futime= (days from enrollment until death. Note that there are 3 subjects whose
         sample was obtained on their death date.)
death= (0=alive at last contact date, 1=dead)
chapter= (for those who died, a grouping of their primary cause of death by chapter
          headings of the International Code of Diseases ICD-9)

First, we are loading the data and split it into train and test set to evaluate how well markers generalize.

In [2]:
x, y = load_flchain()

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)


Serum creatinine measurements are missing for some patients, therefore we are just going to impute these values with the mean using scikit-learn's SimpleImputer.

In [3]:
num_columns = ['age', 'creatinine', 'kappa', 'lambda']

imputer = SimpleImputer().fit(x_train.loc[:, num_columns])
x_train = imputer.transform(x_train.loc[:, num_columns])
x_test = imputer.transform(x_test.loc[:, num_columns])