In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Following Ruibo's approach, we perform cleannng for both the categorical and the numerical variables. For the categorical variables, we turn entries in the null list and N/A into missing. For the numerical variables, we turn N/A into -1.

In [3]:
df_train = pd.read_csv("../data/train_set.csv")

df_train_cleaned = df_train.copy(deep=True)

# cleaning categorical variables

cat_columns = df_train_cleaned.select_dtypes(include = ['O']).columns

null_list = ["Not done", "Not tested", "Other", "Missing disease status", "Non-resident of the U.S."]
df_train_cleaned.loc[:,cat_columns] = df_train_cleaned[cat_columns].replace(null_list, "missing")

df_train_cleaned.loc[:,cat_columns] = df_train_cleaned[cat_columns].fillna('missing')

# cleaning numerical variables

num_columns = df_train.select_dtypes(include = ['float64']).columns

df_train_cleaned.loc[:, num_columns] = df_train_cleaned[num_columns].fillna(-1.0)

Here, we give a brief overview of important functions in survival analysis and the models for survival time we consider. Given survival time $T$, the survival function $S:\mathbb{R}_{\geq 0}\rightarrow [0,1]$ is defined as 
$$
S(t) = \mathbb{P}(T > t);
$$
it measures how likely a patient is to live longer than time $t$. Associated with $S:\mathbb{R}_{\geq 0}\rightarrow [0,1]$, one can define the hazard function $h:\mathbb{R}_{\geq 0}\rightarrow \mathbb{R}$
$$
h(t) = -\frac{S'(t)}{S(t)};
$$
it measures how likely a patient is going to die at the next instant given he/she has lived for time $t$. One also consider the cumulative hazard function $H:\mathbb{R}_{\geq 0}\rightarrow \mathbb{R}$, which is defined as $H(t) = \int_0^t h(s)\,ds$. Two models for the survival time we consider are Cox proportional hazard model and accelerated failure model (AFT).

Let $X = (X_1, \dotsc, X_p)$ be the predictors. In general, Cox proportional hazards model assumes the hazard function has the form
$$
h(t; X) = h_0(t) \exp(X\cdot\beta).
$$
A consequence of this model is the quotient of the hazards of to patients will be independent of time,
$$
\frac{h(t; X)}{h(t; X')} = \frac{\exp(X\cdot\beta)}{\exp(X'\cdot\beta)}.
$$
Therefore, it is reasonable to use the hazard function as the risk score. In general, AFT models the survival time $T$ as
$$
\log{T} = \sum_{i=1}^p \beta_i X_i + \epsilon,
$$
where $X_i$'s are the covariates and $\epsilon$ is distributed as $\log{T_0}$ (the logarithm of a base survival time). There are several common choices for the distribution of $\log{T_0}$ such as log-logistic, log-normal or Weibull. For the risk score of AFT, a direct calculation shows that 
$$
\mathbb{E}T = \exp\left( \sum_{i=1}^p \beta_i X_i\right) \mathbb{E}T_0,
$$
and this suggests that under AFT it is reasonable to use $\theta = \exp(-(\sum_{i=1}^p \beta_i X_i))$ as the risk score. 

In [47]:
## this cell runs the script for the stratified concordance index

%run -i ../examples/concordance_index.ipynb

In the following, we do not attempt to select features. Instead, we use the full data set and fit the model using Cox proportional hazard model with ridge penalty term. We did not try to use the vanilla Cox proportional model, since it will lead to ill conditioned matrix during the training process. As a prepocessing step, we impute the numerical features with the means. We consider two scenarios:

1. fit CPH directly using the imputed data
2. adding an extra feature using logistic regression, then fit CPH using the embedded data

It turns out that the cross validation scores of the second approach have improved. This suggests the possibility to incorporate different predictors (classifiers or regressors) with CPH. To be more precise, let $\varphi_1, \dotsc, \varphi_n$ be different predictors. We consider the embedding
$$
f: (X_1,\dotsc, X_p) \mapsto (X_1,\dotsc, X_p, \varphi_1(X), \dotsc, \varphi_n(X)) = (Y_1,\dotsc, Y_{p+n}),
$$ 
and fit CPH using $Y$. Since we have a lot of data, overfitting is perhaps a less worrying issue. Such strategy can incorporate the predictions of different predictors, and theoretically the trained CPH model will have better performance than any individual $\varphi_i$'s.

In [146]:
## use all features

from sklearn.model_selection import StratifiedKFold
from sklearn.impute import SimpleImputer
from sksurv.util import Surv
from sksurv.linear_model import CoxPHSurvivalAnalysis

kfold = StratifiedKFold(n_splits = 10,
              shuffle = True,
              random_state = 582)

# prepare the data for training
X_train = pd.get_dummies(df_train_cleaned).drop(["ID", "efs", "efs_time"], axis=1)

surv = Surv() #a helper class to construct the structured array for sksurv

y_train = surv.from_dataframe("efs", "efs_time", df_train_cleaned)

# parameters for penalty
alphas = 10.0 ** np.linspace(-4, 4, 10)

# rmses will hold the cross validation root mean squared errors of each model. 
sci = np.zeros((len(alphas), 10))

for i in range(len(alphas)):
    for j, (train_index, test_index) in enumerate(kfold.split(X_train, y_train["efs"])):
        ## get the kfold training data
        X_train_train = X_train.iloc[train_index,:]

        y_train_train = y_train[train_index]
    
        ## get the holdout data
        X_train_holdout = X_train.iloc[test_index,:]

        y_train_holdout = y_train[test_index]

        ## fit the imputer
        num_columns = X_train_train.select_dtypes(include = ['float64']).columns

        imp = SimpleImputer(missing_values=-1.0, strategy='mean')

        imp.fit(X_train_train[num_columns])

        X_train_train.loc[:, num_columns] = imp.transform(X_train_train[num_columns])

        ## Fit both models
        cph = CoxPHSurvivalAnalysis()

        cph.set_params(alpha=alphas[i])

        cph.fit(X_train_train, y_train_train) # Cox proportional hazard model with penalty

        ## impute the holdout set, then predict 

        X_train_holdout.loc[:, num_columns] = imp.transform(X_train_holdout[num_columns])

        cph_prediction = cph.predict(X_train_holdout)

        ## Record the sci
        ## To fit into the format for submission, we create a data frame where the first column contains the IDs and the second column contains the risk scores
        cph_submission = pd.DataFrame({'ID': df_train_cleaned.loc[test_index]["ID"], 'prediction': cph_prediction}) 

        sci[i, j] = score(df_train_cleaned.iloc[test_index].copy(deep=True), cph_submission.copy(deep=True), "ID")

## Compute the average score of the model
np.average(sci, axis=1)


array([0.64779443, 0.64779383, 0.64779061, 0.64779327, 0.64781716,
       0.64788686, 0.64793279, 0.64668817, 0.63852411, 0.61295798])

In [53]:
## use all features + embedding with logistic regression

from sklearn.model_selection import StratifiedKFold
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sksurv.util import Surv
from sksurv.linear_model import CoxPHSurvivalAnalysis

kfold = StratifiedKFold(n_splits = 10,
              shuffle = True,
              random_state = 582)

# prepare the data for training
X_train = pd.get_dummies(df_train_cleaned).drop(["ID", "efs", "efs_time"], axis=1)

surv = Surv() #a helper class to construct the structured array for sksurv

y_train = surv.from_dataframe("efs", "efs_time", df_train_cleaned)

# parameters for penalty
alphas = 10.0 ** np.linspace(-4, 4, 10)

# rmses will hold the cross validation root mean squared errors of each model. 
sci = np.zeros((len(alphas), 10))

for i in range(len(alphas)):
    for j, (train_index, test_index) in enumerate(kfold.split(X_train, y_train["efs"])):
        ## get the kfold training data
        X_train_train = X_train.iloc[train_index,:]

        y_train_train = y_train[train_index]
    
        ## get the holdout data
        X_train_holdout = X_train.iloc[test_index,:]

        y_train_holdout = y_train[test_index]

        ## fit the imputer
        num_columns = X_train_train.select_dtypes(include = ['float64']).columns

        imp = SimpleImputer(missing_values=-1.0, strategy='mean')

        imp.fit(X_train_train[num_columns])

        X_train_train.loc[:, num_columns] = imp.transform(X_train_train[num_columns])

        ## create embedding with logistic regression
        clf = LogisticRegression(max_iter=15000)

        clf.fit(X_train_train, y_train_train["efs"])

        X_train_train = np.concatenate((X_train_train.to_numpy(), np.reshape(clf.predict(X_train_train), (-1, 1) )), axis=1)

        ## fit the model with the embedded data
        cph = CoxPHSurvivalAnalysis()

        cph.set_params(alpha=alphas[i])

        cph.fit(X_train_train, y_train_train) # Cox proportional hazard model with penalty

        ## impute the holdout set, embed using logistic regression and predict 

        X_train_holdout.loc[:, num_columns] = imp.transform(X_train_holdout[num_columns])

        X_train_holdout = np.concatenate((X_train_holdout.to_numpy(), np.reshape(clf.predict(X_train_holdout), (-1, 1) )), axis=1)

        cph_prediction = cph.predict(X_train_holdout)

        ## Record the sci
        ## To fit into the format for submission, we create a data frame where the first column contains the IDs and the second column contains the risk scores
        cph_submission = pd.DataFrame({'ID': df_train_cleaned.loc[test_index]["ID"], 'prediction': cph_prediction}) 

        sci[i, j] = score(df_train_cleaned.iloc[test_index].copy(deep=True), cph_submission.copy(deep=True), "ID")

## Compute the average score of the model
np.average(sci, axis=1)


STOP: TOTAL NO. OF F,G EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF F,G EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF F,G EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alterna

array([0.64852823, 0.64852751, 0.64852353, 0.64852239, 0.64853224,
       0.64861352, 0.64885956, 0.64812194, 0.64207804, 0.61944421])