In [167]:
import numpy as np
import pandas as pd

Following Ruibo's approach, we perform cleannng for both the categorical and the numerical variables. For the categorical variables, we turn entries in the null list and N/A into missing. For the numerical variables, we turn N/A into -1. Based on Ray and Ela's heuristics and Yang's observation from the feature selection notebook, we select a few features to fit the models. 

In [168]:
df_train = pd.read_csv("../data/train_set.csv")

df_train_cleaned = df_train.copy(deep=True)

# cleaning categorical variables

cat_columns = df_train_cleaned.select_dtypes(include = ['O']).columns

null_list = ["Not done", "Not tested", "Other", "Missing disease status", "Non-resident of the U.S."]
df_train_cleaned.loc[:,cat_columns] = df_train_cleaned[cat_columns].replace(null_list, "missing")

df_train_cleaned.loc[:,cat_columns] = df_train_cleaned[cat_columns].fillna('missing')

# cleaning numerical variables

num_columns = df_train.select_dtypes(include = ['float64']).columns

df_train_cleaned.loc[:, num_columns] = df_train_cleaned[num_columns].fillna(-1.0)

Here, we give a brief overview of important functions in survival analysis and the models for survival time we consider. Given survival time $T$, the survival function $S:\mathbb{R}_{\geq 0}\rightarrow [0,1]$ is defined as 
$$
S(t) = \mathbb{P}(T > t);
$$
it measures how likely a patient is to live longer than time $t$. Associated with $S:\mathbb{R}_{\geq 0}\rightarrow [0,1]$, one can define the hazard function $h:\mathbb{R}_{\geq 0}\rightarrow \mathbb{R}$
$$
h(t) = -\frac{S'(t)}{S(t)};
$$
it measures how likely a patient is going to die at the next instant given he/she has lived for time $t$. One also consider the cumulative hazard function $H:\mathbb{R}_{\geq 0}\rightarrow \mathbb{R}$, which is defined as $H(t) = \int_0^t h(s)\,ds$. Two models for the survival time we consider are Cox proportional hazard model and accelerated failure model (AFT).

Let $X = (X_1, \dotsc, X_p)$ be the predictors. In general, Cox proportional hazards model assumes the hazard function has the form
$$
h(t; X) = h_0(t) \exp(X\cdot\beta).
$$
A consequence of this model is the quotient of the hazards of to patients will be independent of time,
$$
\frac{h(t; X)}{h(t; X')} = \frac{\exp(X\cdot\beta)}{\exp(X'\cdot\beta)}.
$$
Therefore, it is reasonable to use the hazard function as the risk score. In general, AFT models the survival time $T$ as
$$
\log{T} = \sum_{i=1}^p \beta_i X_i + \epsilon,
$$
where $X_i$'s are the covariates and $\epsilon$ is distributed as $\log{T_0}$ (the logarithm of a base survival time). There are several common choices for the distribution of $\log{T_0}$ such as log-logistic, log-normal or Weibull. For the risk score of AFT, a direct calculation shows that 
$$
\mathbb{E}T = \exp\left( \sum_{i=1}^p \beta_i X_i\right) \mathbb{E}T_0,
$$
and this suggests that under AFT it is reasonable to use $\theta = \exp(-(\sum_{i=1}^p \beta_i X_i))$ as the risk score. 

In [169]:
## this cell runs the script for the stratified concordance index

%run -i ../examples/concordance_index.ipynb

In the following, we fit some models using method from scikit-survival; an introdcution to the package and comparision with other survival analysis toolboxes can be found in: https://www.jmlr.org/papers/volume21/20-729/20-729.pdf. We consider both Cox proportional hazards model and AFT. 

In [170]:
## only selected hla features are used

from sklearn.model_selection import KFold
from sksurv.util import Surv
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sksurv.linear_model import IPCRidge

kfold = KFold(n_splits = 10,
              shuffle = True,
              random_state = 582)

# prepare the data for training

features = ["hla_high_res_8", "hla_low_res_8", "hla_match_drb1_high", "hla_match_drb1_low"]

X_train = df_train_cleaned[features]

surv = Surv() #a helper class to construct the structured array for sksurv

y_train = surv.from_dataframe("efs", "efs_time", df_train_cleaned)

# rmses will hold the cross validation root mean squared errors of each model. 
sci = np.zeros((2, 10))

for i, (train_index, test_index) in enumerate(kfold.split(X_train, y_train)):
    ## get the kfold training data
    X_train_train = X_train.iloc[train_index,:]
    y_train_train = y_train[train_index]
    
    ## get the holdout data
    X_train_holdout = X_train.iloc[test_index,:]
    y_train_holdout = y_train[test_index]

    ## Fit both models
    cph = CoxPHSurvivalAnalysis().fit(X_train_train, y_train_train) # Cox proportional hazard model
    afl = IPCRidge().fit(X_train_train, y_train_train) # accelerated failure time model
   

    ## Use both models to generate predictions on the holdout set
    cph_prediction = cph.predict(X_train_holdout)
    afl_prediction = -afl.predict(X_train_holdout) # for AFL the method predict returns expected survival time
   

    ## Record the sci
    ## To fit into the format for submission, we create a data frame where the first column contains the IDs and the second column contains the risk scores
    cph_submission = pd.DataFrame({'ID': df_train_cleaned.loc[test_index]["ID"], 'prediction': cph_prediction}) 
    sci[0,i] = score(df_train_cleaned.iloc[test_index].copy(deep=True), cph_submission.copy(deep=True), "ID")
    afl_submission = pd.DataFrame({'ID': df_train_cleaned.loc[test_index]["ID"], 'prediction': afl_prediction}) 
    sci[1,i] = score(df_train_cleaned.iloc[test_index].copy(deep=True), afl_submission.copy(deep=True), "ID")


In [171]:
## Compute the average score of the two models
np.average(sci, axis=1)
    

array([0.50845251, 0.50464214])

In [172]:
## selected hla features + race group

from sklearn.model_selection import KFold
from sksurv.util import Surv
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sksurv.linear_model import IPCRidge

kfold = KFold(n_splits = 10,
              shuffle = True,
              random_state = 582)

# prepare the data for training

features = ["hla_high_res_8", "hla_low_res_8", "hla_match_drb1_high", "hla_match_drb1_low", "race_group"] # added race_group 

X_train = pd.get_dummies(df_train_cleaned[features])

surv = Surv() # a helper class to construct the structured array for sksurv

y_train = surv.from_dataframe("efs", "efs_time", df_train_cleaned)

# rmses will hold the cross validation root mean squared errors of each model. 
sci = np.zeros((2, 10))

for i, (train_index, test_index) in enumerate(kfold.split(X_train, y_train)):
    ## get the kfold training data
    X_train_train = X_train.iloc[train_index,:]
    y_train_train = y_train[train_index]
    
    ## get the holdout data
    X_train_holdout = X_train.iloc[test_index,:]
    y_train_holdout = y_train[test_index]

    ## Fit both models
    cph = CoxPHSurvivalAnalysis().fit(X_train_train, y_train_train) # Cox proportional hazard model
    afl = IPCRidge().fit(X_train_train, y_train_train) # accelerated failure time model
   

    ## Use both models to generate predictions on the holdout set
    cph_prediction = cph.predict(X_train_holdout)
    afl_prediction = -afl.predict(X_train_holdout) # for AFL the method predict returns expected survival time
   

    ## Record the sci
    ## To fit into the format for submission, we create a data frame where the first column contains the IDs and the second column contains the risk scores
    cph_submission = pd.DataFrame({'ID': df_train_cleaned.loc[test_index]["ID"], 'prediction': cph_prediction}) 
    sci[0,i] = score(df_train_cleaned.iloc[test_index].copy(deep=True), cph_submission.copy(deep=True), "ID")
    afl_submission = pd.DataFrame({'ID': df_train_cleaned.loc[test_index]["ID"], 'prediction': afl_prediction}) 
    sci[1,i] = score(df_train_cleaned.iloc[test_index].copy(deep=True), afl_submission.copy(deep=True), "ID")
    

  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(
  delta = solve(


In [173]:
## Compute the average score of the two models
np.average(sci, axis=1)

array([0.5084331 , 0.50463732])