SVM for survival, gridsearch for alpha, fit +predict on simulated, predict on original, plot predicted scores, test regression for survival time (not good)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os, glob, inspect, sys
from sksurv.ensemble import RandomSurvivalForest
from sksurv.datasets import load_gbsg2
import sksurv
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sksurv.nonparametric import kaplan_meier_estimator
from sksurv.linear_model import CoxPHSurvivalAnalysis

currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir) 
import epri_mc_lib_3 as mc
from importlib import reload
reload(mc)

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
import seaborn as sns
from sklearn.model_selection import ShuffleSplit, GridSearchCV

from sksurv.datasets import load_veterans_lung_cancer
from sksurv.column import encode_categorical
from sksurv.metrics import concordance_index_censored
from sksurv.svm import FastSurvivalSVM

sns.set_style("whitegrid")

In [None]:
real_data = pd.read_csv(os.path.join(os.path.dirname(os.getcwd()), '../Data/Merged_data/Survival_df.csv'),
                  index_col=0)

In [None]:
data = pd.read_csv(os.path.join(os.path.dirname(os.getcwd()), '../Data/Merged_data/CopulaGAN_simulated_data_survival_2.csv'),
                  index_col=0)
data.reset_index(inplace=True)

## Seperating input and output

#### Real Data

In [None]:
real_x = real_data.iloc[:, 2:]
 # The real data input

#### Scaling real data

In [None]:
sc_real_x = mc.scale_general(real_x, MinMaxScaler())[0]
sc_real_x.index = real_x.index

In [None]:
sc_real_x_sub= sc_real_x[mc.feature_selection2]

#### Simulated  (Copula GAN)

In [None]:
data_x = data.iloc[:, 2:]
 # The simulated input data

#### Scaling the simulated data

In [None]:
sc_data_x = mc.scale_general(data_x, MinMaxScaler())[0]
sc_data_x.index = data_x.index

In [None]:
sc_data_x_sub= sc_data_x[mc.feature_selection2]

#### Output of the real data

In [None]:
real_y = real_data.iloc[:, 0:2]
 # The real data output

#### Output of the simulated data

In [None]:
data_y = data.iloc[:, 0:2]


### Creating Test, Train data

In [None]:
# This is for the simulated data
X_train, X_test, y_train, y_test = train_test_split(sc_data_x, data_y, test_size=0.3)

In [None]:
## Formatting y dataset for the survival ananlysis
data_y_num=data_y.to_records(index=False)
real_y_num=real_y.to_records(index=False)
y_train_num = y_train.to_records(index=False)
y_test_num = y_test.to_records(index=False)

### SVM Survival Model

This guide demonstrates how to use the efficient implementation of Survival Support Vector Machines, which is an extension of the standard Support Vector Machine to right-censored time-to-event data. Its main advantage is that it can account for complex, non-linear relationships between features and survival via the so-called kernel trick. A kernel function implicitly maps the input features into high-dimensional feature spaces where survival can be described by a hyperplane. This makes Survival Support Vector Machines extremely versatile and applicable to a wide a range of data. A popular example for such a kernel function is the Radial Basis Function.

Survival analysis in the context of Support Vector Machines can be described in two different ways:

As a ranking problem: the model learns to assign samples with shorter survival times a lower rank by considering all possible pairs of samples in the training data.

As a regression problem: the model learns to directly predict the (log) survival time.

In both cases, the disadvantage is that predictions cannot be easily related to standard quantities in survival analysis, namely survival function and cumulative hazard function. Moreover, they have to retain a copy of the training data to do predictions.

Let’s start by taking a closer look at the Linear Survival Support Vector Machine, which does not allow selecting a specific kernel function, but can be fitted faster than the more generic Kernel Survival Support Vector Machine.

In [None]:
data_y_num.shape[0] # Total number of records

In [None]:
data_y_num["Observed"].sum() # Number of uncensored data

In [None]:
# Finding the % of the censored data
n_censored = data_y_num.shape[0] - data_y_num["Observed"].sum()
print("%.1f%% of records are censored" % (n_censored / data_y_num.shape[0] * 100))

In [None]:
plt.figure(figsize=(9, 6))
val, bins, patches = plt.hist((data_y_num["F_Time"][~data_y_num["Observed"]],
                               data_y_num["F_Time"][data_y_num["Observed"]]),
                              bins=30, stacked=True)
_ = plt.legend(patches, ["Time of Censoring", "Time of Failure"])

In [None]:
# Model Inititation
estimator = FastSurvivalSVM(max_iter=1000, tol=1e-5, random_state=2020)

In [None]:
def score_survival_model(model, X, y):
    '''
    returns Harrell’s concordance index for the given estimator, X and y
    
    '''
    prediction = model.predict(X)
    result = concordance_index_censored(y['Observed'], y['F_Time'], prediction)
    return result[0]

The hyper-parameter 𝛼>0 determines the amount of regularization to apply: a smaller value increases the amount of regularization and a higher value reduces the amount of regularization. The hyper-parameter 𝑟∈[0;1] determines the trade-off between the ranking objective and the regression objective. If 𝑟=1 it reduces to the ranking objective, and if 𝑟=0 to the regression objective. If the regression objective is used, it is advised to log-transform the observed time first.

In [None]:
param_grid = {'alpha': 2. ** np.arange(-12, 13, 2)}
cv = ShuffleSplit(n_splits=100, test_size=0.5, random_state=2020)
gcv = GridSearchCV(estimator, param_grid, scoring=score_survival_model,
                   n_jobs=4, iid=False, refit=False,
                   cv=cv)

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
gcv = gcv.fit(sc_data_x, data_y_num)

In [None]:
round(gcv.best_score_, 3), gcv.best_params_

In [None]:
def plot_performance(gcv):
    '''
    Plots the performance of the Grid Search for each alpha
    
    '''
    n_splits = gcv.cv.n_splits
    cv_scores = {"alpha": [], "test_score": [], "split": []}
    order = []
    for i, params in enumerate(gcv.cv_results_["params"]):
        name = "%.5f" % params["alpha"]
        order.append(name)
        for j in range(n_splits):
            vs = gcv.cv_results_["split%d_test_score" % j][i]
            cv_scores["alpha"].append(name)
            cv_scores["test_score"].append(vs)
            cv_scores["split"].append(j)
    df = pandas.DataFrame.from_dict(cv_scores)
    _, ax = plt.subplots(figsize=(11, 6))
    sns.boxplot(x="alpha", y="test_score", data=df, order=order, ax=ax)
    _, xtext = plt.xticks()
    for t in xtext:
        t.set_rotation("vertical")

In [None]:
plot_performance(gcv)

The optimal alpha value is obtained after grid search.

In [None]:
# The best alpha value obtained from the Grid Search
estimator.set_params(**gcv.best_params_)
estimator.fit(sc_data_x, data_y_num)

It is important to remember that only if the ranking objective is used exclusively (𝑟=1), that predictions denote risk scores, i.e. a higher predicted value indicates shorter survival, a lower value longer survival.

In [None]:
pred = estimator.predict(sc_data_x.iloc[:2])
print(np.round(pred, 3))


The model predicted that the first sample has a higher risk than the second sample, which is in concordance with the actual survival times.

### Implementing on the real data

In [None]:
def pred_risk_scores(sc_train_x,sc_train_y,sc_test_x,sc_test_y):
    '''
    Predicts the Risk scores 
    
    '''
    import warnings
    warnings.filterwarnings("ignore", category=FutureWarning)
    
    # The grid search begins here
    global gcv
    gcv = gcv.fit(sc_train_x, sc_train_y)
    
    # The best parameters of the grid search are: 
    print("The best parameters from the grid search (C-Index and alpha) :")
    print(round(gcv.best_score_, 3), gcv.best_params_)
    
    # The performance of the grid search is plotted
    plot_performance(gcv)
    
    # The best alpha value obtained from the Grid Search is selected for the estimator
    estimator.set_params(**gcv.best_params_)
    
    # The estimator is fitted the data
    estimator.fit(sc_train_x, sc_train_y)
    
    # The prediction is done on the sample
    pred = estimator.predict(sc_test_x)
    
    # Preparing the results dataframe
    df_results=pd.DataFrame(sc_test_x.index)
    df_results["Pred_Risk_Values"]=np.round(pred, 3)
    df_results["F_Time"]=sc_test_y["F_Time"]
    df_results["Observed"]=sc_test_y["Observed"]
    df_results=df_results.sort_values(by='Pred_Risk_Values')
    
    
    # Plotting the predicted Risk values and comparing with the F_Time from the data to see correlation
    plt.figure (figsize=(14,6))
    plt.subplot(1, 2, 1)
    plt.barh(sc_test_x.index,np.round(pred, 3))
    plt.title('The comparison of Predicted Risk Values with F_Time from data',fontsize=20)
    plt.ylabel("Sample ID",fontsize=20)
    plt.xlabel('Predicted Risk Values',fontsize=20)
    
    plt.subplot(1, 2, 2)
    plt.barh(sc_test_x.index,sc_test_y["F_Time"])
    plt.xlabel('F_Time from data',fontsize=20)
    
    plt.show()
    
    # Printing the required parameters from the calculation
    print("IPCW-Index : ", mc.score_survival_model_ipcw(estimator, sc_test_x, sc_train_y, sc_test_y))
    print("Predicted risk scores on the sample are :")
    print(np.round(pred, 3))
    #print("The output data of the sample")
    #print(sc_test_y)
    
    return df_results

In [None]:
pred_risk_scores(sc_data_x,data_y_num,sc_real_x,real_y_num)

In the above figure, since 'Predicted Risk values' and 'F_time' are inversely proportional, a high positive value in 'F_Time' should have more negative value in the 'Predicted Risk Value'.

### Plot using optimised model

In [None]:
tune_model = FastSurvivalSVM(alpha=0.0625, fit_intercept=False, max_iter=1000,
                optimizer='avltree', random_state=2020, rank_ratio=1.0,
                timeit=False, tol=1e-05, verbose=False)
tune_model.fit(sc_data_x,data_y_num)

In [None]:
y_pred = tune_model.predict(sc_real_x)

In [None]:
df_results=pd.DataFrame(index=sc_real_x.index)
df_results["Pred_Risk_Values"]=np.round(y_pred, 3)
df_results["F_Time"]=real_y_num["F_Time"]
df_results["Observed"]=real_y_num["Observed"]
df_results=df_results.sort_values(by='Pred_Risk_Values')

In [None]:
sns.set(style='white')
sns.scatterplot(x='F_Time', y='Pred_Risk_Values', data=df_results, hue='Observed',
                alpha=0.8, palette=sns.xkcd_palette(['marine blue', 'deep red'])
               )
plt.plot([0, 3500000], [0, -3.5], 'darkgray', lw=0.8)
plt.xlabel('Observed survival time from NDE measurement')
plt.ylabel('Predicted risk score')
plt.title('SVM')

#### Implementing on the subset of features

In [None]:
risk_scores = pred_risk_scores(sc_data_x_sub,data_y_num,sc_real_x_sub,real_y_num)

### Regression

If the regression objective is used (𝑟<1), the semantics are different, because now predictions are on the time scale and lower predicted values indicate shorter survival, higher values longer survival. Moreover, we saw from the histogram of observed times above that the distribution is skewed, therefore it is advised to log-transform the observed time before fitting a model. Here, we are going to use the transformation 𝑦′=log(1+𝑦).

Let’s fit a model using the regression objective (𝑟=0) and compare its performance to the ranking model from above.

In [None]:
def pred_F_Time(sc_train_x,sc_train_y,sc_test_x,sc_test_y):
    '''
    Predicts the F_Time 
    
    '''
    #log-transform the observed time before fitting a model.
    y_log_t = sc_train_y.copy()
    y_log_t["F_Time"]= np.log1p(sc_train_y["F_Time"])
    
    #Defining the model and fitting the data
    ref_estimator = FastSurvivalSVM(rank_ratio=0.0, max_iter=1000, tol=1e-5, random_state=2020)
    ref_estimator.fit(sc_train_x, y_log_t)
    
    #Calculating the concordance index
    cindex = concordance_index_censored(
    sc_train_y['Observed'],
    sc_train_y['F_Time'],
    -ref_estimator.predict(sc_train_x),  # flip sign to obtain risk scores
    )
    
    print("C-Index : ", round(cindex[0], 3))
    print("IPCW-Index : ", mc.score_survival_model_ipcw(ref_estimator, sc_test_x, sc_train_y, sc_test_y))
    
    # Predicting on the real data
    pred_log = ref_estimator.predict(sc_test_x)
    pred_y = np.expm1(pred_log)
    
    plt.scatter(sc_test_y["F_Time"],pred_y)
    plt.plot([0, 3500000], [0, 3500000])
    plt.xlim([0, 3500000])
    plt.ylim([0, 3500000])
    plt.title("SVM Survival Regression")
    plt.xlabel("F_Time")
    plt.ylabel("Predicted F_Time")
    plt.show()
    
    return

Note that concordance_index_censored expects risk scores, therefore, we had to flip the sign of predictions. The resulting performance of the regression model is comparable to the of the ranking model above.

In [None]:
pred_F_Time(sc_data_x,data_y_num,sc_real_x,real_y_num)

In [None]:
pred_F_Time(sc_data_x_sub,data_y_num,sc_real_x_sub,real_y_num)