# Predicting lung cancer survival time by OWKIN

### Problem

- supervised survival prediction problem
- predict the survival time of a patient (remaining days to live) from one three-dimensional CT scan (grayscale image) and a set of pre-extracted quantitative imaging features, as well as clinical data

### Import

In [3]:
import numpy as np
import os
import matplotlib.pyplot as plt
import pandas as pd

### Data

- x_train : data_Q0G7b5t
- y_train : output_VSVxRFU.csv
- x_test : data_9Cbe5hx

In [4]:
data_folder_path = "../data"
training_folder_path = os.path.join(data_folder_path, "data_Q0G7b5t")
test_folder_path = os.path.join(data_folder_path, "data_9Cbe5hx")

training_ct_scan_names = [os.path.join(root,file_name) for root,_,file_names in os.walk(training_folder_path) for file_name in file_names if file_name.endswith('.npz')]
test_ct_scan_names = [os.path.join(root,file_name) for root,_,file_names in os.walk(test_folder_path) for file_name in file_names if file_name.endswith('.npz')]

print("Number of training ct scans : {}".format(len(training_ct_scan_names)))
print("Number of test ct scans : {}".format(len(test_ct_scan_names)))

training_features_path = os.path.join(training_folder_path, "features")
test_features_path = os.path.join(test_folder_path, "features")

Number of training ct scans : 300
Number of test ct scans : 125


In [16]:
archive = np.load(training_ct_scan_names[0])
scan = archive['scan']
mask = archive['mask']
# scan.shape equals mask.shape

In [18]:
train_output = pd.read_csv(os.path.join(data_folder_path, "output_VSVxRFU.csv"), index_col=0)
p0 = train_output.loc[202]
print("p0.Event", p0.Event) # prints 1 or 0
print("p0.SurvivalTime", p0.SurvivalTime)
# prints time to event (time to death or time to last known alive) in days

p0.Event 0
p0.SurvivalTime 1378


### Interpretation

(`1=death observed`, `0=escaped from study`)

### Load training data

In [22]:
file_name = os.path.join(training_features_path, "clinical_data.csv")
df_training_clinical_data = pd.read_csv(file_name, delimiter=',')

file_name = os.path.join(training_features_path, "radiomics.csv")
df_training_radiomics = pd.read_csv(file_name, delimiter=',', header=[0,1])

### clinical_data.csv

In [23]:
df_training_clinical_data.sample(5)

Unnamed: 0,PatientID,Histology,Mstage,Nstage,SourceDataset,Tstage,age
47,338,Adenocarcinoma,0,0,l2,1,59.0
66,323,Adenocarcinoma,0,0,l2,1,79.0
103,399,squamous cell carcinoma,0,0,l1,1,69.41
250,276,squamous cell carcinoma,0,2,l1,2,66.9733
95,310,nos,0,0,l1,2,58.809


### radiomics.csv

In [24]:
df_training_radiomics.sample(5)

Unnamed: 0_level_0,Unnamed: 0_level_0,shape,shape,shape,shape,shape,shape,shape,shape,firstorder,...,textural,textural,textural,textural,textural,textural,textural,textural,textural,textural
Unnamed: 0_level_1,Unnamed: 0_level_1.1,original_shape_Compactness1,original_shape_Compactness2,original_shape_Maximum3DDiameter,original_shape_SphericalDisproportion,original_shape_Sphericity,original_shape_SurfaceArea,original_shape_SurfaceVolumeRatio,original_shape_VoxelVolume,original_firstorder_Energy,...,original_glrlm_LongRunEmphasis,original_glrlm_GrayLevelNonUniformity,original_glrlm_RunLengthNonUniformity,original_glrlm_RunPercentage,original_glrlm_LowGrayLevelRunEmphasis,original_glrlm_HighGrayLevelRunEmphasis,original_glrlm_ShortRunLowGrayLevelEmphasis,original_glrlm_ShortRunHighGrayLevelEmphasis,original_glrlm_LongRunLowGrayLevelEmphasis,original_glrlm_LongRunHighGrayLevelEmphasis
298,129,0.025668,0.234096,64.876806,1.622565,0.616308,9571.67502,0.224665,42663.0,122782700.0,...,7.254939,4513.100699,8467.466888,0.511173,0.001135,1030.197953,0.000784,637.718544,0.006983,7752.599509
210,420,0.016523,0.096999,47.843495,2.176424,0.459469,3295.559398,0.594808,5589.0,988017100.0,...,1.230483,138.938728,4626.378427,0.935079,0.00363,1013.750999,0.003541,946.865103,0.00402,1356.517263
159,4,0.022972,0.187502,133.895482,1.747154,0.572359,19410.111759,0.176282,110220.0,6019421000.0,...,2.776563,3687.540945,53820.687146,0.740312,0.001657,1431.057216,0.001459,1208.546317,0.00343,3740.26835
134,121,0.02597,0.23964,19.949937,1.609954,0.621136,687.20769,0.82871,850.0,198454100.0,...,1.138374,20.662759,749.602624,0.957828,0.011637,938.502113,0.011345,899.407513,0.012811,1116.033447
152,336,0.02522,0.226,83.456576,1.641713,0.60912,12389.173051,0.200979,61734.0,3675052000.0,...,2.721813,2931.615811,30100.743101,0.742539,0.001591,1235.740892,0.001444,977.170636,0.002846,4044.221929


### Baseline model for survival regression on NSCLC clinical data : Cox proportional hazard (Cox-PH) model

This baseline is trained on a selection of features from both clinical data file and radiomics file. A Cox-PH model was fitted on

- 1 - Tumor sphericity, a measure of the roundness of the shape of the tumor region relative to a sphere, regardless its dimensions (size).
- 2 - The tumor's surface to volume ratio is a measure of the compactness of the tumor, related to its size.
- 3 - The tumor's maximum 3d diameter The biggest diameter measurable from the tumor volume
- 4 - The dataset of origin
- 5 - The N-tumoral stage grading of the tumor describing nearby (regional) lymph nodes involved
- 6 - The tumor's joint entropy, specifying the randomness in the image pixel values
- 7 - The tumor's inverse different, a measure of the local homogeneity of the tumor
- 8 - The tumor's inverse difference moment is another measurement of the local homogeneity of the tumor

### Name of variables

- 1 - original_shape_Sphericity
- 2 - original_shape_SurfaceVolumeRatio
- 3 - original_shape_Maximum3DDiameter
- 4 - l1 (0) or l2 (1)
- 5 - Nstage
- 6 - original_firstorder_Entropy
- 7 - inverse difference (original_glcm_Id)
- 8 - inverse difference moment (original_glcm_Idm) (according to [here](https://static-content.springer.com/esm/art%3A10.1038%2Fncomms5006/MediaObjects/41467_2014_BFncomms5006_MOESM716_ESM.pdf), ctr+F IDMN and [here](https://github.com/cerr/CERR/wiki/GLCM_global_features))

### Remark

Variables used in the baseline use quantitve and qualitive variables. (dataset of origin (l1 or l2)) makes no sens 

## $\color{red}{\text{To be continued}}$