# Predicting lung cancer survival time by OWKIN

### Problem

- supervised survival prediction problem
- predict the survival time of a patient (remaining days to live) from one three-dimensional CT scan (grayscale image) and a set of pre-extracted quantitative imaging features, as well as clinical data

### Import

In [3]:
import numpy as np
import os
import matplotlib.pyplot as plt
import pandas as pd

### Data

- x_train : data_Q0G7b5t
- y_train : output_VSVxRFU.csv
- x_test : data_9Cbe5hx

In [4]:
data_folder_path = "../data"
training_folder_path = os.path.join(data_folder_path, "data_Q0G7b5t")
test_folder_path = os.path.join(data_folder_path, "data_9Cbe5hx")

training_ct_scan_names = [os.path.join(root,file_name) for root,_,file_names in os.walk(training_folder_path) for file_name in file_names if file_name.endswith('.npz')]
test_ct_scan_names = [os.path.join(root,file_name) for root,_,file_names in os.walk(test_folder_path) for file_name in file_names if file_name.endswith('.npz')]

print("Number of training ct scans : {}".format(len(training_ct_scan_names)))
print("Number of test ct scans : {}".format(len(test_ct_scan_names)))

training_features_path = os.path.join(training_folder_path, "features")
test_features_path = os.path.join(test_folder_path, "features")

Number of training ct scans : 300
Number of test ct scans : 125


In [16]:
archive = np.load(training_ct_scan_names[0])
scan = archive['scan']
mask = archive['mask']
# scan.shape equals mask.shape

In [18]:
train_output = pd.read_csv(os.path.join(data_folder_path, "output_VSVxRFU.csv"), index_col=0)
p0 = train_output.loc[202]
print("p0.Event", p0.Event) # prints 1 or 0
print("p0.SurvivalTime", p0.SurvivalTime)
# prints time to event (time to death or time to last known alive) in days

p0.Event 0
p0.SurvivalTime 1378


### Interpretation

(`1=death observed`, `0=escaped from study`)

### Load training data

In [36]:
file_name = os.path.join(training_features_path, "clinical_data.csv")
df_training_clinical_data = pd.read_csv(file_name, delimiter=',')
print("Nb rows in df_training_clinical_data : {}".format(len(df_training_clinical_data)))

file_name = os.path.join(training_features_path, "radiomics.csv")
df_training_radiomics = pd.read_csv(file_name, delimiter=',', header=[0,1,2])
print("Nb rows in df_training_radiomics : {}".format(len(df_training_radiomics)))

Nb rows in df_training_clinical_data : 300
Nb rows in df_training_radiomics : 300


### clinical_data.csv

In [38]:
df_training_clinical_data.sample(5)

Unnamed: 0,PatientID,Histology,Mstage,Nstage,SourceDataset,Tstage,age
145,406,Adenocarcinoma,0,0,l2,1,61.0
248,32,Adenocarcinoma,0,0,l2,1,74.0
80,30,large cell,0,2,l1,2,85.1116
126,23,Adenocarcinoma,0,2,l2,3,43.0
153,18,,0,0,l1,1,


#### Are there NaN values in df_training_radiomics ?

In [46]:
df_training_clinical_data.isnull().sum()

PatientID         0
Histology        20
Mstage            0
Nstage            0
SourceDataset     0
Tstage            0
age              16
dtype: int64

### Remark

There are NaN values Histology and age columns. We will not use these in our study so no problem.

### radiomics.csv

In [39]:
df_training_radiomics.sample(5)

Unnamed: 0_level_0,Unnamed: 0_level_0,shape,shape,shape,shape,shape,shape,shape,shape,firstorder,...,textural,textural,textural,textural,textural,textural,textural,textural,textural,textural
Unnamed: 0_level_1,Unnamed: 0_level_1,original_shape_Compactness1,original_shape_Compactness2,original_shape_Maximum3DDiameter,original_shape_SphericalDisproportion,original_shape_Sphericity,original_shape_SurfaceArea,original_shape_SurfaceVolumeRatio,original_shape_VoxelVolume,original_firstorder_Energy,...,original_glrlm_LongRunEmphasis,original_glrlm_GrayLevelNonUniformity,original_glrlm_RunLengthNonUniformity,original_glrlm_RunPercentage,original_glrlm_LowGrayLevelRunEmphasis,original_glrlm_HighGrayLevelRunEmphasis,original_glrlm_ShortRunLowGrayLevelEmphasis,original_glrlm_ShortRunHighGrayLevelEmphasis,original_glrlm_LongRunLowGrayLevelEmphasis,original_glrlm_LongRunHighGrayLevelEmphasis
Unnamed: 0_level_2,PatientID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,...,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2,Unnamed: 49_level_2,Unnamed: 50_level_2,Unnamed: 51_level_2,Unnamed: 52_level_2,Unnamed: 53_level_2
145,406,0.02637,0.247075,28.513155,1.593642,0.627494,1523.109185,0.54821,2809.0,857706700.0,...,1.350421,77.557194,2183.839359,0.910754,0.012103,511.67725,0.01148,463.452901,0.015142,830.430427
16,185,0.018633,0.123362,58.898217,2.008814,0.497806,6376.235249,0.379187,16937.0,1608007000.0,...,1.257524,658.032193,13707.746024,0.927864,0.001894,1392.650296,0.001848,1301.752166,0.002105,1844.198046
125,152,0.033609,0.401332,26.570661,1.355706,0.737623,1294.810031,0.466521,2799.0,149279200.0,...,1.778461,117.206529,1805.694745,0.840025,0.002548,835.821188,0.00244,716.690243,0.003281,1772.586151
288,375,0.027013,0.25926,31.638584,1.568271,0.637645,2109.807794,0.454712,4671.0,985046600.0,...,1.278623,144.65109,3750.591437,0.924098,0.005743,884.97954,0.005592,814.820324,0.006413,1264.551851
164,25,0.029582,0.310925,25.884358,1.476096,0.677463,1613.790265,0.474761,3431.0,1535280000.0,...,1.168042,101.53261,2954.28746,0.949712,0.029861,381.118424,0.027067,369.59331,0.044218,431.024025


#### Are there NaN values in df_training_radiomics ?

In [44]:
df_training_radiomics.isnull().sum().sum()

0

### Remark

There are no NaN values in df_training_radiomics.

### Baseline model for survival regression on NSCLC clinical data : Cox proportional hazard (Cox-PH) model

This baseline is trained on a selection of features from both clinical data file and radiomics file. A Cox-PH model was fitted on

- 1 - Tumor sphericity, a measure of the roundness of the shape of the tumor region relative to a sphere, regardless its dimensions (size).
- 2 - The tumor's surface to volume ratio is a measure of the compactness of the tumor, related to its size.
- 3 - The tumor's maximum 3d diameter The biggest diameter measurable from the tumor volume
- 4 - The dataset of origin
- 5 - The N-tumoral stage grading of the tumor describing nearby (regional) lymph nodes involved
- 6 - The tumor's joint entropy, specifying the randomness in the image pixel values
- 7 - The tumor's inverse different, a measure of the local homogeneity of the tumor
- 8 - The tumor's inverse difference moment is another measurement of the local homogeneity of the tumor

### Name of variables

- 1 - original_shape_Sphericity
- 2 - original_shape_SurfaceVolumeRatio
- 3 - original_shape_Maximum3DDiameter
- 4 - l1 (0) or l2 (1)
- 5 - Nstage
- 6 - original_firstorder_Entropy
- 7 - inverse difference (original_glcm_Id)
- 8 - inverse difference moment (original_glcm_Idm) (according to [here](https://static-content.springer.com/esm/art%3A10.1038%2Fncomms5006/MediaObjects/41467_2014_BFncomms5006_MOESM716_ESM.pdf), ctr+F IDMN and [here](https://github.com/cerr/CERR/wiki/GLCM_global_features))

### Remark

Variables used in the baseline use quantitve and qualitive variables. (dataset of origin (l1 or l2)) makes no sens 

## $\color{red}{\text{To be continued}}$