# Exploratory Data Analysis (EDA)

This notebook explores the dataset for predicting lung cancer patient survival time. The dataset consists of three files:

1. **Clinical Data (`clinical.csv`)**  
   Contains patient-specific clinical information, including:
   - `PatientID`: Unique identifier
   - `Histology`: Type of lung cancer (Adenocarcinoma, Squamous Cell Carcinoma, etc.)
   - `Mstage`, `Nstage`, `Tstage`: Cancer staging
   - `SourceDataset`: Origin of the data
   - `age`: Patient's age

2. **Survival Labels (`labels.csv`)**  
   - `PatientID`: Unique identifier
   - `SurvivalTime`: Number of days the patient survived after diagnosis
   - `Event`: Whether the patient has died (`1`) or is still alive (`0`)

3. **Radiomics Features (`radiomics.csv`)**  
   Contains extracted radiomic features from CT scans, such as:
   - `Compactness_1`, `Compactness_2`
   - `Maximum_Diameter`
   - `Sphericity`
   - `Surface_Area`
   - `Voxel_Volume`

In [1]:
# Download dependencies.
import pandas as pd

In [2]:
# Import the data.
data_clinical = pd.read_csv('../data/features/clinical.csv')
data_radiomics = pd.read_csv('../data/features/radiomics.csv')
labels = pd.read_csv('../data/features/labels.csv')

In [3]:
# Show the first 5 rows of the clinical data.
data_clinical.head()

Unnamed: 0,PatientID,Histology,Mstage,Nstage,SourceDataset,Tstage,age
0,202,Adenocarcinoma,0,0,l2,2,66.0
1,371,LargeCell,0,2,l1,4,64.5722
2,246,SquamousCellCarcinoma,0,3,l1,2,66.0452
3,240,Nos,0,2,l1,3,59.3566
4,284,SquamousCellCarcinoma,0,3,l1,4,71.0554


We need to convert the categorical variables in the data set into numerical variables. The "SourceDataset" is irrelevant in our case so we can get rid of it.

In [4]:
# We decide to drop any kind of missing values.
data_clinical.isnull().sum() # 30 rows with missing values (all features inclusive).
data_clinical = data_clinical.dropna()

# Get rid of the "SourceDataset" column. First remove any leading or trailing white spaces in the column names.
data_clinical.columns = data_clinical.columns.str.strip()
if 'SourceDataset' in data_clinical.columns: # Avoid error if we run this cell multiple times.
    data_clinical = data_clinical.drop(columns=['SourceDataset'])

# One-hot encode the categorical variables. Keep the 0/1 format for better adaptation to the machine learning algorithms.
if 'Histology' in data_clinical.columns: # Avoid error if we run this cell multiple times.
    #data_clinical['Histology'].value_counts()
    data_clinical = pd.get_dummies(data_clinical, columns=['Histology'], dtype=int)

In [5]:
# Show the first 5 rows of the radiomics data.
data_radiomics.head()

Unnamed: 0,PatientID,Compactness_1,Compactness_2,Maximum_Diameter,Spherical_Disproportion,Sphericity,Surface_Area,Surface_Volume_Area,Voxel_Volume
0,202,0.027815,0.274892,48.559242,1.537964,0.65021,5431.33321,0.275228,19786.0
1,371,0.023015,0.18821,75.703368,1.744961,0.573079,10369.568729,0.240727,43168.0
2,246,0.027348,0.26574,70.434367,1.55542,0.642913,10558.818691,0.200766,52655.0
3,240,0.026811,0.255406,46.8188,1.57612,0.634469,4221.412123,0.323878,13074.0
4,284,0.023691,0.199424,53.795911,1.71162,0.584242,5295.900331,0.327241,16237.0


After talking with radiologs and the organizers from Owkin. Interesting features to keep are the ones affecting Volume and if the cancer has a more "closed shape" tendency or "open shape" and is sparse. Therefore, features to keep are "Compactness", "Maximum_Diameter", "Sphericity", "Surface_Volume_Area", and "Voxel_Volume"

In [6]:
# Drop the "Compactness_1", "Spherical_Disproportion", "Surface_Area" columns.
drop_list = ["Compactness_1", "Spherical_Disproportion", "Surface_Area"]
if all(col in data_radiomics.columns for col in drop_list): # Avoid error if we run this cell multiple times.
    data_radiomics = data_radiomics.drop(columns=drop_list)

In [7]:
# Check for missing values.
data_radiomics.isnull().sum() # No missing values.

# Merge the clinical and radiomics data on the "PatientID" column.
data = pd.merge(data_clinical, data_radiomics, on='PatientID')
data.head()

Unnamed: 0,PatientID,Mstage,Nstage,Tstage,age,Histology_Adenocarcinoma,Histology_LargeCell,Histology_Nos,Histology_SquamousCellCarcinoma,Compactness_2,Maximum_Diameter,Sphericity,Surface_Volume_Area,Voxel_Volume
0,202,0,0,2,66.0,1,0,0,0,0.274892,48.559242,0.65021,0.275228,19786.0
1,371,0,2,4,64.5722,0,1,0,0,0.18821,75.703368,0.573079,0.240727,43168.0
2,246,0,3,2,66.0452,0,0,0,1,0.26574,70.434367,0.642913,0.200766,52655.0
3,240,0,2,3,59.3566,0,0,1,0,0.255406,46.8188,0.634469,0.323878,13074.0
4,284,0,3,4,71.0554,0,0,0,1,0.199424,53.795911,0.584242,0.327241,16237.0


In [8]:
# Show the first 5 rows of the labels data.
labels.head()

Unnamed: 0,PatientID,SurvivalTime,Event
0,202,1378,0
1,371,379,1
2,246,573,1
3,240,959,0
4,284,2119,0


In [9]:
# Check for any missing values in the labels.
labels.isnull().sum() # No missing values.

# Merge the data and labels on the "PatientID" column.
meta_data = pd.merge(data, labels, on='PatientID')
meta_data.head()

Unnamed: 0,PatientID,Mstage,Nstage,Tstage,age,Histology_Adenocarcinoma,Histology_LargeCell,Histology_Nos,Histology_SquamousCellCarcinoma,Compactness_2,Maximum_Diameter,Sphericity,Surface_Volume_Area,Voxel_Volume,SurvivalTime,Event
0,202,0,0,2,66.0,1,0,0,0,0.274892,48.559242,0.65021,0.275228,19786.0,1378,0
1,371,0,2,4,64.5722,0,1,0,0,0.18821,75.703368,0.573079,0.240727,43168.0,379,1
2,246,0,3,2,66.0452,0,0,0,1,0.26574,70.434367,0.642913,0.200766,52655.0,573,1
3,240,0,2,3,59.3566,0,0,1,0,0.255406,46.8188,0.634469,0.323878,13074.0,959,0
4,284,0,3,4,71.0554,0,0,0,1,0.199424,53.795911,0.584242,0.327241,16237.0,2119,0


In [11]:
# Save the data to a csv.file.
meta_data.to_csv('../data/meta_data.csv', index=False)