# Data preparation

Notebook with basic data preparation and train/test split for future experiments

In [1]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split

## Data loading and preparation

Load raw data (has to be placed in "raw_data_path" directory)

In [2]:
raw_data_path = '../data/raw'

patients = pd.read_csv(os.path.join(raw_data_path, 'SampleInfo_short_multiclass_2022-10-14.tsv'), sep='\t')
patients = patients.reset_index(drop=True)

patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1916 entries, 0 to 1915
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Sample.ID        1916 non-null   object 
 1   Group            1916 non-null   object 
 2   Stage            1648 non-null   object 
 3   Sex              1898 non-null   object 
 4   Age              1898 non-null   object 
 5   Lib.size         1916 non-null   int64  
 6   Description      107 non-null    object 
 7   Comments         0 non-null      float64
 8   IsNew            1916 non-null   object 
 9   PotentialIssues  36 non-null     object 
 10  TR               1916 non-null   object 
 11  RealLocation     1916 non-null   object 
 12  MultiGroup       1916 non-null   object 
 13  MultiGroup2      1916 non-null   object 
 14  MultiGroup3      1916 non-null   object 
 15  TrainTest        1916 non-null   object 
dtypes: float64(1), int64(1), object(14)
memory usage: 239.6+ KB


Check for different Stage and Group values

In [3]:
patients['Stage'].value_counts(dropna=False)

Stage
n.a.    794
IV      497
NaN     268
III     152
II      125
I        80
Name: count, dtype: int64

In [4]:
patients['Group'].value_counts(dropna=False)

Group
NSCLC                             567
Asymptomatic controls             405
Pulmonary Hypertension            175
Ovarian cancer                    133
Glioma                            128
Pancreatic cancer                 123
Cholangiocarcinoma                 83
Multiple sclerosis                 83
Colorectal cancer                  80
Medically-intractable epilepsy     43
Endometrial cancer                 38
Angina pectoris                    26
Hepatocellular carcinoma           22
Esophageal carcinoma               10
Name: count, dtype: int64

Filter out the "Asymptomatic control" values of Group and cases with undefined Stage

Change "n.a." values of Stage column to "I"

In [5]:
valid_stages = ('I', 'II', 'III', 'IV')
patients = patients.loc[patients['Stage'].isin(valid_stages)]

patients['Stage'].value_counts(dropna=False)

Stage
IV     497
III    152
II     125
I       80
Name: count, dtype: int64

In [6]:
patients['Group'].value_counts(dropna=False)

Group
NSCLC                       404
Ovarian cancer              126
Pancreatic cancer           122
Cholangiocarcinoma           80
Colorectal cancer            63
Endometrial cancer           36
Hepatocellular carcinoma     14
Esophageal carcinoma          9
Name: count, dtype: int64

Change data type of Age column to numeric

Change missing value markers to None

Rename column with patient IDs

Select only subset of available columns

In [7]:
patients['Age'] = pd.to_numeric(patients['Age'], errors='coerce')
patients.loc[patients['Sex'] == 'n.a.', 'Sex'] = None
patients = patients.rename(columns={'Sample.ID': 'ID'})

cols = [
    'ID',
    'Group',
    'Sex',
    'Age',
    'Stage'
]
patients = patients.loc[:, cols]
patients.info()

<class 'pandas.core.frame.DataFrame'>
Index: 854 entries, 328 to 1755
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      854 non-null    object 
 1   Group   854 non-null    object 
 2   Sex     854 non-null    object 
 3   Age     853 non-null    float64
 4   Stage   854 non-null    object 
dtypes: float64(1), object(4)
memory usage: 40.0+ KB


Load and transpose a dataframe with marker values

In [8]:
markers = pd.read_csv(os.path.join(raw_data_path, 'Counts_prefiltered_multiclass_2022-10-14.tsv'), sep='\t')
markers = markers.T.reset_index(names='ID')

markers.head()

Unnamed: 0,ID,ENSG00000000419,ENSG00000000938,ENSG00000001036,ENSG00000001461,ENSG00000001629,ENSG00000001631,ENSG00000002330,ENSG00000002549,ENSG00000002586,...,ENSG00000257267,ENSG00000257923,ENSG00000258890,ENSG00000263563,ENSG00000264538,ENSG00000266356,ENSG00000266714,ENSG00000269028,ENSG00000271043,ENSG00000272168
0,Vumc-HD-101-TR922,3.064289,3.834176,4.171537,4.737304,4.272177,3.940969,4.418057,4.668012,10.759357,...,6.788573,8.077838,3.546347,4.328872,4.328872,4.803181,3.12849,7.739557,6.582879,4.418057
1,Vumc-HD-103-TR923,5.19438,6.964049,4.644469,3.8385,3.951551,5.386353,4.537357,5.478881,10.215786,...,6.073116,6.388674,5.33769,4.444854,3.877458,5.152584,5.686621,7.05587,5.815763,3.951551
2,Vumc-HD-108-TR924,5.387337,7.608523,4.097419,3.871438,5.966998,4.877867,4.097419,5.992483,9.772417,...,5.789179,7.25784,4.932819,4.490325,3.807177,4.932819,6.549959,7.091888,6.042124,3.871438
3,Vumc-HD-127-TR925,6.5843,5.626849,5.076153,3.865364,4.355678,5.188931,4.745318,5.215744,9.867106,...,6.150602,5.586682,5.390227,4.627846,4.707302,5.342562,6.746681,7.691876,6.080439,4.920776
4,Vumc-HD-130-TR926,5.684044,5.990387,4.338011,4.072761,4.029651,4.994614,4.693579,5.862317,9.94944,...,6.760555,5.605931,5.292285,4.527899,3.839155,5.188022,3.786024,8.199582,6.456418,5.275442


Merge patients with their markers values

In [9]:
df = patients.merge(markers, on='ID', how='inner')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 854 entries, 0 to 853
Columns: 4398 entries, ID to ENSG00000272168
dtypes: float64(4394), object(4)
memory usage: 28.7+ MB


Save merged dataframe to file

In [10]:
cleaned_data_path = '../data/cleaned'

df.to_csv(os.path.join(cleaned_data_path, 'dataset.csv'), index=False, sep=';')

## Train test split

The dataset will be split to train/test sets in a stratified fashion based on two columns: Group and Stage.

Some types of cancer have only one sample with a specific stage, so it's impossible to split them in a stratified way - they will be randomly placed in a train or test set.

In [12]:
groups_and_stages = df.apply(lambda row: f"{row['Group'].replace(' ', '_')}_{row['Stage']}", axis=1)
groups_and_stages_counts = groups_and_stages.value_counts()
groups_and_stages[groups_and_stages.isin(groups_and_stages_counts[groups_and_stages_counts == 1].index)] = 'temp'

train, test = train_test_split(df, test_size=0.25, stratify=groups_and_stages)

Now both train and test sets are representative of (almost) every cancer/stage pairs available in the dataset.

In [13]:
train.apply(lambda row: f"{row['Group'].replace(' ', '_')}_{row['Stage']}", axis=1).value_counts()

NSCLC_IV                        247
Pancreatic_cancer_II             48
Colorectal_cancer_IV             40
NSCLC_III                        34
Cholangiocarcinoma_IV            33
Ovarian_cancer_III               32
Ovarian_cancer_IV                26
Ovarian_cancer_I                 23
Pancreatic_cancer_III            23
Endometrial_cancer_I             18
Pancreatic_cancer_IV             18
Cholangiocarcinoma_II            17
Ovarian_cancer_II                13
NSCLC_I                          11
NSCLC_II                         10
Hepatocellular_carcinoma_IV       8
Cholangiocarcinoma_III            7
Endometrial_cancer_III            6
Esophageal_carcinoma_III          6
Cholangiocarcinoma_I              4
Colorectal_cancer_III             4
Endometrial_cancer_II             3
Colorectal_cancer_II              2
Pancreatic_cancer_I               2
Hepatocellular_carcinoma_III      2
Esophageal_carcinoma_II           1
Hepatocellular_carcinoma_I        1
Hepatocellular_carcinoma_II 

In [14]:
test.apply(lambda row: f"{row['Group'].replace(' ', '_')}_{row['Stage']}", axis=1).value_counts()

NSCLC_IV                       83
Pancreatic_cancer_II           16
Colorectal_cancer_IV           14
NSCLC_III                      12
Cholangiocarcinoma_IV          11
Ovarian_cancer_III             11
Ovarian_cancer_IV               9
Pancreatic_cancer_III           8
Ovarian_cancer_I                8
Endometrial_cancer_I            6
Pancreatic_cancer_IV            6
Cholangiocarcinoma_II           5
Ovarian_cancer_II               4
NSCLC_I                         4
NSCLC_II                        3
Esophageal_carcinoma_III        2
Hepatocellular_carcinoma_IV     2
Endometrial_cancer_III          2
Cholangiocarcinoma_III          2
Colorectal_cancer_II            1
Cholangiocarcinoma_I            1
Endometrial_cancer_II           1
Colorectal_cancer_III           1
Pancreatic_cancer_I             1
Colorectal_cancer_I             1
Name: count, dtype: int64

In [15]:
print(f'Train set size: {len(train)}')
print(f'Test set size: {len(test)}')

Train set size: 640
Test set size: 214


Save train/test sets to files

In [16]:
train.to_csv(os.path.join(cleaned_data_path, 'train.csv'), index=False, sep=';')
test.to_csv(os.path.join(cleaned_data_path, 'test.csv'), index=False, sep=';')