# Train/Test Split
Split the 80K-patch data set.  
Generate an ID list with patient ID, WSI ID, and CellProfiler IDs.

In [1]:
import datetime
import numpy as np
import pandas as pd
from CellProfiler_Util import CP_Util

In [2]:
print(datetime.datetime.now())
BASE_DIR='/home/jrm/Adjeroh/Naved/CP_80K/'  
CLASS_DIR=['Output0/','Output1/','Output2/','Output3/','Output4/','Output5/',]
CLASSES=range(0,6)  # use all 6 classes

2022-06-13 05:45:02.545653


## Explanation of the data
This notebook generates files that relate TCAG IDs to CellProfiler image numbers.
It also makes the 80/20 train/test split of those IDs.

This notebook refers to a set of 80K patch files. 
The patch files are a random selection of WSI from a TCAG dataset.
Whereas the WSI images are huge, the patch size is 300x300 pixels.
The patches contain consecutive, non-overlapping sub-images.
This dataset was analyzed with a CNN, and separately with CellProfiler.

In May 2022, we developed and ran a CellProfiler pipeline called Process100.  
It was run separately on patches from each of the six cancer classes.
CellProfiler assigned a patch ID starting with 1 for each class.
Most CellProfiler outputs use just that ID to refer to the patch.
Only one CellProfiler output relates patch ID to TCAG ID.
We use that file here.

The CellProfiler data gives path and filename. 
We parse away the path to get FileName (including .png).
We also parse out the WSI ID, and the X,Y coordinates of the top-left corner of the patch relative to the WSI. 
From WSI we extract the patient ID.
The data is said to contain two WSI per patient. 
We implement the 80/20 split by patient ID.
According to TCAG documentation, the aliquot barcode
TCGA-02-0001-01C-01D-0182-01 means
tissue 02, patient 0001, sample 01 vial C, portion 01 analyte D, plate 0182, center 01.

Confusingly, CellProfiler calls its patch number "ImageName",
and it calls the patch file its "Image" file.
Here, we rename the column from ImageName to PatchNumber.

We use our own CP_Util to generate a pandas dataframe.
The utility uses a hard-coded seed for the random number generator.
Thus, it uses the same pseudo-random numbers on every run.
The utility reads from the CellProfiler Image csv file using its hard-coded filename.

Here we write files that:
1. Show the ImageName assigned by CellProfiler, which we rename PatchNumber. 
2. Relate it to the TumorName assigned by TCAG, which we extract from the patch filename.
3. Assign every patch file to either the train set or the test set.

Here is a sample of the outputs:

In [3]:
cls=5
fullpath = BASE_DIR+CLASS_DIR[cls]
cp = CP_Util(fullpath)
cp.train_test_split()
cp.validate_split()
train_patches = cp.get_train_patches()
train_patches    

Train: 2 participants, 3 WSI, 1195 patches.
Test: 1 participants, 1 WSI, 396 patches.


Unnamed: 0_level_0,Participant,FileName,WSI,PatchX,PatchY
PatchNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,A4XG,TCGA-DB-A4XG-01Z-00-DX1_10200_43500.png,TCGA-DB-A4XG-01Z-00-DX1,10200,43500
2,A4XG,TCGA-DB-A4XG-01Z-00-DX1_10200_45600.png,TCGA-DB-A4XG-01Z-00-DX1,10200,45600
3,A4XG,TCGA-DB-A4XG-01Z-00-DX1_10200_49200.png,TCGA-DB-A4XG-01Z-00-DX1,10200,49200
4,A4XG,TCGA-DB-A4XG-01Z-00-DX1_10500_17100.png,TCGA-DB-A4XG-01Z-00-DX1,10500,17100
5,A4XG,TCGA-DB-A4XG-01Z-00-DX1_10500_18600.png,TCGA-DB-A4XG-01Z-00-DX1,10500,18600
...,...,...,...,...,...
1587,A6CU,TCGA-QH-A6CU-01Z-00-DX1_9600_26100.png,TCGA-QH-A6CU-01Z-00-DX1,9600,26100
1588,A6CU,TCGA-QH-A6CU-01Z-00-DX1_9600_50100.png,TCGA-QH-A6CU-01Z-00-DX1,9600,50100
1589,A6CU,TCGA-QH-A6CU-01Z-00-DX1_9600_52800.png,TCGA-QH-A6CU-01Z-00-DX1,9600,52800
1590,A6CU,TCGA-QH-A6CU-01Z-00-DX1_9900_32100.png,TCGA-QH-A6CU-01Z-00-DX1,9900,32100


## Data Processing

In [4]:
print(datetime.datetime.now())
for cls in CLASSES:
    fullpath = BASE_DIR+CLASS_DIR[cls]
    train_name = 'Train'+str(cls)+'.csv'
    test_name = 'Test'+str(cls)+'.csv'
    print("Class",cls)
    print('from',fullpath,'to',train_name,'and',test_name)
    cp = CP_Util(fullpath)    # looks for the CellProfiler Image csv file
    cp.train_test_split()   # sets aside 20% for testing
    cp.validate_split()  # ensures the two are non-overlapping
    train_patches = cp.get_train_patches()
    test_patches = cp.get_test_patches()
    train_patches.to_csv(train_name)
    test_patches.to_csv(test_name)
print(datetime.datetime.now())

2022-06-13 05:45:02.969740
Class 0
from /home/jrm/Adjeroh/Naved/CP_80K/Output0/ to Train0.csv and Test0.csv
Train: 37 participants, 98 WSI, 37561 patches.
Test: 9 participants, 19 WSI, 7405 patches.
Class 1
from /home/jrm/Adjeroh/Naved/CP_80K/Output1/ to Train1.csv and Test1.csv
Train: 17 participants, 28 WSI, 10925 patches.
Test: 4 participants, 5 WSI, 1977 patches.
Class 2
from /home/jrm/Adjeroh/Naved/CP_80K/Output2/ to Train2.csv and Test2.csv
Train: 14 participants, 30 WSI, 11138 patches.
Test: 3 participants, 7 WSI, 2812 patches.
Class 3
from /home/jrm/Adjeroh/Naved/CP_80K/Output3/ to Train3.csv and Test3.csv
Train: 11 participants, 13 WSI, 5006 patches.
Test: 3 participants, 4 WSI, 1358 patches.
Class 4
from /home/jrm/Adjeroh/Naved/CP_80K/Output4/ to Train4.csv and Test4.csv
Train: 6 participants, 7 WSI, 2809 patches.
Test: 1 participants, 1 WSI, 388 patches.
Class 5
from /home/jrm/Adjeroh/Naved/CP_80K/Output5/ to Train5.csv and Test5.csv
Train: 2 participants, 3 WSI, 1195 patche

In [5]:
train4 = pd.read_csv('Train4.csv')
train4

Unnamed: 0,PatchNumber,Participant,FileName,WSI,PatchX,PatchY
0,1,A4XF,TCGA-DB-A4XF-01Z-00-DX1_10200_20700.png,TCGA-DB-A4XF-01Z-00-DX1,10200,20700
1,2,A4XF,TCGA-DB-A4XF-01Z-00-DX1_10200_28500.png,TCGA-DB-A4XF-01Z-00-DX1,10200,28500
2,3,A4XF,TCGA-DB-A4XF-01Z-00-DX1_10200_37500.png,TCGA-DB-A4XF-01Z-00-DX1,10200,37500
3,4,A4XF,TCGA-DB-A4XF-01Z-00-DX1_10500_27900.png,TCGA-DB-A4XF-01Z-00-DX1,10500,27900
4,5,A4XF,TCGA-DB-A4XF-01Z-00-DX1_10500_39300.png,TCGA-DB-A4XF-01Z-00-DX1,10500,39300
...,...,...,...,...,...,...
2804,2805,A6U9,TCGA-S9-A6U9-01Z-00-DX1_8700_81900.png,TCGA-S9-A6U9-01Z-00-DX1,8700,81900
2805,2806,A6U9,TCGA-S9-A6U9-01Z-00-DX1_9000_33900.png,TCGA-S9-A6U9-01Z-00-DX1,9000,33900
2806,2807,A6U9,TCGA-S9-A6U9-01Z-00-DX1_9000_81300.png,TCGA-S9-A6U9-01Z-00-DX1,9000,81300
2807,2808,A6U9,TCGA-S9-A6U9-01Z-00-DX1_9300_24000.png,TCGA-S9-A6U9-01Z-00-DX1,9300,24000
