# Train/Test Split
Split the 80K-patch data set.  
Generate an ID list with both WSI and CP IDs.

In [1]:
import datetime
import numpy as np
import pandas as pd
from CellProfiler_Util import CP_Util

In [2]:
print(datetime.datetime.now())
BASE_DIR='/home/jrm/Adjeroh/Naved/CP_80K/'  # append Output0/ etc
CLASS_DIR=['Output0/','Output1/','Output2/','Output3/','Output4/','Output5/',]
CLASSES=range(0,6)  # use all 6 classes

2022-06-10 08:49:30.122619


## Explanation of the data
This notebook generates files that relate TCAG IDs to CellProfiler image numbers.
It also makes the 80/20 train/test split of those IDs.

This notebook refers to a set of 80K patch files. 
The patch files are a random selection of WSI from a TCAG dataset.
Whereas the WSI images are huge, the patch size is 300x300 pixels.
The patches contain consecutive, non-overlapping sub-images.
This dataset was analyzed with a CNN, and separately with CellProfiler.

In May 2022, we developed and ran a CellProfiler pipeline called Process100.  
It was run separately on patches from each of the six cancer classes.
CellProfiler assigned a patch ID starting with 1 for each class.
Most CellProfiler outputs use just that ID to refer to the patch.
Onely one CellProfiler output relates patch ID to TCAG ID.
We use that file here.

We extract the TCAG ID from the patch filename.
We parse away the path and present the FileName (including .png), the TumorName, and the X,Y coordinates of the top-left corner of the patch relative to the WSI. 
What we call TumorName is actually the TCGA ID for one image. 
The ID encodes tissue, patient, sample, and imaging center. 
The data appears to contain multiple images of the same tumor sample in a few instances, 
but here we treat every ID as a separate tumor.

Confusingly, CellProfiler calls its patch number "ImageName",
and it calls the patch file its "Image" file.
Here, we rename the column from ImageName to PatchNumber.

We use our own CP_Util to generate a pandas dataframe.
The utility uses a hard-coded seed for the random number generator.
Thus, it uses the same pseudo-random numbers on every run.
The utility reads from the CellProfiler Image csv file using its hard-coded filename.

Here we write files that:
1. Show the ImageName assigned by CellProfiler, which we rename PatchNumber. 
2. Relate it to the TumorName assigned by TCAG, which we extract from the patch filename.
3. Assign every patch file to either the train set or the test set.

Here is a sample of the outputs:

In [3]:
cls=5
fullpath = BASE_DIR+CLASS_DIR[cls]
cp = CP_Util(fullpath)
cp.train_test_split()
cp.validate_split()
train_patches = cp.get_train_patches()
train_patches    

Num WSI in test/train sets: 1 3
Num patches in test/train sets: 396 1195


Unnamed: 0_level_0,TumorName,FileName,PatchX,PatchY
PatchNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,TCGA-DB-A4XG-01Z-00-DX1,TCGA-DB-A4XG-01Z-00-DX1_10200_43500.png,10200,43500
2,TCGA-DB-A4XG-01Z-00-DX1,TCGA-DB-A4XG-01Z-00-DX1_10200_45600.png,10200,45600
3,TCGA-DB-A4XG-01Z-00-DX1,TCGA-DB-A4XG-01Z-00-DX1_10200_49200.png,10200,49200
4,TCGA-DB-A4XG-01Z-00-DX1,TCGA-DB-A4XG-01Z-00-DX1_10500_17100.png,10500,17100
5,TCGA-DB-A4XG-01Z-00-DX1,TCGA-DB-A4XG-01Z-00-DX1_10500_18600.png,10500,18600
...,...,...,...,...
1587,TCGA-QH-A6CU-01Z-00-DX1,TCGA-QH-A6CU-01Z-00-DX1_9600_26100.png,9600,26100
1588,TCGA-QH-A6CU-01Z-00-DX1,TCGA-QH-A6CU-01Z-00-DX1_9600_50100.png,9600,50100
1589,TCGA-QH-A6CU-01Z-00-DX1,TCGA-QH-A6CU-01Z-00-DX1_9600_52800.png,9600,52800
1590,TCGA-QH-A6CU-01Z-00-DX1,TCGA-QH-A6CU-01Z-00-DX1_9900_32100.png,9900,32100


## Data Processing

In [4]:
print(datetime.datetime.now())
for cls in CLASSES:
    fullpath = BASE_DIR+CLASS_DIR[cls]
    train_name = 'Train'+str(cls)+'.csv'
    test_name = 'Test'+str(cls)+'.csv'
    print("Class",cls)
    print('from',fullpath,'to',train_name,'and',test_name)
    cp = CP_Util(fullpath)    # looks for the CellProfiler Image csv file
    cp.train_test_split()   # sets aside 20% for testing
    cp.validate_split()  # ensures the two are non-overlapping
    train_patches = cp.get_train_patches()
    test_patches = cp.get_test_patches()
    train_patches.to_csv(train_name)
    test_patches.to_csv(test_name)
print(datetime.datetime.now())

2022-06-10 08:49:30.557136
Class 0
from /home/jrm/Adjeroh/Naved/CP_80K/Output0/ to Train0.csv and Test0.csv
Num WSI in test/train sets: 23 94
Num patches in test/train sets: 8803 36163
Class 1
from /home/jrm/Adjeroh/Naved/CP_80K/Output1/ to Train1.csv and Test1.csv
Num WSI in test/train sets: 7 26
Num patches in test/train sets: 2734 10168
Class 2
from /home/jrm/Adjeroh/Naved/CP_80K/Output2/ to Train2.csv and Test2.csv
Num WSI in test/train sets: 7 30
Num patches in test/train sets: 2769 11181
Class 3
from /home/jrm/Adjeroh/Naved/CP_80K/Output3/ to Train3.csv and Test3.csv
Num WSI in test/train sets: 3 14
Num patches in test/train sets: 1091 5273
Class 4
from /home/jrm/Adjeroh/Naved/CP_80K/Output4/ to Train4.csv and Test4.csv
Num WSI in test/train sets: 2 6
Num patches in test/train sets: 813 2384
Class 5
from /home/jrm/Adjeroh/Naved/CP_80K/Output5/ to Train5.csv and Test5.csv
Num WSI in test/train sets: 1 3
Num patches in test/train sets: 396 1195
2022-06-10 08:49:52.231403
