# Color Analysis - Split Nuclei
Train/test split the nuclei based on tumor names.

## Overall Plan
* Run CellProfiler on 80K patches. Make CSV files.
* Record bounding box of every nucleus of every patch.
* Run CNN on 80K patches. 
* For each class c, label correctly classified patches c_Cor.
* For each class c, label in correctly classified patches c_Inc.
* Run CNN attention on 80K patches. Make heatmaps.
* Compute average heatmap color per nucleus bounding box.
* Set aside test set: 20% of images (and all their patch data) per class.
* Possibly set aside patches with too little tissue, too many RBC, or too few nuclei.
* Remove useless columns such as XY locations.
* Add dispersion columns such as deciles.
* Train a Cor/Inc binary classifier for each class.
* Evaluate the model by cross-validation over training data.
* If the model is accurate, extract important features.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import sklearn
sklearn.__version__

'1.1.1'

In [2]:
RANDOM_STATE=12345
THIS_CLASS=4   # use a small class for process development
NUM_CLASSES=6
FILEPATHS=['path']*NUM_CLASSES
FILEPATHS[THIS_CLASS]='/Users/jasonmiller/WVU/Output4/'

In [3]:
from CellProfiler_Util import CP_Util
cputil = CP_Util(FILEPATHS[THIS_CLASS])
cputil.train_test_split()       

In [4]:
cputil.validate_split()
train_set = cputil.get_train_patches()
train_set

Unnamed: 0_level_0,TumorName,FileName,PatchX,PatchY
PatchNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
804,TCGA-DH-A66B-01Z-00-DX1,TCGA-DH-A66B-01Z-00-DX1_10200_45600.png,10200,45600
805,TCGA-DH-A66B-01Z-00-DX1,TCGA-DH-A66B-01Z-00-DX1_10200_46500.png,10200,46500
806,TCGA-DH-A66B-01Z-00-DX1,TCGA-DH-A66B-01Z-00-DX1_10500_45600.png,10500,45600
807,TCGA-DH-A66B-01Z-00-DX1,TCGA-DH-A66B-01Z-00-DX1_10500_58800.png,10500,58800
808,TCGA-DH-A66B-01Z-00-DX1,TCGA-DH-A66B-01Z-00-DX1_11400_55800.png,11400,55800
...,...,...,...,...
3193,TCGA-S9-A6WL-01Z-00-DX1,TCGA-S9-A6WL-01Z-00-DX1_9900_35400.png,9900,35400
3194,TCGA-S9-A6WL-01Z-00-DX1,TCGA-S9-A6WL-01Z-00-DX1_9900_35700.png,9900,35700
3195,TCGA-S9-A6WL-01Z-00-DX1,TCGA-S9-A6WL-01Z-00-DX1_9900_36000.png,9900,36000
3196,TCGA-S9-A6WL-01Z-00-DX1,TCGA-S9-A6WL-01Z-00-DX1_9900_36300.png,9900,36300
