# Color Analysis - Split Nuclei
Move the util class to a module.  
Train/test split the nuclei based on tumor names.

## Overall Plan
* Run CellProfiler on 80K patches. Make CSV files.
* Record bounding box of every nucleus of every patch.
* Run CNN on 80K patches. 
* For each class c, label correctly classified patches c_Cor.
* For each class c, label in correctly classified patches c_Inc.
* Run CNN attention on 80K patches. Make heatmaps.
* Compute average heatmap color per nucleus bounding box.
* Set aside test set: 20% of images (and all their patch data) per class.
* Possibly set aside patches with too little tissue, too many RBC, or too few nuclei.
* Remove useless columns such as XY locations.
* Add dispersion columns such as deciles.
* Train a Cor/Inc binary classifier for each class.
* Evaluate the model by cross-validation over training data.
* If the model is accurate, extract important features.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import sklearn
sklearn.__version__

'1.1.1'

In [2]:
THIS_CLASS=4   # use a small class for process development
NUM_CLASSES=6
FILEPATHS=['path']*NUM_CLASSES
FILEPATHS[THIS_CLASS]='/Users/jasonmiller/WVU/Output4/'

In [3]:
from CellProfiler_Util import CP_Util
cputil = CP_Util(FILEPATHS[THIS_CLASS])
cputil.train_test_split() 
cputil.validate_split()

In [4]:
train_set=cputil.get_train_patches()
train_set

Unnamed: 0_level_0,TumorName,FileName,PatchX,PatchY
PatchNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
404,TCGA-DB-A4XF-01Z-00-DX2,TCGA-DB-A4XF-01Z-00-DX2_10200_27300.png,10200,27300
405,TCGA-DB-A4XF-01Z-00-DX2,TCGA-DB-A4XF-01Z-00-DX2_10500_33600.png,10500,33600
406,TCGA-DB-A4XF-01Z-00-DX2,TCGA-DB-A4XF-01Z-00-DX2_10500_35400.png,10500,35400
407,TCGA-DB-A4XF-01Z-00-DX2,TCGA-DB-A4XF-01Z-00-DX2_10500_36900.png,10500,36900
408,TCGA-DB-A4XF-01Z-00-DX2,TCGA-DB-A4XF-01Z-00-DX2_10500_39000.png,10500,39000
...,...,...,...,...
3193,TCGA-S9-A6WL-01Z-00-DX1,TCGA-S9-A6WL-01Z-00-DX1_9900_35400.png,9900,35400
3194,TCGA-S9-A6WL-01Z-00-DX1,TCGA-S9-A6WL-01Z-00-DX1_9900_35700.png,9900,35700
3195,TCGA-S9-A6WL-01Z-00-DX1,TCGA-S9-A6WL-01Z-00-DX1_9900_36000.png,9900,36000
3196,TCGA-S9-A6WL-01Z-00-DX1,TCGA-S9-A6WL-01Z-00-DX1_9900_36300.png,9900,36300


In [5]:
nuc = cputil.get_nuclei()
nuc

Unnamed: 0_level_0,ObjectNumber,AreaShape_Area,AreaShape_BoundingBoxArea,AreaShape_BoundingBoxMaximum_X,AreaShape_BoundingBoxMaximum_Y,AreaShape_BoundingBoxMinimum_X,AreaShape_BoundingBoxMinimum_Y,AreaShape_Center_X,AreaShape_Center_Y,AreaShape_CentralMoment_0_0,...,Texture_Variance_Hematoxylin_4_02_256,Texture_Variance_Hematoxylin_4_03_256,Texture_Variance_Hematoxylin_5_00_256,Texture_Variance_Hematoxylin_5_01_256,Texture_Variance_Hematoxylin_5_02_256,Texture_Variance_Hematoxylin_5_03_256,Texture_Variance_Hematoxylin_7_00_256,Texture_Variance_Hematoxylin_7_01_256,Texture_Variance_Hematoxylin_7_02_256,Texture_Variance_Hematoxylin_7_03_256
PatchNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
404,1,381,432,300,18,276,0,287.944882,7.545932,381.0,...,472.926020,433.992091,446.022027,429.616332,468.850868,435.819105,472.378310,484.954447,473.837532,460.497900
404,2,333,420,15,82,0,54,5.987988,66.735736,333.0,...,960.487220,946.438594,893.961503,910.633845,956.217727,995.500138,879.640020,931.303401,894.149080,1185.569393
404,3,626,1014,77,115,51,76,61.797125,93.624601,626.0,...,751.848005,826.661482,789.524803,782.392906,744.234931,850.755891,850.787668,824.578326,750.389399,969.339850
404,4,365,572,88,116,66,90,77.556164,103.019178,365.0,...,878.694485,894.319551,895.501398,994.985190,901.141203,950.244359,942.455686,1023.953372,974.822222,1046.536369
404,5,360,621,173,120,150,93,159.716667,104.886111,360.0,...,1027.057762,1068.790001,1025.144211,980.865961,1006.012089,1030.046147,1085.918790,947.877531,928.062831,1024.342687
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3196,6,491,684,234,216,216,178,224.256619,197.356415,491.0,...,1422.754541,1428.622175,1361.131323,1573.083123,1431.157207,1464.698052,1423.252150,1665.083694,1452.185763,1378.132307
3196,7,298,364,149,208,136,180,142.043624,193.359060,298.0,...,1695.628540,1787.881929,1789.764515,1910.202155,1681.800641,1907.651012,2040.167169,2072.963190,1667.108287,2022.937578
3196,8,235,345,210,219,195,196,200.523404,206.736170,235.0,...,1399.381486,1530.439781,1401.851562,1347.494422,1361.041939,1664.847698,1485.990710,1525.224615,1400.377914,1819.843827
3196,9,269,414,214,233,196,210,205.635688,222.234201,269.0,...,1255.160768,1348.841787,1194.540566,1004.038428,1251.985997,1453.395007,1280.273024,901.265840,1300.529218,1757.844843
