# Ignore this notebook!
We need an 80/20 split by tumor name. From there we must we assign every patch, and from there assign every nucleus. This notebook did not use tumor name so we need to start again. We were confused by CellProfiler's column name 'ImageNumber' which is actually the patch number.

# Color Analysis
* Run CellProfiler on 80K patches. Make CSV files.
* Record bounding box of every nucleus of every patch.
* Run CNN on 80K patches. 
* For each class c, label correctly classified patches c_Cor.
* For each class c, label in correctly classified patches c_Inc.
* Run CNN attention on 80K patches. Make heatmaps.
* Compute average heatmap color per nucleus bounding box.
* Set aside test set: 20% of images (and all their patch data) per class.
* Possibly set aside patches with too little tissue, too many RBC, or too few nuclei.
* Remove useless columns such as XY locations.
* Add dispersion columns such as deciles.
* Train a Cor/Inc binary classifier for each class.
* Evaluate the model by cross-validation over training data.
* If the model is accurate, extract important features.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import sklearn
sklearn.__version__

'1.1.1'

In [2]:
RANDOM_STATE=12345
THIS_CLASS=4   # use a small class for process development
NUM_CLASSES=6
FILEPATHS=['path']*NUM_CLASSES
FILEPATHS[THIS_CLASS]='/Users/jasonmiller/WVU/Output4/'
NUCLEI_FN='Process100_Nucleus.csv'
TEST_SET_ASIDE=0.20
KEY_COL='ImageNumber'

In [3]:
all_rows = pd.read_csv(FILEPATHS[THIS_CLASS]+NUCLEI_FN)
all_rows

Unnamed: 0,ImageNumber,ObjectNumber,AreaShape_Area,AreaShape_BoundingBoxArea,AreaShape_BoundingBoxMaximum_X,AreaShape_BoundingBoxMaximum_Y,AreaShape_BoundingBoxMinimum_X,AreaShape_BoundingBoxMinimum_Y,AreaShape_Center_X,AreaShape_Center_Y,...,Texture_Variance_Hematoxylin_4_02_256,Texture_Variance_Hematoxylin_4_03_256,Texture_Variance_Hematoxylin_5_00_256,Texture_Variance_Hematoxylin_5_01_256,Texture_Variance_Hematoxylin_5_02_256,Texture_Variance_Hematoxylin_5_03_256,Texture_Variance_Hematoxylin_7_00_256,Texture_Variance_Hematoxylin_7_01_256,Texture_Variance_Hematoxylin_7_02_256,Texture_Variance_Hematoxylin_7_03_256
0,1,1,196,294,80,26,59,12,68.193878,18.607143,...,602.743285,642.679273,642.319285,642.400310,640.764599,699.330009,644.967400,687.076735,834.618344,850.561523
1,1,2,469,609,126,37,97,16,111.159915,25.989339,...,539.175397,565.853063,548.903754,562.875134,556.688253,597.419939,567.698283,508.662662,579.599297,624.531535
2,1,3,223,304,193,41,174,25,182.551570,32.852018,...,888.080613,808.414113,845.188775,847.659739,899.641109,733.203961,877.066057,587.996061,814.744709,819.948881
3,1,4,508,798,21,61,0,23,7.982283,43.293307,...,1241.667208,1159.372486,1238.719919,1169.194520,1257.374985,1114.919318,1175.562484,1175.745806,1287.354528,1171.277847
4,1,5,521,768,143,74,111,50,126.280230,60.255278,...,951.986476,964.896780,907.957406,909.009379,956.391792,966.036051,933.923255,943.625640,974.996622,928.684668
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46520,3196,6,491,684,234,216,216,178,224.256619,197.356415,...,1422.754541,1428.622175,1361.131323,1573.083123,1431.157207,1464.698052,1423.252150,1665.083694,1452.185763,1378.132307
46521,3196,7,298,364,149,208,136,180,142.043624,193.359060,...,1695.628540,1787.881929,1789.764515,1910.202155,1681.800641,1907.651012,2040.167169,2072.963190,1667.108287,2022.937578
46522,3196,8,235,345,210,219,195,196,200.523404,206.736170,...,1399.381486,1530.439781,1401.851562,1347.494422,1361.041939,1664.847698,1485.990710,1525.224615,1400.377914,1819.843827
46523,3196,9,269,414,214,233,196,210,205.635688,222.234201,...,1255.160768,1348.841787,1194.540566,1004.038428,1251.985997,1453.395007,1280.273024,901.265840,1300.529218,1757.844843


In [4]:
im_min =    all_rows[KEY_COL].min()
im_max =    all_rows[KEY_COL].max()
im_total =  all_rows[KEY_COL].nunique()
image_ids = all_rows[KEY_COL].unique()
expect_train_size = im_total * (1-TEST_SET_ASIDE)
expect_test_size =  im_total * TEST_SET_ASIDE
print('There are %d unique images numbered %d to %d.'%
      (im_total,im_min,im_max))
if im_min==1 and im_max==im_total:
    print('We are good to go with %d train and %d test images.'%
         (expect_train_size,expect_test_size))
else:
    print('ERROR: The images are not consecutively numbered.')

There are 3197 unique images numbered 1 to 3197.
We are good to go with 2557 train and 639 test images.


In [5]:
print('Expected split: %.1f / %.1f of %d'%
      (expect_train_size,expect_test_size,im_total))
train_image_ids, test_image_ids = train_test_split( 
    image_ids, test_size=TEST_SET_ASIDE, random_state=RANDOM_STATE ) 
actual_train_size = len(train_image_ids)
actual_test_size = len(test_image_ids)
print('Actual split: %d+%d=%d of %d'%
      (actual_train_size,actual_test_size,
       actual_train_size+actual_test_size,im_total))

Expected split: 2557.6 / 639.4 of 3197
Actual split: 2557+640=3197 of 3197


In [6]:
X_train = all_rows.loc[all_rows[KEY_COL].isin(train_image_ids)]
X_test =  all_rows.loc[all_rows[KEY_COL].isin( test_image_ids)]
train_count=len(X_train)
test_count=len(X_test)
all_count = len(all_rows)
print('Row counts: %d train + %d test = %d of %d'%
      (train_count,test_count,train_count+test_count,all_count))
train_uniq = X_train[KEY_COL].nunique()
test_uniq =  X_test [KEY_COL].nunique()
print('Image counts: %d train + %d test = %d of %d'%
      (train_uniq,test_uniq,train_uniq+test_uniq,im_total))

Row counts: 36860 train + 9665 test = 46525 of 46525
Image counts: 2557 train + 640 test = 3197 of 3197


## Need to save csv files for train and test