# Color Analysis - Nucleus rollup
Rollup the nucleus statistics per patch. All classes.

## Overall Plan
* Run CellProfiler on 80K patches. Make CSV files.
* Record bounding box of every nucleus of every patch.
* Run CNN on 80K patches. 
* For each class c, label correctly classified patches c_Cor.
* For each class c, label in correctly classified patches c_Inc.
* Run CNN attention on 80K patches. Make heatmaps.
* Compute average heatmap color per nucleus bounding box.
* Set aside test set: 20% of images (and all their patch data) per class.
* Possibly set aside patches with too little tissue, too many RBC, or too few nuclei.
* Remove useless columns such as XY locations.
* Add dispersion columns such as deciles.
* Train a Cor/Inc binary classifier for each class.
* Evaluate the model by cross-validation over training data.
* If the model is accurate, extract important features.

In [1]:
import datetime
print(datetime.datetime.now())
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import sklearn
print('scikit-learn version',sklearn.__version__)

2022-06-07 15:10:42.348221
scikit-learn version 1.0.2


In [2]:
CLASSES=[0,1,2,3,4,5]     # out of memory processing class 0 which is 12 GB
CLASSES=[6,7,1,2,3,4,5]   # 6 and 7 are first and second half of 0 respectively
CLASSES=[7,8,1,2,3,4,5]   # the notebook died on 7 after writing 6
CLASSES=[8,1,2,3,4,5]     # the notebook died on 8 after writing 7
FILEPATHS=[0,1,2,3,4,5,6,7,8]
FILEPATHS[0]='/home/jrm/Adjeroh/Naved/CP_80K/Output0/'
FILEPATHS[1]='/home/jrm/Adjeroh/Naved/CP_80K/Output1/'
FILEPATHS[2]='/home/jrm/Adjeroh/Naved/CP_80K/Output2/'
FILEPATHS[3]='/home/jrm/Adjeroh/Naved/CP_80K/Output3/'
FILEPATHS[4]='/home/jrm/Adjeroh/Naved/CP_80K/Output4/' 
FILEPATHS[5]='/home/jrm/Adjeroh/Naved/CP_80K/Output5/'
FILEPATHS[6]='/home/jrm/Adjeroh/Naved/CP_80K/Output6/' 
FILEPATHS[7]='/home/jrm/Adjeroh/Naved/CP_80K/Output7/' 
FILEPATHS[8]='/home/jrm/Adjeroh/Naved/CP_80K/Output8/' 

In [3]:
from CellProfiler_Util import CP_Util
for c in CLASSES:
    print(datetime.datetime.now())
    outfile = f'Nucleus_Rollup_%1d.csv'%c
    cputil = CP_Util(FILEPATHS[c])
    print('Process',FILEPATHS[c],'to',outfile)
    cputil.train_test_split() 
    cputil.validate_split()
    train_set=cputil.get_train_patches()
    nuc = cputil.get_nuclei()
    rollup = nuc.groupby(['PatchNumber']).describe() ## this is slow
    nuc = None
    rollup.columns=rollup.columns.map('_'.join)  ## helps random forest code
    rollup.to_csv(outfile)
    rollup = None
print(datetime.datetime.now())
print("Done")

2022-06-07 15:10:42.672710
Process /home/jrm/Adjeroh/Naved/CP_80K/Output8/ to Nucleus_Rollup_8.csv
2022-06-07 16:22:13.169339
Process /home/jrm/Adjeroh/Naved/CP_80K/Output1/ to Nucleus_Rollup_1.csv
2022-06-07 17:37:30.329404
Process /home/jrm/Adjeroh/Naved/CP_80K/Output2/ to Nucleus_Rollup_2.csv
2022-06-07 19:08:17.507080
Process /home/jrm/Adjeroh/Naved/CP_80K/Output3/ to Nucleus_Rollup_3.csv
2022-06-07 19:51:08.148176
Process /home/jrm/Adjeroh/Naved/CP_80K/Output4/ to Nucleus_Rollup_4.csv
2022-06-07 20:10:15.253663
Process /home/jrm/Adjeroh/Naved/CP_80K/Output5/ to Nucleus_Rollup_5.csv
2022-06-07 20:18:44.760738
Done


In [4]:
df=pd.read_csv('Nucleus_Rollup_5.csv')
df

Unnamed: 0,PatchNumber,ObjectNumber_count,ObjectNumber_mean,ObjectNumber_std,ObjectNumber_min,ObjectNumber_25%,ObjectNumber_50%,ObjectNumber_75%,ObjectNumber_max,AreaShape_Area_count,...,Texture_Variance_Hematoxylin_7_02_256_75%,Texture_Variance_Hematoxylin_7_02_256_max,Texture_Variance_Hematoxylin_7_03_256_count,Texture_Variance_Hematoxylin_7_03_256_mean,Texture_Variance_Hematoxylin_7_03_256_std,Texture_Variance_Hematoxylin_7_03_256_min,Texture_Variance_Hematoxylin_7_03_256_25%,Texture_Variance_Hematoxylin_7_03_256_50%,Texture_Variance_Hematoxylin_7_03_256_75%,Texture_Variance_Hematoxylin_7_03_256_max
0,1,16.0,8.5,4.760952,1.0,4.75,8.5,12.25,16.0,16.0,...,515.499745,806.931279,16.0,408.618560,214.781603,154.695633,258.212905,402.843399,513.465007,947.381911
1,2,10.0,5.5,3.027650,1.0,3.25,5.5,7.75,10.0,10.0,...,514.366296,1228.802695,10.0,483.101182,350.161055,67.633510,269.218502,376.585665,664.029826,1054.690208
2,3,29.0,15.0,8.514693,1.0,8.00,15.0,22.00,29.0,29.0,...,1367.814844,2246.295589,29.0,1170.234328,419.840252,319.085600,851.541558,1151.797133,1449.717124,2063.316367
3,4,34.0,17.5,9.958246,1.0,9.25,17.5,25.75,34.0,34.0,...,1013.044074,1326.400294,34.0,796.924000,305.644800,323.922906,556.227842,739.282772,1108.957455,1368.495121
4,5,27.0,14.0,7.937254,1.0,7.50,14.0,20.50,27.0,27.0,...,1167.610251,1776.780209,27.0,872.023588,465.019642,146.843737,573.465876,781.263463,1084.113804,2280.150933
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1190,1587,7.0,4.0,2.160247,1.0,2.50,4.0,5.50,7.0,7.0,...,2645.755171,5002.663831,7.0,2072.810441,1442.119639,682.306939,946.916962,2007.931412,2641.212717,4643.175378
1191,1588,17.0,9.0,5.049752,1.0,5.00,9.0,13.00,17.0,17.0,...,2697.422264,3144.417339,17.0,2120.125661,784.452285,435.222222,1636.718754,2149.951462,2783.133279,3362.958125
1192,1589,19.0,10.0,5.627314,1.0,5.50,10.0,14.50,19.0,19.0,...,2196.200883,2971.276407,19.0,1731.620347,639.473204,677.151060,1348.982274,1626.336416,2185.555700,2992.328549
1193,1590,33.0,17.0,9.669540,1.0,9.00,17.0,25.00,33.0,33.0,...,1931.452617,2672.847178,33.0,1330.902406,667.821030,479.463374,800.137069,1102.353972,1578.432099,2878.275735


## Memory considerations
Output0 could not be processed in one chunk. The nucleus file was 12 GB. Our Alien computer only has 8 GB RAM, and this notebook crashed trying the groupby/describe step on that file. So we broke the class 0 data into two chunks of equal size. This notebook used 85% RAM temporarily, then 50% growing to 60%. To break the file...

cat Process100_Nucleus.csv | awk '{c++; if (c==1 || c<500010) print $0;}' > Nucleus_0_1.csv

cat Process100_Nucleus.csv | awk '{c++; if (c==1 || c>=500010) print $0;}' > Nucleus_0_2.csv

In the Output6 directory, we put Nucleus 0 1 and a link to the Image.csv file. In the Output7 directory, we put Nucleus 0 2 and a link to the Image.csv file. After processing, we concatenated classes 6 and 7 into 0.

We had to further divide 7 into 7 and 8 at line 249995.