# Color Analysis - Nucleus rollup
Rollup the nucleus statistics per patch. All classes.

## Overall Plan
* Run CellProfiler on 80K patches. Make CSV files.
* Record bounding box of every nucleus of every patch.
* Run CNN on 80K patches. 
* For each class c, label correctly classified patches c_Cor.
* For each class c, label in correctly classified patches c_Inc.
* Run CNN attention on 80K patches. Make heatmaps.
* Compute average heatmap color per nucleus bounding box.
* Set aside test set: 20% of images (and all their patch data) per class.
* Possibly set aside patches with too little tissue, too many RBC, or too few nuclei.
* Remove useless columns such as XY locations.
* Add dispersion columns such as deciles.
* Train a Cor/Inc binary classifier for each class.
* Evaluate the model by cross-validation over training data.
* If the model is accurate, extract important features.

In [1]:
import datetime
print(datetime.datetime.now())
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import sklearn
print('scikit-learn version',sklearn.__version__)

2022-06-07 11:07:07.417977
scikit-learn version 1.0.2


In [2]:
CLASSES=[0,1,2,3,4,5]
# out of memory processing class 0 which is 12 GB
CLASSES=[6,7,1,2,3,4,5]   # 6 and 7 are first and second half of 0 respectively
FILEPATHS=[0,1,2,3,4,5,6,7]
FILEPATHS[0]='/home/jrm/Adjeroh/Naved/CP_80K/Output0/'
FILEPATHS[1]='/home/jrm/Adjeroh/Naved/CP_80K/Output1/'
FILEPATHS[2]='/home/jrm/Adjeroh/Naved/CP_80K/Output2/'
FILEPATHS[3]='/home/jrm/Adjeroh/Naved/CP_80K/Output3/'
FILEPATHS[4]='/home/jrm/Adjeroh/Naved/CP_80K/Output4/' 
FILEPATHS[5]='/home/jrm/Adjeroh/Naved/CP_80K/Output5/'
FILEPATHS[6]='/home/jrm/Adjeroh/Naved/CP_80K/Output6/' 
FILEPATHS[7]='/home/jrm/Adjeroh/Naved/CP_80K/Output7/' 

In [None]:
from CellProfiler_Util import CP_Util
for c in CLASSES:
    print(datetime.datetime.now())
    outfile = f'Nucleus_Rollup_%1d.csv'%c
    cputil = CP_Util(FILEPATHS[c])
    print('Process',FILEPATHS[c],'to',outfile)
    cputil.train_test_split() 
    cputil.validate_split()
    train_set=cputil.get_train_patches()
    nuc = cputil.get_nuclei()
    rollup = nuc.groupby(['PatchNumber']).describe() ## this is slow
    rollup.columns=rollup.columns.map('_'.join)  ## helps random forest code
    rollup.to_csv(outfile)
print(datetime.datetime.now())
print("Done")

2022-06-07 11:07:07.741586
Process /home/jrm/Adjeroh/Naved/CP_80K/Output6/ to Nucleus_Rollup_6.csv


In [None]:
df=pd.read_csv('Nucleus_Rollup_5.csv')
df

## Memory considerations
Output0 could not be processed in one chunk. The nucleus file was 12 GB. Our Alien computer only has 8 GB RAM, and this notebook crashed trying the groupby/describe step on that file. So we broke the class 0 data into two chunks of equal size. This notebook used 85% RAM temporarily, then 50% growing to 60%. To break the file...

cat Process100_Nucleus.csv | awk '{c++; if (c==1 || c<500010) print $0;}' > Nucleus_0_1.csv

cat Process100_Nucleus.csv | awk '{c++; if (c==1 || c>=500010) print $0;}' > Nucleus_0_2.csv

In the Output6 directory, we put Nucleus 0 1 and a link to the Image.csv file. In the Output7 directory, we put Nucleus 0 2 and a link to the Image.csv file. After processing, we concatenated classes 6 and 7 into 0.