# Color Analysis - Nucleus rollup
Rollup the nucleus statistics per patch.

## Overall Plan
* Run CellProfiler on 80K patches. Make CSV files.
* Record bounding box of every nucleus of every patch.
* Run CNN on 80K patches. 
* For each class c, label correctly classified patches c_Cor.
* For each class c, label in correctly classified patches c_Inc.
* Run CNN attention on 80K patches. Make heatmaps.
* Compute average heatmap color per nucleus bounding box.
* Set aside test set: 20% of images (and all their patch data) per class.
* Possibly set aside patches with too little tissue, too many RBC, or too few nuclei.
* Remove useless columns such as XY locations.
* Add dispersion columns such as deciles.
* Train a Cor/Inc binary classifier for each class.
* Evaluate the model by cross-validation over training data.
* If the model is accurate, extract important features.

In [1]:
import datetime
print(datetime.datetime.now())
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import sklearn
print('scikit-learn version',sklearn.__version__)

2022-06-02 13:49:14.202902
scikit-learn version 1.1.1


In [2]:
THIS_CLASS=4   # use a small class for process development
NUM_CLASSES=6
FILEPATHS=['path']*NUM_CLASSES
FILEPATHS[THIS_CLASS]='/Users/jasonmiller/WVU/Output4/'

In [3]:
from CellProfiler_Util import CP_Util
cputil = CP_Util(FILEPATHS[THIS_CLASS])
cputil.train_test_split() 
cputil.validate_split()
train_set=cputil.get_train_patches()
nuc = cputil.get_nuclei()

In [None]:
print(datetime.datetime.now())
rollup = nuc.groupby(['PatchNumber']).describe() ## this is slow
print(datetime.datetime.now())
rollup.columns=rollup.columns.map('_'.join)  ## helps random forest code
print(datetime.datetime.now())
rollup

2022-06-02 13:49:26.319563


In [None]:
rollup.to_csv('Nucleus_Rollup_4')