# Color Analysis - Tumor Names
Version 01 of this notebook made the mistake of splitting train/test based on patch number. Here we analyze tumor names.

# Overall Plan
* Run CellProfiler on 80K patches. Make CSV files.
* Record bounding box of every nucleus of every patch.
* Run CNN on 80K patches. 
* For each class c, label correctly classified patches c_Cor.
* For each class c, label in correctly classified patches c_Inc.
* Run CNN attention on 80K patches. Make heatmaps.
* Compute average heatmap color per nucleus bounding box.
* Set aside test set: 20% of images (and all their patch data) per class.
* Possibly set aside patches with too little tissue, too many RBC, or too few nuclei.
* Remove useless columns such as XY locations.
* Add dispersion columns such as deciles.
* Train a Cor/Inc binary classifier for each class.
* Evaluate the model by cross-validation over training data.
* If the model is accurate, extract important features.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import sklearn
sklearn.__version__

'1.1.1'

In [2]:
RANDOM_STATE=12345
THIS_CLASS=4   # use a small class for process development
NUM_CLASSES=6
FILEPATHS=['path']*NUM_CLASSES
FILEPATHS[THIS_CLASS]='/Users/jasonmiller/WVU/Output4/'
NUCLEI_FN='Process100_Nucleus.csv'
PATCH_FN='Process100_Image.csv'
TEST_SET_ASIDE=0.20
KEY_COL='ImageNumber'

In [5]:
def load_patches(filename):
    """Load dataframe from CellProfiler Image.csv file.
    Parse strings like FILE-NAME_XCOORD_YCOORD.PNG
    # example: TCGA-DB-A4XF-01Z-00-DX1_10200_20700.png"""

    cols=['ImageNumber','FileName_Tumor']
    image_info = pd.read_csv(filename,usecols=cols)
    column_rename = {'ImageNumber':'PatchNumber','FileName_Tumor':'FileName'}
    patch_info = image_info.rename(column_rename,axis=1)
    patch_info.set_index('PatchNumber',inplace=True)
    tumor_prefix = []
    tumor_x=[]
    tumor_y=[]
    for index,row in patch_info.iterrows():
        tumor = row['FileName']
        delim = tumor.index('_')
        prefix = tumor[:delim]
        suffix = tumor[delim+1:]
        x = suffix[:suffix.index('_')]
        y = suffix[suffix.index('_')+1:suffix.index('.')]
        tumor_prefix.append(prefix)
        tumor_x.append(x)
        tumor_y.append(y)
    patch_info['TumorName']=tumor_prefix
    patch_info['PatchX']=tumor_x
    patch_info['PatchY']=tumor_y
    return patch_info

In [7]:
patch_info = load_patches(FILEPATHS[THIS_CLASS]+PATCH_FN)
patch_info

Unnamed: 0_level_0,FileName,TumorName,PatchX,PatchY
PatchNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,TCGA-DB-A4XF-01Z-00-DX1_10200_20700.png,TCGA-DB-A4XF-01Z-00-DX1,10200,20700
2,TCGA-DB-A4XF-01Z-00-DX1_10200_28500.png,TCGA-DB-A4XF-01Z-00-DX1,10200,28500
3,TCGA-DB-A4XF-01Z-00-DX1_10200_37500.png,TCGA-DB-A4XF-01Z-00-DX1,10200,37500
4,TCGA-DB-A4XF-01Z-00-DX1_10500_27900.png,TCGA-DB-A4XF-01Z-00-DX1,10500,27900
5,TCGA-DB-A4XF-01Z-00-DX1_10500_39300.png,TCGA-DB-A4XF-01Z-00-DX1,10500,39300
...,...,...,...,...
3193,TCGA-S9-A6WL-01Z-00-DX1_9900_35400.png,TCGA-S9-A6WL-01Z-00-DX1,9900,35400
3194,TCGA-S9-A6WL-01Z-00-DX1_9900_35700.png,TCGA-S9-A6WL-01Z-00-DX1,9900,35700
3195,TCGA-S9-A6WL-01Z-00-DX1_9900_36000.png,TCGA-S9-A6WL-01Z-00-DX1,9900,36000
3196,TCGA-S9-A6WL-01Z-00-DX1_9900_36300.png,TCGA-S9-A6WL-01Z-00-DX1,9900,36300
