# PreProcess csv files from CellProfiler

We built a pipeline to segment red blood cells (RBC), nuclei (based on hematoxylin), and nucleated cells (based on eosin).  
* CP_20220429f = Ypos  
* CP_20220429g = Yneg  

We used MeasureShape and MeasureNeighbor. Thus, we get about 68 measures per object (spreadsheet columns), for 4 object classes (csv files), for as many objects as were detected in each path (spreadsheet rows).  

    39616 Cells.csv 68 columns
    39616 Nuclei.csv 68 columns
    60887 RBC.csv 60 columns
    12980 Image.csv 62 columns
    153099 total

Here, each image is actually a patch. The image data includes redundant fields: #RBC, #nuclei, #cells. But the rest of the image data seems non-redundant, including thresholds used and total areas covered. CellProfiler assigned image numbers in order starting at 1 for the first patch. 

The Images.csv contains fields called Filename and URL that contain the path and filename of the patch. For example, D:Martinez/B3.2.jpg is patch 2 of tumor B3.

We ran CellProfiler on the full training set (excludes a 20% test set) of center patches. Here, transform per-object metrics into per-patch metrics. Also, exclude patches with no nuclei.

In [1]:
from platform import python_version
print('Python',python_version())
import numpy as np
import pandas as pd
import sklearn
print('sklearn',sklearn.__version__)

Python 3.8.10
sklearn 1.0.2


In [2]:
YPOS_DIR='/home/jrm/Martinez/CellProfilerRuns/CP_20220429f/'
YNEG_DIR='/home/jrm/Martinez/CellProfilerRuns/CP_20220429g/'
IMAGES="Image.csv"
REDS="RBC.csv"
NUKES="Nuclei.csv"
CELLS="Cells.csv"
MIN_NUCLEI = 2

In [3]:
def describe_all_columns(df):
    with pd.option_context('display.max_columns', None):
        print(df.describe(include='all'))
def make_dataframe(filename,verbose=False):
    df = pd.read_csv(filename) 
    if verbose:
        count1 = df.isnull().sum().sum()
        print('Zero out this many NaN:', count1)
    df = df.fillna(0)
    return df
def shave_images(df):
    # drop uninformative columns (mostly just zero)
    bad_columns = ['ProcessingStatus', 'Height_HE', 'Width_HE', 'Scaling_HE']
    bad_columns += ['Series_HE', 'Frame_HE', 'Channel_HE', 'Group_Number']
    df = df.drop(columns=bad_columns)
    # drop strings, like filepaths, that could leak the Ypos/Yneg labels
    df = df.select_dtypes(['number'])
    # drop CellProfiler timing, errors (should all be zero), and metadata stats
    df = df.drop(df.filter(regex='^Metadata_').columns, axis=1)
    df = df.drop(df.filter(regex='^ExecutionTime_').columns, axis=1)
    df = df.drop(df.filter(regex='^ModuleError_').columns, axis=1)
    df = df[df['Count_Nuclei']>=MIN_NUCLEI]
    return df


In [4]:
filename = YPOS_DIR+IMAGES
df_ipos = make_dataframe(filename) 
df_ipos = shave_images(df_ipos)
#describe_all_columns(df_ipos)
df_ipos.describe()

Unnamed: 0,Count_Cells,Count_Nuclei,Count_RBC,Group_Index,ImageNumber,Threshold_FinalThreshold_Cells,Threshold_FinalThreshold_Nuclei,Threshold_FinalThreshold_RBC,Threshold_GuideThreshold_Cells,Threshold_GuideThreshold_Nuclei,Threshold_OrigThreshold_Cells,Threshold_OrigThreshold_Nuclei,Threshold_OrigThreshold_RBC,Threshold_SumOfEntropies_Cells,Threshold_SumOfEntropies_Nuclei,Threshold_SumOfEntropies_RBC,Threshold_WeightedVariance_Cells,Threshold_WeightedVariance_Nuclei,Threshold_WeightedVariance_RBC
count,10201.0,10201.0,10201.0,10201.0,10201.0,10201.0,10201.0,10201.0,10201.0,10201.0,10201.0,10201.0,10201.0,10201.0,10201.0,10201.0,10201.0,10201.0,10201.0
mean,3.699343,3.699343,4.346731,6446.005098,6446.005098,0.198198,0.229849,0.447377,0.190327,0.231883,0.200201,0.229477,0.372814,-11.732075,-11.436286,-10.138482,0.948547,0.679775,2.19911
std,1.478569,1.478569,3.673069,3889.139278,3889.139278,0.070158,0.051438,0.090652,0.069516,0.057026,0.070901,0.0502,0.075544,0.797363,0.926382,0.882962,0.462473,0.577341,1.485949
min,2.0,2.0,0.0,1.0,1.0,0.000166,0.026288,0.000304,0.000165,0.022708,0.000166,0.024872,0.000254,-13.905191,-13.968753,-13.900173,0.008287,0.002667,0.034349
25%,2.0,2.0,2.0,2878.0,2878.0,0.162104,0.196336,0.406246,0.153097,0.192319,0.165786,0.196755,0.338538,-12.323464,-11.89981,-10.604216,0.594257,0.285706,1.021031
50%,3.0,3.0,4.0,6650.0,6650.0,0.197002,0.234884,0.452503,0.189919,0.242477,0.199161,0.232378,0.377086,-11.903494,-11.494668,-10.201425,0.948817,0.487546,1.87339
75%,5.0,5.0,6.0,9924.0,9924.0,0.234956,0.267813,0.49846,0.226771,0.275309,0.237304,0.266793,0.415383,-11.255853,-11.144,-9.724634,1.288512,0.88029,3.134938
max,10.0,10.0,41.0,12979.0,12979.0,0.513549,0.590891,0.707806,0.526853,0.554902,0.513549,0.601256,0.589838,-7.785976,-5.882243,0.0,2.795326,3.681442,9.340085


In [5]:
filename = YNEG_DIR+IMAGES
df_ineg = make_dataframe(filename) 
df_ineg = shave_images(df_ineg)
#describe_all_columns(df_ineg)
df_ineg.describe()

Unnamed: 0,Count_Cells,Count_Nuclei,Count_RBC,Group_Index,ImageNumber,Threshold_FinalThreshold_Cells,Threshold_FinalThreshold_Nuclei,Threshold_FinalThreshold_RBC,Threshold_GuideThreshold_Cells,Threshold_GuideThreshold_Nuclei,Threshold_OrigThreshold_Cells,Threshold_OrigThreshold_Nuclei,Threshold_OrigThreshold_RBC,Threshold_SumOfEntropies_Cells,Threshold_SumOfEntropies_Nuclei,Threshold_SumOfEntropies_RBC,Threshold_WeightedVariance_Cells,Threshold_WeightedVariance_Nuclei,Threshold_WeightedVariance_RBC
count,14873.0,14873.0,14873.0,14873.0,14873.0,14873.0,14873.0,14873.0,14873.0,14873.0,14873.0,14873.0,14873.0,14873.0,14873.0,14873.0,14873.0,14873.0,14873.0
mean,3.957036,3.957036,3.508035,9014.261615,9014.261615,0.193183,0.232311,0.414436,0.186639,0.234816,0.194638,0.231926,0.345364,-11.506303,-11.281784,-10.010848,0.787749,0.495452,1.955225
std,1.579725,1.579725,3.519821,5181.611479,5181.611479,0.079657,0.043542,0.096308,0.08113,0.048858,0.079618,0.042573,0.080257,0.917771,0.952213,1.189379,0.474008,0.463761,1.706803
min,2.0,2.0,0.0,1.0,1.0,0.00025,0.020551,0.000368,0.000269,0.020493,0.00025,0.020545,0.000307,-13.900045,-14.217982,-13.879088,0.004965,0.002697,0.010274
25%,3.0,3.0,1.0,4341.0,4341.0,0.144394,0.208142,0.379074,0.133939,0.20824,0.149888,0.207586,0.315895,-12.26399,-11.728599,-10.557552,0.385899,0.210314,0.669665
50%,4.0,4.0,3.0,9133.0,9133.0,0.195527,0.240153,0.42801,0.18699,0.247547,0.197753,0.238625,0.356675,-11.662491,-11.39296,-10.057407,0.746173,0.354169,1.406422
75%,5.0,5.0,5.0,13642.0,13642.0,0.247884,0.263945,0.471838,0.243635,0.270187,0.248085,0.263287,0.393198,-10.875579,-11.022301,-9.577021,1.140852,0.612824,2.778047
max,12.0,12.0,42.0,17912.0,17912.0,0.498625,0.394072,0.684108,0.504537,0.402519,0.502132,0.394072,0.57009,-7.53263,-5.834748,0.0,2.993972,3.754664,10.8295


In [7]:
filename = YPOS_DIR+REDS
df_rpos = make_dataframe(filename) 
describe_all_columns(df_rpos)
#df_rpos.describe()

        ImageNumber  ObjectNumber  AreaShape_Area  AreaShape_BoundingBoxArea  \
count  60886.000000  60886.000000    60886.000000               60886.000000   
mean    6761.198699      4.342230      638.321831                1149.941974   
std     3598.488363      3.513238      380.694314                 756.017250   
min        1.000000      1.000000      315.000000                 324.000000   
25%     4178.000000      2.000000      387.000000                 676.000000   
50%     6309.000000      3.000000      507.000000                 910.000000   
75%     9692.000000      6.000000      738.000000                1344.000000   
max    12979.000000     41.000000     2952.000000               12382.000000   

       AreaShape_BoundingBoxMaximum_X  AreaShape_BoundingBoxMaximum_Y  \
count                    60886.000000                    60886.000000   
mean                       129.654551                      128.257498   
std                         63.592907                       