# PreProcess csv files from CellProfiler

We built a pipeline to segment red blood cells (RBC), nuclei (based on hematoxylin), and nucleated cells (based on eosin).  
* CP_20220429f = Ypos  
* CP_20220429g = Yneg  

We used MeasureShape and MeasureNeighbor. Thus, we get about 68 measures per object (spreadsheet columns), for 4 object classes (csv files), for as many objects as were detected in each path (spreadsheet rows).  

    39616 Cells.csv 68 columns
    39616 Nuclei.csv 68 columns
    60887 RBC.csv 60 columns
    12980 Image.csv 62 columns
    153099 total

Here, each image is actually a patch. The image data includes redundant fields: #RBC, #nuclei, #cells. But the rest of the image data seems non-redundant, including thresholds used and total areas covered. CellProfiler assigned image numbers in order starting at 1 for the first patch. 

The Images.csv contains fields called Filename and URL that contain the path and filename of the patch. For example, D:Martinez/B3.2.jpg is patch 2 of tumor B3.

We ran CellProfiler on the full training set (excludes a 20% test set) of center patches. Here, transform per-object metrics into per-patch metrics. Also, exclude patches with no nuclei.

In [1]:
from platform import python_version
print('Python',python_version())
import numpy as np
import pandas as pd
import sklearn
print('sklearn',sklearn.__version__)

Python 3.8.10
sklearn 1.0.2


In [2]:
YPOS_DIR='/home/jrm/Martinez/CellProfilerRuns/CP_20220429f/'
YNEG_DIR='/home/jrm/Martinez/CellProfilerRuns/CP_20220429g/'
IMAGES="Image.csv"
REDS="RBC.csv"
NUKES="Nuclei.csv"
CELLS="Cells.csv"

In [20]:
def describe_all_columns(df):
    with pd.option_context('display.max_columns', None):
        print(df.describe(include='all'))
def make_dataframe(filename,verbose=False):
    df = pd.read_csv(filename) 
    if verbose:
        count1 = df.isnull().sum().sum()
        print('Zero out this many NaN:', count1)
    df = df.fillna(0)
    return df
def shave_images(df):
    # drop uninformative columns (mostly just zero)
    bad_columns = ['ProcessingStatus', 'Height_HE', 'Width_HE', 'Scaling_HE', 'Series_HE', 'Frame_HE', 'Group_Number']
    df = df.drop(columns=bad_columns)
    # drop strings, like filepaths, that could leak the Ypos/Yneg labels
    df = df.select_dtypes(['number'])
    # drop CellProfiler timing, errors (should all be zero), and metadata stats
    df = df.drop(df.filter(regex='^Metadata_').columns, axis=1)
    df = df.drop(df.filter(regex='^ExecutionTime_').columns, axis=1)
    df = df.drop(df.filter(regex='^ModuleError_').columns, axis=1)
    return df
filename = YPOS_DIR+IMAGES
df = make_dataframe(filename) 
df = shave_images(df)
describe_all_columns(df)

       Channel_HE   Count_Cells  Count_Nuclei     Count_RBC   Group_Index  \
count     12979.0  12979.000000  12979.000000  12979.000000  12979.000000   
mean         -2.0      3.052238      3.052238      4.691116   6490.000000   
std           0.0      1.817379      1.817379      3.747426   3746.858907   
min          -2.0      0.000000      0.000000      0.000000      1.000000   
25%          -2.0      2.000000      2.000000      2.000000   3245.500000   
50%          -2.0      3.000000      3.000000      4.000000   6490.000000   
75%          -2.0      4.000000      4.000000      7.000000   9734.500000   
max          -2.0     10.000000     10.000000     41.000000  12979.000000   

        ImageNumber  Threshold_FinalThreshold_Cells  \
count  12979.000000                    12979.000000   
mean    6490.000000                        0.187894   
std     3746.858907                        0.070089   
min        1.000000                        0.000166   
25%     3245.500000            

In [4]:
df.describe()

Unnamed: 0,Channel_HE,Count_Cells,Count_Nuclei,Count_RBC,ExecutionTime_01Images,ExecutionTime_02Metadata,ExecutionTime_03NamesAndTypes,ExecutionTime_04Groups,ExecutionTime_05UnmixColors,ExecutionTime_06IdentifyPrimaryObjects,...,Threshold_OrigThreshold_Cells,Threshold_OrigThreshold_Nuclei,Threshold_OrigThreshold_RBC,Threshold_SumOfEntropies_Cells,Threshold_SumOfEntropies_Nuclei,Threshold_SumOfEntropies_RBC,Threshold_WeightedVariance_Cells,Threshold_WeightedVariance_Nuclei,Threshold_WeightedVariance_RBC,Width_HE
count,12979.0,12979.0,12979.0,12979.0,12979.0,12979.0,12979.0,12979.0,12979.0,12979.0,...,12979.0,12979.0,12979.0,12979.0,12979.0,12979.0,12979.0,12979.0,12979.0,12979.0
mean,-2.0,3.052238,3.052238,4.691116,0.0,2e-06,0.070269,7e-06,0.009494,0.053006,...,0.189411,0.234294,0.371383,-11.849882,-11.319628,-10.250422,1.024952,0.647304,2.428445,224.0
std,0.0,1.817379,1.817379,3.747426,0.0,0.000194,0.024801,0.000336,0.008375,0.012499,...,0.071284,0.049905,0.076229,0.780875,0.970909,0.893448,0.479942,0.555952,1.529482,0.0
min,-2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,...,0.000166,0.024872,0.000254,-13.905191,-13.968753,-13.900173,0.008287,0.002667,0.034349,224.0
25%,-2.0,2.0,2.0,2.0,0.0,0.0,0.0625,0.0,0.0,0.046875,...,0.154098,0.202338,0.338583,-12.419574,-11.817203,-10.721451,0.658918,0.272578,1.161727,224.0
50%,-2.0,3.0,3.0,4.0,0.0,0.0,0.0625,0.0,0.015625,0.046875,...,0.189916,0.238565,0.377706,-12.049719,-11.420145,-10.333776,1.042017,0.464456,2.178341,224.0
75%,-2.0,4.0,4.0,7.0,0.0,0.0,0.078125,0.0,0.015625,0.0625,...,0.226971,0.271689,0.414974,-11.420395,-11.019305,-9.832758,1.390086,0.835508,3.505212,224.0
max,-2.0,10.0,10.0,41.0,0.0,0.015625,1.15625,0.015625,0.09375,0.1875,...,0.513549,0.601256,0.589838,-7.785976,-5.882243,0.0,2.856014,3.681442,9.340085,224.0


In [5]:
FILENAME_YPOS = '/home/jrm/Martinez/CellProfilerRuns/CP_20220417_Ypos/Nuclei.CP_20220417_Ypos.csv'
feature_vec_Ypos = make_dataframe(FILENAME_YPOS)
#feature_vec_Ypos

In [6]:
FILENAME_YNEG = '/home/jrm/Martinez/CellProfilerRuns/CP_20220417_Yneg/Nuclei.CP_20220417_Yneg.csv'
feature_vec_Yneg = make_dataframe(FILENAME_YNEG)
#feature_vec_Yneg

In [7]:
Ypos_rows,Ypos_cols = feature_vec_Ypos.shape
Yneg_rows,Yneg_cols = feature_vec_Yneg.shape
if Ypos_cols == Yneg_cols:
    print('The dataframes are compatible.')
else:
    print('ERROR! Column counts do not match.')

The dataframes are compatible.


In [8]:
feature_vec_all = pd.concat ( [feature_vec_Ypos, feature_vec_Yneg], ignore_index=True )
label_vec_Ypos = np.ones(Ypos_rows,dtype=int)
label_vec_Yneg = np.zeros(Yneg_rows,dtype=int)
label_vec_all = np.concatenate ( [label_vec_Ypos, label_vec_Yneg] )

In [9]:
# Default test size is 25%
Xtrain,Xtest,ytrain,ytest = train_test_split(feature_vec_all, label_vec_all.ravel(), random_state=42)
print('Xtrain',Xtrain.shape,'ytrain',ytrain.shape,'ones:',np.count_nonzero(ytrain))
print('Xtest',Xtest.shape,'ytest',ytest.shape,'ones:',np.count_nonzero(ytest))

NameError: name 'train_test_split' is not defined