# PreProcess csv files from CellProfiler

We built a pipeline to segment red blood cells (RBC), nuclei (based on hematoxylin), and nucleated cells (based on eosin).  
* CP_20220429f = Ypos  
* CP_20220429g = Yneg  

We used MeasureShape and MeasureNeighbor. Thus, we get about 68 measures per object (spreadsheet columns), for 4 object classes (csv files), for as many objects as were detected in each path (spreadsheet rows).  

    39616 Cells.csv 68 columns
    39616 Nuclei.csv 68 columns
    60887 RBC.csv 60 columns
    12980 Image.csv 62 columns
    153099 total

Here, each image is actually a patch. The image data includes redundant fields: #RBC, #nuclei, #cells. But the rest of the image data seems non-redundant, including thresholds used and total areas covered. CellProfiler assigned image numbers in order starting at 1 for the first patch. It would take some work to reconstruct the mapping of these numbers to tumor names (e.g. patches 1-200 came from tumor B3); check whether CellProfiler has an option to add filenames to the csv files.

We ran CellProfiler on the full training set (excludes a 20% test set) of center patches. Here, transform per-object metrics into per-patch metrics. Also, exclude patches with no nuclei.

In [1]:
from platform import python_version
print('Python',python_version())
import numpy as np
import pandas as pd
import sklearn
print('sklearn',sklearn.__version__)

Python 3.8.10
sklearn 1.0.2


## Load Train and Test Sets

In [4]:
def make_dataframe(filename):
    df = pd.read_csv(filename,dtype=np.float32)  # remove dtype?
    count1 = df.isnull().sum().sum()
    print('Zero out this many NaN:', count1)
    df = df.fillna(0)
    count2 = df.isnull().sum().sum()
    print('Now how many NaN?:', count2)
    print('Largest value:', df.max().max())
    print('Smallest:', df.min().min())
    return df

In [5]:
FILENAME_YPOS = '/home/jrm/Martinez/CellProfilerRuns/CP_20220417_Ypos/Nuclei.CP_20220417_Ypos.csv'
feature_vec_Ypos = make_dataframe(FILENAME_YPOS)
#feature_vec_Ypos

Zero out this many NaN: 35
Now how many NaN?: 0
Largest value: 19328.0
Smallest: -89.999825


In [6]:
FILENAME_YNEG = '/home/jrm/Martinez/CellProfilerRuns/CP_20220417_Yneg/Nuclei.CP_20220417_Yneg.csv'
feature_vec_Yneg = make_dataframe(FILENAME_YNEG)
#feature_vec_Yneg

Zero out this many NaN: 20
Now how many NaN?: 0
Largest value: 25110.0
Smallest: -89.99928


In [7]:
Ypos_rows,Ypos_cols = feature_vec_Ypos.shape
Yneg_rows,Yneg_cols = feature_vec_Yneg.shape
if Ypos_cols == Yneg_cols:
    print('The dataframes are compatible.')
else:
    print('ERROR! Column counts do not match.')

The dataframes are compatible.


In [8]:
feature_vec_all = pd.concat ( [feature_vec_Ypos, feature_vec_Yneg], ignore_index=True )
label_vec_Ypos = np.ones(Ypos_rows,dtype=int)
label_vec_Yneg = np.zeros(Yneg_rows,dtype=int)
label_vec_all = np.concatenate ( [label_vec_Ypos, label_vec_Yneg] )

In [9]:
# Default test size is 25%
Xtrain,Xtest,ytrain,ytest = train_test_split(feature_vec_all, label_vec_all.ravel(), random_state=42)
print('Xtrain',Xtrain.shape,'ytrain',ytrain.shape,'ones:',np.count_nonzero(ytrain))
print('Xtest',Xtest.shape,'ytest',ytest.shape,'ones:',np.count_nonzero(ytest))

Xtrain (28364, 68) ytrain (28364,) ones: 13621
Xtest (9455, 68) ytest (9455,) ones: 4517
