# Random Forest Classifier
Center patches from two H&E slides.  
Patches are 224 x 224.  
Ypos example: D1  18K patches  
Yneg example: E5  20K patches  
Feature extraction by CellProfiler, segment nuclei, measure shape and neighbors.  
Same 68 features for both.    

In notebook 001, we managed to train a RF, getting past NaN issues.  
Here, run a simple cross validation.  
Also, fit to train data and test on test data.  

The model was 61% accurate in cross-validation.  
These results may not generalize so don't overrate this result.  
All patches come from only two slide images, one Ypos and one Yneg.  
The tumors may differ in ways besides Ychr.  

Our data have one row per nuclei (called 'object').  
This classifier classified each nuclei (and its patch ranked high in importance).  
We really want to classify patches, not objects.  
We could classify patches by counting objects.  
But let's see if CellProfiler could output stats per patch instead of per object.

In [1]:
import numpy as np
import pandas as pd
import sklearn
print(sklearn.__version__)

1.0.2


In [2]:
def make_dataframe(filename):
    df = pd.read_csv(filename,dtype=np.float32)  # remove dtype?
    count1 = df.isnull().sum().sum()
    print('Zero out this many NaN:', count1)
    df = df.fillna(0)
    count2 = df.isnull().sum().sum()
    print('Now how many NaN?:', count2)
    print('Largest value:', df.max().max())
    print('Smallest:', df.min().min())
    return df

In [3]:
FILENAME_YPOS = '/home/jrm/Martinez/CellProfilerRuns/CP_20220417_Ypos/Nuclei.CP_20220417_Ypos.csv'
feature_vec_Ypos = make_dataframe(FILENAME_YPOS)
feature_vec_Ypos

Zero out this many NaN: 35
Now how many NaN?: 0
Largest value: 19328.0
Smallest: -89.999825


Unnamed: 0,ImageNumber,ObjectNumber,AreaShape_Area,AreaShape_BoundingBoxArea,AreaShape_BoundingBoxMaximum_X,AreaShape_BoundingBoxMaximum_Y,AreaShape_BoundingBoxMinimum_X,AreaShape_BoundingBoxMinimum_Y,AreaShape_Center_X,AreaShape_Center_Y,...,Location_Center_Y,Location_Center_Z,Neighbors_AngleBetweenNeighbors_Expanded,Neighbors_FirstClosestDistance_Expanded,Neighbors_FirstClosestObjectNumber_Expanded,Neighbors_NumberOfNeighbors_Expanded,Neighbors_PercentTouching_Expanded,Neighbors_SecondClosestDistance_Expanded,Neighbors_SecondClosestObjectNumber_Expanded,Number_Object_Number
0,1.0,1.0,843.0,1394.0,38.0,59.0,4.0,18.0,22.590748,34.474495,...,34.474495,0.0,52.354198,59.501389,3.0,2.0,70.562767,72.924950,5.0,1.0
1,1.0,2.0,1089.0,2432.0,121.0,64.0,83.0,0.0,104.489441,31.939394,...,31.939394,0.0,84.178970,43.707657,5.0,3.0,78.797470,74.513222,4.0,2.0
2,1.0,3.0,2532.0,7020.0,65.0,147.0,0.0,39.0,30.254740,93.802528,...,93.802528,0.0,75.474617,59.501389,1.0,4.0,78.444443,59.648666,5.0,3.0
3,1.0,4.0,967.0,1806.0,179.0,115.0,158.0,29.0,167.997925,70.884178,...,70.884178,0.0,115.062004,51.082310,6.0,3.0,66.080399,74.513222,2.0,4.0
4,1.0,5.0,2509.0,7656.0,118.0,134.0,60.0,2.0,85.536865,71.314865,...,71.314865,0.0,137.543488,43.707657,2.0,6.0,93.320236,59.648666,3.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18133,1434.0,14.0,2221.0,3600.0,157.0,224.0,112.0,144.0,134.861328,191.363800,...,191.363800,0.0,60.717670,50.313515,17.0,5.0,82.692307,54.764091,13.0,14.0
18134,1434.0,15.0,605.0,2014.0,100.0,218.0,47.0,180.0,72.646278,200.705780,...,200.705780,0.0,57.875278,33.664047,18.0,4.0,84.482758,45.451160,11.0,15.0
18135,1434.0,16.0,665.0,1224.0,41.0,217.0,7.0,181.0,22.075188,199.330826,...,199.330826,0.0,79.765228,18.758469,18.0,2.0,77.419357,38.419117,11.0,16.0
18136,1434.0,17.0,1995.0,3072.0,208.0,224.0,160.0,160.0,185.139343,189.518799,...,189.518799,0.0,63.777115,50.313515,14.0,3.0,57.333332,53.246357,13.0,17.0


In [4]:
FILENAME_YNEG = '/home/jrm/Martinez/CellProfilerRuns/CP_20220417_Yneg/Nuclei.CP_20220417_Yneg.csv'
feature_vec_Yneg = make_dataframe(FILENAME_YNEG)
feature_vec_Yneg

Zero out this many NaN: 20
Now how many NaN?: 0
Largest value: 25110.0
Smallest: -89.99928


Unnamed: 0,ImageNumber,ObjectNumber,AreaShape_Area,AreaShape_BoundingBoxArea,AreaShape_BoundingBoxMaximum_X,AreaShape_BoundingBoxMaximum_Y,AreaShape_BoundingBoxMinimum_X,AreaShape_BoundingBoxMinimum_Y,AreaShape_Center_X,AreaShape_Center_Y,...,Location_Center_Y,Location_Center_Z,Neighbors_AngleBetweenNeighbors_Expanded,Neighbors_FirstClosestDistance_Expanded,Neighbors_FirstClosestObjectNumber_Expanded,Neighbors_NumberOfNeighbors_Expanded,Neighbors_PercentTouching_Expanded,Neighbors_SecondClosestDistance_Expanded,Neighbors_SecondClosestObjectNumber_Expanded,Number_Object_Number
0,1.0,1.0,522.0,1260.0,52.0,30.0,10.0,0.0,35.624519,10.103448,...,10.103448,0.0,52.348301,61.014439,2.0,3.0,51.724136,62.971436,4.0,1.0
1,1.0,2.0,4162.0,5607.0,136.0,63.0,47.0,0.0,94.156418,27.335415,...,27.335415,0.0,65.665215,54.718487,4.0,4.0,79.500000,61.014439,1.0,2.0
2,1.0,3.0,699.0,1806.0,187.0,42.0,144.0,0.0,165.296143,15.801145,...,15.801145,0.0,78.999229,71.271278,6.0,2.0,52.264809,72.070114,2.0,3.0
3,1.0,4.0,1160.0,2970.0,86.0,93.0,31.0,39.0,58.443104,68.788795,...,68.788795,0.0,26.785013,38.306911,7.0,5.0,100.000000,51.534393,5.0,4.0
4,1.0,5.0,525.0,812.0,28.0,102.0,0.0,73.0,10.695238,88.080002,...,88.080002,0.0,67.267570,22.964008,8.0,2.0,72.727272,24.466272,7.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19676,1434.0,15.0,1902.0,4292.0,155.0,224.0,97.0,150.0,123.790749,186.090424,...,186.090424,0.0,34.170803,16.905943,13.0,8.0,91.735535,37.599354,11.0,15.0
19677,1434.0,16.0,1237.0,2464.0,67.0,222.0,11.0,178.0,36.443008,198.021011,...,198.021011,0.0,98.946457,25.149639,17.0,4.0,70.300751,46.310112,12.0,16.0
19678,1434.0,17.0,619.0,1116.0,71.0,224.0,35.0,193.0,57.436188,211.848145,...,211.848145,0.0,153.307632,25.149639,16.0,3.0,78.181816,41.495163,18.0,17.0
19679,1434.0,18.0,548.0,799.0,124.0,224.0,77.0,207.0,98.649635,216.675186,...,216.675186,0.0,122.564507,39.682652,15.0,2.0,71.165642,41.495163,17.0,18.0


In [5]:
Ypos_rows,Ypos_cols = feature_vec_Ypos.shape
Yneg_rows,Yneg_cols = feature_vec_Yneg.shape
if Ypos_cols == Yneg_cols:
    print('The dataframes are compatible.')
else:
    print('ERROR! Column counts do not match.')

The dataframes are compatible.


In [6]:
feature_vec_all = pd.concat ( [feature_vec_Ypos, feature_vec_Yneg], ignore_index=True )
feature_vec_all

Unnamed: 0,ImageNumber,ObjectNumber,AreaShape_Area,AreaShape_BoundingBoxArea,AreaShape_BoundingBoxMaximum_X,AreaShape_BoundingBoxMaximum_Y,AreaShape_BoundingBoxMinimum_X,AreaShape_BoundingBoxMinimum_Y,AreaShape_Center_X,AreaShape_Center_Y,...,Location_Center_Y,Location_Center_Z,Neighbors_AngleBetweenNeighbors_Expanded,Neighbors_FirstClosestDistance_Expanded,Neighbors_FirstClosestObjectNumber_Expanded,Neighbors_NumberOfNeighbors_Expanded,Neighbors_PercentTouching_Expanded,Neighbors_SecondClosestDistance_Expanded,Neighbors_SecondClosestObjectNumber_Expanded,Number_Object_Number
0,1.0,1.0,843.0,1394.0,38.0,59.0,4.0,18.0,22.590748,34.474495,...,34.474495,0.0,52.354198,59.501389,3.0,2.0,70.562767,72.924950,5.0,1.0
1,1.0,2.0,1089.0,2432.0,121.0,64.0,83.0,0.0,104.489441,31.939394,...,31.939394,0.0,84.178970,43.707657,5.0,3.0,78.797470,74.513222,4.0,2.0
2,1.0,3.0,2532.0,7020.0,65.0,147.0,0.0,39.0,30.254740,93.802528,...,93.802528,0.0,75.474617,59.501389,1.0,4.0,78.444443,59.648666,5.0,3.0
3,1.0,4.0,967.0,1806.0,179.0,115.0,158.0,29.0,167.997925,70.884178,...,70.884178,0.0,115.062004,51.082310,6.0,3.0,66.080399,74.513222,2.0,4.0
4,1.0,5.0,2509.0,7656.0,118.0,134.0,60.0,2.0,85.536865,71.314865,...,71.314865,0.0,137.543488,43.707657,2.0,6.0,93.320236,59.648666,3.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37814,1434.0,15.0,1902.0,4292.0,155.0,224.0,97.0,150.0,123.790749,186.090424,...,186.090424,0.0,34.170803,16.905943,13.0,8.0,91.735535,37.599354,11.0,15.0
37815,1434.0,16.0,1237.0,2464.0,67.0,222.0,11.0,178.0,36.443008,198.021011,...,198.021011,0.0,98.946457,25.149639,17.0,4.0,70.300751,46.310112,12.0,16.0
37816,1434.0,17.0,619.0,1116.0,71.0,224.0,35.0,193.0,57.436188,211.848145,...,211.848145,0.0,153.307632,25.149639,16.0,3.0,78.181816,41.495163,18.0,17.0
37817,1434.0,18.0,548.0,799.0,124.0,224.0,77.0,207.0,98.649635,216.675186,...,216.675186,0.0,122.564507,39.682652,15.0,2.0,71.165642,41.495163,17.0,18.0


In [7]:
# Was silly to convert numpy to pandas when we need numpy eventually
#label_vec_Ypos = pd.DataFrame (np.ones(Ypos_rows,dtype=int))
#label_vec_Yneg = pd.DataFrame (np.zeros(Yneg_rows,dtype=int))
#label_vec_all = pd.concat ( [label_vec_Ypos, label_vec_Yneg], ignore_index=True )
label_vec_Ypos = np.ones(Ypos_rows,dtype=int)
label_vec_Yneg = np.zeros(Yneg_rows,dtype=int)
label_vec_all = np.concatenate ( [label_vec_Ypos, label_vec_Yneg] )
label_vec_all

array([1, 1, 1, ..., 0, 0, 0])

In [8]:
# Was looking for which data rows caused NaN errors during fit().
#feature_vec_all = feature_vec_Ypos[:1226]
#label_vec_all =     label_vec_Ypos[:1226]
#print(feature_vec_all.shape)
#pd.set_option('display.max_rows', None)
#print(feature_vec_all.iloc[-1])
#label_vec_all.shape

In [9]:
from sklearn.model_selection import train_test_split
#Xtrain,Xtest,ytrain,ytest = train_test_split(feature_vec_all, label_vec_all.ravel(), test_size=100, random_state=41)
# Default test size is 25%
Xtrain,Xtest,ytrain,ytest = train_test_split(feature_vec_all, label_vec_all.ravel(), random_state=42)
print('Xtrain',Xtrain.shape,'ytrain',ytrain.shape)
print('Xtest',Xtest.shape,'ytest',ytest.shape)

Xtrain (28364, 68) ytrain (28364,)
Xtest (9455, 68) ytest (9455,)


In [10]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

In [11]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
n_scores = cross_val_score(model, Xtrain, ytrain, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

Accuracy: 0.609 (0.009)


In [12]:
model.fit(Xtrain,ytrain)
ypred = model.predict(Xtest)
#for i in range(len(ypred)):
#    print('Actual',ytest[i],'Predict',ypred[i])

In [13]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(ytest, ypred)
cm

array([[3334, 1604],
       [2049, 2468]])

In [15]:
print('The impurity-based feature importances.')
names = model.feature_names_in_
importances = model.feature_importances_
pairs = np.column_stack( (names,importances) )
sorted(pairs, key = lambda e:e[1], reverse=True)

The impurity-based feature importances.


[array(['AreaShape_Orientation', 0.03129507536909002], dtype=object),
 array(['AreaShape_MeanRadius', 0.020069586896440193], dtype=object),
 array(['ImageNumber', 0.019144306218596994], dtype=object),
 array(['Neighbors_SecondClosestDistance_Expanded', 0.018853010253342056],
       dtype=object),
 array(['AreaShape_Solidity', 0.018016707752642105], dtype=object),
 array(['Neighbors_FirstClosestDistance_Expanded', 0.017743964215453215],
       dtype=object),
 array(['AreaShape_Extent', 0.01749113895867147], dtype=object),
 array(['AreaShape_Compactness', 0.017351841191621486], dtype=object),
 array(['AreaShape_FormFactor', 0.01718101943401564], dtype=object),
 array(['AreaShape_Zernike_0_0', 0.01687326268082619], dtype=object),
 array(['AreaShape_Zernike_9_1', 0.01673088614226464], dtype=object),
 array(['AreaShape_Zernike_8_8', 0.016663620777941576], dtype=object),
 array(['AreaShape_Zernike_4_0', 0.01665080621415337], dtype=object),
 array(['Neighbors_AngleBetweenNeighbors_Expanded', 