# Random Forest

## Prior Progress
Random Forest on 6*100
* Random Forest 06 - 09 ran on the "random" sample of 100 per class out of the 80K from Naved.
* Random Forest 08 used a rollup of Nucleus and RBC counts. It invoked describe() to get quartile stats per column per patch. In the six-way classification, it had 89.3% accuracy. The confusion matrix indicates slight over-prediction of class 5. The top-ranked features were the 50% and 75% quartiles of Nuc_Texture_SumAverage_Eosin_5_00_256 and there were no RBC features in the top 5.
* Random Forest 09 used a rollup of Nucleus and RBC counts. It did not invoke describe(); it used only the max per column per patch. In the six-way classification, it had 82.7% accuracy. The confusion matrix indicates slight over-prediction of class 5. The top-ranked feature was Nuc_Texture_SumAverage_Eosin_5_00_256 and there were no RBC features in the top 5.

Rollup nucleus stats to patch level: 
* Use CP_Util and thus obtain the 80% train set only. This uses all patches from 80% of the WSI IDs. We discovered after that we have multiple WSI IDs representing the same patient+tumor+sample; they differ only by the "center" that generated the image. We must revisit this problem!
* The rollup means input csv with one line per nucleus, and output csv with one line per patch.
* We used pandas describe() to expand each one feature into its mean, std, and quartiles.
* The rollup decreases the number of rows (instances) but increases the number of columns (features).
* Color Analysis 04 thru 08 generated Nucleus_Rollup_?.csv for ?=1,2,3,4,5 (missing 0).

## Next Task
Random Forest on 80K, 5 of 6 classes. Use the rollups described above.

In [1]:
import datetime
print(datetime.datetime.now())
from platform import python_version
print('Python',python_version())
import numpy as np
import pandas as pd
import sklearn   # pip install --upgrade scikit-learn
print('sklearn',sklearn.__version__)
import tensorflow as tf
tf.config.list_physical_devices('GPU')

2022-06-06 12:27:58.236238
Python 3.8.10
sklearn 1.0.2


[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [2]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import RFE
import joblib # used to dump/load sklearn models

In [3]:
BASE_DIR='/home/jrm/ShepherdML/TumorII/'
CLASS_FILES=[
    'Nucleus_Rollup_0.csv',
    'Nucleus_Rollup_1.csv',
    'Nucleus_Rollup_2.csv',
    'Nucleus_Rollup_3.csv',
    'Nucleus_Rollup_4.csv',
    'Nucleus_Rollup_5.csv']
CLASSES=range(1,6)   # we only have 1 thru 5 right now
CLASSES=range(4,6)   # load just 2 small classes for testing
MODELS_DIR='/home/jrm/Adjeroh/Naved/models/RandomForest.10/'
DESCRIBE=False  # just retain the mean over objects in patch (not even the count!)
DESCRIBE=True   # compute stats for every column

In [5]:
print(datetime.datetime.now())
df = None
def load_all_classes():
    X = None
    y = None
    for i in CLASSES:
        Xi = pd.read_csv(BASE_DIR+CLASS_FILES[i])
        size = len(Xi)
        yi = np.ones(size) * i   # e.g. class 3
        if X is None:
            X = Xi
            y = yi
        else:
            X = pd.concat( (X,Xi) )
            y = np.concatenate( (y,yi) )
    X.fillna(0,inplace=True)  
    return X,y
X,y=load_all_classes()
X

2022-06-06 12:28:23.811825


Unnamed: 0,PatchNumber,ObjectNumber_count,ObjectNumber_mean,ObjectNumber_std,ObjectNumber_min,ObjectNumber_25%,ObjectNumber_50%,ObjectNumber_75%,ObjectNumber_max,AreaShape_Area_count,...,Texture_Variance_Hematoxylin_7_02_256_75%,Texture_Variance_Hematoxylin_7_02_256_max,Texture_Variance_Hematoxylin_7_03_256_count,Texture_Variance_Hematoxylin_7_03_256_mean,Texture_Variance_Hematoxylin_7_03_256_std,Texture_Variance_Hematoxylin_7_03_256_min,Texture_Variance_Hematoxylin_7_03_256_25%,Texture_Variance_Hematoxylin_7_03_256_50%,Texture_Variance_Hematoxylin_7_03_256_75%,Texture_Variance_Hematoxylin_7_03_256_max
0,404,19.0,10.0,5.627314,1.0,5.50,10.0,14.50,19.0,19.0,...,1144.132739,1478.077515,19.0,966.724256,362.321824,390.335351,742.188304,975.444261,1220.595523,1783.006359
1,405,17.0,9.0,5.049752,1.0,5.00,9.0,13.00,17.0,17.0,...,760.967400,980.534043,17.0,604.258528,178.547488,205.978733,525.962433,605.744924,655.724148,973.894819
2,406,19.0,10.0,5.627314,1.0,5.50,10.0,14.50,19.0,19.0,...,1109.622853,1710.046251,19.0,956.605344,369.223619,216.555801,757.328160,980.648035,1123.184088,1611.128641
3,407,12.0,6.5,3.605551,1.0,3.75,6.5,9.25,12.0,12.0,...,1513.105372,1724.063900,12.0,1237.469664,474.639615,360.595372,1009.561389,1262.841822,1550.750741,1961.868103
4,408,3.0,2.0,1.000000,1.0,1.50,2.0,2.50,3.0,3.0,...,893.088973,902.385591,3.0,796.103551,197.290069,608.127305,693.382321,778.637336,890.091675,1001.546013
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1190,1587,7.0,4.0,2.160247,1.0,2.50,4.0,5.50,7.0,7.0,...,2645.755171,5002.663831,7.0,2072.810441,1442.119639,682.306939,946.916962,2007.931412,2641.212717,4643.175378
1191,1588,17.0,9.0,5.049752,1.0,5.00,9.0,13.00,17.0,17.0,...,2697.422264,3144.417339,17.0,2120.125661,784.452285,435.222222,1636.718754,2149.951462,2783.133279,3362.958125
1192,1589,19.0,10.0,5.627314,1.0,5.50,10.0,14.50,19.0,19.0,...,2196.200883,2971.276407,19.0,1731.620347,639.473204,677.151060,1348.982274,1626.336416,2185.555700,2992.328549
1193,1590,33.0,17.0,9.669540,1.0,9.00,17.0,25.00,33.0,33.0,...,1931.452617,2672.847178,33.0,1330.902406,667.821030,479.463374,800.137069,1102.353972,1578.432099,2878.275735


In [6]:
print(datetime.datetime.now())
Xtrain,Xvalid,ytrain,yvalid = train_test_split(X, y.ravel()) 
        # ,random_state=42) # add this for reproducibility
print('Xtrain',Xtrain.shape,'ytrain',ytrain.shape,'non-zero:',np.count_nonzero(ytrain))
print('Xvalid',Xvalid.shape,'yvalid',yvalid.shape,'non-zero:',np.count_nonzero(yvalid))

2022-06-06 12:28:47.671315
Xtrain (2684, 5193) ytrain (2684,) non-zero: 2684
Xvalid (895, 5193) yvalid (895,) non-zero: 895


In [7]:
# RandomForestClassifier can only track feature names of type string.
num_problems=0
for name in Xtrain.columns:
    if not isinstance(name,str):
        num_problems += 1
        print(type(name),name)
if num_problems==0:
    print("Ok")

Ok


In [8]:
print(datetime.datetime.now())
class RF_Util:
    def __init__(self):
        self.model=RandomForestClassifier()
    def get_model(self):
        return self.model
    def set_train(self,X,y):
        self.Xtr = X
        self.ytr = y
    def set_validation(self,X,y):
        self.Xval = X
        self.yval = y
    def fit(self):
        self.model.fit(self.Xtr,self.ytr)
        #print(dir(self.model))  # see whether feature_names_in_ got created
    def validation_accuracy(self):
        ypred = self.model.predict(self.Xval)
        matches = np.count_nonzero(self.yval==ypred)
        accuracy = 100.0 * matches / len(ypred)  # bug fix
        return accuracy
    def validation_confusion(self):
        ypred = self.model.predict(self.Xval)
        cm = confusion_matrix(self.yval, ypred)
        return cm
    def important_features(self):
        names = self.model.feature_names_in_
        importances = self.model.feature_importances_
        pairs = np.column_stack( (names,importances) )
        top_array = sorted(pairs, key = lambda e:e[1], reverse=True)
        # There must be a way to do this witout a loop!
        top_list = []
        for i in top_array:
             top_list.append((i[1],i[0]))  # 0=feature_name, 1=importance
        top_df = pd.DataFrame(top_list)
        return top_df

2022-06-06 12:29:17.977455


In [9]:
print('Train on all Features')
rf1 = RF_Util()
rf1.set_train(Xtrain,ytrain)
rf1.set_validation(Xvalid,yvalid)
print(datetime.datetime.now())
rf1.fit()
print(datetime.datetime.now())
print('Accuracy:',rf1.validation_accuracy())
print('Confusion:')
print(rf1.validation_confusion())
print('The impurity-based feature importances.')
top = rf1.important_features()
top.head()

Train on all Features
2022-06-06 12:29:18.618257
2022-06-06 12:29:25.325565
Accuracy: 92.9608938547486
Confusion:
[[589   6]
 [ 57 243]]
The impurity-based feature importances.


Unnamed: 0,0,1
0,0.009002,Texture_InfoMeas2_Hematoxylin_3_03_256_count
1,0.008343,PatchNumber
2,0.006938,Texture_Correlation_Eosin_3_00_256_75%
3,0.006857,Neighbors_SecondClosestObjectNumber_Expanded_std
4,0.005923,Texture_Contrast_Eosin_3_03_256_count


## Issues
* We only used two classes. Need to try 5 (till we get the 6th).
* We only did one train/validate. Need to do cross-validation.