# Random Forest Classifier with Feature Elimination

Try correlated feature elimination to improve the explainability of results.

Darst, B.F., Malecki, K.C. & Engelman, C.D. Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genet 19, 65 (2018). https://doi.org/10.1186/s12863-018-0633-8

Gregorutti, B., Michel, B. & Saint-Pierre, P. Correlation and variable importance in random forests. Stat Comput 27, 659–678 (2017). https://doi.org/10.1007/s11222-016-9646-1

The sklearn class RFE recursively removes least important feature till default=half are left.    
The sklearn class RFECV decides when to stop by doing cross-validation after each round.  


In [1]:
from platform import python_version
print('Python',python_version())
import numpy as np
import pandas as pd
import sklearn
print('sklearn',sklearn.__version__)

Python 3.8.10
sklearn 1.0.2


In [2]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import RFE

In [3]:
# How to interpret the confusion matrix.
# The CM considers 0=Positive and 1=Negative.
# TP FP |  True 0=Y-neg  False 0=Y-neg
# FN TN | False 1=Y-pos  True 1=Y-pos
# Note 1=Ypos="Negative", 0=Yneg="Positive", so it appears backwards.
ytest=[1,1]
ypred=[1,0]
confusion_matrix(ytest,ypred)  

array([[0, 0],
       [1, 1]])

## Load Train and Test Sets

In [4]:
def make_dataframe(filename):
    df = pd.read_csv(filename,dtype=np.float32)  # remove dtype?
    count1 = df.isnull().sum().sum()
    print('Zero out this many NaN:', count1)
    df = df.fillna(0)
    count2 = df.isnull().sum().sum()
    print('Now how many NaN?:', count2)
    print('Largest value:', df.max().max())
    print('Smallest:', df.min().min())
    return df

In [5]:
FILENAME_YPOS = '/home/jrm/Martinez/CellProfilerRuns/CP_20220417_Ypos/Nuclei.CP_20220417_Ypos.csv'
feature_vec_Ypos = make_dataframe(FILENAME_YPOS)
#feature_vec_Ypos

Zero out this many NaN: 35
Now how many NaN?: 0
Largest value: 19328.0
Smallest: -89.999825


In [6]:
FILENAME_YNEG = '/home/jrm/Martinez/CellProfilerRuns/CP_20220417_Yneg/Nuclei.CP_20220417_Yneg.csv'
feature_vec_Yneg = make_dataframe(FILENAME_YNEG)
#feature_vec_Yneg

Zero out this many NaN: 20
Now how many NaN?: 0
Largest value: 25110.0
Smallest: -89.99928


In [7]:
Ypos_rows,Ypos_cols = feature_vec_Ypos.shape
Yneg_rows,Yneg_cols = feature_vec_Yneg.shape
if Ypos_cols == Yneg_cols:
    print('The dataframes are compatible.')
else:
    print('ERROR! Column counts do not match.')

The dataframes are compatible.


In [8]:
feature_vec_all = pd.concat ( [feature_vec_Ypos, feature_vec_Yneg], ignore_index=True )
label_vec_Ypos = np.ones(Ypos_rows,dtype=int)
label_vec_Yneg = np.zeros(Yneg_rows,dtype=int)
label_vec_all = np.concatenate ( [label_vec_Ypos, label_vec_Yneg] )

In [9]:
# Default test size is 25%
Xtrain,Xtest,ytrain,ytest = train_test_split(feature_vec_all, label_vec_all.ravel(), random_state=42)
print('Xtrain',Xtrain.shape,'ytrain',ytrain.shape,'ones:',np.count_nonzero(ytrain))
print('Xtest',Xtest.shape,'ytest',ytest.shape,'ones:',np.count_nonzero(ytest))

Xtrain (28364, 68) ytrain (28364,) ones: 13621
Xtest (9455, 68) ytest (9455,) ones: 4517


## Random Forest Utility Class

In [10]:
class RF_Util:
    def __init__(self):
        self.model=RandomForestClassifier()
    def get_model(self):
        return self.model
    def set_train(self,X,y):
        self.Xtr = X
        self.ytr = y
    def set_test(self,X,y):
        self.Xte = X
        self.yte = y
    def fit(self):
        self.model.fit(self.Xtr,self.ytr)
    def test_accuracy(self):
        ypred = self.model.predict(self.Xte)
        matches = np.count_nonzero(self.yte==ypred)
        accuracy = 100.0 * matches / len(ytest)
        return accuracy
    def test_confusion(self):
        ypred = self.model.predict(self.Xte)
        cm = confusion_matrix(self.yte, ypred)
        return cm
    def important_features(self):
        names = self.model.feature_names_in_
        importances = self.model.feature_importances_
        pairs = np.column_stack( (names,importances) )
        top_array = sorted(pairs, key = lambda e:e[1], reverse=True)
        # This must be a way to do this witout a loop!
        top_list = []
        for i in top_array:
             top_list.append((i[1],i[0]))  # 0=feature_name, 1=importance
        top_df = pd.DataFrame(top_list)
        return top_df

## Random Forest 1 - All Features

In [11]:
print('Train on all Features')
rf1 = RF_Util()
rf1.set_train(Xtrain,ytrain)
rf1.set_test(Xtest,ytest)
rf1.fit()
print('Accuracy:',rf1.test_accuracy())
print('Confusion:')
print(rf1.test_confusion())
print('The impurity-based feature importances.')
top = rf1.important_features()
top.head()

Train on all Features
Accuracy: 61.068217874140664
Confusion:
[[3273 1665]
 [2016 2501]]
The impurity-based feature importances.


Unnamed: 0,0,1
0,0.031536,AreaShape_Orientation
1,0.019301,Neighbors_SecondClosestDistance_Expanded
2,0.01906,AreaShape_MeanRadius
3,0.018697,ImageNumber
4,0.018116,AreaShape_Extent


## Random Forest 2 - Reduced Features

In [12]:
model = rf1.get_model()
rfe = RFE(model)  # Random Forest feature Elimination model
rfe.fit(Xtrain,ytrain) # This is slow! Uses 100% cpu but 0% gpu.
print('Ranking',rfe.ranking_) # Selected features get rank=1. Large numbers mean not selected.
support = rfe.support_
no_support = np.invert(rfe.support_)
selected = rfe.feature_names_in_[rfe.support_]
not_selected = rfe.feature_names_in_[no_support]
Xtest_reduced = Xtest.drop(not_selected,axis=1)
Xtrain_reduced = Xtrain.drop(not_selected,axis=1)

Ranking [ 1 30 18 19 25 24 28 29  1 12  1 22 17  1 34  1  1 15 21 10  1 32  1  1
  1 20  1  1  1  1  1 14  7  1  9  1  1  1 16 11  1  5  4  8  1  1  1  1
  6  1  1  1  1  1  3  2  1 13  1 35  1  1 26 33 27  1 23 31]


In [13]:
print('Train on all Features')
rf2 = RF_Util()
rf2.set_train(Xtrain_reduced,ytrain) # X has fewer columns but y is unchanged
rf2.set_test(Xtest_reduced,ytest)
rf2.fit()
print('Accuracy:',rf2.test_accuracy())
print('Confusion:')
print(rf2.test_confusion())
print('The impurity-based feature importances.')
top = rf2.important_features()
top.head()

Train on all Features
Accuracy: 61.64992067689053
Confusion:
[[3314 1624]
 [2002 2515]]
The impurity-based feature importances.


Unnamed: 0,0,1
0,0.044978,AreaShape_Orientation
1,0.035577,AreaShape_MeanRadius
2,0.031821,Neighbors_SecondClosestDistance_Expanded
3,0.031673,ImageNumber
4,0.03155,AreaShape_Extent
