## <center><font color='green'> Random Forest Classifier - Rule Extraction

> https://stackoverflow.com/questions/50600290/how-extraction-decision-rules-of-random-forest-in-python

> Random Forests are a combination of tree predictors where each tree depends on the values of a random vector sampled independently with the same distribution for all trees in the forest. The basic principle is that a group of “weak learners” can come together to form a “strong learner”. Random Forests are a wonderful tool for making predictions considering they do not overfit because of the law of large numbers. Introducing the right kind of randomness makes them accurate classifiers and regressors.

> Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction (see figure below).

<img src="./images/random_forest.jpeg" align="left" width=400>

In [1]:
import numpy
from sklearn.model_selection import train_test_split
from sklearn import metrics, datasets, ensemble

In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pylab as plt 

import warnings
warnings.filterwarnings('ignore')

In [3]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split

In [4]:
from sklearn.ensemble import RandomForestClassifier

In [5]:
from rulefit import RuleFit

#### <font color='purple'> Load dataset  

In [6]:
train_dataset_upsampled = pd.read_csv("dataset/Resampled_neonates_train_data_4.csv")
test_dataset_upsampled = pd.read_csv("dataset/Resampled_neonates_test_data_4.csv")

X_train = train_dataset_upsampled.drop(["DEAD"], axis=1) 
y_train = train_dataset_upsampled["DEAD"]

X_test = test_dataset_upsampled.drop(["DEAD"], axis=1) 
y_test = test_dataset_upsampled["DEAD"]

#### <font color='purple'> Normalizing 

In [7]:
columns = X_train.columns.to_list() 

Min_max_scaler = MinMaxScaler().fit(X_train)

## Scaling 
X_train_mm_scaled = Min_max_scaler.transform(X_train)
X_test_mm_scaled = Min_max_scaler.transform(X_test)

## Numpy Array to DataFrame 
df_train_mm_scaled = pd.DataFrame(X_train_mm_scaled, columns = columns)
df_test_mm_scaled = pd.DataFrame(X_test_mm_scaled, columns = columns)

#### <font color='purple'> Feature Selection  

In [7]:
tain_mm_scaled_df = df_train_mm_scaled[:]
tain_mm_scaled_df["DEAD"] = y_train 

test_mm_scaled_df = df_test_mm_scaled[:] 
test_mm_scaled_df["DEAD"] = y_test


importances = mutual_info_classif(df_train_mm_scaled, y_train)
feat_importance = pd.Series(importances, tain_mm_scaled_df.columns[0:len(tain_mm_scaled_df.columns)-1])
    
feat_importance = feat_importance.sort_values(ascending=False)
    
selected_features = feat_importance[:30]
selected_features_list_mm_scaled = selected_features.index.to_list()


tain_mm_scaled_df[selected_features_list_mm_scaled].head(2)

NameError: name 'df_train_mm_scaled' is not defined

In [20]:
X_train_mm = X_train[selected_features_list_mm_scaled][:] ## using only selected features 
X_test_mm = X_test[selected_features_list_mm_scaled][:]   ## using only selected features 

In [9]:
X_train_mm = df_train_mm_scaled[selected_features_list_mm_scaled][:] ## using only selected features 
X_test_mm = df_test_mm_scaled[selected_features_list_mm_scaled][:]   ## using only selected features 

In [21]:
X_train_mm.rename({'temperature_mean': 'Temperature (mean)', 
           'temperature_var': 'Temperature (variance)', 
           'temperature_std': 'Temperature (std)',  
           'respRate_std': 'Respiratory Rate (std)', 
           'respRate_var': 'Respiratory Rate (variance)',
           'respRate_mean': 'Respiratory Rate (mean)', 
           'skinTemperature_var': 'Skin Temperature (variance)', 
           'skinTemperature_std': 'Skin Temperature (std)',
           'skinTemperature_mean': 'Skin Temperature (mean)',
           'heartRate_std': 'Heart Rate (std)', 
           'heartRate_var': 'Heart Rate (variance)',
           'heartRate_mean': 'Heart Rate (mena)',
           'bpCuffMean_std': 'Blood Pressure Cuff Mean (std)', 
           'bpCuffMean_var': 'Blood Pressure Cuff Mean (variance)',
           'bpCuffMean_mean': 'Blood Pressure Cuff Mean (mean)',
           'bpCuffSystolic_std': 'Blood Pressure Cuff Systolic (std)',
           'bpCuffSystolic_var': 'Blood Pressure Cuff Systolic (variance)',
           'bpCuffSystolic_mean': 'Blood Pressure Cuff Systolic (mean)',
           'bpCuffDiastolic_var': 'Blood Pressure Cuff Diastolic (variance)',
           'bpCuffDiastolic_std': 'Blood Pressure Cuff Diastolic (std)',
           'bpCuffDiastolic_mean': 'Blood Pressure Cuff Diastolic (mean)',
           'sao2_var': 'SaO2 (variance)', 
           'sao2_std': 'SaO2 (std)',
           'sao2_mean': 'SaO2 (mean)',
           'glucometer_std': 'Glucometer (std)', 
           'glucometer_var': 'Glucometer (variance)', 
           'glucometer_mean': 'Glucometer (mean)',
           'BIRTH_WEIGHT': 'Birth Weight (kg)', 
           'PLATELET': 'Platelet',
           'D10W_SUM': 'Sum of D10W (input)'}, axis=1, inplace=True)

In [22]:
X_test_mm.rename({'temperature_mean': 'Temperature (mean)', 
           'temperature_var': 'Temperature (variance)', 
           'temperature_std': 'Temperature (std)',  
           'respRate_std': 'Respiratory Rate (std)', 
           'respRate_var': 'Respiratory Rate (variance)',
           'respRate_mean': 'Respiratory Rate (mean)', 
           'skinTemperature_var': 'Skin Temperature (variance)', 
           'skinTemperature_std': 'Skin Temperature (std)',
           'skinTemperature_mean': 'Skin Temperature (mean)',
           'heartRate_std': 'Heart Rate (std)', 
           'heartRate_var': 'Heart Rate (variance)',
           'heartRate_mean': 'Heart Rate (mena)',
           'bpCuffMean_std': 'Blood Pressure Cuff Mean (std)', 
           'bpCuffMean_var': 'Blood Pressure Cuff Mean (variance)',
           'bpCuffMean_mean': 'Blood Pressure Cuff Mean (mean)',
           'bpCuffSystolic_std': 'Blood Pressure Cuff Systolic (std)',
           'bpCuffSystolic_var': 'Blood Pressure Cuff Systolic (variance)',
           'bpCuffSystolic_mean': 'Blood Pressure Cuff Systolic (mean)',
           'bpCuffDiastolic_var': 'Blood Pressure Cuff Diastolic (variance)',
           'bpCuffDiastolic_std': 'Blood Pressure Cuff Diastolic (std)',
           'bpCuffDiastolic_mean': 'Blood Pressure Cuff Diastolic (mean)',
           'sao2_var': 'SaO2 (variance)', 
           'sao2_std': 'SaO2 (std)',
           'sao2_mean': 'SaO2 (mean)',
           'glucometer_std': 'Glucometer (std)', 
           'glucometer_var': 'Glucometer (variance)', 
           'glucometer_mean': 'Glucometer (mean)',
           'BIRTH_WEIGHT': 'Birth Weight (kg)', 
           'PLATELET': 'Platelet',
           'D10W_SUM': 'Sum of D10W (input)'}, axis=1, inplace=True)

In [28]:
X_train_mm['DEAD'] = y_train
X_train_mm.to_csv("train_classification_dataset.csv", index=False)

In [29]:
X_train_mm

Unnamed: 0,Temperature (mean),Respiratory Rate (std),Respiratory Rate (variance),Skin Temperature (variance),Skin Temperature (std),Heart Rate (std),Heart Rate (variance),Blood Pressure Cuff Mean (variance),SaO2 (std),SaO2 (variance),...,Blood Pressure Cuff Systolic (mean),Blood Pressure Cuff Diastolic (mean),Temperature (std),Glucometer (mean),Temperature (variance),SaO2 (mean),Blood Pressure Cuff Mean (mean),Platelet,Sum of D10W (input),DEAD
0,36.165218,14.233050,202.579710,0.133913,0.365941,10.892675,118.650362,23.982143,2.686183,7.215580,...,54.600000,31.200000,0.503270,85.285714,0.253280,95.791667,40.625000,193.0,115.999999,0
1,35.658334,19.684283,387.471014,0.055958,0.236555,13.269776,176.086957,9.142857,2.222660,4.940217,...,58.428571,32.714286,0.246645,76.666667,0.060834,97.375000,42.142857,258.0,157.600001,0
2,35.608333,19.811430,392.492754,0.094493,0.307397,8.722368,76.079710,49.839286,3.583375,12.840580,...,61.375000,30.375000,0.270131,93.800000,0.072971,96.666667,40.875000,245.0,159.600000,0
3,36.004999,19.231428,369.847826,0.151869,0.389703,11.852964,140.492754,41.333333,2.496374,6.231884,...,64.000000,43.000000,0.374833,78.666667,0.140500,97.166667,48.333333,287.0,189.000001,0
4,36.158334,14.442289,208.579710,0.011884,0.109014,11.374313,129.375000,74.564103,0.916831,0.840580,...,65.200000,38.500000,0.190917,102.166667,0.036449,99.333333,42.307692,295.0,187.900000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4315,36.017391,13.774320,189.731884,0.008775,0.093673,5.496870,30.215580,153.300000,2.626258,6.897233,...,67.600000,35.400000,0.049103,111.800000,0.002411,93.521739,45.600000,255.0,26.500000,1
4316,36.062500,5.727681,32.806324,0.025634,0.160107,11.760903,138.318841,2.250000,1.888178,3.565217,...,60.500000,33.250000,0.087539,58.600000,0.007663,98.000000,43.750000,380.0,122.200000,1
4317,36.017391,13.774320,189.731884,0.008775,0.093673,5.496870,30.215580,153.300000,2.626258,6.897233,...,67.600000,35.400000,0.049103,111.800000,0.002411,93.521739,45.600000,255.0,26.500000,1
4318,36.244444,7.701700,59.316176,0.039015,0.197522,8.833426,78.029412,142.000000,1.944071,3.779412,...,61.428571,37.285714,0.328295,58.500000,0.107777,98.176471,42.000000,233.0,76.699999,1


#### <font color='purple'> Model and Fit

In [23]:
model_rf = RandomForestClassifier(criterion='gini', n_estimators=80, max_depth=2)

model_rf.fit(X_train_mm, y_train)

print(f'Random Forest Classifier accuracy: {round(model_rf.score(X_test_mm, y_test), 3)}')

Random Forest Classifier accuracy: 0.957


In [24]:
def print_decision_rules(rf, columns):

    for tree_idx, est in enumerate(rf.estimators_):
        tree = est.tree_
        assert tree.value.shape[1] == 1 # no support for multi-output

        print('TREE: {}'.format(tree_idx))

        iterator = enumerate(zip(tree.children_left, tree.children_right, tree.feature, tree.threshold, tree.value))
        for node_idx, data in iterator:
            left, right, feature, th, value = data

            # left: index of left child (if any)
            # right: index of right child (if any)
            # feature: index of the feature to check
            # th: the threshold to compare against
            # value: values associated with classes            

            # for classifier, value is 0 except the index of the class to return
            class_idx = numpy.argmax(value[0])

            if left == -1 and right == -1:
                print('{} LEAF: return class={}'.format(node_idx, class_idx))
            else:
                print('{} NODE: if [{}] < {} then next={} else next={}'.format(node_idx, columns[feature], th, left, right))    

In [26]:
model_rf.fit(X_train_mm, y_train)

print_decision_rules(model_rf, X_train_mm.columns.to_list())

TREE: 0
0 NODE: if [Temperature (mean)] < 36.01213264465332 then next=1 else next=4
1 NODE: if [Blood Pressure Cuff Systolic (mean)] < 79.96666717529297 then next=2 else next=3
2 LEAF: return class=0
3 LEAF: return class=1
4 NODE: if [SaO2 (std)] < 1.8879384398460388 then next=5 else next=6
5 LEAF: return class=0
6 LEAF: return class=1
TREE: 1
0 NODE: if [Respiratory Rate (std)] < 6.013504981994629 then next=1 else next=4
1 NODE: if [Skin Temperature (variance)] < 0.013913213275372982 then next=2 else next=3
2 LEAF: return class=0
3 LEAF: return class=1
4 NODE: if [Birth Weight (kg)] < 1.2774999737739563 then next=5 else next=6
5 LEAF: return class=1
6 LEAF: return class=0
TREE: 2
0 NODE: if [SaO2 (variance)] < 24.473731994628906 then next=1 else next=4
1 NODE: if [SaO2 (mean)] < 95.47916793823242 then next=2 else next=3
2 LEAF: return class=1
3 LEAF: return class=0
4 NODE: if [Birth Weight (kg)] < 2.575000047683716 then next=5 else next=6
5 LEAF: return class=1
6 LEAF: return class=0
