# Introduction

In this notebook, we would like to implement the paper [CHRISP](https://link.springer.com/article/10.1007/s10462-020-09833-6). CHRISP is a method that tries to explain the random forest classifier decision making process. In other words, it generates data-driven rules based on what the random forest classifier makes the decisions. So, for instance, suppose we have a data that has only one feature $X \in R$, and two classes $\{0,1\}$. Also suppose that observations with potitive values of X have label $1$, and the ones with negative values of X have label $0$. If we use RandomForest to model this data, a simple decision that can be made by the random forest is that whether X is positive or not. Therefore, in this very simple case, the rule is: \n

$ (x>0) \longrightarrow label=1$

# import libraries

In [4]:
import numpy as np
from sklearn import datasets 
from sklearn.ensemble import RandomForestClassifier

# Implement AdjacentSpaces
**See Algorithm 1 on page 18**

In [12]:
def _get_adjacent_identifier(subspace_range, space_range):
    """
    This function find adjacent spaces of a subspace_range enclosed by space_range
    
    Parameters
    ----------
    
    subspace_range : numpy.ndarray
        has shape (p, 2), where p is the number of dimensions. The first column show lower bounds and the second
        column shows upper bounds.
    
    space_range : numpy.ndarray
        has shape (p, 2), where p is the number of dimensions. The first column show lower bounds and the second
        column shows upper bounds. The space defined by subspace_range is a subset of this space.
    
    Returns
    ----------
        adj_identifier_below : numpy.ndarray
            has shape (p, 2), where the i-th row is the identifer of its corresponding adjacent space, below the subspace 
            
            
        adj_identifier_above: numpy.ndarray
            has shape (p, 2), where the i-th row is the identifer of its corresponding adjacent space, above the subspace
            
    
    NOTE
    ---------
    `adj_identifier` is basically the boundary of one dimensions and the boundaries of other dimensions should be selected
    from subspace range. So, the i-th row of adj-identifier gives boundary of adj space in the i-th dimension, and the 
    boundaries in other dimension for that adj space are the same as what provided in the subsapce range.
    
    
    """
    if subspace_range.shape != space_range.shape:
        raise ValueError("The two inputs must have the same shape.")
        
    if (np.any(subspace_range < space_range[:,0]) 
        or 
        np.any(subspace_range > space_range[:,1])
    ):
        raise ValueError("subspace_range is not fully enclosed by space_range")
        
    
    adj_identifier_below = np.c_[space_range[:,0], subspace_range[:,0]]
    adj_identifier_above = np.c_[subspace_range[:,1], space_range[:,1]]
    
    return adj_identifier_below, adj_identifier_above

In [13]:
space_range = np.array([[-10.0, 10.0],[-5.0, 5.0]], dtype=np.float64)
subspace_range = np.array([[-2.0, 2.0],[-4.0, 4.0]], dtype=np.float64)

adj_identifier_below, adj_identifier_above = _get_adjacent_identifier(subspace_range, space_range)

In [14]:
adj_identifier_below

array([[-10.,  -2.],
       [ -5.,  -4.]])

let us consider the first row of `adj_identifier_below`, i.e., `[-10.,  -2.]`. This is the boundary of the first dimension and the boundary of the other dimension should be obtained from the `subspac_range`. Therefore, the boundary on the second dimension is `subspace_range[1]`, which is `[-4.0, 4.0]`.

Note that the boundary of subspace in the first dimension is `[-2.0, 2.]`. The boundary of the aforementioned adjacent space in the first dimension is `[-10, 2]`. In other words, it is on the left of (below) `[-2.0, 2.]`. We can find the other adjacent spaces similarly.

In [17]:
adj_identifier_above

array([[ 2., 10.],
       [ 4.,  5.]])

# RandomForest Path

The idea is to extract paths in RandomForest, and rank them.

### Random Forest Path Extracting

In [19]:
iris_data = datasets.load_iris()
X = iris_data['data']
y = iris_data['target']

#to drop class with label y=2
mask = y == 2
X = X[~mask]
y = y[~mask]

In [20]:
X.shape

(100, 4)

In [24]:
seed = 0
classifier = RandomForestClassifier(random_state=seed).fit(X, y)
indicator, n_nodes_ptr = classifier.decision_path(X)

In [25]:
indicator = indicator.toarray()
indicator

array([[1, 1, 0, ..., 1, 1, 0],
       [1, 1, 0, ..., 1, 1, 0],
       [1, 1, 0, ..., 1, 1, 0],
       ...,
       [1, 0, 1, ..., 1, 0, 1],
       [1, 0, 1, ..., 1, 0, 1],
       [1, 0, 1, ..., 1, 0, 1]], dtype=int64)

In [26]:
n_nodes_ptr

array([  0,   3,   6,  13,  22,  25,  28,  31,  34,  37,  40,  43,  46,
        49,  56,  59,  66,  69,  76,  83,  86,  89,  92,  95,  98, 101,
       104, 107, 110, 113, 116, 119, 122, 125, 136, 139, 142, 145, 148,
       151, 154, 157, 160, 163, 166, 173, 176, 179, 182, 185, 188, 191,
       194, 197, 200, 203, 206, 209, 212, 215, 218, 221, 224, 227, 234,
       243, 246, 253, 256, 259, 266, 275, 278, 281, 284, 287, 294, 297,
       300, 307, 310, 315, 318, 321, 324, 327, 330, 333, 336, 339, 342,
       345, 348, 351, 354, 357, 360, 363, 366, 369, 372], dtype=int32)

What do these two outputs (i.e. `indicator` and `n_nodes_ptr`) tell us?!