## Random Forest Interpretation
<br> <br> 
#### - Confidence Based on Tree Variance
> The variance of the predictions of the trees. The prediction is usually the average of input values of the dependent variables. This confidence calcualte the variance regarding this final prediction (overall average).

<br> 
Calcualte variance in a group of samples/observations to check the variance of the prediction. If the variance is high, we may not be confident of the prediction result.
<br> <br> 
<br> <br> <br> 
#### - Feature Importance
> To check which features have great importance on the prediction result. This is a good way for analysis, remove some columns or add weights to improve the prediction result.


<br> 
Steps:
* Separate the dataset into independent variables `X_trian` and dependent variables `y_trian` for training.
* Use the training data to build a random forest model.
* Compare the prediction result `y_pred` with `y_train` and calcualte `score`.
* Want to find which features (columns in `X_train`) influence prediction most without rebuilding the model.
* Shuffel one column and feed the new dataset to the model again to get prediction result `y_sub`.
* Compare the prediction result `y_sub` with `y_train` and calcualte `score_sub`.
* Compare `score_sub` with `score`.
* Rank `score_sub`s to check which features have great effect on the overall prediction.

<br> <br> <br> 
#### - Partial Dependence
> Analysis an individual fator that influences the overall prediction result assuming that all other features are equable. Used to specify the relationship between one fator and the prediction. For example, how did `YearMade` influences the dependent variable `SalePrice` over the past 50 years. 
<br> <br> Pull out the underlying truth of feature importance.


<br> 
Steps:
* Leave other factors constant (just what they are) in the dataset.
* Do partial dependence plot regarding each feature by set it a __constant value__ each time instead of shuffling the colum (feature importance).
<br><br>
[Do the rest same as calculating feature importance]<br><br>
* Plot the chart __for each sample__ (row in the dataset) as the __constant value__ changes.
* Calcualte the median value across lines of all samples.
* Do cluster analysis to find out a few different shapes (one shape for one cluster).

<br> <br> <br> 
#### - Tree Interpreter
> One obervation (one row in the dataset can only have __one path__ through out the tree). Check how each decision or feature influences the final prediction result __for a particular observation.__<br><br>
Pick out __one particular__ row/observation.<br> <br>
Get the `prediction` of that observation and the overall `bias` for all data.<br> <br> 
Get the `contribution` measuring a set of feature importance for this observation.

<br> 
Steps:
* Track backwards after reaching the leaf node for each observation.
* Check how much each decision node has changed the prediction value for the start. 
<br> (usually record in a table)

<br> 
__Interaction feature importance:__ <br>
Although each decision can be analyzied individually, the result at each decision node is the result of the combination or interaction of the current decision and previous ones.

## Random Forest from Scratch

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai.imports import *
from fastai.tabular import *
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics

from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype

In [3]:
# Copied from the old version of fastai (fastai.structured) 
# https://github.com/fastai/fastai/blob/master/old/fastai/structured.py

def numericalize(df, col, name, max_n_cat):
    ''' For values not numeric, convert it to corresponding categorical values + 1.abc
    e.g. NaN values are not -1 but 0.'''
    
    if not is_numeric_dtype(col) and ( max_n_cat is None or len(col.cat.categories)>max_n_cat):
        df[name] = pd.Categorical(col).codes+1
    
    
def fix_missing(df, col, name, na_dict):
    if is_numeric_dtype(col):
        if pd.isnull(col).sum() or (name in na_dict):
            df[name+'_na'] = pd.isnull(col)
            filler = na_dict[name] if name in na_dict else col.median()
            df[name] = col.fillna(filler)
            na_dict[name] = filler
    return na_dict


def proc_df(df, y_fld=None, skip_flds=None, ignore_flds=None, do_scale=False, na_dict=None,
            preproc_fn=None, max_n_cat=None, subset=None, mapper=None):
    if not ignore_flds: ignore_flds=[]
    if not skip_flds: skip_flds=[]
    if subset: df = get_sample(df,subset)
    else: df = df.copy()
    ignored_flds = df.loc[:, ignore_flds]
    df.drop(ignore_flds, axis=1, inplace=True)
    if preproc_fn: preproc_fn(df)
    if y_fld is None: y = None
    else:
        if not is_numeric_dtype(df[y_fld]): df[y_fld] = pd.Categorical(df[y_fld]).codes
        y = df[y_fld].values
        skip_flds += [y_fld]
    df.drop(skip_flds, axis=1, inplace=True)

    if na_dict is None: na_dict = {}
    else: na_dict = na_dict.copy()
    na_dict_initial = na_dict.copy()
    for n,c in df.items(): na_dict = fix_missing(df, c, n, na_dict)
    if len(na_dict_initial.keys()) > 0:
        df.drop([a + '_na' for a in list(set(na_dict.keys()) - set(na_dict_initial.keys()))], axis=1, inplace=True)
    if do_scale: mapper = scale_vars(df, mapper)
    for n,c in df.items(): numericalize(df, c, n, max_n_cat)
    df = pd.get_dummies(df, dummy_na=True)
    df = pd.concat([ignored_flds, df], axis=1)
    res = [df, y, na_dict]
    if do_scale: res = res + [mapper]
    return res


def set_rf_samples(n):
    forest._generate_sample_indices = (lambda rs, n_samples:
    forest.check_random_state(rs).randint(0, n_samples, n))

    
def reset_rf_samples():
    """ Undoes the changes produced by set_rf_samples.
    """
    forest._generate_sample_indices = (lambda rs, n_samples:
    forest.check_random_state(rs).randint(0, n_samples, n_samples))
    
    
def draw_tree(t, df, size=10, ratio=0.6, precision=0):
    s=export_graphviz(t, out_file=None, feature_names=df.columns, filled=True,
                      special_characters=True, rotate=True, precision=precision)
    display(graphviz.Source(re.sub('Tree {',
       f'Tree {{ size={size}; ratio={ratio}', s)))
    
    
def train_cats(df):
    """Change any columns of strings in a panda's dataframe to a column of
    categorical values. This applies the changes inplace.
    """
    for n,c in df.items():
        if is_string_dtype(c): df[n] = c.astype('category').cat.as_ordered()
            
def split_vals(a, n):
    return a[:n], a[n:]

### 1 Load in Data from Last Lesson

In [4]:
PATH = 'data/bulldozers/'

df_raw = pd.read_feather('tmp/raw')
df_trn, y_trn, _ = proc_df(df_raw, 'SalePrice')

In [5]:
n_valid = 12000
n_trn = len(df_trn) - n_valid

X_train, X_valid = split_vals(df_trn, n_trn)
y_train, y_valid = split_vals(y_trn, n_trn)

raw_train, raw_valid = split_vals(df_raw, n_trn)

In [7]:
x_sub = X_train[['YearMade', 'MachineHoursCurrentMeter']]

### 2 Basic Data Structures
<br><br>
**np.random.seed(42):** define a seed that makes the random numbers predictable;
        generate a random number (the same number until reseed) that starting from 42.
<br><br>
**np.random.permutation(num):** return back a randomly shuffled int suquence from zero to int *num*(not included). *num* can be int or array_like.
<br>
                    Can be used in machine learning to generate random samples.
<br><br>
The **iloc** indexer for Pandas Dataframe is used for integer-location based indexing / selection by position. # Rows: data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output. Multiple columns and rows can be selected together using the .iloc indexer.

<br><br>
Python property: http://funhacks.net/explore-python/Class/property.html

<br><br>
Python numpy.arange(): return evenly spaced values within a given interval (default step=1).
<br>
https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html

In [9]:
class TreeEnsemble():
    def __init__(self, x, y, n_trees, sample_sz, min_leaf=5):
        np.random.seed(42)
        self.x, self.y, self.sample_sz, self.min_leaf = x, y, sample_sz, min_leaf
        self.trees = [self.create_tree() for i in range(n_trees)]
        
    def create_tree(self):
        rnd_idxs = np.random.permutation(len(self.y))[:self.sample_sz]
        return DecisionTree(self.x.iloc[rnd_idxs], self.y[rnd_idxs], min_leaf=self.min_leaf)
    
    def predict(self, x):
        return np.mean([t.predict(x) for t in self.trees], axis=0)

self.n: the number of rows given to the tree (len(idx) = len(y))
<br><br>
self.c: the number of columns given to the tree (equals to the number of columns in the independent variables, the second parameter of a dataframe)
<br><br>
self.val: for this tree, the prediction value (just do the average)

In [10]:
class DecisionTree():
    def __init__(self, x, y, idxs=None, min_leaf=5):
        if idxs is None:
            idxs = np.arange(len(y))
        self.x, self.y, self.idxs, self.min_leaf = x, y, idxs, min_leaf
        self.n, self.c = len(idxs), x.shape[1]
        self.val = np.mean(y[idxs])
        self.score = float('inf')
        self.find_varsplit()
        
    def find_varsplit(self):
        for i in range(self.c):
            self.find_better_split(i)
    
    def find_better_split(self, var_idx):
        pass
    
    @property
    def split_name(self):
        return self.x.columns[self.var_idx]
    
    @property
    def split_col(self):
        return self.x.values[self.idxs, self.val_idx]
    
    @property
    def is_leaf(self):
        return self.score == float('inf')
    
    def __repr__(self):
        s = f'n: {self.n}; val:{self.val}'
        if not self.is_leaf:
            s += f'; score:{self.score}; split:{self.split}; var:{self.split_name}'
        return s