# Illustrating the `teller` (v0.3.0)

This notebook is about the [`teller`](https://github.com/thierrymoudiki/teller), a model-agnostic tool for Machine Learning explainability. Version `0.3.0` allows to compare and understand model performances. We are going to compare 2 ML models on Boston Housing dataset:  [Extremely Randomized  Trees](https://en.wikipedia.org/wiki/Random_forest#ExtraTrees) and Random Forest Regressions.  

Currently, the `teller` can be installed from Github as: 

In [0]:
!pip install git+https://github.com/thierrymoudiki/teller.git

Data for the demo is Boston Housing dataset. The response is MEDV, Median value of owner-occupied homes in $1000’s:



- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000’s (the __reponse__)


We start by importing the packages and data necessary  for our demo:


In [0]:
# Import packages and data
import teller as tr
import pandas as pd
import numpy as np   
import lightgbm as lgb
import xgboost as xgb
import math



from sklearn import datasets, linear_model
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, BaggingRegressor
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics

# import data
boston = datasets.load_boston()
X = np.delete(boston.data, 11, 1)
y = boston.target
col_names = np.append(np.delete(boston.feature_names, 11), 'MEDV')


We split data into a training and a testing set:

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=123)
print(X_train.shape)
print(X_test.shape)
print("mean of y_test: ")
print(np.mean(y_test))
print("std. deviation of y_test: ")
print(np.std(y_test))

(404, 12)
(102, 12)
mean of y_test: 
23.158823529411773
std. deviation of y_test: 
9.095919715930988


Now we train our 2 models, starting with the Extremely Randomized Trees:

In [5]:
# fit an Extra Trees model to Boston Housing data
regr2 = ExtraTreesRegressor(n_estimators=1000, 
                            max_features=int(math.sqrt(X_train.shape[1])),
                            random_state=123)
regr2.fit(X_train, y_train)


# creating the explainer
expr2 = tr.Explainer(obj=regr2)


# fitting the explainer (for heterogeneity of effects only)
expr2.fit(X_test, y_test, X_names=col_names[:-1], method="avg")


# confidence intervals and tests on marginal effects (Jackknife)
expr2.fit(X_test, y_test, X_names=col_names[:-1], method="ci")


# summary of results for the model
print(expr2.summary())




Calculating the effects...
12/12 [██████████████████████████████] - 2s 153ms/step




Calculating the effects...
102/102 [██████████████████████████████] - 13s 124ms/step




Score (rmse): 
 10.813


Residuals: 
     Min       1Q    Median        3Q       Max
-11.7904 -1.84795 -0.288655  0.937975  18.51445


Tests on marginal effects (Jackknife): 
          Estimate   Std. Error   95% lbound   95% ubound     Pr(>|t|)     
NOX       -59.4205  2.22045e-16     -59.4205     -59.4205            0  ***
PTRATIO   -2.00072     0.390455     -2.77528     -1.22616  1.44031e-06  ***
CRIM             0  2.22045e-16 -4.40477e-16  4.40477e-16            1    -
ZN               0  2.22045e-16 -4.40477e-16  4.40477e-16            1    -
CHAS             0  2.22045e-16 -4.40477e-16  4.40477e-16            1    -
RAD              0  2.22045e-16 -4.40477e-16  4.40477e-16            1    -
TAX      0.0121302  2.22045e-16    0.0121302    0.0121302            0  ***
INDUS    0.0125259  3.31241e-16    0.012

Extra Trees predictions for home value are highly sensisitive to air pollution. And increase of 1 in nitrogen oxides concentration (parts per 10 million) leads, all else held constant and on average, to a decrease of 58k$ in median homes' values. The increase in home value is driven by the number of rooms. We can also note that variables such as criminality rate and the accessibility to radial highways, seem to have a negligible impact on model predictions.



Now, we'll train a `RandomForest` on the same dataset, and see what it tells us about its predictions: 

In [6]:
# fit a random forest model 
regr1 = RandomForestRegressor(n_estimators=1000, 
                              max_features=int(math.sqrt(X_train.shape[1])),
                              random_state=123)
regr1.fit(X_train, y_train)


# creating the explainer
expr1 = tr.Explainer(obj=regr1)


# fitting the explainer (for heterogeneity of effects only)
expr1.fit(X_test, y_test, X_names=col_names[:-1], method="avg")


# confidence intervals and tests on marginal effects (Jackknife)
expr1.fit(X_test, y_test, X_names=col_names[:-1], method="ci")


# summary of results for the model
print(expr1.summary())



Calculating the effects...
12/12 [██████████████████████████████] - 2s 143ms/step




Calculating the effects...
102/102 [██████████████████████████████] - 12s 116ms/step




Score (rmse): 
 13.639


Residuals: 
     Min     1Q  Median       3Q      Max
-10.6667 -1.396 -0.5047  1.25705  22.4512


Tests on marginal effects (Jackknife): 
         Estimate   Std. Error   95% lbound   95% ubound     Pr(>|t|)     
NOX      -65.9852      23.5248     -112.652     -19.3183   0.00603773   **
PTRATIO  -19.0443      5.74131     -30.4335      -7.6551   0.00126512   **
LSTAT      -2.972      3.11832     -9.15791      3.21392     0.342827    -
INDUS    -1.90767      2.88467     -7.63009      3.81474     0.509917    -
ZN      -0.670289     0.429838     -1.52297     0.182394      0.12203    -
TAX     -0.412312    0.0252358    -0.462373    -0.362251  4.10351e-30  ***
CHAS            0  2.22045e-16 -4.40477e-16  4.40477e-16            1    -
AGE      0.583416   5.5788e-15     0.583416     0.583416    

For this model too, air pollution is an important variable driving the decrease in home value. The lack of teachers for each kid plays a more important role but  contrary to Extra Trees, the Random Forest gives much more importance to the accessibility of radial highways.

We can finally __compare both models side by side__, using the `teller`'s `Comparator`:

In [7]:
# create object for model comparison
# expr1 is for Random Forest 
# expr2 is for Extra Trees
cpr = tr.Comparator(expr1, expr2)


# print summary of results for model comparison
print(cpr.summary())



Scores (rmse): 
Object1: 13.639
Object2: 10.813


R-squared: 
Object1: 
Multiple:  0.835, Adjusted:  0.813
Object2: 
Multiple:  0.869, Adjusted:  0.852


Residuals: 
Object1: 
     Min     1Q  Median       3Q      Max
-10.6667 -1.396 -0.5047  1.25705  22.4512
Object2: 
     Min       1Q    Median        3Q       Max
-11.7904 -1.84795 -0.288655  0.937975  18.51445


Paired t-test (H0: mean(resids1) > mean(resids2) at 5%): 
statistic: 0.18249
p.value: 0.57231
conf. int: [-inf, 0.90189]
mean of x: -0.11477
mean of y: -0.20446
alternative: less


Marginal effects: 
        Estimate1  Std. Error1 Signif.  Estimate2  Std. Error2 Signif.
AGE      0.583416   5.5788e-15     ***   0.643206  6.69456e-15     ***
CHAS            0  2.22045e-16       -          0  2.22045e-16       -
CRIM      4.74938  1.16039e-13     ***          0  2.22045e-16       -
DIS       10.7329  2.14226e-13     ***    1.17726  2.45467e-14     ***
INDUS    -1.90767      2.88467       -  0.0125259  3.31241e-16     ***
LSTA

The first output is test set Root Mean Squared Error (RMSE) for both models, then we have information such as Multiple R-Squared and the distribution of residuals. Confidence interval (given by a Student t-test) around the difference of residuals means contains 0, so the null hypothesis is not rejected at 5%.
