# Illustrating the `teller`

This notebook illustrates the use of the [`teller`](https://github.com/thierrymoudiki/teller), a model-agnostic tool for Machine Learning explainability. Two models are used: a linear model and a [Random Forest](https://en.wikipedia.org/wiki/Random_forest) (here, the _black-box_ model). The most straightforward way to illustrate the `teller` is to use a linear model. In this case, the effects of model covariates on the response can be directly related to the linear model's coefficients.

Currently, the `teller` can be installed from Github as: 

In [18]:
pip install git+https://github.com/thierrymoudiki/teller.git

Collecting git+https://github.com/thierrymoudiki/teller.git
  Cloning https://github.com/thierrymoudiki/teller.git to /tmp/pip-req-build-54wbobzk
  Running command git clone -q https://github.com/thierrymoudiki/teller.git /tmp/pip-req-build-54wbobzk
Building wheels for collected packages: teller
  Building wheel for teller (setup.py) ... [?25l[?25hdone
  Created wheel for teller: filename=teller-0.1.0-py2.py3-none-any.whl size=9119 sha256=ed591690821ca4691e5ff0921e417b60b922fccb12512edfbd57bf9380d2657b
  Stored in directory: /tmp/pip-ephem-wheel-cache-a416_fr3/wheels/d9/51/a6/69fa991f7529be33ba87e6e684fdc936eb67a827aa7a2bbfcf
Successfully built teller


Data for the demo is Boston Housing dataset. The response is MEDV, Median value of owner-occupied homes in $1000’s (the __reponse__):



- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000’s (the __reponse__)


In [0]:
import teller as tr
import pandas as pd
import numpy as np      

from sklearn import datasets, linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets
from sklearn.model_selection import train_test_split


# import data
boston = datasets.load_boston()
X = np.delete(boston.data, 11, 1)
y = boston.target
col_names = np.append(np.delete(boston.feature_names, 11), 'MEDV')


Split data into a training and a testing set:

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=123)
print(X_train.shape)
print(X_test.shape)


(404, 12)
(102, 12)


As we said before, the most straightforward way to illustrate the `teller` is to use a linear model. In this case, the effects of model covariates on the response can be directly related to the linear model's coefficients.

In [21]:
# fit a linear regression model 
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
print(col_names)
print(regr.coef_) # these will be compared to effects 


['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'LSTAT' 'MEDV']
[-1.01154624e-01  4.76941400e-02  6.25165481e-02  1.47253911e+00
 -1.61503638e+01  4.19880279e+00  1.85740482e-03 -1.37739515e+00
  2.62817392e-01 -1.28645883e-02 -8.92383870e-01 -5.72958247e-01]


Now, using the `teller`, we can obtain a similar result. Notice that there's no heterogeneity in the effects of covariates on the response, and that the effects are equal to linear model's coefficients.

In [22]:
# creating the explainer (needs a data frame, for column names)
df_test = pd.DataFrame(data = np.column_stack((X_test, y_test)), 
                       columns = col_names)
expr = tr.Explainer(obj=regr, df=df_test, target='MEDV')

# fitting the explainer
expr.fit()

# model effects, to be compared to regr.coef_
print(expr.effects_)

              mean           std        min        max
NOX     -16.150364  4.653454e-10 -16.150364 -16.150364
DIS      -1.377395  7.876549e-11  -1.377395  -1.377395
PTRATIO  -0.892384  1.272029e-11  -0.892384  -0.892384
LSTAT    -0.572958  2.521022e-11  -0.572958  -0.572958
CRIM     -0.101155  3.162134e-09  -0.101155  -0.101155
TAX      -0.012865  7.018682e-13  -0.012865  -0.012865
AGE       0.001857  5.350552e-12   0.001857   0.001857
ZN        0.047694  1.182888e-11   0.047694   0.047694
INDUS     0.062517  3.562424e-11   0.062517   0.062517
RAD       0.262817  5.098845e-11   0.262817   0.262817
CHAS      1.472539  8.301885e-11   1.472539   1.472539
RM        4.198803  5.172671e-11   4.198803   4.198803


__All else held constant__, homes values are mostly affected by air pollution here, with a decrease of 16k\$ observed for an increase of 10 million parts in concentration of nitric oxides. The number of rooms drives the increase in homes values, with an increase of 4k\$ each time a room is added. What story does  Random Forest (here, the _black-box_ model) model tells us here?

In [23]:
# fit a linear regression model 
regr2 = RandomForestRegressor(n_estimators=1000, random_state=123)
regr2.fit(X_train, y_train)


# creating the explainer
df_test = pd.DataFrame(data = np.column_stack((X_test, y_test)), 
                       columns = col_names)
expr = tr.Explainer(obj=regr2, df=df_test, target='MEDV')


# fitting the explainer
expr.fit()


# heterogeneity of effects
print(expr.effects_)

              mean         std          min          max
LSTAT   -11.541770  104.111356  -680.369720   335.990384
PTRATIO  -5.795078   26.975073  -155.914653    56.827716
INDUS    -3.425733   26.951393  -258.382895     0.000000
TAX      -0.052272    0.824834    -6.479723     4.839278
CHAS      0.000000    0.000000     0.000000     0.000000
AGE       0.970438    5.204533    -7.242999    39.647849
ZN        1.043840   11.672871   -28.280289    83.808739
NOX       1.286747  325.585815 -1258.347012  1937.006074
DIS       2.014293   20.343364     0.000000   205.457901
RAD      18.420244  192.075879  -247.710558  1791.773035
RM       28.570050  146.327113  -123.772764  1126.812921
CRIM     72.200382  585.402432     0.000000  5685.533164


Here, home values decrease most when the percentage of "lower" status population increases, or when there are not enough teachers for each kid in the area. __All else held constant__, the number or rooms is still an important driver for an increase. The distance to highways and employment centers also play an important role here. Conversely, what is said about the criminality rate is rather surprising. 

__(Very) Important__: Typically, these interpretability numbers would be coupled with __model's accuracy__ (and other performance considerations for production).

In [24]:
# accuracy of linear model 
print(np.sqrt(np.mean((regr.predict(X_test) - y_test)**2)))

# accuracy of Random Forest
print(np.sqrt(np.mean((regr2.predict(X_test) - y_test)**2)))

5.431091875823595
4.322189349251635
