# Break-down Plots for Instance Level Attributions
### Code snippets for Python
##### Wojciech Kretowicz
##### for Dalex 1.0

* prepare data
* construct the model
* construct an explainer
* prepare instance to be explainer
* `predict_parts` on an example of Break Down
* advanced use of the `predict_parts()`

In this section, we use an `dalex.Explainer.predict_parts()` method for calculation of Break Down plots.

If you want learn more about Break Down plots read https://pbiecek.github.io/ema/breakDown.html.

# Prepare data

In this example we will use the titanic data. It has few variables that are easy to understand. The dataset has some missing values that we drop, thus our model is more simple.

In [4]:
import pandas as pd
data = pd.read_csv("../data/titanic.csv", index_col=0).dropna()
data.head()

import numpy
print (numpy.__version__)


1.14.5


In [2]:
from sklearn.preprocessing import LabelEncoder

data.loc[:, 'survived'] = LabelEncoder().fit_transform(data.survived)

X = data.drop(columns='survived')
y = data.survived

# Construct the model

Due to categorical variables (`gender`, `class`, `embarked`, `country` and `survived`) we have to include proper preprocessing in our model. We one hot encode this features, except our target - `survived`. This column will be simply encode using $0$ and $1$. All numerical variables will be standarized.

* pipeline "numeric_transformer":
    * we choose numerical features to transform
    * scale data with standard scaler
    
* pipeline "categorical_transformer":
    * we choose categorical features to transform
    * one hot encode
    
* we aggregate those two pipelines into ColumnTransformer "preprocessor"

In [3]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [4]:
numeric_features = ['age', 'fare', 'sibsp', 'parch']
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['gender', 'class', 'embarked', 'country']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])


clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', GradientBoostingClassifier(random_state=77))])

### Train

In [5]:
clf.fit(X, y)

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean=True,
                                                                                  with_std=True))],
                                                           verbose=False),
                                                  ['age', 'fare', 'sibsp',
                                                   'parch']),
                                                 ('c

# Construct an explainer

Different models have different structures thus we must construct explainer - object that wraps model to an uniform interface.

In [6]:
import dalex as dx

ex = dx.Explainer(
    model=clf,
    data=X,
    y=y,
    label='Gradient boosting classification'
)

Preparation of a new explainer is initiated

  -> target variable   :  Gradient boosting classification
  -> data              : 2099 rows 8 cols
  -> target variable   :  Argument 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2099 values
  -> predict function  : <function yhat.<locals>.<lambda> at 0x11292e6a8> will be used
  -> predicted values  : min = 0.04031140910577281, mean = 0.3244828587155178, max = 0.9815486368550363
  -> residual function : difference between y and yhat
  -> residuals         : min = -0.9436208420083135, mean = -4.26490918874969e-05, max = 0.935340460409561
  -> model_info        : package sklearn

A new explainer has been created!


This is the average prediction

In [9]:
ex.predict(X).mean()

0.3244828587155178

# Prepare instance

For instance level explanations we need an observation for which we can generate explanation.

Let’s create a pandas.DataFrame with a single row that corresponds to 8 years old boy from 1st class.

*Important: instance has to be represented as a pandas.DataFrame*

In [10]:
johny = pd.DataFrame({
    'gender': ['male'],
    'age': [8],
    'class': ['1st'],
    'embarked': ['Southampton'],
    'country': ['England'],
    'fare': [72],
    'sibsp': [0],
    'parch': 0},
    index = ['johny_d'])

johny

Unnamed: 0,gender,age,class,embarked,country,fare,sibsp,parch
johny_d,male,8,1st,Southampton,England,72,0,0


The predicted survival for Johny:

In [11]:
ex.predict(johny)

array([0.81198888])

# `predict_parts()` on an example of Break Down

The `dalex.Expainer.predict_parts()` method calculates the variable attributions for a selected model and the instance of interest. 

The result is a data frame containing the calculated attributions. 
In the simplest call, the function requires only three arguments: 

* the model explainer, 
* the data frame with the instance of interest and 
* the method for calculation of variable attribution, for example `break_down`. 

In [12]:
pp = ex.predict_parts(johny, type='break_down')

TypeError: flip() missing 1 required positional argument: 'axis'

Method returned a new object - `BreakDown`. You may be interested in its method: `plot` that creates interactive plot.

In [13]:
pp.plot()

NameError: name 'pp' is not defined

# Advanced use of the `predict_parts()` 

The function `predict_parts()` allows more arguments. The most commonly used are:

* `model` - a wrapper over a model created with `dalex.Explainer.explain()`,
* `new_observation` - an observation to be explained, this has to be a data frame with structure that matches the training data,
* `order` - a vector of characters (column names) or integers (column indexes) that specify order of explanatory variables that is used for computing the variable-importance measures. If not specified (default), then a one-step heuristic is used to determine the order

In what follows we illustrate the use of the arguments.

First, we will specify the ordering of the explanatory variables. Toward this end we can use integer indexes or variable names.

In [14]:
import numpy as np

pp = ex.predict_parts(johny, type='break_down', order=np.array([2, 1, 0, 5, 7, 6, 3, 4]))
pp.plot(max_vars=4)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



KeyError: '[5] not in index'