# Raw features explainability

We will be using the Titanic data set. So from the raw data that is a mixture of categoricals and numericals, we will featurize the categoricals using one hot encoding.

Explain a model with the AML explain-model package on raw features:

1. Train a Logistic Regression model using Scikit-learn
2. Run 'explain_model' with full dataset in local mode, which doesn't contact any Azure services.
3. Run 'explain_model' with summarized dataset in local mode, which doesn't contact any Azure services.
4. Visualize the global and local explanations with the visualization dashboard.

In [None]:
# Import needed packages
import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn_pandas import DataFrameMapper


In [None]:
# We are using the Titanic dataset for this example
data_url = (
    "https://raw.githubusercontent.com/amueller/"
    "scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv"
)
data = pd.read_csv(data_url)

# fill missing values
data = data.fillna(method="ffill")
data = data.fillna(method="bfill")

data.head()

## Create model and train

The numeric data is standard-scaled after median-imputation, while the categorical data is one-hot encoded after imputing missing values with a new category ('missing').

Finally, the preprocessing pipeline is integrated in a full prediction pipeline using sklearn.pipeline.Pipeline, together with a simple classification model.

In [None]:
from sklearn.model_selection import train_test_split

# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.

numeric_features = ["age", "fare"]
categorical_features = ["embarked", "sex", "pclass"]

In [None]:
y = data["survived"].values
X = data[categorical_features + numeric_features]

# split the data in train and test
x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn_pandas import DataFrameMapper

# Impute, standardize the numeric features and one-hot encode the categorical features.
# We create the preprocessing pipelines for both numeric and categorical data.

transformations = [
    (
        ["age", "fare"],
        Pipeline(
            steps=[
                ("imputer", SimpleImputer(strategy="median")),
                ("scaler", StandardScaler()),
            ]
        ),
    ),
    (
        ["embarked"],
        Pipeline(
            steps=[
                ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
                ("encoder", OneHotEncoder(sparse=False)),
            ]
        ),
    ),
    (["sex", "pclass"], OneHotEncoder(sparse=False)),
]


# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(
    steps=[
        ("preprocessor", DataFrameMapper(transformations)),
        ("classifier", LogisticRegression(solver="lbfgs")),
    ]
)


##  Train a logistic regression  model, which is what we want to explain

In [None]:
model = clf.fit(x_train, y_train)

In [None]:
print(f"model score: {clf.score(x_test, y_test)}")

## Using the explain model package

See https://docs.microsoft.com/en-us/azure/machine-learning/service/machine-learning-interpretability-explainability

We will be using the tabular explainer. 
As such we will receive explanations in terms of the raw features before the transformation (rather than engineered features). If you skip this, the explainer provides explanations in terms of engineered features.

The format of supported transformations is same as the one described in sklearn-pandas. In general, any transformations are supported as long as they operate on a single column and are therefore clearly one to many.


---
Meta explainers automatically select a suitable direct explainer and generate the best explanation info based on the given model and data sets. The meta explainers leverage all the libraries (SHAP, LIME, Mimic, etc.) that we have integrated or developed. The following are the meta explainers available in the SDK:

- Tabular Explainer: Used with tabular datasets.
- Text Explainer: Used with text datasets.


In [None]:
from azureml.explain.model.tabular_explainer import TabularExplainer

# Explain predictions on the local machine
# clf.steps[-1][1] returns the trained classification model
# Pass transformation as an input to create the explanation object
# "features" and "classes" fields are optional

tabular_explainer = TabularExplainer(
    clf.steps[-1][1],
    initialization_examples=x_train,
    features=x_train.columns,
    transformations=transformations,
)


In [None]:
# Passing in test dataset for evaluation examples - note it must be a representative sample of the original data
# x_train can be passed as well, but with more examples explanations will take longer although they may be more accurate

global_explanation = tabular_explainer.explain_global(x_test)

Now we can see the global importance of the features in our model:

In [None]:
sorted_global_importance_values = global_explanation.get_ranked_global_values()
sorted_global_importance_names = global_explanation.get_ranked_global_names()
dict(zip(sorted_global_importance_names, sorted_global_importance_values))

# Explain overall model predictions as a collection of local (instance-level) explanations 
You can apply the interpretability classes and methods to understand the model’s global behavior or specific predictions. The former is called global explanation and the latter is called local explanation.

So for this we will explain the first member of the test set

In [None]:
local_explanation = tabular_explainer.explain_local(x_test[:1])

In [None]:
# get the prediction for the first member of the test set and explain why model made that prediction
prediction_value = clf.predict(x_test)[0]

In [None]:
sorted_local_importance_values = local_explanation.get_ranked_local_values()[prediction_value]
sorted_local_importance_names = local_explanation.get_ranked_local_names()[prediction_value]

Kernel Explainer: SHAP's Kernel explainer uses a specially weighted local linear regression to estimate SHAP values for any model.

In [None]:
# Sorted local SHAP values
print('ranked local importance values: {}'.format(sorted_local_importance_values))
# Corresponding feature names
print('ranked local importance names: {}'.format(sorted_local_importance_names))

# Load visualization dashboard

In [None]:
from azureml.contrib.explain.model.visualize import ExplanationDashboard


In [None]:
ExplanationDashboard(global_explanation, model, x_test)