# Arize Tutorial: SHAP Value For Every Model

Let's get started on using Arize! ✨

Arize helps you visualize your model performance, understand drift & data quality issues, and share insights learned from your models.

**SHAP (SHapley Additive exPlanations)** is a game theoretic approach to explain the output of any machine learning model.

For more in-depth usage of the `shap` library, visit [SHAP Core Explainers](https://shap-lrjball.readthedocs.io/en/docs_update/generated/shap.Explainer.html) and read more about the math behind each type of explainer. `shap.Explainer` is the default explainer that will match any model type, but you can specify your own type. For example, you can choose to use for example `shap.TreeExplainer`, but it won't work on models such as `sklearn.LinearModel.LogisticRegression`.

In this notebook, we will show you which `shap.ExplainerType` to use for some common ML Models 🚀.

### Running This Notebook
1. Save a copy in Google Drive for yourself.
2. Step through each section below, pressing play on the code blocks to run the cells.
3. In Step 2, use your own Space and API key from your Arize account.

## Part 0: SHAP API Quick-Start
If you are trying to start right away using the `shap` library for logging to Arize, here are some quick codes snippits for you to copy.


| Model                     | Explainer                                                | Generating Arguments    |   |   |
|:-|:-|:-|---|---|
| sklearn.LinearModel       | exp = shap.LinearExplainer(model, X_train)               | exp.shap_values(X_test) |   |   |
| sklearn.ensembles         | exp = shap.TreeExplainer(model, X_train)                 | exp.shap_values(X_test) |   |   |
| Neural and Model Agnostic | exp = shap.KernelExplainer(model.predict_proba, X_train) | exp.shap_values(X_test) |   |   |

First, remember to `pip install shap` and `import shap`.

In [None]:
!pip install -q shap # Run this if you do not have shap installed
import shap

##  Part 1: Load Data and Build Model

In [None]:
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn.model_selection import train_test_split

# 1 Load data and split data
data = datasets.load_breast_cancer()

# NOTE: We need to set y.astype(str) since BINARY expected non-integer.
X, y = datasets.load_breast_cancer(return_X_y=True)
X, y = X.astype(np.float32), y.astype(str)
X, y = pd.DataFrame(X, columns=data["feature_names"]), pd.Series(y)

# For SVM/Neural Example - shrink data for faster example runtime
X = X.sample(n=100, random_state=1)
y = y.sample(n=100, random_state=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, random_state=42)


def predict(model, X):
    pred_proba = model.predict_proba(X)
    pred = pd.Series((str(np.argmax(p)) for p in pred_proba), index=X.index)
    pred_score = pd.Series((p[1] for p in pred_proba), index=X.index)
    return pred, pred_score


print("Step 1 ✅: Load Data Done!")

## Step 2: Import and Setup Arize Client
First, copy the Arize `API_KEY` and `SPACE_KEY` from your admin page linked below!



<img src="https://storage.googleapis.com/arize-assets/fixtures/copy-keys.png" width="700">

In [None]:
!pip install -q arize
from arize.pandas.logger import Client, Schema
from arize.utils.types import ModelTypes, Environments

SPACE_KEY = "SPACE_KEY"
API_KEY = "API_KEY"
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)

model_id = "breast_cancer_prediction_SHAP"
model_type = ModelTypes.SCORE_CATEGORICAL

if SPACE_KEY == "SPACE_KEY" or API_KEY == "API_KEY":
    raise ValueError("❌ NEED TO CHANGE SPACE AND/OR API_KEY")
else:
    print("Step 2 ✅: Import and Setup Arize Client Done! Now we can start using Arize!")

# Logging During Production
We'll use the following helper functions to generate prediction IDs and timestamps to simulate a production environment.

In [None]:
import uuid
from datetime import datetime, timedelta

# Prediction ID is required for logging any dataset
def generate_prediction_ids(X):
    return pd.Series((str(uuid.uuid4()) for _ in range(len(X_test))), index=X.index)


# OPTIONAL: We can directly specify when inferences were made
def simulate_production_timestamps(X, days=30):
    t = datetime.now()
    current_t, earlier_t = t.timestamp(), (t - timedelta(days=days)).timestamp()
    return pd.Series(np.linspace(earlier_t, current_t, num=len(X)), index=X.index)

## Part 3: Arize Quick-Start Helper
This section of code helps fit your code to the shape Arize needs to `log` onto our platform.

In [None]:
def get_shap_values(model, X_train, X_test, ExplainerType, show_graph=False):
    # Linear Models directly generate in the shape loggable to Arize
    if ExplainerType == shap.LinearExplainer:
        explainer = shap.LinearExplainer(model, X_train)
        shap_values = explainer.shap_values(X_test)

    # Tree Model Explainers
    elif ExplainerType == shap.TreeExplainer:
        explainer = shap.TreeExplainer(model, X_train)
        shap_values = np.array(explainer.shap_values(X_test)[1])

    # Model Agnostic Explainers
    else:
        explainer = shap.KernelExplainer(model.predict_proba, X_train)
        shap_values = np.array(explainer.shap_values(X_test)[1])

    # When not in production, it can be helpful to check graphs for feature explainability
    if show_graph:
        shap.summary_plot(shap_values, X_test, feature_names=data["feature_names"])

    return pd.DataFrame(shap_values, columns=data["feature_names"], index=X_test.index)


print("Step 3 ✅: Helper functions defined!")

## Part 4: Linear Models

In [None]:
model_version = f"{model_id}-linear-model-1.0"

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(max_iter=10000).fit(X_train, y_train)
y_test_pred, y_test_pred_score = predict(logreg, X_test)
shap_values = get_shap_values(logreg, X_train, X_test, shap.LinearExplainer)

# For this example we need to first assemble our data into a pandas DataFrame

# A mapping is needed to pair up each SHAP value column with its feature name
shap_values_column_names_mapping = {
    f"{feat}": f"{feat}_shap" for feat in data["feature_names"]
}

production_dataset = X_test.join(
    [
        pd.DataFrame(
            {
                "prediction_id": generate_prediction_ids(X_test),
                "prediction_ts": simulate_production_timestamps(X_test),
                "prediction_label": y_test_pred,
                "prediction_score": y_test_pred_score,
                "actual_label": y_test,
            }
        ),
        shap_values.rename(columns=shap_values_column_names_mapping),
    ]
)

Three easy steps to log a `pandas.DataFrame`. See [docs](https://docs.arize.com/arize/api-reference/python-sdk/arize.pandas) for more details.

1.   Define `Schema` to designate column names
2.   Call `arize.pandas.log()`
3.   Check `response.status_code`

In [None]:
# Define a Schema() object for Arize to pick up data from the correct columns for logging
production_schema = Schema(
    prediction_id_column_name="prediction_id",  # REQUIRED
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="prediction_label",
    prediction_score_column_name="prediction_score",
    actual_label_column_name="actual_label",
    feature_column_names=data["feature_names"],
    shap_values_column_names=shap_values_column_names_mapping,
)

# arize_client.log returns a Response object from Python's requests module
response = arize_client.log(
    dataframe=production_dataset,
    schema=production_schema,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.PRODUCTION,
)

# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(
        f"❌ logging failed with response code {response.status_code}, {response.text}"
    )
else:
    print(
        f"Step 4 ✅: You have successfully logged {len(production_dataset)} data points to Arize!"
    )

## Part 5: Tree Models

In [None]:
model_version = f"{model_id}-tree-model-1.0"

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier().fit(X_train, y_train)
y_test_pred, y_test_pred_score = predict(clf, X_test)
shap_values = get_shap_values(clf, X_train, X_test, shap.TreeExplainer)

# For this example we need to first assemble our data into a pandas DataFrame

# A mapping is needed to pair up each SHAP value column with its feature name
shap_values_column_names_mapping = {
    f"{feat}": f"{feat}_shap" for feat in data["feature_names"]
}

production_dataset = X_test.join(
    [
        pd.DataFrame(
            {
                "prediction_id": generate_prediction_ids(X_test),
                "prediction_ts": simulate_production_timestamps(X_test),
                "prediction_label": y_test_pred,
                "prediction_score": y_test_pred_score,
                "actual_label": y_test,
            }
        ),
        shap_values.rename(columns=shap_values_column_names_mapping),
    ]
)

Three easy steps to log a `pandas.DataFrame`. See [docs](https://docs.arize.com/arize/api-reference/python-sdk/arize.pandas) for more details.

1.   Define `Schema` to designate column names
2.   Call `arize.pandas.log()`
3.   Check `response.status_code`

In [None]:
# Define a Schema() object for Arize to pick up data from the correct columns for logging
production_schema = Schema(
    prediction_id_column_name="prediction_id",  # REQUIRED
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="prediction_label",
    prediction_score_column_name="prediction_score",
    actual_label_column_name="actual_label",
    feature_column_names=data["feature_names"],
    shap_values_column_names=shap_values_column_names_mapping,
)

# arize_client.log returns a Response object from Python's requests module
response = arize_client.log(
    dataframe=production_dataset,
    schema=production_schema,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.PRODUCTION,
)

# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(
        f"❌ logging failed with response code {response.status_code}, {response.text}"
    )
else:
    print(
        f"Step 5 ✅: You have successfully logged {len(production_dataset)} data points to Arize!"
    )

## Part 6: Neural Models and Model Agnostic
In this section, we will use `shap.KernelExplainer`, which works for all models such as deep learning model and SVMs.

In [None]:
model_version = f"{model_id}-nn-model-1.0"

from sklearn.neural_network import MLPClassifier

clf = MLPClassifier().fit(X_train, y_train)
y_test_pred, y_test_pred_score = predict(clf, X_test)
shap_values = get_shap_values(clf, X_train, X_test, shap.KernelExplainer)

# For this example we need to first assemble our data into a pandas DataFrame

# A mapping is needed to pair up each SHAP value column with its feature name
shap_values_column_names_mapping = {
    f"{feat}": f"{feat}_shap" for feat in data["feature_names"]
}

production_dataset = X_test.join(
    [
        pd.DataFrame(
            {
                "prediction_id": generate_prediction_ids(X_test),
                "prediction_ts": simulate_production_timestamps(X_test),
                "prediction_label": y_test_pred,
                "prediction_score": y_test_pred_score,
                "actual_label": y_test,
            }
        ),
        shap_values.rename(columns=shap_values_column_names_mapping),
    ]
)

Three easy steps to log a `pandas.DataFrame`. See [docs](https://docs.arize.com/arize/api-reference/python-sdk/arize.pandas) for more details.

1.   Define `Schema` to designate column names
2.   Call `arize.pandas.log()`
3.   Check `response.status_code`

In [None]:
# Define a Schema() object for Arize to pick up data from the correct columns for logging
production_schema = Schema(
    prediction_id_column_name="prediction_id",  # REQUIRED
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="prediction_label",
    prediction_score_column_name="prediction_score",
    actual_label_column_name="actual_label",
    feature_column_names=data["feature_names"],
    shap_values_column_names=shap_values_column_names_mapping,
)

# arize_client.log returns a Response object from Python's requests module
response = arize_client.log(
    dataframe=production_dataset,
    schema=production_schema,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.PRODUCTION,
)

# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(
        f"❌ logging failed with response code {response.status_code}, {response.text}"
    )
else:
    print(
        f"Step 6 ✅: You have successfully logged {len(production_dataset)} data points to Arize!"
    )

### Check Data Ingestion Information
You now know how to seamlessly log SHAP values onto the Arize platform. Go to [Arize](https://app.arize.com/) in order to analyze and monitor the logged SHAP values.

Data will be available in the UI in about 10 minutes after it was received. If data from a new model is sent, the model will be reflected almost immediately in the Arize platform. However, you will not see data yet. To verify data has been sent correctly and is being processed, we recommend that you check our Data Ingestion tab.

You will be able to see the predictions, actuals, and feature importances that have been sent in the last week, last day or last 30 minutes.

An example view of the Data Ingestion tab from a model, when data is sent continuously over 30 minutes, is shown in the image below.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/data-ingestion-tab.png" width="700">



### Overview
Arize is an end-to-end ML observability and model monitoring platform. The platform is designed to help ML engineers and data science practitioners surface and fix issues with ML models in production faster with:
- Automated ML monitoring and model monitoring
- Workflows to troubleshoot model performance
- Real-time visualizations for model performance monitoring, data quality monitoring, and drift monitoring
- Model prediction cohort analysis
- Pre-deployment model validation
- Integrated model explainability

### Website
Visit Us At: https://arize.com/model-monitoring/

### Additional Resources
- [What is ML observability?](https://arize.com/what-is-ml-observability/)
- [Playbook to model monitoring in production](https://arize.com/the-playbook-to-monitor-your-models-performance-in-production/)
- [Using statistical distance metrics for ML monitoring and observability](https://arize.com/using-statistical-distance-metrics-for-machine-learning-observability/)
- [ML infrastructure tools for data preparation](https://arize.com/ml-infrastructure-tools-for-data-preparation/)
- [ML infrastructure tools for model building](https://arize.com/ml-infrastructure-tools-for-model-building/)
- [ML infrastructure tools for production](https://arize.com/ml-infrastructure-tools-for-production-part-1/)
- [ML infrastructure tools for model deployment and model serving](https://arize.com/ml-infrastructure-tools-for-production-part-2-model-deployment-and-serving/)
- [ML infrastructure tools for ML monitoring and observability](https://arize.com/ml-infrastructure-tools-ml-observability/)

Visit the [Arize Blog](https://arize.com/blog) and [Resource Center](https://arize.com/resource-hub/) for more resources on ML observability and model monitoring.
