# Arize Tutorial: Logging Predictions First, Then Logging SHAP After

Let's get started on using Arize! ✨

Arize helps you visualize your model performance, understand drift & data quality issues, and share insights learned from your models.

In this tutorial, we will using our Score Categorical model for predicting if someone has breast cancer or not to showcase one of the many ways of using the `arize.pandas.log` to log (i.e. send) data from a Pandas dataframe to the Arize platform.

### Why Use Multiple `log` Calls 🤔
Sometimes, we want to `log` predictions during production and store our `prediction_ids` right away for model tracking, but we don't have ground truth labels avaliable until much later. Othertimes, they become avaliable at the same time. Depending on your situation, you may need to use `log` differently.

**In this notebook, we will show how to `log` using `prediction_ids` to log only your predictions, then follow up to log the delayed SHAP values 🚀**

For more of our usage case tutorials, visit our other [example tutorials](https://arize.gitbook.io/arize/examples).

In general, if any part if your data (including `features`) become avaliable later and you can't log them right away, Arize provides the functionality of matching them through using `prediction_ids`, which is a required input for all `log` calls.

### Running This Notebook
1. Save a copy in Google Drive for yourself.
2. Step through each section below, pressing play on the code blocks to run the cells.
3. In Step 2, use your own Org and API key from your Arize account.


## Step 1: Load Data and Build Model

In [None]:
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

###############################################################################
# 1 Load data and split data
data = datasets.load_breast_cancer()
X, y = datasets.load_breast_cancer(return_X_y=True)

# NOTE: We need to set y.astype(str) since BINARY expected non-integer.
X, y = X.astype(np.float32), y.astype(str)
X, y = pd.DataFrame(X, columns=data["feature_names"]), pd.Series(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, random_state=42)

###############################################################################
# 2 Fit a simple logistic regression model
clf = LogisticRegression(max_iter=3000, verbose=False).fit(X_train, y_train)

# 3 Use the model to generate predictions
def predict(model, X):
    proba = model.predict_proba(X)
    pred = pd.Series((str(np.argmax(p)) for p in proba), index=X.index)
    score = pd.Series((p[1] for p in proba), index=X.index)
    return pred, score


y_train_pred, y_train_pred_score = predict(clf, X_train)
y_val_pred, y_val_pred_score = predict(clf, X_val)
y_test_pred, y_test_pred_score = predict(clf, X_test)

print("Step 1 ✅: Load Data & Build Model Done!")

## Step 2: Import and Setup Arize Client
First, copy the Arize `API_KEY` and `ORG_KEY` from your admin page linked below!

[![Button_Open.png](https://storage.googleapis.com/arize-assets/fixtures/Button_Open.png)](https://app.arize.com/admin)

<img src="https://storage.googleapis.com/arize-assets/fixtures/copy-keys.jpeg" width="600">

In [None]:
!pip install -q arize
from arize.pandas.logger import Client, Schema
from arize.utils.types import ModelTypes, Environments

ORGANIZATION_KEY = "ORGANIZATION_KEY"
API_KEY = "API_KEY"
arize_client = Client(organization_key=ORGANIZATION_KEY, api_key=API_KEY)

model_id = "logging_tutorial_delayed_shap"
model_version = "1.0"
model_type = ModelTypes.SCORE_CATEGORICAL

if ORGANIZATION_KEY == "ORGANIZATION_KEY" or API_KEY == "API_KEY":
    raise ValueError("❌ NEED TO CHANGE ORGANIZATION AND/OR API_KEY")
else:
    print("Step 2 ✅: Import and Setup Arize Client Done! Now we can start using Arize!")

# Logging Tutorial
We'll use the following helper functions to generate prediction IDs and timestamps to simulate a production environment.

In [None]:
import uuid
from datetime import datetime, timedelta

# Prediction ID is required for all datasets
def generate_prediction_ids(X):
    return pd.Series((str(uuid.uuid4()) for _ in range(len(X))), index=X.index)


# OPTIONAL: We can directly specify when inferences were made
def simulate_production_timestamps(X, days=30):
    t = datetime.now()
    current_ts, earlier_ts = t.timestamp(), (t - timedelta(days=days)).timestamp()
    return pd.Series(np.linspace(earlier_ts, current_ts, num=len(X)), index=X.index)

## Step 3: Logging Predictions
We can log predictions to Arize first, and match various other values such as actuals, explainability (i.e SHAP), or even features later.

In this example, we will use `arize.pandas.log` to only log the `prediction_labels` and `features` directly assuming you had it avaliable. This is to simulate predictions making in production as features become avaliable.

You can see our `arize.pandas.log()` documentations by clicking the button below.

[![Buttons_OpenOrange.png](https://storage.googleapis.com/arize-assets/fixtures/Buttons_OpenOrange.png)](https://docs.arize.com/arize/sdks-and-integrations/python-sdk/arize.pandas)

In [None]:
# For this example we need to first assemble our data into a pandas DataFrame
production_dataset = X_test.join(
    pd.DataFrame(
        {
            "prediction_id": generate_prediction_ids(X_test),
            "prediction_ts": simulate_production_timestamps(X_test, days=30),
            "prediction_label": y_test_pred,
            "prediction_score": y_test_pred_score,
        }
    )
)

Three easy steps to log a `pandas.DataFrame`. See [docs](https://docs.arize.com/arize/api-reference/python-sdk/arize.pandas) for more details.

1.   Define `Schema` to designate column names
2.   Call `arize.pandas.log()`
3.   Check `response.status_code`

In [None]:
# Define a Schema() object for Arize to pick up data from the correct columns for logging
production_schema = Schema(
    prediction_id_column_name="prediction_id",  # REQUIRED
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="prediction_label",
    prediction_score_column_name="prediction_score",
    feature_column_names=data["feature_names"],
)

# arize_client.log returns a Response object from Python's requests module
response = arize_client.log(
    dataframe=production_dataset,
    schema=production_schema,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.PRODUCTION,
    path="inferences.bin",
    batch_id=None,
)

# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(
        f"❌ logging failed with response code {response.status_code}, {response.text}"
    )
else:
    print(
        f"Step 3 ✅: You have successfully logged {len(production_dataset)} data points to Arize!"
    )

## Step 4.1: Generating and Formatting SHAP Values
**SHAP (SHapley Additive exPlanations)** is a game theoretic approach to explain the output of any machine learning model.

For more in-depth usage of the `shap` library, visit [SHAP Core Explainers](https://shap-lrjball.readthedocs.io/en/docs_update/generated/shap.Explainer.html) and pick an explainer specific to your machine learning model. `shap.Explainer` is the default explainer that will matches model type, but you can specify your own type. For example, you can choose to use for example `shap.TreeExplainer`, but it won't work on models such as `sklearn.LinearModel.LogisticRegression`.

We create this helper function `get_shap_values` to format the data and/or create visualizations for our shap values. We will store our results in a `pd.DataFrame` with matching columns for logging later.

In [None]:
!pip install -q shap
import shap


def get_shap_values(model, X_train, X_test, ExplainerType, show_graph=False):
    # Linear Models directly generate in the shape loggable to Arize
    if ExplainerType == shap.LinearExplainer:
        explainer = shap.LinearExplainer(model, X_train)
        shap_values = explainer.shap_values(X_test)

    # Tree Model Explainers
    elif ExplainerType == shap.TreeExplainer:
        explainer = shap.TreeExplainer(model, X_train)
        shap_values = np.array(explainer.shap_values(X_test)[1])

    # Model Agnostic Explainers
    else:
        explainer = shap.KernelExplainer(model.predict_proba, X_train)
        shap_values = np.array(explainer.shap_values(X_test)[1])

    # When not in production, it can be helpful to check graphs for feature explainability
    if show_graph:
        shap.summary_plot(shap_values, X_test, feature_names=data["feature_names"])

    return pd.DataFrame(shap_values, columns=data["feature_names"], index=X_test.index)


shap_values = get_shap_values(
    clf, X_train, X_test, shap.LinearExplainer, show_graph=True
)
print(
    f"Part 4.1 ✅: If no errors showed up, you have just generated {len(shap_values)} shap values!"
)

## Step 4.2: Matching Explainability
Sometimes, we want to log explainability metric later than during production time when we logged our prediction. If `log` calls are made separately, the shape, length, and order of the `prediction_labels` and `shap_values` do not need to match.

**IMPORTANT:** To match a SHAP value with a prediction, both MUST be logged with the same `prediction_id`

In [None]:
# A mapping is needed to pair up each SHAP value column with its feature name
shap_values_column_names_mapping = {
    f"{feat}": f"{feat}_shap" for feat in data["feature_names"]
}

# Here we create a SHAP values DataFrame with matching prediction_ids
shap_dataset = production_dataset[["prediction_id"]].join(
    shap_values.rename(columns=shap_values_column_names_mapping)
)

Three easy steps to log a `pandas.DataFrame`. See [docs](https://docs.arize.com/arize/api-reference/python-sdk/arize.pandas) for more details.

1.   Define `Schema` to designate column names
2.   Call `arize.pandas.log()`
3.   Check `response.status_code`

In [None]:
# Define a Schema() object for Arize to pick up data from the correct columns for logging
shap_schema = Schema(
    prediction_id_column_name="prediction_id",  # REQUIRED
    shap_values_column_names=shap_values_column_names_mapping,
    feature_column_names=[],
)

# arize_client.log returns a Response object from Python's requests module
response = arize_client.log(
    dataframe=shap_dataset,
    schema=shap_schema,
    path="inferences.bin",
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.PRODUCTION,
    batch_id=None,
)

# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(
        f"❌ logging failed with response code {response.status_code}, {response.text}"
    )
else:
    print(
        f"Step 4.2 ✅: You have successfully logged {len(shap_dataset)} data points to Arize!"
    )

### Overview
Arize is an end-to-end ML observability and model monitoring platform. The platform is designed to help ML engineers and data science practitioners surface and fix issues with ML models in production faster with:
- Automated ML monitoring and model monitoring
- Workflows to troubleshoot model performance
- Real-time visualizations for model performance monitoring, data quality monitoring, and drift monitoring
- Model prediction cohort analysis
- Pre-deployment model validation
- Integrated model explainability

### Website
Visit Us At: https://arize.com/model-monitoring/

### Additional Resources
- [What is ML observability?](https://arize.com/what-is-ml-observability/)
- [Playbook to model monitoring in production](https://arize.com/the-playbook-to-monitor-your-models-performance-in-production/)
- [Using statistical distance metrics for ML monitoring and observability](https://arize.com/using-statistical-distance-metrics-for-machine-learning-observability/)
- [ML infrastructure tools for data preparation](https://arize.com/ml-infrastructure-tools-for-data-preparation/)
- [ML infrastructure tools for model building](https://arize.com/ml-infrastructure-tools-for-model-building/)
- [ML infrastructure tools for production](https://arize.com/ml-infrastructure-tools-for-production-part-1/)
- [ML infrastructure tools for model deployment and model serving](https://arize.com/ml-infrastructure-tools-for-production-part-2-model-deployment-and-serving/)
- [ML infrastructure tools for ML monitoring and observability](https://arize.com/ml-infrastructure-tools-ml-observability/)

Visit the [Arize Blog](https://arize.com/blog) and [Resource Center](https://arize.com/resource-hub/) for more resources on ML observability and model monitoring.
