# Arize Tutorial: Surrogate Model Feature Importance

Let's get started on using Arize! ✨

Arize helps you visualize your model performance, understand drift & data quality issues, and share insights learned from your models.

A surrogate model is an interpretable model trained on predicting the predictions of a black box model. The goal is to approximate the predictions of the black box model as closely as possible and generate feature importance values from the interpretable surrogate model. The benefit of this approach is that it does not require knowledge of the inner workings of the black box model.

In this tutorial we use the `surrogate_explainability` flag to compute feature importance values from a surrogate model using only the prediction outputs from a black box model. Both [classification](#classification) and [regression](#regression) examples are provided below and feature importance values are logged to Arize using our [Pandas logger](https://docs.arize.com/arize/api-reference/python-sdk/arize.pandas).

## Install Dependencies and Import Libraries 📚

In [1]:
!pip install -q arize[MimicExplainer]

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
import uuid
from datetime import datetime, timedelta

from arize.pandas.logger import Client, Schema
from arize.utils.types import ModelTypes, Environments

<a name="classification"></a>
# Classification Example
### Generate example
In this example we'll use a support vector machine (SVM) as our black box model. Only the prediction outputs of the SVM model is needed to train the surrogate model, and feature importances are generated from the surrogate model and logged in Arize.

In [2]:
bc = load_breast_cancer()

feature_names = bc.feature_names
target_names = bc.target_names
data, target = bc.data, bc.target

df = pd.DataFrame(data, columns=feature_names)

model = SVC(probability=True).fit(df, target)

prediction_label = pd.Series(map(lambda v: target_names[v], model.predict(df)))
prediction_score = pd.Series(map(lambda v: v[1], model.predict_proba(df)))
actual_label = pd.Series(map(lambda v: target_names[v], target))
actual_score = pd.Series(target)

## Initialize Arize client
You can find your `API_KEY` and `SPACE_KEY` by navigating to the settings page in your workspace as shown below (only space admins can see the keys). 



<img src="https://storage.cloud.google.com/arize-assets/fixtures/copy-keys.png" width="700">

In [None]:
SPACE_KEY = "SPACE_KEY"
API_KEY = "API_KEY"
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)

model_id="surrogate_model_example_classification"
model_version="1.0"
model_type=ModelTypes.SCORE_CATEGORICAL


if SPACE_KEY == "SPACE_KEY" or API_KEY == "API_KEY":
    raise ValueError("❌ NEED TO CHANGE SPACE AND/OR API_KEY")
else:
    print("✅ Import and Setup Arize Client Done! Now we can start using Arize!")

### Use the `surrogate_explainability` flag in the Python SDK


Helper functions to simulate prediction IDs and timestamps.

In [None]:
# Prediction ID is required for logging any dataset
def generate_prediction_ids(df):
    return pd.Series((str(uuid.uuid4()) for _ in range(len(df))), index=df.index)


# OPTIONAL: We can directly specify when inferences were made
def simulate_production_timestamps(df, days=30):
    t = datetime.now()
    current_t, earlier_t = t.timestamp(), (t - timedelta(days=days)).timestamp()
    return pd.Series(np.linspace(earlier_t, current_t, num=len(df)), index=df.index)

Assemble new Pandas DataFrame as a production dataset with prediction IDs and timestamps.

In [None]:
production_dataset = pd.concat(
    [
        pd.DataFrame(
            {
                "prediction_id": generate_prediction_ids(df),
                "prediction_ts": simulate_production_timestamps(df),
                "prediction_label": prediction_label,
                "actual_label": actual_label,
                "prediction_score": prediction_score,
                "actual_score": actual_score,
            }
        ),
        df
    ],
    axis=1,
)
production_dataset

In [None]:
# Define a Schema() object for Arize to pick up data from the correct columns for logging
production_schema = Schema(
    prediction_id_column_name="prediction_id",  # REQUIRED
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="prediction_label",
    prediction_score_column_name="prediction_score",
    actual_label_column_name="actual_label",
    actual_score_column_name="actual_score",
    feature_column_names=feature_names,
)

# arize_client.log returns a Response object from Python's requests module
response = arize_client.log(
    dataframe=production_dataset,
    schema=production_schema,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.PRODUCTION,
    surrogate_explainability = True     # assign surrogate_explainability flag to True here 
)

# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(
        f"❌ logging failed with response code {response.status_code}, {response.text}"
    )
else:
    print(
        f"✅ You have successfully logged {len(production_dataset)} data points to Arize!"
    )

<a name="regression"></a>
# Regression Example
### Generate example
In this example we'll use a support vector machine (SVM) as our black box model. Only the prediction outputs of the SVM model is needed to train the surrogate model, and feature importances are generated from the surrogate model and sent to Arize.

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()

# Use only 1,000 data point for a speedier example
data_reg = housing.data[:1000]
target_reg = housing.target[:1000]
feature_names_reg = housing.feature_names

df_reg = pd.DataFrame(data_reg, columns=feature_names_reg)

from sklearn.svm import SVR

model_reg = SVR().fit(df_reg, target_reg)

prediction_label_reg = pd.Series(model_reg.predict(df_reg))
actual_label_reg = pd.Series(target_reg)

### Use the `surrogate_explainability` flag in the Python SDK

Assemble Pandas DataFrame as a production dataset with prediction IDs and timestamps.


In [None]:
production_dataset_reg = pd.concat(
    [
        pd.DataFrame(
            {
                "prediction_id": generate_prediction_ids(df_reg),
                "prediction_ts": simulate_production_timestamps(df_reg),
                "prediction_label": prediction_label_reg,
                "actual_label": actual_label_reg,
            }
        ),
        df_reg
    ],
    axis=1,
)

production_dataset_reg

Send DataFrame to Arize.

In [None]:
# Define a Schema() object for Arize to pick up data from the correct columns for logging
production_schema_reg = Schema(
    prediction_id_column_name="prediction_id",  # REQUIRED
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="prediction_label",
    actual_label_column_name="actual_label",
    feature_column_names=feature_names_reg,
)

# arize_client.log returns a Response object from Python's requests module
response_reg = arize_client.log(
    dataframe=production_dataset_reg,
    schema=production_schema_reg,
    model_id="surrogate_model_example_regression",
    model_type=ModelTypes.NUMERIC,
    environment=Environments.PRODUCTION,
    surrogate_explainability = True    # assign surrogate_explainability flag to True here
)

# If successful, the server will return a status_code of 200
if response_reg.status_code != 200:
    print(
        f"❌ logging failed with response code {response_reg.status_code}, {response_reg.text}"
    )
else:
    print(
        f"✅ You have successfully logged {len(production_dataset_reg)} data points to Arize!"
    )

### Check Data Ingestion Information
You now know how to seamlessly log surrogate model feature importance values onto the Arize platform. Go to [Arize](https://app.arize.com/) in order to analyze and monitor the logged SHAP values.

Data will be available in the UI in about 10 minutes after it was received. If data from a new model is sent, the model will be reflected almost immediately in the Arize platform. However, you will not see data yet. To verify data has been sent correctly and is being processed, we recommend that you check our Data Ingestion tab.

You will be able to see the predictions, actuals, and feature importances that have been sent in the last week, last day or last 30 minutes.

An example view of the Data Ingestion tab from a model, when data is sent continuously over 30 minutes, is shown in the image below.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/data-ingestion-tab.png" width="700">



### Overview
Arize is an end-to-end ML observability and model monitoring platform. The platform is designed to help ML engineers and data science practitioners surface and fix issues with ML models in production faster with:
- Automated ML monitoring and model monitoring
- Workflows to troubleshoot model performance
- Real-time visualizations for model performance monitoring, data quality monitoring, and drift monitoring
- Model prediction cohort analysis
- Pre-deployment model validation
- Integrated model explainability

### Website
Visit Us At: https://arize.com/model-monitoring/

### Additional Resources
- [What is ML observability?](https://arize.com/what-is-ml-observability/)
- [Playbook to model monitoring in production](https://arize.com/the-playbook-to-monitor-your-models-performance-in-production/)
- [Using statistical distance metrics for ML monitoring and observability](https://arize.com/using-statistical-distance-metrics-for-machine-learning-observability/)
- [ML infrastructure tools for data preparation](https://arize.com/ml-infrastructure-tools-for-data-preparation/)
- [ML infrastructure tools for model building](https://arize.com/ml-infrastructure-tools-for-model-building/)
- [ML infrastructure tools for production](https://arize.com/ml-infrastructure-tools-for-production-part-1/)
- [ML infrastructure tools for model deployment and model serving](https://arize.com/ml-infrastructure-tools-for-production-part-2-model-deployment-and-serving/)
- [ML infrastructure tools for ML monitoring and observability](https://arize.com/ml-infrastructure-tools-ml-observability/)

Visit the [Arize Blog](https://arize.com/blog) and [Resource Center](https://arize.com/resource-hub/) for more resources on ML observability and model monitoring.
