<img src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="200"/>

# <center>Getting Started with the Arize Platform</center>
## <center>Investigating Embedding Drift in Object Detection</center>


In this tutorial, we are going to ingest model data from the [Facebook DETR](https://huggingface.co/facebook/detr-resnet-101) using input images from the [COCO dataset](https://cocodataset.org/#home).


Guides for other model types are available [here](https://docs.arize.com/arize/sending-data-to-arize/model-types).


**In this walkthrough, we are going to ingest embedding data and look at embedding drift.** 

In this scenario, you are in charge of maintaining an Object Detection model. Your model, [Facebook DETR](https://huggingface.co/facebook/detr-resnet-101), will localize and classify entities in input images from the [COCO dataset](https://cocodataset.org/#home). However, once the model is released into production, you notice that the performance of the model has degraded over a period of time.


This notebook will show you how Arize can automatically surface and troubleshoot the reason for this performance degradation by analyzing _image vectors_ associated with the input image so that you can take the right action to retrain your model/clean your data, saving you time and effort to correctly wrangle the datasets and visualize them. In this example, there are worse quality images in the production set during some period of time.

It is worth noting that, according to our research, inspecting embedding drift can surface problems with your data before they cause performance degradation.

In this tutorial, we will start from scratch. We will:
* Download the embedding vectors and predictions of a dataset we have curated for this tutorial
* Log the inferences into Arize
* Visually explore embeddings in the Arize Platform

**Note**: This example compares training vs production data. Arize supports sending only one dataset.


Let's get started! 


# Step 0. Install Dependencies and Import Libraries 📚

In [None]:
!pip install -q arize 

In [None]:
import uuid
import pandas as pd

from datetime import datetime
from arize.pandas.logger import Client
from arize.utils.types import (
    Schema,
    Environments,
    ModelTypes,
    EmbeddingColumnNames,
    ObjectDetectionColumnNames,
)

# Step 1. Download and Display the data

We have curated a dataset for you so that you can send it to Arize in this tutorial.

In [None]:
url = "https://storage.googleapis.com/arize-assets/fixtures/Embeddings/arize-demo-models-data/CV/Object-Detection/coco_detection_quality_drift"
train_df = pd.read_parquet(f"{url}_training.parquet")
prod_df = pd.read_parquet(f"{url}_production.parquet")

# Step 2. Prepare your data to be sent to Arize


## Update the timestamps

The data that you are working with was constructed in May of 2023. Hence, we will update the timestamps so they are current at the time that you're sending data to Arize.

In [None]:
last_ts = max(prod_df["prediction_ts"])
now_ts = datetime.timestamp(datetime.now())
delta_ts = now_ts - last_ts

train_df["prediction_ts"] = (train_df["prediction_ts"] + delta_ts).astype(float)
prod_df["prediction_ts"] = (prod_df["prediction_ts"] + delta_ts).astype(float)

## Add prediction ids

The Arize platform uses prediction IDs to link a prediction to an actual. Visit the [Arize documentation](https://docs.arize.com/arize/data-ingestion/model-schema/5.-prediction-id?q=prediction_id) for more details.

You can generate prediction IDs as follows:

In [None]:
def add_prediction_id(df):
    return [str(uuid.uuid4()) for _ in range(df.shape[0])]

In [None]:
train_df["prediction_id"] = add_prediction_id(train_df)
prod_df["prediction_id"] = add_prediction_id(prod_df)

# Step 3. Sending Data into Arize 💫

## Import and Setup Arize Client

The first step is to setup the Arize client. After that we will log the data.

Copy the Arize `API_KEY` and `SPACE_ID` from your Space Settings page (shown below) to the variables in the cell below. We will also be setting up some metadata to use across all logging.

<img src="https://storage.googleapis.com/arize-assets/fixtures/copy-id-and-key.png" width="700">

In [None]:
SPACE_ID = "SPACE_ID"
API_KEY = "API_KEY"
arize_client = Client(space_id=SPACE_ID, api_key=API_KEY)
model_id = "CV-demo-coco-object-detection-quality-drift"
model_version = "1.0"
model_type = ModelTypes.OBJECT_DETECTION
if SPACE_ID == "SPACE_ID" or API_KEY == "API_KEY":
    raise ValueError("❌ CHANGE SPACE_ID AND/OR API_KEY")
else:
    print(
        "✅ Import and Setup Arize Client Done! Now we can start using Arize!"
    )

Now that our Arize client is set up, let's go ahead and log all of our data to the platform. For more details on how **`arize.pandas.logger`** works, visit our documentation.

[![Buttons_OpenOrange.png](https://storage.googleapis.com/arize-assets/fixtures/Buttons_OpenOrange.png)](https://docs.arize.com/arize/sdks-and-integrations/python-sdk/arize.pandas)

We will use the `ObjectDetectionColumnNames` and `EmbeddingColumnNames` classes from Arize's SDK. 

[Here](https://docs.arize.com/arize/sending-data/model-schema-reference#8.-embedding-features-unstructured) is more information about defining embedding features using `EmbeddingColumnNames`.

## Define Schema

A Schema specifies the column names for corresponding data in the dataframe. While we could define different Schemas for training and production datasets, the dataframes have the same column names, so the Schema will be the same in this example.

#### Embedding features

To ingest non-embedding features, it suffices to provide a list of column names that contain the features in our dataframe. Embedding features, however, are a little bit different.

Arize allows you to ingest not only the embedding vector but the raw data associated with that embedding, or a URL link to that raw data. Therefore, up to 3 columns can be associated with the same _embedding object_*. To be able to do this, Arize's SDK provides the `EmbeddingColumnNames` class, used below.

***NOTE**: This is how we refer to the 3 possible pieces of information that can be sent as embedding objects:
* Embedding `vector` (required)
* Embedding `data` (optional): raw text associated with the embedding vector
* Embedding `link_to_data` (optional): link to the data file (image, audio, ...) associated with the embedding vector

[Here](https://docs.arize.com/arize/sending-data/model-schema-reference#8.-embedding-features-unstructured) is more information about defining embedding features using `EmbeddingColumnNames`.

#### Object Detection predictions & actuals

In order to send predictions and actuals for an object detection model, we will use the `ObjectDetectionColumnNames` class. Similarly to `EmbeddingColumnNames`, there are 3 pieces of information for `ObjectDetectionColumnNames`:
* `bounding_boxes_coordinates_column_name` (str): Column name containing the coordinates of the rectangular outline that locates an object within an image or video. Pascal VOC format required. The contents of this column must be a List[List[float]].
* `categories_column_name` (str): Column name containing the predefined classes or labels used by the model to classify the detected objects. The contents of this column must be List[str].
* `scores_column_names` (str, optional): Column name containint the confidence scores that the model assigns to it's predictions, indicating how certain the model is that the predicted class is contained within the bounding box. This argument is only applicable for prediction values. The contents of this column must be List[float].

In [None]:
tags = ["drift_type"]
embedding_feature_column_names = {
    "image_embedding": EmbeddingColumnNames(
        vector_column_name="image_vector", link_to_data_column_name="url"
    )
}
object_detection_prediction_column_names = ObjectDetectionColumnNames(
    bounding_boxes_coordinates_column_name="prediction_bboxes",
    categories_column_name="prediction_categories",
    scores_column_name="prediction_scores",
)
object_detection_actual_column_names = ObjectDetectionColumnNames(
    bounding_boxes_coordinates_column_name="actual_bboxes",
    categories_column_name="actual_categories",
)

# Define a Schema() object for Arize to pick up data from the correct columns for logging
schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="prediction_ts",
    tag_column_names=tags,
    embedding_feature_column_names=embedding_feature_column_names,
    object_detection_prediction_column_names=object_detection_prediction_column_names,
    object_detection_actual_column_names=object_detection_actual_column_names,
)

## Log Training Data

**Note**: This example compares training vs production data. Arize supports sending only one dataset.

In [None]:
# Logging Training DataFrame
response = arize_client.log(
    dataframe=train_df,
    schema=schema,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.TRAINING,
)

# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(
        f"❌ logging failed with response code {response.status_code}, {response.text}"
    )
else:
    print(f"✅ You have successfully logged training set to Arize")

## Log Production Data

In [None]:
# Logging Production DataFrame
response = arize_client.log(
    dataframe=prod_df,
    schema=schema,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.PRODUCTION,
)

# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(
        f"❌ logging failed with response code {response.status_code}, {response.text}"
    )
else:
    print(f"✅ You have successfully logged training set to Arize")

# Step 4. Confirm Data in Arize ✅
Note that the Arize platform takes about 15 minutes to index embedding data. While the model should appear immediately, the data will not show up until the indexing is complete. Feel free to head over to the **Data Ingestion** tab for your model to watch Arize work its magic!🔮

You will be able to see the predictions, actuals, and feature importances that have been sent in the last 30 minutes, last day or last week.

An example view of the Data Ingestion tab from a model, when data is sent continuously over 30 minutes, is shown in the image below.

<img src="https://storage.googleapis.com/arize-assets/fixtures/data-ingestion-tab.png" width="700">

# Wrap Up 🎁
Congratulations, you've now sent your first machine learning embedding data to the Arize platform!!

Additionally, if you want to remove this example model from your account, just click **Models** -> **CV-demo-coco-object-detection-quality-drift** -> **config** -> **delete**

### Overview
Arize is an end-to-end ML observability and model monitoring platform. The platform is designed to help ML engineers and data science practitioners surface and fix issues with ML models in production faster with:
- Automated ML monitoring and model monitoring
- Workflows to troubleshoot model performance
- Real-time visualizations for model performance monitoring, data quality monitoring, and drift monitoring
- Model prediction cohort analysis
- Pre-deployment model validation
- Integrated model explainability

### Website
Visit Us At: https://arize.com/model-monitoring/

### Additional Resources
- [What is ML observability?](https://arize.com/what-is-ml-observability/)
- [Monitor Unstructured Data with Arize](https://arize.com/blog/monitor-unstructured-data-with-arize)
- [Getting Started With Embeddings Is Easier Than You Think](https://arize.com/blog/getting-started-with-embeddings-is-easier-than-you-think)
- [Playbook to model monitoring in production](https://arize.com/the-playbook-to-monitor-your-models-performance-in-production/)
- [Using statistical distance metrics for ML monitoring and observability](https://arize.com/using-statistical-distance-metrics-for-machine-learning-observability/)
<!-- - [ML infrastructure tools for data preparation](https://arize.com/ml-infrastructure-tools-for-data-preparation/) -->
- [ML infrastructure tools for model building](https://arize.com/ml-infrastructure-tools-for-model-building/)
- [ML infrastructure tools for production](https://arize.com/ml-infrastructure-tools-for-production-part-1/)
<!-- - [ML infrastructure tools for model deployment and model serving](https://arize.com/ml-infrastructure-tools-for-production-part-2-model-deployment-and-serving/) -->

- [ML infrastructure tools for ML monitoring and observability](https://arize.com/ml-infrastructure-tools-ml-observability/)

Visit the [Arize Blog](https://arize.com/blog) and [Resource Center](https://arize.com/resource-hub/) for more resources on ML observability and model monitoring.
