<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Active Learning for a Drifting Image Classification Model</h1>

Imagine you're in charge of maintaining a model that classifies the action of people in photographs. Your model initially performs well in production, but its performance gradually degrades over time.

Phoenix helps you surface the reason for this regression by analyzing the embeddings representing each image. Your model was trained on crisp and high-resolution images, but as you'll discover, it's encountering blurred and noisy images in production that it can't correctly classify.

In this tutorial, you will:

- Download curated datasets of embeddings and predictions
- Define a schema to describe the format of your data
- Launch Phoenix to visually explore your embeddings
- Investigate problematic clusters
- Export problematic production data for labeling and fine-tuning

Let's get started!

## 1. Install Dependencies and Import Libraries

In [None]:
!pip install -q arize-phoenix

In [None]:
import uuid
from dataclasses import replace
from datetime import datetime

from IPython.display import display, HTML
import pandas as pd
import phoenix as px

## 2. Download and Inspect the Data

Download the curated dataset.

In [None]:
train_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/cv/human-actions/human_actions_training.parquet"
)
prod_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/cv/human-actions/human_actions_production.parquet"
)

View the first few rows of the training DataFrame.

In [None]:
train_df.head()

The columns of the DataFrame are:
- **prediction_id:** a unique identifier for each data point
- **prediction_ts:** the Unix timestamps of your predictions
- **url:** a link to the image data
- **image_vector:** the embedding vectors representing each review
- **actual_action:** the ground truth for each image (sleeping, eating, running, etc.)
- **predicted_action:** the predicted class for the image

View the first few rows of the production DataFrame.

In [None]:
prod_df.head()

Notice that the production data is missing ground truth, i.e., has no "actual_action" column.

Display a few images alongside their predicted and actual labels. 

In [None]:
def display_examples(df):
    """
    Displays each image alongside the actual and predicted classes.
    """
    sample_df = df[["actual_action", "predicted_action", "url"]].rename(columns={"url": "image"})
    html = sample_df.to_html(
        escape=False, index=False, formatters={"image": lambda url: f'<img src="{url}">'}
    )
    display(HTML(html))


display_examples(train_df.head())

## 3. Launch Phoenix

### a) Define Your Schema
To launch Phoenix with your data, you first need to define a schema that tells Phoenix which columns of your DataFrames correspond to features, predictions, actuals (i.e., ground truth), embeddings, etc.

The trickiest part is defining embedding features. In this case, each embedding feature has two pieces of information: the embedding vector itself contained in the "image_vector" column and the link to the image contained in the "url" column.

Define a schema for your training data.

In [None]:
train_schema = px.Schema(
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="predicted_action",
    actual_label_column_name="actual_action",
    embedding_feature_column_names={
        "image_embedding": px.EmbeddingColumnNames(
            vector_column_name="image_vector",
            link_to_data_column_name="url",
        ),
    },
)

The schema for your production data is the same, except it does not have an actual label column.

In [None]:
prod_schema = replace(train_schema, actual_label_column_name=None)

### b) Define Your Datasets
Next, define your primary and reference datasets. In this case, your reference dataset contains training data and your primary dataset contains production data.

In [None]:
prod_ds = px.Dataset(prod_df, prod_schema)
train_ds = px.Dataset(train_df, train_schema)

### c) Create a Phoenix Session

In [None]:
session = px.launch_app(prod_ds, train_ds)

### d) Launch the Phoenix UI

You can open Phoenix by copying and pasting the output of `session.url` into a new browser tab.

In [None]:
session.url

Alternatively, you can open the Phoenix UI in your notebook with

In [None]:
session.view()

## 4. Find and Export Problematic Clusters

### Steps

1. Click on "image_embedding" in the "Embeddings" section.
1. In the Euclidean distance graph at the top of the page, select a point on the graph where the Euclidean distance is high.
1. Click on the top cluster in the panel on the left.
1. Use the panel at the bottom to examine the data points in this cluster.
1. Click on the "Export" button to save your cluster.

### Questions:

1. What does the Euclidean distance graph measure?
1. What do the points in the point cloud represent?
1. What do you notice about the cluster you selected?
1. What's gone wrong with your model in production?

### Answers

1. This graph measures the drift of your production data relative to your training data over time.
1. Each point in the point cloud corresponds to an image. Phoenix has taken the high-dimensional embeddings in your original DataFrame and has reduced the dimensionality so that you can view them in lower dimensions.
1. It consists almost entirely of production data, meaning that your model is seeing data in production the likes of which it never saw during training.
1. Your model was trained crisp and high-resolution images. In production, your model is encountering blurry and noisy images that it cannot correctly classify.

## 5. Load and View Exported Data

View your exported files.

In [None]:
session.exports

Load your most recent exported data back into a DataFrame.

In [None]:
export_df = session.exports[0].dataframe
export_df.head()

Display a few examples from your export.

In [None]:
display_examples(export_df.head())

Congrats! You've pinpointed the blurry or noisy images that are hurting your model's performance in production. As an actionable next step, you can label your exported production data and fine-tune your model to improve performance.

## 6. Close the App

When you're done, don't forget to close the app.

In [None]:
px.close_app()