<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Phoenix Quickstart</h1>

This quickstart dives straight into the code with minimal explanation. Click below to explore the capabilities of Phoenix for various tasks.

- [Computer Vision](#Computer-Vision)
- [Natural Language Processing](#Natural-Language-Processing)
- [Tabular Data](#Tabular-Data)

## Computer Vision

Install Phoenix.

In [None]:
!pip install arize-phoenix

Import dependencies.

In [None]:
import uuid
from dataclasses import replace
from datetime import datetime

from IPython.display import display, HTML
import pandas as pd
import phoenix as px

Download production and training image data containing photographs of people performing various actions (sleeping, eating, running, etc.).

In [None]:
train_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/cv/human-actions/human_actions_training.parquet"
)
prod_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/cv/human-actions/human_actions_production.parquet"
)

View a few training data points.

In [None]:
train_df.head()

The columns of the DataFrame are:
- **prediction_id:** a unique identifier for each data point
- **prediction_ts:** the Unix timestamps of your predictions
- **url:** a link to the image data
- **image_vector:** the embedding vectors representing each image
- **actual_action:** the ground truth for each image
- **predicted_action:** the predicted class for the image

View a few production data points.

In [None]:
prod_df.head()

Notice that the production data is missing ground truth, i.e., has no "actual_action" column.

Display a few images alongside their predicted and actual labels. 

In [None]:
def display_examples(df):
    """
    Displays each image alongside the actual and predicted classes.
    """
    sample_df = df[["actual_action", "predicted_action", "url"]].rename(columns={"url": "image"})
    html = sample_df.to_html(
        escape=False, index=False, formatters={"image": lambda url: f'<img src="{url}">'}
    )
    display(HTML(html))


display_examples(train_df.head())

Define a schema for your training data.

In [None]:
train_schema = px.Schema(
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="predicted_action",
    actual_label_column_name="actual_action",
    embedding_feature_column_names={
        "image_embedding": px.EmbeddingColumnNames(
            vector_column_name="image_vector",
            link_to_data_column_name="url",
        ),
    },
)

The schema for your production data is the same, except it does not have an actual label column.

In [None]:
prod_schema = replace(train_schema, actual_label_column_name=None)

Define your primary and reference datasets.

In [None]:
prod_ds = px.Dataset(prod_df, prod_schema)
train_ds = px.Dataset(train_df, train_schema)

Launch Phoenix.

In [None]:
session = px.launch_app(prod_ds, train_ds)

Open the Phoenix UI by copying and pasting the session URL into a new browser tab.

In [None]:
session.url

Alternatively, open the Phoenix UI in your notebook.

In [None]:
session.view()

Navigate to the embeddings view. Find a cluster of production data that is unlike any of your training data. Export the cluster.

View the exported cluster as a DataFrame in your notebook.

In [None]:
export_df = session.exports[-1]
export_df.head()

Display a few examples from your exported data.

In [None]:
display_examples(export_df.head())

Close the app.

In [None]:
px.close_app()

## Natural Language Processing

Install Phoenix.

In [None]:
!pip install -q arize-phoenix

Import dependencies.

In [None]:
import pandas as pd
import phoenix as px

Download training and production data from a model that classifies the sentiment of product reviews as positive, negative, or neutral.

In [None]:
train_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/nlp/sentiment-classification-language-drift/sentiment_classification_language_drift_training.parquet",
)
prod_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/nlp/sentiment-classification-language-drift/sentiment_classification_language_drift_production.parquet",
)

View a few training data points.

In [None]:
train_df.head()

The columns of the DataFrame are:
- **prediction_ts:** the Unix timestamps of your predictions
- **review_age**, **reviewer_gender**, **product_category**, **language:** the features of your model
- **text:** the text of each product review
- **text_vector:** the embedding vectors representing each review
- **pred_label:** the label your model predicted
- **label:** the ground-truth label for each review

Define your schema.

In [None]:
schema = px.Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="pred_label",
    actual_label_column_name="label",
    embedding_feature_column_names={
        "text_embedding": px.EmbeddingColumnNames(
            vector_column_name="text_vector", raw_data_column_name="text"
        ),
    },
)

Define your primary and reference datasets.

In [None]:
prim_ds = px.Dataset(dataframe=prod_df, schema=schema, name="production")
ref_ds = px.Dataset(dataframe=train_df, schema=schema, name="training")

Launch Phoenix.

In [None]:
session = px.launch_app(primary=prim_ds, reference=ref_ds)

Open Phoenix by copying and pasting the output of `session.url` into a new browser tab.

In [None]:
session.url

Alternatively, open the Phoenix UI in your notebook.

In [None]:
session.view()

Navigate to the embeddings page. Select a period of high drift. Click on the clusters on the left and inspect the data in each cluster. One cluster contains positive reviews, one contains negative reviews, and another contains production data that has drifted from the training distribution.

Close the app.

In [None]:
px.close_app()

## Tabular Data

Install Phoenix and Arize auto-embeddings.

In [None]:
!pip install -q arize-phoenix "arize[AutoEmbeddings]"

Import dependencies.

In [None]:
from arize.pandas.embeddings.tabular_generators import EmbeddingGeneratorForTabularFeatures
import pandas as pd
import phoenix as px
import torch

Download your training and production data from a fraud detection model.

In [None]:
train_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/phoenix/datasets/structured/credit-card-fraud/credit_card_fraud_train.parquet",
)
prod_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/phoenix/datasets/structured/credit-card-fraud/credit_card_fraud_production.parquet",
)
train_df.head()

The columns of the DataFrame are:
- **prediction_id:** the unique ID for each prediction
- **prediction_timestamp:** the timestamps of your predictions
- **predicted_label:** the label your model predicted
- **predicted_score:** the score of each prediction
- **actual_label:** the true, ground-truth label for each prediction (fraud vs. not_fraud)
- **tabular_vector:** pre-computed tabular embeddings for each row of data
- **age:** a tag used to filter your data in the Phoenix UI
- the rest of the columns are features

Run the cell below if you have a CUDA-enabled GPU and want to compute embeddings for your tabular data from scratch; otherwise, skip this step to use the pre-computed embeddings downloaded with the rest of your data.

In [None]:
feature_column_names = [
    "fico_score",
    "loan_amount",
    "term",
    "interest_rate",
    "installment",
    "grade",
    "home_ownership",
    "annual_income",
    "verification_status",
    "pymnt_plan",
    "addr_state",
    "dti",
    "delinq_2yrs",
    "inq_last_6mths",
    "mths_since_last_delinq",
    "mths_since_last_record",
    "open_acc",
    "pub_rec",
    "revol_bal",
    "revol_util",
    "state",
    "merchant_ID",
    "merchant_risk_score",
]

if torch.cuda.is_available():
    generator = EmbeddingGeneratorForTabularFeatures(
        model_name="distilbert-base-uncased",
    )
    train_df["tabular_vector"] = generator.generate_embeddings(
        train_df,
        selected_columns=feature_column_names,
    )
    prod_df["tabular_vector"] = generator.generate_embeddings(
        prod_df,
        selected_columns=feature_column_names,
    )
else:
    print("CUDA is not available. Using pre-computed embeddings.")

Define your schema.

In [None]:
schema = px.Schema(
    prediction_id_column_name="prediction_id",
    prediction_label_column_name="predicted_label",
    prediction_score_column_name="predicted_score",
    actual_label_column_name="actual_label",
    timestamp_column_name="prediction_timestamp",
    feature_column_names=feature_column_names,
    tag_column_names=["age"],
    embedding_feature_column_names={
        "tabular_embedding": px.EmbeddingColumnNames(
            vector_column_name="tabular_vector",
        ),
    },
)

Define your primary and reference datasets.

In [None]:
prod_ds = px.Dataset(dataframe=prod_df, schema=schema, name="production")
train_ds = px.Dataset(dataframe=train_df, schema=schema, name="training")

Launch Phoenix.

In [None]:
session = px.launch_app(primary=prod_ds, reference=train_ds)

Open Phoenix by copying and pasting the output of `session.url` into a new browser tab.

In [None]:
session.url

Alternatively, open the Phoenix UI in your notebook.

In [None]:
session.view()

Navigate to the embeddings page. Select a period of high drift. Select a drifted cluster. Color your data by the `merchant_ID` feature. Select a cluster of drifted production data. Notice that much of this data consists of fraudulent transactions from the Scammeds merchant. Export the cluster.

View your most recently exported data as a DataFrame.

In [None]:
export_df = session.exports[-1]
export_df.head()

Close the app.

In [None]:
px.close_app()