<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Root-Cause Analysis for a Drifting Sentiment Classification Model</h1>

Imagine you're in charge of maintaining a model that takes as input online reviews of your U.S.-based product and classifies the sentiment of each review as positive, negative, or neutral. Your model initially performs well in production, but its performance gradually degrades over time.

Phoenix helps you surface the reason for this regression by analyzing the embeddings representing the text of each review. Your model was trained on English reviews, but as you'll discover, it's encountering Spanish reviews in production that it can't correctly classify.

In this tutorial, you will:
* Download curated datasets of embeddings and predictions
* Define a schema to describe the format of your data
* Launch Phoenix to visually explore your embeddings
* Investigate problematic clusters to identify the root cause of your model performance issue

Let's get started!

## 1. Install Dependencies and Import Libraries

In [None]:
%pip install -q arize-phoenix

In [None]:
import pandas as pd
import phoenix as px

## 2. Download the Data

Load your training and production data into two Pandas DataFrames.

In [None]:
train_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/nlp/sentiment-classification-language-drift/sentiment_classification_language_drift_training.parquet",
)
prod_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/nlp/sentiment-classification-language-drift/sentiment_classification_language_drift_production.parquet",
)

Inspect a few rows of the training DataFrame.

In [None]:
train_df.head()

The columns of the DataFrame are:
- **prediction_ts:** the Unix timestamps of your predictions
- **review_age**, **reviewer_gender**, **product_category**, **language:** the features of your model
- **text:** the text of each review
- **text_vector:** the embedding vectors representing each review
- **pred_label:** the label your model predicted
- **label:** the ground-truth label for each review

## 3. Launch Phoenix

### a) Define Your Schema

To launch Phoenix with your data, you first need to define a schema that tells Phoenix which columns of your DataFrames correspond to features, predictions, actuals (i.e., ground truth), embeddings, etc.

The trickiest part is defining embedding features. In this case, each embedding feature has two pieces of information: the embedding vector itself contained in the "text_vector" column and the review text contained in the "text" column.

In [None]:
embedding_features = {
    "text_embedding": px.EmbeddingColumnNames(
        vector_column_name="text_vector", raw_data_column_name="text"
    ),
}
schema = px.Schema(
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="pred_label",
    actual_label_column_name="label",
    embedding_feature_column_names=embedding_features,
)
schema

### b) Define Your Datasets 
Next, define your primary and reference datasets. In this case, your reference dataset contains training data and your primary dataset contains production data.

In [None]:
prim_ds = px.Dataset(dataframe=prod_df, schema=schema, name="primary")
ref_ds = px.Dataset(dataframe=train_df, schema=schema, name="reference")

Inspect your primary dataset.

In [None]:
prim_ds

#### c) Create a Phoenix Session

In [None]:
session = px.launch_app(primary=prim_ds, reference=ref_ds)
session

#### d) Launch the Phoenix UI

You can open Phoenix by copying and pasting the output of `session.url` into a new browser tab.

In [None]:
session.url

Alternatively, you can open the Phoenix UI in your notebook with

In [None]:
session.view()

## 4. Explore Your Data

### Steps:

1. Click on "text_embedding" in the "Embeddings" section.
1. In the Euclidean distance graph at the top of the page, click a point on the graph where the Euclidean distance is high.
1. Click on the top cluster in the panel on the left.
1. Use the panel at the bottom to examine the data points in this cluster.

### Questions:

1. What does the Euclidean distance graph measure?
1. What do the points in the point cloud represent?
1. What do you notice about the cluster you selected?
1. What's gone wrong with your model in production?


### Answers:

1. The Euclidean distance graph measures the drift of your production data relative to your training data over time.
1. Each point in the point cloud represents a single product review.
1. It consists almost entirely of production data, meaning that your model is seeing data in production the likes of which it never saw during training.
1. Your model was trained on examples of labeled product reviews in English. In production, your model is encountering product reviews in Spanish whose sentiment it cannot correctly predict.

Congrats! You've identified the root cause of your model's performance issue.

## 5. Close the App 🧹

When you're done, don't forget to close the app.

In [None]:
px.close_app()