# Investigating Embedding Drift for a Sentiment Classification Model

Imagine you're in charge of maintaining a model that takes as input online reviews of your U.S.-based product and classifies the sentiment of each review as positive, negative, or neutral. Your model initially performs well in production, but its performance gradually degrades over time.

Phoenix helps you surface the reason for this regression by analyzing the [embeddings](https://docs.arize.com/phoenix/concepts/embeddings) representing the text of each review. Your model was trained on English reviews, but as you'll discover, it's encountering Spanish reviews in production that it can't correctly classify.

In this tutorial, you will:
* Download curated datasets of embeddings and predictions
* Define a schema to describe the format of your data
* Launch Phoenix to visually explore your embeddings
* Investigate problematic clusters

Let's get started!

### 1. Install Dependencies and Import Libraries 📚

In [1]:
!pip install -q arize-phoenix

Note: you may need to restart the kernel to use updated packages.


In [6]:
import pandas as pd
import phoenix as px

### 2. Download the Data 📊

Load your training and production data into two Pandas DataFrames.

In [7]:
train_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/nlp/sentiment-classification-language-drift/sentiment_classification_language_drift_training.parquet",
)
prod_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/nlp/sentiment-classification-language-drift/sentiment_classification_language_drift_production.parquet",
)

Inspect a few rows of the training DataFrame.

In [8]:
train_df.head()

Unnamed: 0,prediction_ts,reviewer_age,reviewer_gender,product_category,language,text,text_vector,label,pred_label
0,1650092000.0,21,female,apparel,english,Poor quality of fabric and ridiculously tight ...,"[-0.070516996, 0.6640034, 0.33579218, -0.26907...",negative,negative
1,1650093000.0,29,male,kitchen,english,"Love these glasses, thought they'd be everyday...","[-0.0024410924, -0.5406275, 0.31713492, -0.033...",positive,positive
2,1650093000.0,26,female,sports,english,"These are disgusting, it tastes like you are ""...","[0.40487882, 0.8235396, 0.38333943, -0.4269158...",negative,negative
3,1650093000.0,26,male,other,english,My husband has a pair of TaoTronics so I decid...,"[0.018816521, 0.53441304, 0.4907303, -0.024163...",neutral,neutral
4,1650093000.0,37,male,home_improvement,english,"Threads too deep. Engages on tank, but gasket ...","[-0.25348073, 0.31603432, 0.35810202, -0.24672...",negative,negative


The columns of the DataFrame are:
- **prediction_ts:** the Unix timestamps of your predictions
- **review_age**, **reviewer_gender**, **product_category**, **language:** the features of your model
- **text:** the text of each review
- **text_vector:** the embedding vectors representing each review
- **pred_label:** the label your model predicted
- **label:** the ground-truth label for each review

### 3. Launch Phoenix 🔥🐦

#### a) Define Your Schema

To launch Phoenix with your data, you first need to define a schema that tells Phoenix which columns of your DataFrames correspond to features, predictions, actuals (i.e., ground truth), embeddings, etc.

The trickiest part is defining embedding features. In this case, each embedding feature has two pieces of information: the embedding vector itself contained in the "text_vector" column and the review text contained in the "text" column.

In [12]:
train_df.head()

Unnamed: 0,prediction_ts,reviewer_age,reviewer_gender,product_category,language,text,text_vector,label,pred_label
0,1650092000.0,21,female,apparel,english,Poor quality of fabric and ridiculously tight ...,"[-0.070516996, 0.6640034, 0.33579218, -0.26907...",negative,negative
1,1650093000.0,29,male,kitchen,english,"Love these glasses, thought they'd be everyday...","[-0.0024410924, -0.5406275, 0.31713492, -0.033...",positive,positive
2,1650093000.0,26,female,sports,english,"These are disgusting, it tastes like you are ""...","[0.40487882, 0.8235396, 0.38333943, -0.4269158...",negative,negative
3,1650093000.0,26,male,other,english,My husband has a pair of TaoTronics so I decid...,"[0.018816521, 0.53441304, 0.4907303, -0.024163...",neutral,neutral
4,1650093000.0,37,male,home_improvement,english,"Threads too deep. Engages on tank, but gasket ...","[-0.25348073, 0.31603432, 0.35810202, -0.24672...",negative,negative


In [14]:
import uuid
def add_prediction_id(df):
    return [str(uuid.uuid4()) for _ in range(df.shape[0])]
    
train_df['prediction_id'] = add_prediction_id(train_df)
prod_df['prediction_id'] = add_prediction_id(prod_df)

In [15]:
embedding_features = {
    "text_embedding": px.EmbeddingColumnNames(
        vector_column_name="text_vector", raw_data_column_name="text"
    ),
}
schema = px.Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="pred_label",
    actual_label_column_name="label",
    embedding_feature_column_names=embedding_features,
)
schema

Schema(prediction_id_column_name='prediction_id', timestamp_column_name='prediction_ts', feature_column_names=None, tag_column_names=None, prediction_label_column_name='pred_label', prediction_score_column_name=None, actual_label_column_name='label', actual_score_column_name=None, embedding_feature_column_names={'text_embedding': EmbeddingColumnNames(vector_column_name='text_vector', raw_data_column_name='text', link_to_data_column_name=None)}, excludes=None)

You'll notice that the schema above doesn't explicitly specify features. That's because feature columns are [implicitly inferred](https://docs.arize.com/phoenix/how-to/define-your-schema#implicit-features) if you don't pass `feature_column_names` to your `Schema` object.

#### b) Define Your Datasets 
Next, define your [primary and reference datasets](https://docs.arize.com/phoenix/concepts/phoenix-basics#which-dataset-is-which). In this case, your reference dataset contains training data and your primary dataset contains production data.

In [16]:
prim_ds = px.Dataset(dataframe=prod_df, schema=schema, name="primary")
ref_ds = px.Dataset(dataframe=train_df, schema=schema, name="reference")

Dataset info written to '/Users/kiko/.phoenix/datasets/primary'
Dataset already persisted
Dataset: primary initialized
Dataset info written to '/Users/kiko/.phoenix/datasets/reference'
Dataset already persisted
Dataset: reference initialized


Inspect your primary dataset.

In [17]:
prim_ds

<phoenix.datasets.dataset.Dataset at 0x7fdc29c7c550>

#### c) Create a Phoenix Session

In [18]:
session = px.launch_app(primary=prim_ds, reference=ref_ds)
session

⏳Launching Phoenix...Phoenix failed to launch. Please try again.
🌍 To view the Phoenix app in your browser, visit http://localhost:6060/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


<phoenix.session.session.Session at 0x7fdc29c5feb0>

#### d) Launch the Phoenix UI

You can open Phoenix by copying and pasting the output of `session.url` into a new browser tab.

In [19]:
session.url

'http://localhost:6060/'

INFO:     Started server process [33648]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:6060 (Press CTRL+C to quit)


1️⃣ primary dataset: primary
2️⃣ reference dataset: reference
INFO:     127.0.0.1:55168 - "GET /index.js HTTP/1.1" 200 OK
INFO:     127.0.0.1:55168 - "POST /graphql HTTP/1.1" 200 OK
INFO:     127.0.0.1:55168 - "POST /graphql HTTP/1.1" 200 OK
node: EmbeddingDimension 0
INFO:     127.0.0.1:55169 - "POST /graphql HTTP/1.1" 200 OK
node: EmbeddingDimension 0
INFO:     127.0.0.1:55169 - "POST /graphql HTTP/1.1" 200 OK


OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


node: EmbeddingDimension 0
INFO:     127.0.0.1:55187 - "POST /graphql HTTP/1.1" 200 OK
node: EmbeddingDimension 0
INFO:     127.0.0.1:55169 - "POST /graphql HTTP/1.1" 200 OK
INFO:     127.0.0.1:55755 - "GET /graphql HTTP/1.1" 200 OK
INFO:     127.0.0.1:55755 - "POST /graphql HTTP/1.1" 200 OK
node: EmbeddingDimension 0
INFO:     127.0.0.1:55757 - "POST /graphql HTTP/1.1" 200 OK
node: EmbeddingDimension 0
INFO:     127.0.0.1:55874 - "POST /graphql HTTP/1.1" 200 OK
node: EmbeddingDimension 0
INFO:     127.0.0.1:55937 - "POST /graphql HTTP/1.1" 200 OK
node: EmbeddingDimension 0
INFO:     127.0.0.1:56023 - "POST /graphql HTTP/1.1" 200 OK


Alternatively, you can open the Phoenix UI in your notebook with

In [None]:
session.view()

### 4. Explore Your Data 📈

Investigate troublesome clusters of your data:

1. Navigate to the "Embeddings" tab and click on "text_embedding".
2. In the Euclidean distance graph at the top of the page, click a point on the graph where the Euclidean distance is high.
3. Click on the top cluster in the panel on the left.
4. Use the panel at the bottom to examine the data points in this cluster.

Answer the questions below and click to see answers:

<details>
    <summary>
        What does the Euclidean distance graph measure?
    </summary>
    <p>
        This graph measures the drift of your production data relative to your training data over time. See <a href="https://docs.arize.com/phoenix/reference/metrics/euclidean-distance">here</a> for details.
    </p>
</details>

<details>
    <summary>
        What do the points in the point cloud represent?
    </summary>
    <p>
        Each point in the point cloud corresponds to a single product review. Phoenix has taken the high-dimensional embeddings in your original DataFrame and has reduced the dimensionality so that you can view them in lower dimensions.
    </p>
</details>

<details>
    <summary>
        What do you notice about the cluster you selected?
    </summary>
    <p>
        It consists almost entirely of production data, meaning that your model is seeing data in production the likes of which it never saw during training.
    </p>
</details>

<details>
    <summary>
        What's gone wrong with your model in production?
    </summary>
    <p>
        Your model was fine-tuned on examples of labeled product reviews in English. In production, your model is encountering product reviews in Spanish whose sentiment it cannot correctly predict.
    </p>
</details>

Congrats! You've identified the root cause of your model's performance issue.

### 5. Close the App 🧹

When you're done, don't forget to close the app.

In [None]:
px.close_app()