# <center>Phoenix in Flight</center>
## <center>Investigating Embedding Drift for a Sentiment Classification Model</center>

Imagine you're in charge of maintaining a model that takes as input online reviews of your U.S.-based product and classifies the sentiment of each review as positive, negative, or neutral. Your model initially performs well in production, but its performance gradually degrades over time.

Phoenix helps you surface the reason for this regression by analyzing the *embeddings* representing the text of each review. Your model was trained on English reviews, but as you'll discover, it's encountering Spanish reviews in production that it can't correctly classify.

According to our research, embedding drift often precedes performance degradation. So Phoenix can help you proactively detect and fix this issue before it becomes noticable to your users.

In this tutorial, you will:
* Download curated datasets of embeddings and predictions
* Visually explore embeddings in Phoenix
* Investigate problematic clusters
* Export data for labeling and re-training

Let's get started!

### 1. Install Dependencies and Import Libraries 📚

In [1]:
%pip install -q arize-phoenix


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import phoenix as px

### 2. Download the Data 📊

Load your training and production data into two pandas dataframes and inspect a few rows of the training dataframe.

In [3]:
training_dataframe = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/nlp/sentiment-classification-language-drift/sentiment_classification_language_drift_training.parquet",
)
production_dataframe = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/nlp/sentiment-classification-language-drift/sentiment_classification_language_drift_production.parquet",
)
training_dataframe.head()

Unnamed: 0,prediction_ts,reviewer_age,reviewer_gender,product_category,language,text,text_vector,label,pred_label
0,1650092000.0,21,female,apparel,english,Poor quality of fabric and ridiculously tight ...,"[-0.070516996, 0.6640034, 0.33579218, -0.26907...",negative,negative
1,1650093000.0,29,male,kitchen,english,"Love these glasses, thought they'd be everyday...","[-0.0024410924, -0.5406275, 0.31713492, -0.033...",positive,positive
2,1650093000.0,26,female,sports,english,"These are disgusting, it tastes like you are ""...","[0.40487882, 0.8235396, 0.38333943, -0.4269158...",negative,negative
3,1650093000.0,26,male,other,english,My husband has a pair of TaoTronics so I decid...,"[0.018816521, 0.53441304, 0.4907303, -0.024163...",neutral,neutral
4,1650093000.0,37,male,home_improvement,english,"Threads too deep. Engages on tank, but gasket ...","[-0.25348073, 0.31603432, 0.35810202, -0.24672...",negative,negative


The columns of the dataframe are:
- **prediction_ts:** the Unix timestamps of your predictions
- **review_age**, **reviewer_gender**, **product_category**, **language:** the features of your model
- **text:** the text of each review
- **text_vector:** the embedding vectors representing each review
- **pred_label:** the label your model predicted
- **label:** the ground-truth label for each review

### 3. Launch Phoenix 🔥🐦

#### a) Define Your Schema

To launch Phoenix with your data, you first need to define a schema that tells Phoenix which columns of your dataframes correspond to features, predictions, actuals (i.e., ground truth), embeddings, etc.

The trickiest part is defining embedding features. In this case, each embedding feature has two pieces of information: the embedding vector itself contained in the "text_vector" column and the review text contained in the "text" column.

In [4]:
embedding_features = {
    "text_embedding": px.EmbeddingColumnNames(
        vector_column_name="text_vector", raw_data_column_name="text"
    ),
}
schema = px.Schema(
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="pred_label",
    actual_label_column_name="label",
    embedding_feature_column_names=embedding_features,
)

You'll notice that the schema above doesn't explicitly specify features. That's because feature columns are automatically inferred if you don't pass `feature_column_names` to your `Schema` object.

#### b) Define Your Datasets 
Next, define your primary and reference datasets. In this case, your reference dataset contains training data and your primary dataset contains production data.

In [5]:
primary_dataset = px.Dataset(dataframe=production_dataframe, schema=schema, name="primary")
reference_dataset = px.Dataset(dataframe=training_dataframe, schema=schema, name="reference")

Dataset info written to '/Users/natemar/.phoenix/datasets/primary'
Dataset already persisted
Dataset: primary initialized
Dataset info written to '/Users/natemar/.phoenix/datasets/reference'
Dataset already persisted
Dataset: reference initialized


#### c) Create a Phoenix Session

In [14]:
session = px.launch_app(primary=primary_dataset, reference=reference_dataset)

⏳Launching Phoenix...Phoenix failed to launch. Please try again.


#### d) Launch the Phoenix UI

In [15]:
session.view()

INFO:     Started server process [17902]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:6060 (Press CTRL+C to quit)


1️⃣ primary dataset: primary
2️⃣ reference dataset: reference
INFO:     127.0.0.1:52344 - "GET /embeddings/RW1iZWRkaW5nRGltZW5zaW9uOjA%3D HTTP/1.1" 200 OK
INFO:     127.0.0.1:52344 - "GET /index.js HTTP/1.1" 200 OK
node: EmbeddingDimension 0
INFO:     127.0.0.1:52344 - "POST /graphql HTTP/1.1" 200 OK
INFO:     127.0.0.1:52344 - "POST /graphql HTTP/1.1" 200 OK
node: EmbeddingDimension 0
INFO:     127.0.0.1:52344 - "POST /graphql HTTP/1.1" 200 OK


OMP: Info #273: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


node: EmbeddingDimension 0
{<class 'phoenix.datasets.event.EventId'>}
{<class 'numpy.ndarray'>}
INFO:     127.0.0.1:52345 - "POST /graphql HTTP/1.1" 200 OK


### 4. Explore Your Data 📈

Phoenix is under active development. At the moment, we display your model schema and a few data quality statistics. Check back soon for more updates.

### 5. Close the App 🧹

When you're done, don't forget to close the app.

In [11]:
px.close_app()

INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [17712]
