<img src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="200"/>

# Getting Started with the Arize Platform - Investigating Embedding Drift in NLP

**In this walkthrough, we are going to ingest embedding data and look at embedding drift.** 

In this scenario, you are in charge of maintaining a sentiment classification model. This simple model takes online reviews of your U.S.-based product as the input and predict whether the reviewer's sentiment was positive, negative, or neutral. You trained your sentiment classification model on English reviews. However, once the model was released into production, you notice that the performance of the model has degraded over a period of time.

Arize is able to surface the reason for this performance degradation. In this example, the presence of reviews written in Spanish impact the model's performance. You can surface and troubleshoot this issue by analyzing the _embedding vectors_ associated with the online review text.

It is worth noting that, according to our research, inspecting embedding drift can surface problems with your data before they cause performance degradation.

In this tutorial, we will start from scratch. We will:
* Download the data
* Obtain embedding vectors using OpenAI's API
* Train the model
* Obtain predictions
* Log the inferences into the Arize Plaftorm

We will be using [OpenAI](https://openai.com/)'s API to make this process extremely easy. The OpenAI API can be applied to virtually any task that involves understanding or generating natural language, offering a wide variety of models for different applications, from content generation to semantic search and classification.

In this tutorial we will be leveraging OpenAI's tools to generate embedding representations of the input text. Next, we will train a simple `RandomForestClassifier` to classify the online reviews into the following classes: `Positive`, `Negative`, `Neutral`.

Before we start, if this is your first Arize Tutorial, we recommend that you complete [Send Data to Arize in 5 Easy Steps](https://colab.research.google.com/github/Arize-ai/client_python/blob/main/arize/examples/tutorials/Arize_Tutorials/Quick_Start/Send_data_to_Arize_in_5_easy_steps_classification.ipynb) before continuing. If you are familiar with sending data to Arize, it only takes a few more lines to send embedding data. 

Let's get started!

# Step 0. Setup and Getting the Data


## Install Dependencies and Import Libraries 📚

In [None]:
!pip install datasets
!pip install umap-learn
!pip install pyyaml==5.4.1
!pip install --upgrade openai
!pip install arize


import uuid
import time
from datetime import datetime
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from datasets import load_dataset

import numpy as np
import pandas as pd
from umap import UMAP
import matplotlib.pyplot as plt

from arize.pandas.logger import Client, Schema
from arize.utils.types import Environments, ModelTypes, EmbeddingColumnNames

import openai
from openai.embeddings_utils import get_embedding

## **🌐 Download the Data**

The easiest way to load a dataset is from the [Hugging Face Hub](https://huggingface.co/datasets). There are already over 900 datasets in over 100 languages on the Hub. At Arize, we have crafted the [arize-ai/ecommerce_reviews_with_language_drift](https://huggingface.co/datasets/arize-ai/ecommerce_reviews_with_language_drift) dataset for this example notebook. 

Thanks to Hugging Face 🤗 Datasets, we can download the data in one line of code. The Dataset Object comes equipped with methods that make it very easy to inspect, pre-process, and post-process your data. 

In [None]:
dataset = load_dataset("arize-ai/ecommerce_reviews_with_language_drift")

You can select the splits of the dataset as you would in a dictionary.

In [None]:
train_ds, val_ds, prod_ds = dataset['training'], dataset['validation'], dataset['production']

## Inspect the Data

It is often convenient to convert a Hugging Face `Dataset` object to a Pandas `DataFrame` so we can access high-level APIs for data visualization. To do so, the 🤗Datasets library provides a `set_format()` method that allows us to change the output format of the `Dataset`. This does not change the underlying data format, an Arrow table. When the `DataFrame` format is no longer needed, we can reset the output format using `reset_format()`.

From this point forward, it is convenient to use Pandas DataFrames. We can do so easily using the format methods we have already seen.

In [None]:
def from_dataset_to_dataframe(ds):
    ds.set_format(type="pandas")
    return ds[:]

train_df = from_dataset_to_dataframe(train_ds)
val_df = from_dataset_to_dataframe(val_ds)
prod_df = from_dataset_to_dataframe(prod_ds)

To stay within the limits of OpenAI's free tier account, we will sample our dataset.

In [None]:
train_df = train_df.sample(200, ignore_index=True)
val_df = val_df.sample(50, ignore_index=True)
prod_df = prod_df.sample(500, ignore_index=True)

In [None]:
train_df.head()

# Step 1. Developing your Sentiment Classification Model

## Obtain text embeddings

The OpenAI Python library provides convenient access to the OpenAI API. We use the `get_embedding` function to generate an embedding vector from a piece of text - making use of one of the pre-trained models from OpenAI.

They offer three families of embedding models for different functionalities: text search, text similarity and code search. Each family includes up to four models on a spectrum of capability

* Ada (1024 dimensions),
* Babbage (2048 dimensions),
* Curie (4096 dimensions),
* Davinci (12288 dimensions).

Given that our usecase is sentiment classification, we will use the pre-trained model `text-similarity-babbage-001`.

To use OpenAI's tools, create a free account [here](https://openai.com/api/). Then, find your `API_KEY` by clicking on your profile icon and into "View API Keys". If you logged in as part of an organization, you'll need to enter your organization's api key as well.

In [None]:
ORGANIZATION_KEY = "OPENAI_ORG_KEY"
API_KEY = "OPENAI_API_KEY"
if API_KEY == "OPENAI_API_KEY":
    raise ValueError("❌ NEED TO CHANGE OPENAI's API_KEY")

openai.api_key = API_KEY
if ORGANIZATION_KEY != "OPENAI_ORG_KEY":
    openai.organization = ORGANIZATION_KEY

OpenAI's free tier account has limitations on their usage. We have prepared the following function for you to use their embedding model at the allowed free rate.

In [None]:
def get_embeddings_from_series(series, max_rate=60):
    emb_series = series.copy()
    N = np.ceil(len(series)/max_rate).astype(int)
    for i in range(N):
        start = i*max_rate
        end = (i+1)*max_rate
        if end>len(series):
            end = len(series)
        
        emb_series[start:end] = series[start:end].apply(lambda x: get_embedding(x, engine='text-similarity-babbage-001'))
        time.sleep(60)
    return emb_series

Finally, we now use the aforementioned `get_embeddings_from_series` function to obtain the embedding vectors of our dataset.

In [None]:
train_df['text_vector'] = get_embeddings_from_series(train_df['text'])

In [None]:
val_df['text_vector'] = get_embeddings_from_series(val_df['text'])

In [None]:
prod_df['text_vector'] = get_embeddings_from_series(prod_df['text'])

## Inspect text embeddings

We can do a quick inspection of how the embedding vectors obtained for our training set look using [UMAP](https://umap-learn.readthedocs.io/en/latest/). When using Arize, you will get this view automatically so you won't have to do this with your data.

In [None]:
def inspect_embeddings(X,y):
    projections = UMAP().fit_transform(X)
    
    df_emb = pd.DataFrame({})
    df_emb["X"] = projections[:,0]
    df_emb["Y"] = projections[:,1]
    df_emb['label'] = y_train

    fig, axes = plt.subplots(1, 3, figsize=(7,3))
    axes = axes.flatten()
    cmaps = ["Reds","Greys","Greens"]
    labels = train_ds.features["label"].names

    for i, (label, cmap) in enumerate(zip(labels, cmaps)):
        df_emb_sub = df_emb.query(f"label=={i}")
        axes[i].hexbin(df_emb_sub["X"], df_emb_sub["Y"], cmap=cmap, gridsize=20, linewidths=(0,))
        axes[i].set_title(label)
        axes[i].set_xticks([])
        axes[i].set_yticks([])

In [None]:
inspect_embeddings(X_train,y_train)

The pre-trained model from OpenAI is able to extract embedding vectors that are different depending on the label values, without ever seeing the data or being trained on it!

## Train the model

Now that we have used the pre-trained model to perform feature extraction and obtain embedding vectors, we can train a simple text sentiment classifier that uses the embedding vectors as features. We will use `RandomForestClassifier` from the [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) library.

In [None]:
clf = RandomForestClassifier(n_estimators=100)

X_train, y_train = np.stack(train_df['text_vector']), np.stack(train_df['label'])
X_val, y_val = np.stack(val_df['text_vector']), np.stack(val_df['label'])
X_prod, y_prod = np.stack(prod_df['text_vector']), np.stack(prod_df['label'])

clf.fit(X_train, y_train)

## Evaluate the model

Let's evaluate the performance of the model on our validation and production sets.

In [None]:
preds_val = clf.predict(X_val)

report = classification_report(y_val, preds_val)
print(report)

In [None]:
preds_prod = clf.predict(X_prod)

report = classification_report(y_prod, preds_prod)
print(report)

Something is happening with our data that causes production performance degradation. Let's use Arize to identify the issue and troubleshoot.Something is happening with our data that causes our production performance degradation. Let's use Arize to identify the issue and troubleshoot.

# Step 2. Prepare your data to be sent to Arize



## Update the timestamps

The data that you are working with was constructed in April of 2022. Hence, we will update the timestamps so they are current at the time that you're sending data to Arize.

In [None]:
train_df['prediction_ts'] = (train_df['prediction_ts']/4).astype(float)
val_df['prediction_ts'] = (val_df['prediction_ts']/4).astype(float)
prod_df['prediction_ts'] = (prod_df['prediction_ts']/4).astype(float)

In [None]:
last_ts = max(prod_df['prediction_ts'])
now_ts = datetime.timestamp(datetime.now())
delta_ts = now_ts - last_ts    

train_df['prediction_ts'] = (train_df['prediction_ts'] + delta_ts).astype(float)
val_df['prediction_ts'] = (val_df['prediction_ts'] + delta_ts).astype(float)
prod_df['prediction_ts'] = (prod_df['prediction_ts'] + delta_ts).astype(float)

## Map labels to class names

For readability, we will want to log our inferences (predictions and actuals) with class labels instead of numeric labels. Since we used Hugging Face 🤗 Datasets to download our dataset, it already comes equipped with methods to do this.

The dataset we downloaded defined the label to be an instance of the `datasets.ClassLabel` class, which has the convenient method `int2str` (visit [Hugging Face documentation](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.ClassLabel.names) for more information).

In [None]:
def label_int2str(row):
    return train_ds.features['label'].int2str(row)

In [None]:
train_df['label'] = train_df['label'].apply(label_int2str)
val_df['label'] = val_df['label'].apply(label_int2str)
prod_df['label'] = prod_df['label'].apply(label_int2str)

train_df['pred_label'] = train_df['pred_label'].apply(label_int2str)
val_df['pred_label'] = val_df['pred_label'].apply(label_int2str)
prod_df['pred_label'] = prod_df['pred_label'].apply(label_int2str)

## Add prediction ids

Our system needs prediction IDs to link a `prediction` to an `actual` (find out more in our [documentation](https://docs.arize.com/arize/data-ingestion/model-schema/5.-prediction-id?q=prediction_id)). You can generate them as follows:

In [None]:
def add_prediction_id(df):
    return [str(uuid.uuid4()) for _ in range(df.shape[0])]

In [None]:
train_df['prediction_id'] = add_prediction_id(train_df)
val_df['prediction_id'] = add_prediction_id(val_df)
prod_df['prediction_id'] = add_prediction_id(prod_df)

# Step 3. Sending Data into Arize 💫

## Select the columns we want to send to Arize (optional)

This step is not really necessary, since we will select the columns we want to send to Arize using the `Schema` definition (below). However, for the purpose of visibility, this is our final `DataFrame` with the data that will be sent to Arize.

In [None]:
arize_columns = [
    'prediction_id', 
    'prediction_ts', 
    'reviewer_age', 
    'reviewer_gender', 
    'product_category', 
    'language',
    'text',
    'text_vector',
    'label',
    'pred_label'
    ]

train_df = train_df[arize_columns]
val_df = val_df[arize_columns]
prod_df = prod_df[arize_columns]

train_df.head()

## Import and Setup Arize Client

The first step is to setup the Arize client. After that we will log the data.

Copy the Arize `API_KEY` and `SPACE_KEY` from your admin page (shown below) to the variables in the cell below. We will also be setting up some metadata to use across all logging.

<img src="https://storage.googleapis.com/arize-assets/fixtures/copy-keys.png" width="700">

In [None]:
SPACE_KEY = "SPACE_KEY"
API_KEY = "API_KEY"
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)
model_id = "NLP-reviews-demo-language-drift"
model_version = "1.0"
model_type = ModelTypes.CATEGORICAL
if SPACE_KEY == "SPACE_KEY" or API_KEY == "API_KEY":
    raise ValueError("❌ NEED TO CHANGE SPACE AND/OR API_KEY")
else:
    print("✅ Import and Setup Arize Client Done! Now we can start using Arize!")

Now that the Arize client is setup, let's go ahead and log all of our data to the platform. For more details on how **`arize.pandas.logger`** works, visit our documentation.

[![Buttons_OpenOrange.png](https://storage.googleapis.com/arize-assets/fixtures/Buttons_OpenOrange.png)](https://docs.arize.com/arize/sdks-and-integrations/python-sdk/arize.pandas)

## Define the Schema 

A Schema instance specifies the column names for corresponding data in the dataframe. While we could define different Schemas for training and production datasets, the dataframes have the same column names, so the Schema will be the same in this instance.

To ingest non-embedding features, it suffices to provide a list of column names that contain the features in our dataframe. Embedding features, however, are a little bit different.

Arize allows you to ingest not only the embedding vector, but the raw data associtated with that embedding, or a URL link to that raw data. Therefore, up to 3 columns can be associated to the same _embedding object_*. To be able to do this, Arize's SDK provides the `EmbeddingColumnNames` class, used below.

*NOTE: This is how we refer to the 3 possible pieces of information that can be sent as embedding objects:
* Embedding `vector` (required)
* Embedding `data` (optional): raw text, image, ...; associated with the embedding vector
* Embedding `link_to_data` (optional): link to the data associated with the embedding vector

Learn more [here](https://docs.arize.com/arize/data-ingestion/model-schema/7b.-embedding-features).

In [None]:
features = [
    'reviewer_age',
    'reviewer_gender',
    'product_category',
    'language',
]

embedding_features = [
    EmbeddingColumnNames(
        vector_column_name="text_vector",  # Will be name of embedding feature in the app
        data_column_name="text",
    ),
]

# Define a Schema() object for Arize to pick up data from the correct columns for logging
schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="pred_label",
    actual_label_column_name="label",
    feature_column_names=features,
    embedding_feature_column_names=embedding_features
)



## Log Training Data

In [None]:
# Logging Training DataFrame
response = arize_client.log(
    dataframe=train_df,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.TRAINING,
    schema=schema,
    sync=True
)


# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(f"❌ logging failed with response code {response.status_code}, {response.text}")
else:
    print(f"✅ You have successfully logged training set to Arize")


## Log Validation Data

In [None]:
# Logging Training DataFrame
response = arize_client.log(
    dataframe=val_df,
    model_id=model_id,
    model_version=model_version,
    batch_id="validation",
    model_type=model_type,
    environment=Environments.VALIDATION,
    schema=schema,
    sync=True
)


# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(f"❌ logging failed with response code {response.status_code}, {response.text}")
else:
    print(f"✅ You have successfully logged training set to Arize")


## Log Production Data

In [None]:
# send production data
response = arize_client.log(
    dataframe=prod_df,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.PRODUCTION,
    schema=schema,
    sync=True
)

if response.status_code != 200:
    print(f"❌ logging failed with response code {response.status_code}, {response.text}")
else:
    print(f"✅ You have successfully logged production set to Arize")

# Step 5. Confirm Data in Arize ✅
Note that the Arize platform takes about 15 minutes to index embedding data. While the model should appear immediately, the data will not show up until the indexing is complete. Feel free to head over to the **Data Ingestion** tab for your model to watch Arize works its magic!🔮

You will be able to see the predictions, actuals, and feature importances that have been sent in the last 30 minutes, last day or last week.

An example view of the Data Ingestion tab from a model, when data is sent continuously over 30 minutes, is shown in the image below.

<img src="https://storage.googleapis.com/arize-assets/fixtures/data-ingestion-tab.png" width="700">

# Check the Embedding Data in Arize

First, set the baseline to the training set that we logged before.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/embedding_setup_baseline.gif" width="700">


If your model contains embedding data, you will see it in your Model's Overview page. 

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP-reviews-demo-language-drift-overview.jpg" width="700">

 Click on the Embedding Name or the Euclidean Distance value to see how your embedding data is drifting over time. In the picture below we represent the global euclidean distance between your production set (at different points in time) and the baseline (which we set to be our training set). We can see there is a period of a week where suddenly the distance is remarkably higher. This shows us that during that time text data was sent to our model that was different than what it was trained on (English). This is the period of time when reviews written in Spanish were sent alongside the expected English reviews.
 
<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP-reviews-demo-language-drift-emb-0.jpg" width="700">

In addition to the drift tracking plot above, below you can find the UMAP visualization of your data, according to the point in time selected. Notice that the production data and our baseline (training) data are superimposed, which is indicative that the model is seeing data in production similar to the data it was trained on.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP-reviews-demo-language-drift-emb-1.jpg" width="700">

Next, select a point in time when the drift was high and select a UMAP visualization in 2D. We can see that both training and production data are superimposed for the most part, but another cluster of production data has appeared. This indicates that the model is seeing data in production qualitatively different to the data it was trained on, and in this case causing performance degradation.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP-reviews-demo-language-drift-emb-2.jpg" width="700">

For further inspection, you may select a 3D UMAP view and clicked _Explore UMAP_ to expand the view. With this view we can interact in 3D with our dataset. We can zoom, rotate, and drag so we can see the areas of our dataset that are most interesting to us. Check out the workflow below:

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP-reviews-demo-language-drift-workflow.gif" width="700">

You can see that the coloring has been made to distinguish production data vs baseline data (training in this example). More coloring options will be added, to help understand/debug your dataset, including:
* Color by prediction label
* Color by actual label
* Color by feature value
* Color by accuracy (correct vs incorrect predictions)

# Wrap Up 🎁
Congratulations, you've now sent your first machine learning embedding data to the Arize platform!!

Additionally, if you want to remove this example model from your account, just click **Models** -> **NLP-reviews-demo-language-drift** -> **config** -> **delete**

### Overview
Arize is an end-to-end ML observability and model monitoring platform. The platform is designed to help ML engineers and data science practitioners surface and fix issues with ML models in production faster with:
- Automated ML monitoring and model monitoring
- Workflows to troubleshoot model performance
- Real-time visualizations for model performance monitoring, data quality monitoring, and drift monitoring
- Model prediction cohort analysis
- Pre-deployment model validation
- Integrated model explainability

### Website
Visit Us At: https://arize.com/model-monitoring/

### Additional Resources
- [What is ML observability?](https://arize.com/what-is-ml-observability/)
- [Monitor Unstructured Data with Arize](https://arize.com/blog/monitor-unstructured-data-with-arize)
- [Getting Started With Embeddings Is Easier Than You Think](https://arize.com/blog/getting-started-with-embeddings-is-easier-than-you-think)
- [Playbook to model monitoring in production](https://arize.com/the-playbook-to-monitor-your-models-performance-in-production/)
- [Using statistical distance metrics for ML monitoring and observability](https://arize.com/using-statistical-distance-metrics-for-machine-learning-observability/)
<!-- - [ML infrastructure tools for data preparation](https://arize.com/ml-infrastructure-tools-for-data-preparation/) -->
- [ML infrastructure tools for model building](https://arize.com/ml-infrastructure-tools-for-model-building/)
- [ML infrastructure tools for production](https://arize.com/ml-infrastructure-tools-for-production-part-1/)
<!-- - [ML infrastructure tools for model deployment and model serving](https://arize.com/ml-infrastructure-tools-for-production-part-2-model-deployment-and-serving/) -->

- [ML infrastructure tools for ML monitoring and observability](https://arize.com/ml-infrastructure-tools-ml-observability/)

Visit the [Arize Blog](https://arize.com/blog) and [Resource Center](https://arize.com/resource-hub/) for more resources on ML observability and model monitoring.
