# Getting Started with the Arize Platform - Investigating Embedding Drift

**In this walkthrough, we are going to ingest embedding data and look at embedding drift.** 

You are in charge of maintaining a sentiment classification model. This simple model will take online reviews as input and predict whether it was positive or negative.

In this scenario, you trained the model to work on movie reviews using the IMDB dataset. Once the model was released into production, hotel reviews from Trip Advisor were ingested in addition to the expected movie reviews. This may cause performance degradation on your model. 

Arize is able to surface the presence of this unexpected data, regardless of it affecting performance, by analyzing the embedding vectors associated with the reviews.

After the model has been trained and deployed, the inference data is stored here (in HDF5 format):
*   [Training Set](https://storage.googleapis.com/arize-assets/documentation-sample-data/embeddings/NLP/embeddings_IMDB_Movie_Reviews_train)
*   [Production Set](https://storage.googleapis.com/arize-assets/documentation-sample-data/embeddings/NLP/embeddings_IMDB_Movie_Reviews+TripAdvisor_Hotel_Reviews_prod)

These files contain actual label, review text, features, embedding vectors and more inference data (see dataframes below).

If you are familiar with sending data to Arize, it only takes a few more lines to send embedding data. If this is your first Arize Tutorial, we recommend that you do this other one instead: [Send Data to Arize in 5 Easy Steps](https://colab.research.google.com/github/Arize-ai/client_python/blob/main/arize/examples/tutorials/Arize_Tutorials/Quick_Start/Send_data_to_Arize_in_5_easy_steps_classification.ipynb)

#Step 0. Setup and Getting the Data

Our embedding offering is still under development so you will have to specify the SDK version of our release candidate.

###Install Dependencies and Import Libraries 📚

In [None]:
!pip install arize==4.1.0rc0

import pandas as pd
from arize.pandas.logger import Client, Schema
from arize.utils.types import ModelTypes, Environments, EmbeddingColumnNames

## **🌐 Download the Data**
We have already got the data for you and broke it out into 2 pandas dataframes, training and production.

In [None]:
gcs_dir = "https://storage.googleapis.com/arize-assets/documentation-sample-data/embeddings/NLP/"
train_name = f"embeddings_IMDB_Movie_Reviews_train"
prod_name = f"embeddings_IMDB_Movie_Reviews+TripAdvisor_Hotel_Reviews_prod"

train_link = f"{gcs_dir:s}{train_name:s}"
prod_link = f"{gcs_dir:s}{prod_name:s}"

!wget $train_link
!wget $prod_link

train_df = pd.read_hdf(train_name, index_col=False)
prod_df = pd.read_hdf(prod_name, index_col=False)

## Inspect the Data 

In [None]:
train_df.head()

In [None]:
prod_df.head()

The _context_ feature has been added artificially to be able to locate the reviews that are coming from different sources, i.e., movie reviews and hotel reviews. 

In [None]:
prod_df[prod_df['context']=='hotels'].head()

# Step 1. Sending Data into Arize 💫
## Import and Setup Arize Client

The first step is to setup our Arize client. After that we will log the data.

Copy the Arize `API_KEY` and `SPACE_KEY` from your admin page shown below! Copy those over to the set-up section. We will also be setting up some metadata to use across all logging.


<img src="https://storage.googleapis.com/arize-assets/fixtures/copy-keys.png" width="700">

In [None]:
SPACE_KEY = "SPACE_KEY"
API_KEY = "API_KEY"
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)
model_id = "NLP-reviews-demo"
model_version = "1.0"
model_type = ModelTypes.CATEGORICAL
if SPACE_KEY == "SPACE_KEY" or API_KEY == "API_KEY":
    raise ValueError("❌ NEED TO CHANGE SPACE AND/OR API_KEY")
else:
    print("✅ Import and Setup Arize Client Done! Now we can start using Arize!")


Now that our Arize client is setup, let's go ahead and log all of our data to the platform. For more details on how **`arize.pandas.logger`** works, visit out documentations page below.

[![Buttons_OpenOrange.png](https://storage.googleapis.com/arize-assets/fixtures/Buttons_OpenOrange.png)](https://docs.arize.com/arize/sdks-and-integrations/python-sdk/arize.pandas)

## Define the Schema 

A Schema instance specifies the column names for corresponding data in the dataframe. While we could define different Schemas for training and production datasets, in this case the dataframes have the same column names so the Schema will be the same.

To ingest features, it suffices to give a list of column names that contain the features in our dataframe. However, embeddings are a little more complex.

Arize allows you to ingest not only the embedding vector, but the raw data associtated with that embedding or a URL link to that raw data. Hence, up to 3 columns can be associated to the same _embedding object_*. To be able to do this, Arize's SDK provides the `EmbeddingColumnNames` class, used below.

*NOTE: We refer as embedding object to the 3 possible pieces of information that can be sent: 
* Embedding `vector` (required)
* Embedding `data` (optional): raw text, image, ...; associated with the embedding vector
* Embedding `link_to_data`: link to the data associated with the embedding vector

In [None]:
features = [
    'age',
    'gender',
    'context',
]

embedding_features = [
    EmbeddingColumnNames(
        vector_column_name="review_vector",  # Will be name of embedding feature in the app
        data_column_name="review_text",
    ),
]

# Define a Schema() object for Arize to pick up data from the correct columns for logging
schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="prediction_label",
    actual_label_column_name="actual_label",
    feature_column_names=features,
    embedding_feature_column_names=embedding_features
)



## Log Training Data

In [None]:
# Logging Training DataFrame
training_response = arize_client.log(
    dataframe=train_df,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.TRAINING,
    schema=schema,
    sync=True
)


# If successful, the server will return a status_code of 200
if training_response.status_code != 200:
    print(f"❌ logging failed with response code {training_response.status_code}, {training_response.text}")
else:
    print(f"✅ You have successfully logged training set to Arize")


## Log Production Data

In [None]:
# send production data
production_response = arize_client.log(
    dataframe=prod_df,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.PRODUCTION,
    schema=schema,
    sync=True
)

if production_response.status_code != 200:
    print(f"❌ logging failed with response code {production_response.status_code}, {production_response.text}")
else:
    print(f"✅ You have successfully logged production set to Arize")

# Step 2. Confirm Data in Arize ✅
Note that the Arize platform takes about 15 minutes to index embedding data. While the model should appear immediately, the data will not show up until the indexing is complete. Feel free to head over to the **Data Ingestion** tab for your model to watch Arize works its magic!🔮

You will be able to see the predictions, actuals, and feature importances that have been sent in the last 30 minutes, last day or last week.

An example view of the Data Ingestion tab from a model, when data is sent continuously over 30 minutes, is shown in the image below. 

<img src="https://storage.googleapis.com/arize-assets/fixtures/data-ingestion-tab.png" width="700">

# Check the Embedding Data in Arize

First, set the baseline to the training set that we logged before.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/embedding_setup_baseline.gif" width="700">


If your model contains embedding data, you will see it in your Model's Overview page. 

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/model-health-with-embeddings.png" width="700">




 Click on the Embedding Name or the Euclidean Distance value to see how your embedding data is drifting over time.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/embedding-drift.png" width="700">

In this picture we represent the global euclidean distance between your production set (at different points in time) and the baseline (which we set to be our training set). We can see there is a period of 3 months where suddenly the distance is remarkably higher. This shows us that during that time very different text data was sent to our model, as expected. This is the period of time when movie reviews were sent together with hotel reviews. 

# Wrap Up 🎁
Congratulations, you've now sent your first machine learning embedding data to the Arize platform!!

Additional:
- If you want to remove this example model from your platform, just click **Models** -> **NLP-reviews-demo** -> **config** -> **delete**

### Overview
Arize is an end-to-end ML observability and model monitoring platform. The platform is designed to help ML engineers and data science practitioners surface and fix issues with ML models in production faster with:
- Automated ML monitoring and model monitoring
- Workflows to troubleshoot model performance
- Real-time visualizations for model performance monitoring, data quality monitoring, and drift monitoring
- Model prediction cohort analysis
- Pre-deployment model validation
- Integrated model explainability

### Website
Visit Us At: https://arize.com/model-monitoring/

### Additional Resources
- [What is ML observability?](https://arize.com/what-is-ml-observability/)
- [Playbook to model monitoring in production](https://arize.com/the-playbook-to-monitor-your-models-performance-in-production/)
- [Using statistical distance metrics for ML monitoring and observability](https://arize.com/using-statistical-distance-metrics-for-machine-learning-observability/)
- [ML infrastructure tools for data preparation](https://arize.com/ml-infrastructure-tools-for-data-preparation/)
- [ML infrastructure tools for model building](https://arize.com/ml-infrastructure-tools-for-model-building/)
- [ML infrastructure tools for production](https://arize.com/ml-infrastructure-tools-for-production-part-1/)
- [ML infrastructure tools for model deployment and model serving](https://arize.com/ml-infrastructure-tools-for-production-part-2-model-deployment-and-serving/)
- [ML infrastructure tools for ML monitoring and observability](https://arize.com/ml-infrastructure-tools-ml-observability/)

Visit the [Arize Blog](https://arize.com/blog) and [Resource Center](https://arize.com/resource-hub/) for more resources on ML observability and model monitoring.
