<center><img src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="200"/></center>

# <center>Getting Started with the Arize Platform</center>
## <center> Investigating Embedding Drift in Tabular Data: Fraud Detection</center>

**In this walkthrough, we are going to ingest embedding data and look at embedding drift.** 

You are responsible for a credit card fraud model at a large bank or payment processing company. You have been alerted by a spike in credit card chargebacks leading you to suspect that fraudsters are getting away with commiting fraud undetected! Realizing that this flaw in your model's performance has a heavy cost on your company and customers, you understand the need for a powerful toolset to troubleshoot (and prevent) costly model degradations. You turn to Arize to find out what changed in your credit card fraud detection model and how you can improve it.

In this walkthrough, we are going to investigate your production credit card fraud model. We will validate degradation in model performance, troubleshoot the root cause, and furthermore set up proactive monitors to mitigate the impact of future degradations.  

We will set up monitors to proactively identify when our fraud model is not perfoming as expected, troubleshoot why we're seeing this deviation in production, and come up with actionable steps to improve the model.

Our steps to resolving this issue will be:
1. Get our model onto the Arize platform to investigate
2. Setup a performance dashboard to look at prediction performance
3. Understand where the model is underperforming
4. Discover the root cause of why a slice (grouping) of predictions is underperforming
5. Set up pro-active monitoring to mitigate the impact of such degradations in the future

The production data contains 1 month of data where 2 main issues exist. You will work on identifying these issues over the course of this exercise.

1. An upstream data source has introduced bad (null) values for ENTRY_MODE 
2. The model is missing fraud for certain merchant types,entry modes and merchant ids especially in certain regions.

**Note**: This example compares training vs production data. Arize supports sending only one dataset.

Let's get started!

# Step 0. Install Dependencies, Import Libraries, Use GPU 📚

To have automatic embedding generation functionality from the Arize SDK, we need to specify the
extra `[AutoEmbeddings]`.

⚠️ Use a GPU to save time generating embeddings. Click on 'Runtime', select 'Change Runtime Type' and
select 'GPU'.

In [None]:
!pip install -q 'arize[AutoEmbeddings]'

import uuid
from datetime import datetime

import pandas as pd
from arize.pandas.logger import Client
from arize.utils.types import Environments, ModelTypes, EmbeddingColumnNames, Schema

# Step 1. Download the data

We have curated a dataset for you so that you can send it to Arize in this tutorial.


In [None]:
url="https://storage.googleapis.com/arize-assets/fixtures/Embeddings/arize-demo-models-data/TABULAR/tabular_embeddings_fraud_demo"

train_df = pd.read_csv(url + "_training.csv")
prod_df = pd.read_csv(url + "_production.csv")

This is what the dataset looks like. 

In [None]:
prod_df.head()

We will need to do a bit more of data preparation before sending to Arize.



# Step 2. Generate embedding vectors form tabular features using Arize

We can generate an embedding vector per row of our dataframe by first converting our table to a text prompt. Then using that text to compute an embedding vector with multivariate information of the features. This allows to track multivariate drift and identify patterns in the embedding space which are not evident in the table.

Arize offers the ability of generating embeddings seemlessly using large pre-trained models. In this example, we will use the pre-trained language model `distilbert-base-uncased`.

**NOTE: We recommend utilizing GPUs to optimize embedding generation. In Google Colaboratory, navigate to the 'Runtime' menu and select 'Change runtime type'. If you are interested in accessing even more powerful GPUs, upgrade to Colab Pro for enhanced speed and performance.** 

The large language models that Arize's embedding generators use have already been trained in such a huge amount of data that the embeddings can capture relevant structure in your data without being fine-tuned.

First step is to import `EmbeddingGenerator` and `UseCases`.

In [None]:
from arize.pandas.embeddings import EmbeddingGenerator, UseCases

Next, we define our generator, choosing the model `distilbert-base-uncased`. Another important variable to set is the `tokenizer_max_length`. This is the maximum number of tokens that the tokenizer will produce. When the text gets tokenized, the list of tokens will get truncated if it surpassed the given `tokenizer_max_length`. If most of your dataframe rows contain large pieces of text, we recommend keeping the maximum number of tokens high. For this example, we will leave it as its default value, 512.

NOTE: The higher the maximum number of tokens, the longer it will take to generate the embeddings.

In addition to `tokenizer_max_length`, you can also set the `batch_size`. This allows you to process the data in smaller batches if you are running out of resources. The default `batch_size` is 100.

Arize then downloads the models and tokenizers from the 🤗 HuggingFace Hub.

In [None]:
generator = EmbeddingGenerator.from_use_case(
    use_case=UseCases.STRUCTURED.TABULAR_EMBEDDINGS,
    model_name="distilbert-base-uncased",
    tokenizer_max_length=512
)

We can explore details about our generators as follows:

In [None]:
generator

If we want to check details about our model or tokenizer, we can reference them as follows (note that the `model_max_length` matches our `tokenizer_max_length` parameter).

In [None]:
generator.tokenizer

To generate the embeddings, we must pass:
1. the dataframe and
2. a list of columns to be considered
3. (optional) a dictionary mapping column names to more verbose naming.

If the table is very wide, having many features, the text generated from it can be too long and the tokenizer can truncate it, affecting negatively to the results. To avoid this, it is good practice to select the columns that shall be considered. For instance, it is a good idea to ignore columns with values that are non-sensical (ids that are hashed, prediction ids, prediction timestamps), and prioritize the columns that are the most important to your use-case.

In [None]:
selected_cols = [
    'fico_score', 'merchant_risk_score', 'loan_amount', 'term',
    'interest_rate', 'installment', 'grade', 'home_ownership',
    'annual_income', 'verification_status', 'num_credit_lines',
    'dti', 'delinq_2yrs', 'inq_last_6mths', 'mths_since_last_delinq', 
    'open_acc','revol_bal', 'state', 'age'
]

Finally, generating the embeddings can be done with a simple line of code:

`train_df['tabular_vector'] = generator.generate_embeddings(train_df, columns=selected_cols)`

If you also want to get a column with the generated prompts, you can add the parameter `return_prompt_col=True` (default is `False`)

In [None]:
train_df['tabular_vector'], train_prompts = generator.generate_embeddings(
    train_df,
    selected_columns=selected_cols,
    return_prompt_col=True
)
prod_df['tabular_vector'] = generator.generate_embeddings(
    prod_df,
    selected_columns=selected_cols,
)

Let's take a look at an example of a generated prompt:

In [None]:
train_prompts[0]

Let's now explore our dataframe and discover the new column `tabular_vector` with the generated embeddings!

In [None]:
train_df.head()

# Step 3. Prepare your data to be sent to Arize


## Update the timestamps

The data that you are working with was constructed in April of 2022. Hence, we will update the timestamps so they are current at the time that you're sending data to Arize.

In [None]:
last_ts = max(prod_df['prediction_ts'])
now_ts = datetime.now().timestamp()
delta_ts = now_ts - last_ts    

prod_df['prediction_ts'] = (prod_df['prediction_ts'] + delta_ts).astype(float)

## Add prediction ids

The Arize platform uses prediction IDs to link a prediction to an actual. Visit the [Arize documentation](https://docs.arize.com/arize/data-ingestion/model-schema/5.-prediction-id?q=prediction_id) for more details.

You can generate prediction IDs as follows:

In [None]:
def add_prediction_id(df):
    return [str(uuid.uuid4()) for _ in range(df.shape[0])]

In [None]:
train_df['prediction_id'] = add_prediction_id(train_df)
prod_df['prediction_id'] = add_prediction_id(prod_df)

# Step 4. Sending Data into Arize 💫


## Import and Setup Arize Client

The first step is to setup the Arize client. After that we will log the data.

Copy the Arize `API_KEY` and `SPACE_ID` from your Space Settings page (shown below) to the variables in the cell below. We will also be setting up some metadata to use across all logging.

<img src="https://storage.googleapis.com/arize-assets/fixtures/copy-id-and-key.png" width="700">

In [None]:
SPACE_ID = "SPACE_ID"
API_KEY = "API_KEY"
arize_client = Client(space_id=SPACE_ID, api_key=API_KEY)
model_id = "demo-auto-embeddings-tabular-fraud-detection"
model_version = "1.0"
model_type = ModelTypes.SCORE_CATEGORICAL
if SPACE_ID == "SPACE_ID" or API_KEY == "API_KEY":
    raise ValueError("❌ CHANGE SPACE_ID AND/OR API_KEY")
else:
    print("✅ Import and Setup Arize Client Done! Now we can start using Arize!")


Now that our Arize client is setup, let's go ahead and log all of our data to the platform. For more details on how **`arize.pandas.logger`** works, visit our documentation.

[![Buttons_OpenOrange.png](https://storage.googleapis.com/arize-assets/fixtures/Buttons_OpenOrange.png)](https://docs.arize.com/arize/sdks-and-integrations/python-sdk/arize.pandas)

## Define the Schema 

A Schema instance specifies the column names for corresponding data in the dataframe. While we could define different Schemas for training and production datasets, the dataframes have the same column names, so the Schema will be the same in this instance.

To ingest non-embedding features, it suffices to provide a list of column names that contain the features in our dataframe. Embedding features, however, are a little bit different.

Arize allows you to ingest not only the embedding vector, but the raw data associated with that embedding, or a URL link to that raw data. Therefore, up to 3 columns can be associated to the same _embedding object_*. To be able to do this, Arize's SDK provides the `EmbeddingColumnNames` class, used below.

*NOTE: This is how we refer to the 3 possible pieces of information that can be sent as embedding objects:
* Embedding `vector` (required)
* Embedding `data` (optional): raw text associated with the embedding vector
* Embedding `link_to_data` (optional): link to the data file (image, audio, ...) associated with the embedding vector

Learn more [here](https://docs.arize.com/arize/sending-data/model-schema-reference#8.-embedding-features-unstructured).

In [None]:
features = [
    'fico_score', 'merchant_risk_score', 'loan_amount', 'term', 
    'interest_rate', 'installment', 'grade', 'home_ownership',
    'annual_income', 'verification_status', 'pymnt_plan', 'merchant_ID', 
    'num_credit_lines', 'dti', 'delinq_2yrs', 'inq_last_6mths', 
    'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 
    'pub_rec', 'revol_bal', 'revol_util', 'state', 'age', 
    'drift_presence' # Artificially added this feature to locate drifted points in UMAP plot
]

embedding_features = {
    # Dictionary keys will be name of embedding feature in the app
    "tabular embedding": EmbeddingColumnNames(
        vector_column_name="tabular_vector",
    ),
}
    
# Define a Schema() object for Arize to pick up data from the correct columns for logging
schema = Schema(
    timestamp_column_name="prediction_ts",
    prediction_id_column_name="prediction_id",
    prediction_label_column_name="pred_label",
    actual_label_column_name="label",
    feature_column_names=features,
    embedding_feature_column_names=embedding_features
)

## Log Training Data

**Note**: This example compares training vs production data. Arize supports sending only one dataset.

In [None]:
# Logging Training DataFrame
response = arize_client.log(
    dataframe=train_df,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.TRAINING,
    schema=schema,
)


# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(f"❌ logging failed with response code {response.status_code}, {response.text}")
else:
    print(f"✅ You have successfully logged training set to Arize")


## Log Production Data

In [None]:
# Logging Production DataFrame
response = arize_client.log(
    dataframe=prod_df,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.PRODUCTION,
    schema=schema
)

if response.status_code != 200:
    print(f"❌ logging failed with response code {response.status_code}, {response.text}")
else:
    print(f"✅ You have successfully logged production set to Arize")

# Step 5. Confirm Data in Arize ✅
Note that the Arize platform takes about 15 minutes to index embedding data. While the model should appear immediately, the data will not show up until the indexing is complete. Feel free to head over to the **Data Ingestion** tab for your model to watch Arize work its magic!🔮

You will be able to see the predictions, actuals, and feature importances that have been sent in the last 30 minutes, last day or last week.

An example view of the Data Ingestion tab from a model, when data is sent continuously over 30 minutes, is shown in the image below.

<img src="https://storage.googleapis.com/arize-assets/fixtures/data-ingestion-tab.png" width="700">

## Check the Embedding Data in Arize

First, set the baseline to the training set that we logged before.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/embedding_setup_baseline.gif" width="700">


If your model contains embedding data, you will see it in your Model's Overview page. 

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP/NLP-reviews-demo-language-drift-overview.jpg" width="700">

 Click on the Embedding Name or the Euclidean Distance value to see how your embedding data is drifting over time. In the picture below we represent the global euclidean distance between your production set (at different points in time) and the baseline (which we set to be our training set). We can see there is a period of a week where suddenly the distance is remarkably higher. This shows us that during that time text data was sent to our model that was different than what it was trained on (English). This is the period of time when reviews written in Spanish were sent alongside the expected English reviews.
 
<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP/NLP-reviews-demo-language-drift-emb-0.jpg" width="700">

In addition to the drift tracking plot above, below you can find the UMAP visualization of your data, according to the point in time selected. Notice that the production data and our baseline (training) data are superimposed, which is indicative that the model is seeing data in production similar to the data it was trained on.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP/NLP-reviews-demo-language-drift-emb-1.jpg" width="700">

Next, select a point in time when the drift was high and select a UMAP visualization in 2D. We can see that both training and production data are superimposed for the most part, but another cluster of production data has appeared. This indicates that the model is seeing data in production qualitatively different to the data it was trained on, and in this case causing performance degradation.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP/NLP-reviews-demo-language-drift-emb-2.jpg" width="700">

For further inspection, you may select a 3D UMAP view and clicked _Explore UMAP_ to expand the view. With this view we can interact in 3D with our dataset. We can zoom, rotate, and drag so we can see the areas of our dataset that are most interesting to us. Check out the workflow below:

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP/NLP-reviews-demo-language-drift-workflow.gif" width="700">

In the display above, Arize offers many coloring options:
1. By Dataset: You can see that the coloring has been made to distinguish production data vs baseline data (training in this example). This is specifically useful to detect drift. In this example, we can see that there is some production data far away from any training data, giving an indication of severe dataset drift. We can identify exactly what datapoints our baseline is missing so that re-train effectively.
2. By Prediction Label: This coloring option gives an insight on how is our model making decisions. Where are the different classes located in the space? Is the model predicting one class in regions where it should be predicting another?
3. By Actual Label: This coloring option is great if we want to identify labeling issues. For instance, if inside the orange cloud, we can see points of other colors, it is a good idea to check and see if the labels are wrong. Further we can use the corrected labels for re-training.
4. By Correctness: This coloring option offers a quick way of identifying where the bulk of your model's mistakes are placed, giving you an area to pay attention to. In this example, we can see that the spanish reviews are almost all red.
5. By Confusion Matrix: This coloring option allows you to select a `positive class` and color the data-points as `True Positives`, `True Negatives`, `False Positives`, `False Negatives`.
6. By Feature: You can identify areas of the space where your model might be underperforming and, by coloring the points by feature, identify patterns at feature level. In other words, you can identify a slice of your data sharing a common feature (or features) that are causing a problem.
7. By Prediction Score: You can identify areas where your model is more confident of its predictions and areas where your model struggled more to make a decision.

More coloring options will be added to help you understand and debug your model and dataset.

# Wrap Up 🎁
Congratulations, you've now sent your first machine learning embedding data to the Arize platform!!

Additionally, if you want to remove this example model from your account, just click **Models** -> **NLP-reviews-demo-language-drift** -> **config** -> **delete**

### Overview
Arize is an end-to-end ML observability and model monitoring platform. The platform is designed to help ML engineers and data science practitioners surface and fix issues with ML models in production faster with:
- Automated ML monitoring and model monitoring
- Workflows to troubleshoot model performance
- Real-time visualizations for model performance monitoring, data quality monitoring, and drift monitoring
- Model prediction cohort analysis
- Pre-deployment model validation
- Integrated model explainability

### Website
Visit Us At: https://arize.com/model-monitoring/

### Additional Resources
- [What is ML observability?](https://arize.com/what-is-ml-observability/)
- [Monitor Unstructured Data with Arize](https://arize.com/blog/monitor-unstructured-data-with-arize)
- [Getting Started With Embeddings Is Easier Than You Think](https://arize.com/blog/getting-started-with-embeddings-is-easier-than-you-think)
- [Playbook to model monitoring in production](https://arize.com/the-playbook-to-monitor-your-models-performance-in-production/)
- [Using statistical distance metrics for ML monitoring and observability](https://arize.com/using-statistical-distance-metrics-for-machine-learning-observability/)
<!-- - [ML infrastructure tools for data preparation](https://arize.com/ml-infrastructure-tools-for-data-preparation/) -->
- [ML infrastructure tools for model building](https://arize.com/ml-infrastructure-tools-for-model-building/)
- [ML infrastructure tools for production](https://arize.com/ml-infrastructure-tools-for-production-part-1/)
<!-- - [ML infrastructure tools for model deployment and model serving](https://arize.com/ml-infrastructure-tools-for-production-part-2-model-deployment-and-serving/) -->

- [ML infrastructure tools for ML monitoring and observability](https://arize.com/ml-infrastructure-tools-ml-observability/)

Visit the [Arize Blog](https://arize.com/blog) and [Resource Center](https://arize.com/resource-hub/) for more resources on ML observability and model monitoring.
