<img src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="200"/>

# Getting Started with the Arize Platform - Ranking

**In this walkthrough, we are going to investigate the performance of a Hotel Booking Ranking model using nDCG.** 

Use case: You manage the Machine Learning models for a Hotel Booking company and spent a signifigant amount of time collecting online data and traning your model. Now that your model is in production, you have little understanding of your model performance, but you notice a higher bounce rate than normal. Use Arize to identify your model issues and gain insights into how to improve your models. 

In this walkthrough we will look at a few scenarios common to a ranking use-case and more specifically looking at the NDCG at different @k values.

You will learn to:

1. Get training and production data into the Arize platform
2. Setup threshold alerts
3. Understand where the model is underperforming
4. Discover the root cause of issues

The sample data contains 1 month of information in which 2 main characteristics exist. You will work on identifying these characteristics in the course of this exercise. At a glance:

1. A new untrained domain is introduced
2. Difference in NDCG performance at different K values



---
NDCG measures a model's ability to rank query results in the order of the highest relevance. Actual relevance scores are usually determined by user interaction. The k value determines the sum of gains up to position k in a list. 



<img src="https://1591756861-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAlgpMyBRcl2qFZRQ67%2Fuploads%2FBtoFDYp86ob7FpqFEOi4%2F1_sb2sXH1RHQFgZgl4l9pCSw.png?alt=media&token=5997e120-65d9-4c88-9364-57d5c0bc16d2"
width="200"/>

# Step 0. Setup and Getting the Data

The first step is to load our preexisting dataset which includes training and production environments for our ranking model. Using a preexisting dataset illustrates how simple it is to get started with the Arize platform.

## Install Dependencies and Import Libraries 📚

In [None]:
!pip install -q arize


import datetime
import tempfile
import uuid
from datetime import timedelta

import numpy as np
import pandas as pd
import requests
from arize.pandas.logger import Client
from arize.utils.types import Environments, ModelTypes, Schema

[?25l[K     |███████▍                        | 10 kB 22.2 MB/s eta 0:00:01[K     |██████████████▉                 | 20 kB 28.7 MB/s eta 0:00:01[K     |██████████████████████▏         | 30 kB 21.7 MB/s eta 0:00:01[K     |█████████████████████████████▋  | 40 kB 6.4 MB/s eta 0:00:01[K     |████████████████████████████████| 44 kB 2.0 MB/s 
[?25h

## **🌐 Download the Data**
In this walkthrough, we’ll be sending real historical data. Note, that while feature names and values are made explicit in this dataset, you can achieve the same level of ML Observability using obfuscated features. 



In [None]:
production = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/fixtures/Ranking-Use-Case/ndcg_ranking_production_dataset.parquet",
)
train = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/fixtures/Ranking-Use-Case/ndcg_ranking_training_dataset.parquet",
)

print("✅ Dependencies installed and data successfully downloaded!")

✅ Dependencies installed and data successfully downloaded!


## Inspect the Data 

Take a quick look at the dataset. The data represents a model designed 
to predict the likelihood a user clicks on a reccomended hotel in an ordered list. The model's predictions are based on features such as destination, location, country, etc. This dataset contains one month of data. 

In [None]:
production["actual_relevancy"]

0         [Not Relevant]
1         [Not Relevant]
2         [Not Relevant]
3         [Not Relevant]
4         [Not Relevant]
               ...      
161532    [Not Relevant]
161533    [Not Relevant]
161534    [Not Relevant]
161535        [Relevant]
161536    [Not Relevant]
Name: actual_relevancy, Length: 161537, dtype: object

In [None]:
feature_column_names = [
  'prop_log_historical_price',
  'price_usd',
  'promotion_flag',
  'search_destination_id',
  'search_length_of_stay',
  'search_booking_window',
  'search_adults_count',
  'search_children_count',
  'search_room_count',
  'search_saturday_night_bool',
  'destination'
]

# Step 1. Sending Data into Arize 💫

Now that we have our dataset imported, we are ready to integrate into Arize. We do this by logging (sending) important data we want to analyze to the platform. There, the data will be easily visualized and troubleshooting workflows will help us find the source of our problem.

For our model, we are going to log:
*   feature data
*   group id
*   rank
*   actuals
## Import and Setup Arize Client

The first step is to setup our Arize client. After that we will log the data.

First, copy the Arize `API_KEY` and `SPACE_KEY` from your Space Settings page shown below! Copy those over to the set-up section. We will also be setting up some metadata to use across all logging.




<img src="https://storage.googleapis.com/arize-assets/fixtures/copy-keys.png" width="700">

In [None]:
SPACE_KEY = "SPACE_KEY"
API_KEY = "API_KEY"

arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)

model_id = (
    "ranking-demo-model"  # This is the model name that will show up in Arize
)
model_version = "v1.0"  # Version of model - can be any string

if SPACE_KEY == "SPACE_KEY" or API_KEY == "API_KEY":
    raise ValueError("❌ NEED TO CHANGE SPACE AND/OR API_KEY")
else:
    print("✅ Arize setup complete!")

✅ Arize setup complete!


## Log Training & Production Data to Arize 

Now that our Arize client is setup, let's go ahead and log all of our data to the platform. For more details on how **`arize.pandas.logger`** works, visit out documentations page below.

[![Buttons_OpenOrange.png](https://storage.googleapis.com/arize-assets/fixtures/Buttons_OpenOrange.png)](https://docs.arize.com/arize/sdks-and-integrations/python-sdk/arize.pandas)

*   **prediction_group_id_column_name**: This is the query id for each ranking groups or lists in ranking models (Required). 
*   **rank_column_name**: Rank of each element within its group or list (Required). 
*   **actual_label_column_name**: A list of strings that represent multiple engagement actions of each element (at least one of Relevance Scores or Actual Label are required).
*   **actual_score_column_name**: Scores are generated base on the engagement actions of each element (at least one of Relevance Scores or Actual Label are required).
* **prediction_score_column_name**: The prediction scores used to generate the rank (Optional).
* **prediction_label_column_name**: Set to "relevant" (since only relevant results are displayed)  (Optional).

Given that our model is predicting between categories, we will use [ModelTypes.RANKING](https://docs.arize.com/arize/product-guides-1/models/model-types) to perform this analysis.



## Log Training Data

In [None]:
# Define a Schema() object for Arize to pick up data from the correct columns for logging
training_schema = Schema(
    prediction_id_column_name="prediction_id",
    prediction_group_id_column_name = "search_id",
    rank_column_name = "rank",
    actual_label_column_name = "actual_relevancy",
    feature_column_names=feature_column_names,
)

# Logging Training DataFrame
training_response = arize_client.log(
    dataframe=train,
    model_id=model_id,
    model_version=model_version,
    model_type=ModelTypes.RANKING,
    environment=Environments.TRAINING,
    schema=training_schema,
)

# If successful, the server will return a status_code of 200
if training_response.status_code != 200:
    print(
        f"logging failed with response code {training_response.status_code}, {training_response.text}"
    )
else:
    print(f"✅ You have successfully logged training set to Arize")

Success! Check out your data at https://app.arize.com/organizations/QWNjb3VudE9yZ2FuaXphdGlvbjo4Nw==/spaces/U3BhY2U6ODc=/models/modelName/ranking-demo-model-parquet2?selectedTab=dataIngestion


INFO:arize.pandas.logger:Success! Check out your data at https://app.arize.com/organizations/QWNjb3VudE9yZ2FuaXphdGlvbjo4Nw==/spaces/U3BhY2U6ODc=/models/modelName/ranking-demo-model-parquet2?selectedTab=dataIngestion


✅ You have successfully logged training set to Arize


## Log Production Data


In [None]:
# changing dates for ease of visualization / to mimic recent produciton dataset
END_DATE = datetime.date.today().strftime("%Y-%m-%d")
START_DATE = (datetime.date.today() - timedelta(31)).strftime("%Y-%m-%d")


def setPredictionIDandTime(df, start, end):
    out_df = pd.DataFrame()
    dts = pd.date_range(start, end).to_pydatetime().tolist()
    for dt in dts:
        day_df = df.loc[df["day"] == dt.day].copy()
        day_df["prediction_ts"] = int(dt.strftime("%s"))
        out_df = pd.concat([out_df, day_df], ignore_index=True)
    out_df["prediction_id"] = [str(uuid.uuid4()) for _ in range(out_df.shape[0])]
    return out_df.drop(columns="day")


production = setPredictionIDandTime(production, START_DATE, END_DATE)

production_schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="prediction_ts",
    prediction_group_id_column_name = "search_id",
    rank_column_name = "rank",
    actual_label_column_name = "actual_relevancy",
    feature_column_names=feature_column_names,
)

production_response = arize_client.log(
    dataframe=production,
    model_id=model_id,
    model_version=model_version,
    model_type=ModelTypes.RANKING,
    environment=Environments.PRODUCTION,
    schema=production_schema,
)

if production_response.status_code != 200:
    print(
        f"logging failed with response code {production_response.status_code}, {production_response.text}"
    )
else:
    print(f"✅ You have successfully logged production set to Arize")

Success! Check out your data at https://app.arize.com/organizations/QWNjb3VudE9yZ2FuaXphdGlvbjo4Nw==/spaces/U3BhY2U6ODc=/models/modelName/ranking-demo-model-parquet2?selectedTab=dataIngestion


INFO:arize.pandas.logger:Success! Check out your data at https://app.arize.com/organizations/QWNjb3VudE9yZ2FuaXphdGlvbjo4Nw==/spaces/U3BhY2U6ODc=/models/modelName/ranking-demo-model-parquet2?selectedTab=dataIngestion


✅ You have successfully logged production set to Arize


# Step 2. Confirm Data in Arize ✅

Note that the Arize performs takes about 10 minutes to index the data. While the model should appear immediately, the data will not show up until the indexing is complete. Feel free to head over to the **Data Ingestion** tab for your model to watch Arize work its magic!🔮

You will be able to see the predictions, actuals, and feature importances that have been sent in the last 30 minutes, last day or last week.

An example view of the Data Ingestion tab from a model, when data is sent continuously over 30 minutes, is shown in the image below.

<img src="https://storage.googleapis.com/arize-assets/fixtures/data-ingestion-tab.png" width="700">

Note that the Arize performs takes about 10 minutes to index the data. While the model should appear immediately, the data will not show up until the indexing is complete. Feel free to head over to the **Data Ingestion** tab for your model to watch Arize works its magic!🔮

**⚠️ DON'T SKIP:**
In order to move on to the next step, make sure your actuals and training/production sets are loaded into the platform. To check:
1. Navigate to models from the left bar, locate and click on model **ranking-demo-model**
2. On the **Overview Tab**, make sure you can see Predictions and Actuals under the **Model Health** section. Once production actuals have been fully recorded on Arize, the row title will change from **0 Actuals** to **Actuals** with summary statistics such as cardinality listed in the tables.
3. Verify the list of **Categorical** and **Numeric Features** below **Actuals**.

![image](https://storage.googleapis.com/arize-assets/fixtures/Ranking-Use-Case/model_overview.png)

# Step 3. Set up Model Baseline

Now that our data has been logged into the [Arize platform](https://app.arize.com/) we can begin our investigation into our poorly performing Hotel search ranking model. 

First, set the baseline to the training set that we logged before.

![image](https://storage.googleapis.com/arize-assets/fixtures/Ranking-Use-Case/model_baseline.gif)

Under the **Config** tab we can also set **Performance Configs**. Select the following configurations:


1.   Default Metric: `NDCG`
2.   Default @K value: 10
3.   Positive Class: `Relevant`


![image](https://storage.googleapis.com/arize-assets/fixtures/Ranking-Use-Case/performance_config.gif)

# Step 4. Performance Analysis

Let's begin troubleshooting our model by navigating to the **Performance Tracing** page. Here, we notice our nDCG value is extremely low for the first 10 reccomendations in our model, where our @k value = 10

![image](https://storage.googleapis.com/arize-assets/fixtures/Ranking-Use-Case/perf_tracing.png)

To further troubleshoot our performance degredation, we'll increase our @k value to see if there's a change in performance for lower ranked reccomendations from 10 to 20. This helps us gauge where our model failure is within a list of ranks.

When we change our k value to 20, we can see a signifigant improvement in our model performance. To further investigate model performance, we'll add a **comparison model** which will help us evaluate our performance breakdown chart. 

To add a comparison model, navigate to the top of the 'Performance Tracing' page and click on add comparison. Since we want to compare our production data with our training data, toggle 'production' to **'training'** to populate an addtional dataset. 

![image](https://storage.googleapis.com/arize-assets/fixtures/Ranking-Use-Case/ranking_vid1.gif)

Now that we've populated our training dataset as 'Dataset B', we can compare how features impact performance between two environments. Scroll down to the **'Performance Breakdown'** card to compare features between our two datasets.

When comparing two histograms look for: 
- different colors (the more red = worse performing)
- missing values  

As we scroll through our model's feature comparisons, notice how `DESTINATION` has a gap between the two histograms. From here, we can click into the card for a detailed view of exactly what's missing. 

![image](https://storage.googleapis.com/arize-assets/fixtures/Ranking-Use-Case/ranking_recording2.gif)

Our performance breakdown comparison indicates our training data is missing `DESTINATION` data for `JACKSONHOLE`,`BREKENRIDGE`,`VAIL`,`ASPEN`, and `PARK CITY`. To confirm our data quality issues, click on the **'View Feature Details'** link on the top right, which will navigate us to a page where we visualize our distribution comparison, cardinality, and % empty over time. 

This ultimately indicates that our training set is not predicting for winter locations, and will likely need to be **retrained** to account for winter locations in the new training dataset. 



# 📚 Conclusion 
In this walkthrough we've shown how to log prediction data, pinpoint model performance degradation, and using a rank-aware evaluation metric to improve our model.

We completed the following tasks: 
1. Uploaded data from a ranking model 
2. Set up a model baseline, evaluated different @k values, added a comparison dataset, and root cause our model's issue
3. Compared production data against training data to identify problematic areas in the performance heatmap
4. Identified correlations between our model's degrading performance with data quality issues in training data


We identified the following areas of concern: 
1. nDCG performance @k = 10 performs worse than higher @k values 
2. Missing training data for `DESTINATION` 
3. Poor data quality for `DESTINATION` -- with `JACKSONHOLE`,`BREKENRIDGE`,`VAIL`,`ASPEN`, and `PARK CITY` missing compared to production data


# About Arize
Arize is an end-to-end ML observability and model monitoring platform. The platform is designed to help ML engineers and data science practitioners surface and fix issues with ML models in production faster with:
- Automated ML monitoring and model monitoring
- Workflows to troubleshoot model performance
- Real-time visualizations for model performance monitoring, data quality monitoring, and drift monitoring
- Model prediction cohort analysis
- Pre-deployment model validation
- Integrated model explainability

### Website
Visit Us At: https://arize.com/model-monitoring/

### Additional Resources
- [What is ML observability?](https://arize.com/what-is-ml-observability/)
- [Playbook to model monitoring in production](https://arize.com/the-playbook-to-monitor-your-models-performance-in-production/)
- [Using statistical distance metrics for ML monitoring and observability](https://arize.com/using-statistical-distance-metrics-for-machine-learning-observability/)
- [ML infrastructure tools for data preparation](https://arize.com/ml-infrastructure-tools-for-data-preparation/)
- [ML infrastructure tools for model building](https://arize.com/ml-infrastructure-tools-for-model-building/)
- [ML infrastructure tools for production](https://arize.com/ml-infrastructure-tools-for-production-part-1/)
- [ML infrastructure tools for model deployment and model serving](https://arize.com/ml-infrastructure-tools-for-production-part-2-model-deployment-and-serving/)
- [ML infrastructure tools for ML monitoring and observability](https://arize.com/ml-infrastructure-tools-ml-observability/)

Visit the [Arize Blog](https://arize.com/blog) and [Resource Center](https://arize.com/resource-hub/) for more resources on ML observability and model monitoring.
