# <center>Phoenix in Flight</center>
## <center>Surfacing Feature Drift and Data Quality Issues for a Fraud-Detection Model</center>

Imagine you maintain a fraud-detection service for your e-commerce company. In the past few weeks, there's been an alarming spike in undetected cases of fraudulent credit card transactions. These false negatives are hurting your bottom line, and you've been tasked with solving the issue.

Phoenix provides opinionated workflows to surface feature drift and data quality issues quickly so you can get straight to the root-cause of the problem. As you'll see, your fraud-detection service is receiving more and more traffic from an untrustworthy merchant, and a missing feature in your pipeline is causing your model's false negative rate to skyrocket.

In this tutorial, you will:
* Download curated datasets of credit card transaction and fraud-detection data
* Investigate troublesome "slices" of your features to detect drift caused by a fraudulent merchant
* Uncover a data quality issue causing a spike in false negatives
* Generate a report to share these insights with your co-workers and other stakeholders at your company

Let's get started!

### 1. Install Dependencies and Import Libraries 📚

In [None]:
%pip install -q arize-phoenix

In [None]:
import pandas as pd
import phoenix as px

### 2. Download the Data 📊

Load your training and production data into two pandas dataframes and inspect a few rows of the training dataframe.

In [None]:
train_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/phoenix/datasets/structured/credit-card-fraud/credit_card_fraud_train.parquet",
)
prod_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-assets/phoenix/datasets/structured/credit-card-fraud/credit_card_fraud_production.parquet",
)
train_df.head()

The columns of the dataframe are:
- **prediction_id:** the unique ID for each prediction
- **prediction_timestamp:** the timestamps of your predictions
- **predicted_label:** the label your model predicted
- **predicted_score:** the score of each prediction
- **actual_label:** the true, ground-truth label for each prediction (fraud vs. not_fraud)
- **age:** a tag used to filter your data in the Phoenix UI
- the rest of the columns are features

### 3. Generate Embeddings using Arize AutoEmbeddings

We can generate an embedding vector per row of our dataframe using `airze[AutoEmbeddings]`. Arize offers the ability of generating embeddings seemlessly using large pre-trained models. In this example, we will use the pre-trained language model `distilbert-base-uncased`.

**NOTE: The use of GPUs is recommended for embedding generation. If you are running in Colab, we encourage upgrading to Colab Pro.** 

The large language models that Arize's embedding generators use have already been trained in such a huge amount of data that the embeddings can capture relevant structure in your data without being fine-tuned.

In [None]:
%pip install -q arize[AutoEmbeddings]
from arize.pandas.embeddings.tabular_generators import EmbeddingGeneratorForTabularFeatures

In [None]:
generator = EmbeddingGeneratorForTabularFeatures(
    model_name="distilbert-base-uncased",
    tokenizer_max_length=512,
)

selected_cols = [
    'fico_score', 'merchant_risk_score', 'loan_amount', 'term',
    'interest_rate', 'installment', 'grade', 'home_ownership',
    'annual_income', 'verification_status', 'num_credit_lines',
    'dti', 'delinq_2yrs', 'inq_last_6mths', 'mths_since_last_delinq', 
    'open_acc','revol_bal', 'state', 'age'
]

train_df['tabular_vector'] = generator.generate_embeddings(
    train_df,
    selected_columns=selected_cols,
)
prod_df['tabular_vector'] = generator.generate_embeddings(
    prod_df,
    selected_columns=selected_cols,
)

### 4. Launch Phoenix 🔥🐦

#### a) Define Your Schema

To launch Phoenix with your data, you first need to define a schema that tells Phoenix which columns of your dataframes correspond to features, predictions, actuals (i.e., ground truth), tags, etc.

In [None]:
embedding_features = {
    "tabular_embedding": px.EmbeddingColumnNames(
        vector_column_name="tabular_vector", 
    ),
}
    
schema = px.Schema(
    prediction_id_column_name="prediction_id",
    prediction_label_column_name="predicted_label",
    prediction_score_column_name="predicted_score",
    actual_label_column_name="actual_label",
    timestamp_column_name="prediction_timestamp",
    tag_column_names=["age"],
    embedding_feature_column_names=embedding_features,
)

You'll notice that the schema above doesn't explicitly specify features. That's because feature columns are automatically inferred if you don't pass `feature_column_names` to your `Schema` object.

#### b) Define Your Datasets 
Next, define your primary and reference datasets. In this case, your reference dataset contains training data and your primary dataset contains production data.

In [None]:
primary_dataset = px.Dataset(dataframe=prod_df, schema=schema, name="primary")
reference_dataset = px.Dataset(dataframe=train_df, schema=schema, name="reference")

#### c) Create a Phoenix Session

In [None]:
session = px.launch_app(primary=primary_dataset, reference=reference_dataset)

#### d) Launch the Phoenix UI

You can open Phoenix by copying and pasting the output of `session.url` into a new browser tab.

In [None]:
session.url

Alternatively, you can open the Phoenix UI in your notebook with

In [None]:
session.view()

### 4. Explore Your Data 📈

Phoenix is under active development. At the moment, we display your model schema and a few data quality statistics. Check back soon for more updates.

### 5. Close the App 🧹

When you're done, don't forget to close the app.

In [None]:
px.close_app()