<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Detecting Fraud with Tabular Embeddings</h1>

Imagine you maintain a fraud-detection service for your e-commerce company. In the past few weeks, there's been an alarming spike in undetected cases of fraudulent credit card transactions. These false negatives are hurting your bottom line, and you've been tasked with solving the issue.

Phoenix provides opinionated workflows to surface feature drift and data quality issues quickly so you can get straight to the root-cause of the problem. As you'll see, your fraud-detection service is receiving more and more traffic from an untrustworthy merchant, causing your model's false negative rate to skyrocket.

In this tutorial, you will:
* Download curated datasets of credit card transaction and fraud-detection data
* Compute tabular embeddings to represent each transaction
* Pinpoint fraudulent transactions from a suspicious merchant
* Export data from this merchant to retrain your model

Let's get started!

## Install Dependencies and Import Libraries

Install Phoenix and the Arize SDK, which provides convenience methods for extracting embeddings for tabular data.

In [None]:
!pip install -q arize-phoenix "arize[AutoEmbeddings]"

Import dependencies.

In [None]:
from arize.pandas.embeddings.tabular_generators import EmbeddingGeneratorForTabularFeatures
import pandas as pd
import phoenix as px
from sklearn.metrics import recall_score

## Download the Data

Download your training and production data from a fraud detection model.

In [None]:
train_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/structured/credit-card-fraud/credit_card_fraud_train.parquet",
)
prod_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/structured/credit-card-fraud/credit_card_fraud_production.parquet",
)
train_df = train_df.reset_index(
    drop=True
)  # recommended when using EmbeddingGeneratorForTabularFeatures
prod_df = prod_df.reset_index(
    drop=True
)  # recommended when using EmbeddingGeneratorForTabularFeatures
train_df.head()

The columns of the dataframe are:
- **prediction_id:** the unique ID for each prediction
- **prediction_timestamp:** the timestamps of your predictions
- **predicted_label:** the label your model predicted
- **predicted_score:** the score of each prediction
- **actual_label:** the true, ground-truth label for each prediction (fraud vs. not_fraud)
- **tabular_vector:** pre-computed tabular embeddings for each row of data
- **age:** a tag used to filter your data in the Phoenix UI

The rest of the columns are features.

In [None]:
feature_column_names = [
    "fico_score",
    "loan_amount",
    "term",
    "interest_rate",
    "installment",
    "grade",
    "home_ownership",
    "annual_income",
    "verification_status",
    "pymnt_plan",
    "addr_state",
    "dti",
    "delinq_2yrs",
    "inq_last_6mths",
    "mths_since_last_delinq",
    "mths_since_last_record",
    "open_acc",
    "pub_rec",
    "revol_bal",
    "revol_util",
    "state",
    "merchant_ID",
    "merchant_risk_score",
]

## Compute Tabular Embeddings (Optional)

⚠️ This step requires a GPU.

Run the cell below to compute embeddings for your tabular data from scratch, or skip this step to use the pre-computed embeddings downloaded with the rest of your data.

`EmbeddingGeneratorForTabularFeatures` represents each row of your dataframe as a piece of text and computes an embedding for that text using a pre-trained language model (in this case, "distilbert-base-uncased"). For example, if a row of your dataframe represents a transaction in the state of California from a merchant named "Leannon Ward" with a FICO score of 616 and a merchant risk score of 23, `EmbeddingGeneratorForTabularFeatures` computes an embedding for the text: "The state is CA. The merchant ID is Leannon Ward. The fico score is 616. The merchant risk score is 23..."

In [None]:
generator = EmbeddingGeneratorForTabularFeatures(
    model_name="distilbert-base-uncased",
)
train_df["tabular_vector"] = generator.generate_embeddings(
    train_df,
    selected_columns=feature_column_names,
)
prod_df["tabular_vector"] = generator.generate_embeddings(
    prod_df,
    selected_columns=feature_column_names,
)

## Launch Phoenix

Define a schema to tell Phoenix what the columns of your dataframe represent (features, predictions, actuals, tags, embeddings, etc.). See the [docs](https://docs.arize.com/phoenix/) for guides on how to define your own schema and API reference on `phoenix.Schema` and `phoenix.EmbeddingColumnNames`.

In [None]:
schema = px.Schema(
    prediction_id_column_name="prediction_id",
    prediction_label_column_name="predicted_label",
    prediction_score_column_name="predicted_score",
    actual_label_column_name="actual_label",
    timestamp_column_name="prediction_timestamp",
    feature_column_names=feature_column_names,
    tag_column_names=["age"],
    embedding_feature_column_names={
        "tabular_embedding": px.EmbeddingColumnNames(
            vector_column_name="tabular_vector",
        ),
    },
)

Create Phoenix datasets that wrap your dataframes with schemas that describe them.

In [None]:
prod_ds = px.Dataset(dataframe=prod_df, schema=schema, name="production")
train_ds = px.Dataset(dataframe=train_df, schema=schema, name="training")

Launch Phoenix. Follow the instructions in the cell output to open the Phoenix UI in the notebook or in a new browser tab.

In [None]:
session = px.launch_app(primary=prod_ds, reference=train_ds)

In [None]:
session.view()

## Find the Fraudulent Merchant and Export Data

Click on "tabular_embedding" in the "Embeddings" section.

![click on tabular embeddings](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/credit-card-fraud-detection-tutorial/click_on_tabular_embedding.png)

Select a period of high drift in the Euclidean distance graph at the top.

![select period of high drift](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/credit-card-fraud-detection-tutorial/select_period_of_high_drift.png)

Hover over the top clusters in the left side panel. Phoenix has identified these clusters as problematic because they consist entirely or almost entirely of production data, meaning that your model is making production inferences on data the likes of which it never saw during training.

❗ Your point cloud won't look identical to the one in this picture if you computed your own embeddings.

![hover over top clusters](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/credit-card-fraud-detection-tutorial/hover_over_top_clusters.png)

In the display settings in the bottom left, select "dimension" in the "Color By" dropdown. Then select the "merchant_ID" feature in the "Dimension" dropdown. Notice that the troublesome clusters you found in the previous step are coming from the Scammeds merchant.

![color by merchant id](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/credit-card-fraud-detection-tutorial/color_by_merchant_id.png)

Next, color your data by correctness to confirm. Notice that many of the data points in the Scammeds clusters are incorrectly classified.

![color by correctness](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/credit-card-fraud-detection-tutorial/color_by_correctness.png)

Select the relevant data using the lasso and export the data for further analysis.

![select points with lasso and export](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/credit-card-fraud-detection-tutorial/select_points_with_lasso_and_export.png)

## View and Analyze Exported Data

View your most recently exported data as a dataframe.

In [None]:
export_df = session.exports[-1]
export_df.head()

Compute the false negative rate on your exported data.

In [None]:
export_df = export_df[
    export_df.actual_label != "uncertain"
]  # remove rows with unknown ground-truth
recall = recall_score(
    y_true=export_df.actual_label == "fraud", y_pred=export_df.predicted_label == "fraud"
)
false_negative_rate = 1 - recall
false_negative_rate

That false negative rate is unacceptably high. Congrats! You've identified the fraudulent merchant causing your false negative rate to spike. As an actionable next step, you can fine-tune your model on the misclassified examples and report the fraudulent merchant to the proper authorities.