<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Detecting Fraud with Tabular Embeddings</h1>

Imagine you maintain a fraud-detection service for your e-commerce company. In the past few weeks, there's been an alarming spike in undetected cases of fraudulent credit card transactions. These false negatives are hurting your bottom line, and you've been tasked with solving the issue.

Phoenix provides opinionated workflows to surface feature drift and data quality issues quickly so you can get straight to the root-cause of the problem. As you'll see, your fraud-detection service is receiving more and more traffic from an untrustworthy merchant, causing your model's false negative rate to skyrocket.

In this tutorial, you will:
* Download curated datasets of credit card transaction and fraud-detection data
* Compute tabular embeddings to represent each transaction
* Pinpoint fraudulent transactions from a suspicious merchant
* Export data from this merchant to retrain your model

Let's get started!

## 1. Install Dependencies and Import Libraries

In [None]:
!pip install -q arize-phoenix "arize[AutoEmbeddings]"

In [None]:
from arize.pandas.embeddings.tabular_generators import EmbeddingGeneratorForTabularFeatures
import pandas as pd
import phoenix as px
import torch

## 2. Download the Data

Load your training and production data into two pandas DataFrames and inspect a few rows of the training DataFrame.

In [None]:
train_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/structured/credit-card-fraud/credit_card_fraud_train.parquet",
)
prod_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/structured/credit-card-fraud/credit_card_fraud_production.parquet",
)
train_df.head()

The columns of the dataframe are:
- **prediction_id:** the unique ID for each prediction
- **prediction_timestamp:** the timestamps of your predictions
- **predicted_label:** the label your model predicted
- **predicted_score:** the score of each prediction
- **actual_label:** the true, ground-truth label for each prediction (fraud vs. not_fraud)
- **tabular_vector:** pre-computed tabular embeddings for each row of data
- **age:** a tag used to filter your data in the Phoenix UI
- the rest of the columns are features

## 3. Compute Embeddings

Run the cell below if you have a CUDA-enabled GPU and want to compute embeddings for your tabular data from scratch; otherwise, skip this step to use the pre-computed embeddings downloaded with the rest of your data in step 2.

`EmbeddingGeneratorForTabularFeatures` represents each row of your DataFrame as a piece of text and computes an embedding for that text using a pre-trained large language model (in this case, "distilbert-base-uncased"). For example, if a row of your DataFrame represents a transaction in the state of California from a merchant named "Leannon Ward" with a FICO score of 616 and a merchant risk score of 23, `EmbeddingGeneratorForTabularFeatures` computes an embedding for the text: "The state is CA. The merchant ID is Leannon Ward. The fico score is 616. The merchant risk score is 23..."

In [None]:
feature_column_names = [
    "fico_score",
    "loan_amount",
    "term",
    "interest_rate",
    "installment",
    "grade",
    "home_ownership",
    "annual_income",
    "verification_status",
    "pymnt_plan",
    "addr_state",
    "dti",
    "delinq_2yrs",
    "inq_last_6mths",
    "mths_since_last_delinq",
    "mths_since_last_record",
    "open_acc",
    "pub_rec",
    "revol_bal",
    "revol_util",
    "state",
    "merchant_ID",
    "merchant_risk_score",
]

if torch.cuda.is_available():
    generator = EmbeddingGeneratorForTabularFeatures(
        model_name="distilbert-base-uncased",
    )
    train_df["tabular_vector"] = generator.generate_embeddings(
        train_df,
        selected_columns=feature_column_names,
    )
    prod_df["tabular_vector"] = generator.generate_embeddings(
        prod_df,
        selected_columns=feature_column_names,
    )
else:
    print("CUDA is not available. Using pre-computed embeddings.")

## 4. Launch Phoenix

### a) Define Your Schema

To launch Phoenix with your data, you first need to define a schema that tells Phoenix which columns of your DataFrames correspond to features, predictions, actuals (i.e., ground truth), tags, etc.

In [None]:
schema = px.Schema(
    prediction_id_column_name="prediction_id",
    prediction_label_column_name="predicted_label",
    prediction_score_column_name="predicted_score",
    actual_label_column_name="actual_label",
    timestamp_column_name="prediction_timestamp",
    feature_column_names=feature_column_names,
    tag_column_names=["age"],
    embedding_feature_column_names={
        "tabular_embedding": px.EmbeddingColumnNames(
            vector_column_name="tabular_vector",
        ),
    },
)

You'll notice that the schema above doesn't explicitly specify features. That's because feature columns are automatically inferred if you don't pass `feature_column_names` to your `Schema` object.

### b) Define Your Datasets 
Next, define your primary and reference datasets. In this case, your reference dataset contains training data and your primary dataset contains production data.

In [None]:
prod_ds = px.Dataset(dataframe=prod_df, schema=schema, name="production")
train_ds = px.Dataset(dataframe=train_df, schema=schema, name="training")

### c) Create a Phoenix Session

In [None]:
session = px.launch_app(primary=prod_ds, reference=train_ds)

### d) Launch the Phoenix UI

You can open Phoenix by copying and pasting the output of `session.url` into a new browser tab.

In [None]:
session.url

Alternatively, you can open the Phoenix UI in your notebook with

In [None]:
session.view()

## 5. Find and Export Fraudulent Transactions

### Steps

1. Click on "tabular_embedding" in the "Embeddings" section.
1. In the Euclidean distance graph at the top of the page, select a point on the graph where the Euclidean distance is high.
1. In the display settings in the bottom left, select "dimension" in the "Color By" dropdown. Then select the "merchant_ID" feature in the "Dimension" dropdown.
1. Click on the top cluster in the panel on the left.
1. Click on the "Export" button to save your cluster.

### Questions:

1. What does the Euclidean distance graph measure?
1. What do the points in the point cloud represent?
1. What do you notice about the cluster you selected?
1. What is the cause of your model's high false negative rate in production?

### Answers

1. This graph measures the drift of your production data relative to your training data over time.
1. Each point in the point cloud represents an individual credit card transaction.
1. It consists mostly of production data from the Scammeds merchant.
1. Your model was trained on relatively little data from the Scammeds merchant, but is seeing a high volume of transactions from this merchant in production.

## 6. Load and View Exported Data

View your most recently exported data as a DataFrame.

In [None]:
export_df = session.exports[-1]
export_df.head()

Congrats! You've successfully pinpointed a cluster of fraudulent transactions. You can now fine-tune your model on the exported data in order to detect similar cases of fraud in the future.

## 7. Close the App

When you're done, don't forget to close the app.

In [None]:
px.close_app()