# **FICO Analytic Challenge © Fair Isaac 2024**

# Week 10: Performance Metrics on Blind Holdout Set

## Model Performance Metrics Importance

In the past weeks we introduced Logistic Regression, Neural Networks, inference and explainability. This week we'll focus on model performance.

In the credit fraud detection world, our primary goal is to accurately identify fraudulent activities while minimizing the impact on genuine transactions. To achieve this, we rely on machine learning models to sift through the vast amounts of transaction data and flag suspicious activities. But how do we know if our models are effective? This is where model performance metrics come into play, serving as vital tools to evaluate and compare the effectiveness of our models. This week, we'll dive into why these metrics are crucial and why, in the fraud detection industry, we often need to develop specific metrics to ensure we deliver the best models that provide the highest value to our clients.

### Why Model Performance Metrics Matter

- **True Positive Rate (TPR)** measures how well the model is correctly identifying actual cases of fraud.
- **False Positive Rate (FPR)** shows how often the model incorrectly labels normal transactions as fraud.

These metrics help us understand the trade-offs between catching fraudsters and avoiding unnecessary disruptions for genuine customers.

### Custom Metrics: Necessity for Accurate Evaluation

Credit fraud detection models are often evaluated using neural networks, where the predictions are presented as scores typically ranging from 1 to 1000. These scores help us rank transactions by their likelihood of being fraud. However, comparing scores from different models can be tricky. Each model might assign scores differently, making it hard to directly compare one model’s scoring to another’s.

To address this, we need to create metrics that can compare models more holistically. One approach is to use cumulative gain or lift charts, which show the what percentage of fraud is caught as we move through the score range. By comparing the areas under these curves, we can get a better sense of overall performance.

Additionally, creating metrics that account for the specific cost-benefit scenarios of our use case can ensure that we are making informed decisions. By developing custom metrics and carefully evaluating score distributions, we can make sure our models are not just effective but also finely tuned for catching real-world credit fraud.

In the fraud detection field, there are industry standard performance metrics that give meaningful insight. We'll focus on a few key performance metrics that are especially useful for understanding these models.

## False-Positive and Detection Rates
- Assessing the performance of a model is a matter of performing a cost-benefit analysis.
- The cost involves the number of false positives (FP), which are normal transactions mistakenly tagged as fraud. The benefit is correct fraud predictions and the reduction in fraud losses achieved by acting upon those predictions.
- Ideally, we want our model to increase the number of correct fraud predictions without raising, or even reducing, the number of false positives.
- If a model scores at least one transaction on a fraud account above a suspect threshold score, that fraudulent account is considered to be detected.

### Transaction Based Metrics
- **Percent Non-Fraud (%NF):** False positives are measured using a Percent Non-Fraud metric.
    - This percentage is the number of transactions from non-fraud accounts that scored above our suspect threashold, divided by the total number of transactions from these non-fraud accounts.
    - As the threshold score increases, the number of false positives decreases, but also reduces the number of actual frauds detected.
      - For instance, if a bank raises the threshold to reduce false alarms, they might also miss catching some genuine fraud cases.
- **Transaction Value Detection Rate (TVDR):** This percentage shows us how much money involved in fraud transactions is caught by our model. It looks at the transactions that score above a certain threshold and calculates what percentage of the total fraud amount these represent.
  - For example, if 100k of fraudulent transactions occur and the model identifies 80k of it, the TVDR would be 80%.
- **Transaction Detection Rate (TDR) or Percent Fraud (%F):** The percentage of fraud transactions with scores above a score threshold.
    - If there are 100 fraud transactions, and the model correctly identifies 72 of them, the TDR is 72 percent.

### Account Based Metrics
- **Account Percent Non-Fraud (A%NF):** Similar to %NF but at the account level, this measures the number of non-fraud accounts that score above the threshold, divided by the total number of non-fraud accounts. This helps in understanding how often legitimate accounts are mistakenly flagged as suspicious.
- **Account Detection Rate (ADR):**  The percentage of correctly identified fraud accounts. This is calculated by taking the Number of frauds accounts correctly detected at or above some Score Threshold divided by total number of actual fraud accounts.
    - For example, if there are 100 fraud accounts and the model identifies 80, the ADR is 80%.
- **Value Detection Rate (VDR):** Sum value associated with frauds detected at some Score Threshold divided by the sum value associated with all frauds. Essentially, this tells us what fraction of the total fraudulent dollars was successfully identified by the model.
  - For example, if the total amount of fraud is 100k and the model catches frauds totaling 80k, the VDR would be 80%.
    - There are two types: OLVDR (on-line) and RTVDR (real-time)
- **Real-Time Value Detection Rate (RTVDR):** This measures the amount of money saved from correct fraud predictions, expressed as a percentage of the total amount fraudulently charged against accounts. Only amounts associated with approved fraud transactions are counted.
    - Lowering the threshold score generally improves RTVDR but may increase false positives, as more transactions are flagged as potential fraud.

## Types of Measurements

### Receiver Operating Characteristic (ROC) Curve
- Created by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
- Curves closer to the top-left corner indicate a better performance.
    - The diagonal line (FPR = TPR) represents a random classifier; the closer the curve is to this line, the less accurate the model.

### Area Under Curve (AUC)  
- Specifically, we’re looking at the area under the Receiver Operating Characteristic (ROC) curve.
- This area, ranging from 0 to 1, measures the ability of a classification model to separate the two classes and sift signal from noise.
- To first draw the ROC curve, we plot the True Positive Rate (TPR) against the False Positive Rate (FPR).

## Mount the Google Drive

In [None]:
import os
import sys
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

path = '/content/drive/MyDrive/FICO Analytic Challenge/'
sys.path.append(path +'Data')
sys.path.append(path +'Week 04')
sys.path.append(path +'Week 06')
sys.path.append(path +'Week 07')
sys.path.append(path +'Week 10')
os.chdir(path)
print(os.getcwd())

### Import the required libraries

In [None]:
# import the necessary libaries
import numpy as np
import pandas as pd
from pickle import dump, load
from fico_functions import *

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.preprocessing import MinMaxScaler
import math

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import warnings
warnings.filterwarnings('ignore')

# Removing limitation in viewing pandas columns and rows
pd.set_option('display.max_columns', None, 'display.max_rows', None)

In [None]:
# path to model
mdlPath = f"{path}Model"

# Folder's name that's holding data of interest
data = 'Data'

# Model name; this will be used to distinguish model's output files
model='NNet'

# import scale file 
scaleFile = os.path.join(path + data, 'scaler.' + model + '.' + data + ".pkl")

### Data location
- test_C_notags.csv is the blind holdout dataset
    - you should have already created the features for it and named it either of the following:
        - test_C_notags_features.csv
            - if only using features from week 4
        - test_C_notags_advanced_features.csv
            - if also using week 8
- score.NNet.test_C_notags_features.csv or score.NNet.test_C_notags_advanced_features.csv
    - this should have scores from your trained NNet model
    - this dataset doesn't have the following columns since it has "notags"
        - mdlIsFraudTrx
        - mdlIsFraudAcct
- score.NNet.test_C_features.csv or score.NNet.test_C_advanced_features.csv
    - this is the files name that we return to you which includes the tags

In [None]:
# Change to correct file name
blindholdoutFile = ['test_C']

# CSV filename suffex
featureTestFileSuffix="_advanced_features.csv"

# Scored Blind Holdout file location
blindholdoutCSV = os.path.join(path + data, 'score.' + model + '.' + blindholdoutFile[0] + featureTestFileSuffix)

if not os.path.isfile(blindholdoutCSV):
    featureTestFileSuffix="_features.csv"
    blindholdoutCSV = os.path.join(path + data, 'score.' + model + '.' + blindholdoutFile[0] + featureTestFileSuffix)

In [None]:
# test dataset
df_test = import_df(blindholdoutCSV)

In [None]:
df_test.head()

### Removing Non-Fraud Transactions from Fraud Accounts

A fraud account will have transactions that are fraud and non-fraud. To ensure we dont have any uncertainty with a transaction being non-fraud or not in a fraud account, we remove the records that have non-fraud transaction from fraud accounts.

In [None]:
df_test = filterNFTrxfromFAccn(df_test)

In [None]:
df_test = matureProf_n_months(df_test, 'transactionDateTime', ['pan','transactionDateTime'], n_months=2)

In [None]:
print("\033[1mNNet\033[0m")
dataset_count(df_test, df_isTrain=False)

### Threshold Score Value
We're working with metrics that vary based on a threshold score. Our goal is to calculate these values for a specific range of threshold scores, which will then be used to generate the ROC plot. This can help reduce False Positives (FP) by adjusting final outputs to the desired Threshold Score where any transaction scoring at or above the threshold would be considered a fraud transaction.

In [None]:
# Getting a range of threshold scores
threshold_list = list(range(0,980, 5))
threshold_list.extend(range(980, 1000, 1))

# Removing any duplicated and sorting list
threshold_list = sorted(set(threshold_list))
# print(threshold_list)

## Calculate Test Sets Metrics

<font color='red'>(**Warning: Takes a long time**)</font>

In [None]:
# Dataset to calculate performance metrics for train and required columns
trainOrTest= 'test'
perfCols = ['pan', 'is_train', 'mdlIsFraudTrx', 'mdlIsFraudAcct', 'transactionAmount', 'transactionDateTime', 'score']

In [None]:
pNF, TDR, TVDR, ApNF, ADR, RTVDR = calcMetrics(df_test[perfCols], threshold_list, model, df_isTrain=trainOrTest)

#### TDR vs %NF (**ROC**)

In [None]:
tdr_pNF = plot_roc_NNet(TDR, pNF, xlabel='Percent Non-Fraud', ylabel='Transaction Detection Rate (%)', f1=blindholdoutFile, legend=trainOrTest)

#### TVDR vs %NF (**Dollar Weighted ROC**)

In [None]:
tvdr_pNF = plot_roc_NNet(TVDR, pNF, xlabel='Percent Non-Fraud', ylabel='Transaction Value Detection Rate (%)', f1=blindholdoutFile, legend=trainOrTest)

##### ADR vs A%NF

In [None]:
adr_afpr = plot_roc_NNet(ADR, ApNF, xlabel='Account % Non-Fraud', ylabel='Account Detection Rate (%)', f1=blindholdoutFile, legend=trainOrTest)

##### RTVDR vs A%NF

In [None]:
rtvdr_apnf = plot_roc_NNet(RTVDR, ApNF, xlabel='Account % Non-Fraud', ylabel='Real-Time Value Detection Rate (%)', f1=blindholdoutFile, legend=trainOrTest)

### Score Distribution Plot

In [None]:
plot_scoreDist_NNet(df_test[df_test['is_train']==0], f1=blindholdoutFile)

# Aside from plots, make statements like the following

- Our model captured X% Fraud Transactions and prevented Y% Fraud Loss at a 0.5% NF review rate
- Our model captured X% Fraud Accounts and prevented Y% Fraud Loss at a 1% NF Account review rate

Recall, **Review Rate** is the Total # of Accounts with score >= Score Threshold divided by the Total # of Accounts. From a buisness persepctive, its not feasible to review all False Positives (FPs), so there is a trade off. The goal is to get the least amount of FPs but have the highest amount of True Positive (TPs).

Be sure to understand the meaning of what you're stating. Its fair game to be questioned, in detail, on any statements you claim.

## Understanding the Lists Produced from calcMetrics

The variable called **threshold_list** is a list of all the thresholds that calcMetrics will calculate the metrics for. If **threshold_list** is [0, 5, 10, 15, 20], then it has 5 elements, (i.e., len(threshold_list)=5). This means, the lists' outputted from calcMetrics (i.e., pNF, TDR, TVDR, ApNF, ADR, RTVDR), will each have 5 elements, where each elements values will correspond to the respected threshold value in **threshold_list**. 

Here is an example, using index numbers, to help understand the values produced. Say, **threshold_list** = [0, 5, 10, 15, 20]. Index value of 0 corresponds to the list value 0, index value of 1 corresponds to the list value of 5, index value of 2 corresponds to the list value of 10, etc. In code form:
- threshold_list[0] is 0
- threshold_list[1] is 5
- threshold_list[2] is 10
- threshold_list[3] is 15
- threshold_list[4] is 20

When calcMetrics is called, it sequentially processes each element in **threshold_list**, one at a time. The pNF, TDR, TVDR, ApNF, ADR, RTVDR values, specific to the elements threshold value, are generated and stored to their respective lists.

The above needs to be understood so that you can produce the types of statements provided above. To get the TDR and TVDR values at a 0.5% NF review rate, you need to find the index value of 0.5 in the pNF list. Similary, to get the ADR and RTVDR values at a 1% NF review rate, you need to find the index value of 1 in the ApNF list. Below are example codes that can help.

In [None]:
def find_closest_index(lst, target):
    # Use the `min` function with a custom key to find the closest value
    closest_index = min(range(len(lst)), key=lambda i: abs(lst[i] - target))
    return closest_index

In [None]:
# Check the values in pNF, but keep in mind we want the index of the pNF value closest to 0.5
pNF

In [None]:
# Then get the index so you can use that index to get the values in TDR and TVDR at that pNF value
idx_pNF = find_closest_index(pNF, 0.5)
print(f'NF review rate = {pNF[idx_pNF]}% at idx = {idx_pNF}')

In [None]:
# Now you can use the index to generate your results
print(f'Our model captured {TDR[idx_pNF]}% Fraud Transactions and prevented {TVDR[idx_pNF]}% Fraud Loss at a {pNF[idx_pNF]}% NF review rate')

In [None]:
# Check the values in ApNF, but keep in mind we want the index of the ApNF value closest to 1
ApNF

In [None]:
# Then get the index so you can use that index to get the values in ADR and RTVDR at that ApNF value
idx_ApNF = find_closest_index(ApNF, 1)
print(f'NF Account review rate = {ApNF[idx_ApNF]}% at idx = {idx_ApNF}')

In [None]:
# Now you can use the index to generate your results
print(f'Our model captured {ADR[idx_ApNF]}% Fraud Account and prevented {RTVDR[idx_ApNF]}% Fraud Loss at a {ApNF[idx_ApNF]}% NF Account review rate')