[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1d2kYPsvC9MLgBhpsRJfMS8SvpMlTmF4Q?usp=sharing)

# Imbalanced classification: credit card fraud detection

https://keras.io/examples/structured_data/imbalanced_classification/

## Introduction

This example looks at the
[Kaggle Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud/)
dataset to demonstrate how
to train a classification model on data with highly imbalanced classes.

## First, vectorize the CSV data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!ls drive/MyDrive/data

creditcard.csv


In [None]:
import csv
import numpy as np

# Get the real data from --> https://www.kaggle.com/mlg-ulb/creditcardfraud/
fname = "/content/drive/MyDrive/data/creditcard.csv"

all_features = []
all_targets = []
with open(fname) as f:
    for i, line in enumerate(f):
        if i == 0:
            print("HEADER:", line.strip())
            continue  # Skip header
        fields = line.strip().split(",")
        all_features.append([float(v.replace('"', "")) for v in fields[:-1]])
        all_targets.append([int(fields[-1].replace('"', ""))])
        if i == 1:
            print("EXAMPLE FEATURES:", all_features[-1])

features = np.array(all_features, dtype="float32")
targets = np.array(all_targets, dtype="uint8")
print("features.shape:", features.shape)
print("targets.shape:", targets.shape)

HEADER: "Time","V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","Amount","Class"
EXAMPLE FEATURES: [0.0, -1.3598071336738, -0.0727811733098497, 2.53634673796914, 1.37815522427443, -0.338320769942518, 0.462387777762292, 0.239598554061257, 0.0986979012610507, 0.363786969611213, 0.0907941719789316, -0.551599533260813, -0.617800855762348, -0.991389847235408, -0.311169353699879, 1.46817697209427, -0.470400525259478, 0.207971241929242, 0.0257905801985591, 0.403992960255733, 0.251412098239705, -0.018306777944153, 0.277837575558899, -0.110473910188767, 0.0669280749146731, 0.128539358273528, -0.189114843888824, 0.133558376740387, -0.0210530534538215, 149.62]
features.shape: (284807, 30)
targets.shape: (284807, 1)


## Prepare a validation set

In [None]:
num_val_samples = int(len(features) * 0.2)
train_features = features[:-num_val_samples]
train_targets = targets[:-num_val_samples]
test_features = features[-num_val_samples:]
test_targets = targets[-num_val_samples:]

print("Number of training samples:", len(train_targets))
print("Number of validation samples:", len(test_features))

Number of training samples: 56961
Number of validation samples: 56961


## [Recommended] Normalize the data

In [None]:
train_features = ...
test_features = ...

## Build a binary classification model

In [None]:
import keras

model = ...
model.summary()

## Exercise: detects 90% of frauds in test dataset (TP >= 68)

**Tips**: check the following documentation (class weight parameter): https://keras.io/api/models/model_training_apis/#fit-method

In [None]:
metrics = [
  keras.metrics.TruePositives(name="tp"),
  ...
]

model.compile(
    ...
)

model.fit(
    train_features,
    train_targets,
    batch_size=...,
    epochs=...,
    callbacks=...,
    verbose=2,
    validation_split=...,
    class_weight=...,
)

In [None]:
results = model.evaluate(test_features, test_targets, verbose=0)
print('Test Loss: {}'.format(results[0]))
print('Test TP: {}'.format(results[1]))