<a href="https://colab.research.google.com/github/MaryamKazemit/GNNE-Stream/blob/main/imbalanced_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imbalanced classification: credit card fraud detection

**Author:** [fchollet](https://twitter.com/fchollet)<br>
**Date created:** 2019/05/28<br>
**Last modified:** 2020/04/17<br>
**Description:** Demonstration of how to handle highly imbalanced classification problems.

## Introduction

This example looks at the
[Kaggle Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud/)
dataset to demonstrate how
to train a classification model on data with highly imbalanced classes.

## First, vectorize the CSV data

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import csv
import numpy as np

# Get the real data from https://www.kaggle.com/mlg-ulb/creditcardfraud/
fname = "/content/drive/My Drive/creditcard.csv"

# loading data that can be easily done in pandas do it himself
all_features = []
all_targets = [] #save labels
with open(fname) as f:
    for i, line in enumerate(f):
        if i == 0:
            print("HEADER:", line.strip())
            continue  # Skip header
        # strip چیزهای زائد را بنداز مثل space اول و آخر
        fields = line.strip().split(",")
        # تمام ستون ها به جز ستون آخر
        all_features.append([float(v.replace('"', "")) for v in fields[:-1]])
        # ستون آخر که label داده ها بوده را در target بریز
        all_targets.append([int(fields[-1].replace('"', ""))])
        if i == 1:
            print("EXAMPLE FEATURES:", all_features[-1])

features = np.array(all_features, dtype="float32")
targets = np.array(all_targets, dtype="uint8")
print("features.shape:", features.shape)
print("targets.shape:", targets.shape)

HEADER: "Time","V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","Amount","Class"
EXAMPLE FEATURES: [0.0, -1.3598071336738, -0.0727811733098497, 2.53634673796914, 1.37815522427443, -0.338320769942518, 0.462387777762292, 0.239598554061257, 0.0986979012610507, 0.363786969611213, 0.0907941719789316, -0.551599533260813, -0.617800855762348, -0.991389847235408, -0.311169353699879, 1.46817697209427, -0.470400525259478, 0.207971241929242, 0.0257905801985591, 0.403992960255733, 0.251412098239705, -0.018306777944153, 0.277837575558899, -0.110473910188767, 0.0669280749146731, 0.128539358273528, -0.189114843888824, 0.133558376740387, -0.0210530534538215, 149.62]
features.shape: (284807, 30)
targets.shape: (284807, 1)


## Prepare a validation set

In [3]:
# instead of doing validation test split with sci-kit learn did it himself
# از اون 30 تایی ها چند تا ویژگی داری طولش را بگیر
num_val_samples = int(len(features) * 0.2)
# از ابتدا تا 0.2 آخر
train_features = features[:-num_val_samples]
train_targets = targets[:-num_val_samples]
# از 0.2 آخر به بعد برای تست
val_features = features[-num_val_samples:]
val_targets = targets[-num_val_samples:]

print("Number of training samples:", len(train_features))
print("Number of validation samples:", len(val_features))
# نسبت 80 به 20 دیتای train و test را جدا کرده

Number of training samples: 227846
Number of validation samples: 56961


## Analyze class imbalance in the targets

In [4]:
# bincount() می شمارد کلاس صفر چندتا داریم؟یک چندتا داریم؟
# 1=fraud 0=notfraud
# fraction of data that was fraud to whole data: 100 * float(counts[1]) / len(train_targets)
counts = np.bincount(train_targets[:, 0])
print(
    "Number of positive samples in training data: {} ({:.2f}% of total)".format(
        counts[1], 100 * float(counts[1]) / len(train_targets)
    )
)

# due to binary
weight_for_0 = 1.0 / counts[0]
weight_for_1 = 1.0 / counts[1]

Number of positive samples in training data: 417 (0.18% of total)


## Normalize the data using training set statistics

In [5]:
# normalized data himself instead of using sci-kitlearn
mean = np.mean(train_features, axis=0)
# make data zero-center
train_features -= mean
val_features -= mean
# standard derivation
std = np.std(train_features, axis=0)
# دیتا میشه zero-center ای که به یک نسبت پخش شده در حول صفر و فشردگی و پخش در همه بعد ها یکسان است
train_features /= std
val_features /= std

## Build a binary classification model

In [6]:
import keras

# can also write model = keras.Sequential() then model.add()
# لایه ها را در ادامه با , از هم جدا کرده
# here the layers are passed to constructor of sequential instead of adding
model = keras.Sequential(
    [
        # train_features.shape[1:] یعنی number of samples*number of features که این یعنی تعداد feat ها که 30 است
        keras.Input(shape=train_features.shape[1:]),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.3),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.3),
        # لایه آخر یه آره یا نه اضافه کرده
        keras.layers.Dense(1, activation="sigmoid"),
    ]
)
model.summary()

## Train the model with `class_weight` argument

In [7]:
metrics = [
    keras.metrics.FalseNegatives(name="fn"),
    keras.metrics.FalsePositives(name="fp"),
    keras.metrics.TrueNegatives(name="tn"),
    keras.metrics.TruePositives(name="tp"),
    keras.metrics.Precision(name="precision"),
    keras.metrics.Recall(name="recall"),
]

model.compile(
    optimizer=keras.optimizers.Adam(1e-2), loss="binary_crossentropy", metrics=metrics
)
# to save the model(for more info: https://aparat.com/v/r879wf1?playlist=342711)
callbacks = [keras.callbacks.ModelCheckpoint("fraud_model_at_epoch_{epoch}.keras")]
# this should be from dictionary type so {}
class_weight = {0: weight_for_0, 1: weight_for_1}

model.fit(
    train_features,
    train_targets,
    batch_size=2048,
    epochs=30,
    verbose=2,
    # this was optional
    callbacks=callbacks,
    validation_data=(val_features, val_targets),
    # penalize according to samples
    class_weight=class_weight,
)

Epoch 1/30
112/112 - 10s - 94ms/step - fn: 55.0000 - fp: 30202.0000 - loss: 2.4622e-06 - precision: 0.0118 - recall: 0.8681 - tn: 197227.0000 - tp: 362.0000 - val_fn: 10.0000 - val_fp: 824.0000 - val_loss: 0.0891 - val_precision: 0.0731 - val_recall: 0.8667 - val_tn: 56062.0000 - val_tp: 65.0000
Epoch 2/30
112/112 - 2s - 19ms/step - fn: 34.0000 - fp: 6667.0000 - loss: 1.3530e-06 - precision: 0.0543 - recall: 0.9185 - tn: 220762.0000 - tp: 383.0000 - val_fn: 6.0000 - val_fp: 2066.0000 - val_loss: 0.1174 - val_precision: 0.0323 - val_recall: 0.9200 - val_tn: 54820.0000 - val_tp: 69.0000
Epoch 3/30
112/112 - 0s - 4ms/step - fn: 30.0000 - fp: 9930.0000 - loss: 1.5505e-06 - precision: 0.0375 - recall: 0.9281 - tn: 217499.0000 - tp: 387.0000 - val_fn: 10.0000 - val_fp: 490.0000 - val_loss: 0.0828 - val_precision: 0.1171 - val_recall: 0.8667 - val_tn: 56396.0000 - val_tp: 65.0000
Epoch 4/30
112/112 - 0s - 4ms/step - fn: 29.0000 - fp: 6483.0000 - loss: 1.0603e-06 - precision: 0.0565 - recall: 

<keras.src.callbacks.history.History at 0x7d9a5c8bb0d0>

## Conclusions

At the end of training, out of 56,961 validation transactions, we are:

- Correctly identifying 66 of them as fraudulent
- Missing 9 fraudulent transactions
- At the cost of incorrectly flagging 441 legitimate transactions

In the real world, one would put an even higher weight on class 1,
so as to reflect that False Negatives are more costly than False Positives.

Next time your credit card gets  declined in an online purchase -- this is why.

