# Credit Card Fraud Detection

All the previous exercises made you take a closer look at all the different parts of a neural network: 
* the architecture of a sequential Dense Neural Network, 
* the compilation method
* the fitting.

Let's now work on a real-life dataset that has **a lot of data**!

💸 **The dataset : `Credit Card Transactions`** 💸

For this open challenge, you will `work with data extracted from credit card transactions`. 

As these are `sensitive data`, from all the 31 columns, only 3 columns are known: the rest are data that have been transformed to `anonymize` them (in fact, they are `PCA projections of initial data`).

The other three known columns are:

* `TIME`: the time elapsed between the transaction and the first transaction in the dataset
* `AMOUNT`: the amount of the transaction
* `CLASS` (our target): 
    * `0 : valid transaction` 
    * `1 : fraudulent transaction`

❓ **Question** ❓ Start by downloading the dataset:
* on the Kaggle website [here](https://www.kaggle.com/mlg-ulb/creditcardfraud) 
* or from our [URL](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/creditcard.csv) 

Load data to create `X` and `y`

In [1]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv("creditcard.csv")
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [15]:
X = data.drop(columns="Class")
y = data['Class']

## 1. Rebalancing classes

In [12]:
# Let's check class balance
pd.Series(y).value_counts(normalize=True)

0    0.998273
1    0.001727
Name: Class, dtype: float64

☝️ in this `fraud detection` challenge, **the classes are extremely imbalanced**:
* 99.8 % of normal transactions
* 0.2 % of fraudulent transactions

**We won't be able to detect frauds unless we apply some serious rebalancing strategies!**

❓ **Question** ❓
1. **First**, create three separate splits `Train/Val/Test` from your dataset. It is extremely important to keep validation and testing sets **unbalanced** so that when you evaluate your model, it is done in true conditions, without data leakage. Keep your test set for the very last cell of this notebook... !

&nbsp;
2. **Second**, rebalance you training set (and only this one). You have many choices:

- Simply oversample the minority class randomly using plain Numpy functions (not the best option since you are duplicating rows and hence creating data leakage)
- Or use <a href="https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/">**`Synthetic Minority Oversampling Technique - SMOTE`**</a> to generate new datapoints by weighting the existing ones
- In addition, you can also try a <a href="https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/">**`RandomUnderSampler`**</a> to downsample the majority class a little bit

In [56]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train,  y_test = train_test_split(X, y, train_size=0.6)
X_train, X_val, y_train,  y_val = train_test_split(X_train, y_train, train_size=0.5)

In [57]:
len(X_train), len(X_val), len(X_test)

(85442, 85442, 113923)

In [72]:
from imblearn.over_sampling import SMOTE

In [71]:
!pip install imblearn

Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.9.0-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 KB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.9.0 imblearn-0.0


In [73]:
oversample = SMOTE()
X_train_r, y_train_r = oversample.fit_resample(X_train, y_train)

In [74]:
X_train_r.shape, X_train.shape

((170618, 30), (85442, 30))

In [31]:
def plot_loss_recall(history, title=None):
    fig, ax = plt.subplots(1,2, figsize=(20,7))
    
    # --- LOSS --- 
    
    ax[0].plot(history.history['loss'])
    ax[0].plot(history.history['val_loss'])
    ax[0].set_title('Model loss')
    ax[0].set_ylabel('Loss')
    ax[0].set_xlabel('Epoch')
    ax[0].set_ylim((0,3))
    ax[0].legend(['Train', 'Test'], loc='best')
    ax[0].grid(axis="x",linewidth=0.5)
    ax[0].grid(axis="y",linewidth=0.5)
    
    # --- ACCURACY
    
    ax[1].plot(history.history['recall'])
    ax[1].plot(history.history['val_recall'])
    ax[1].set_title('Model recall')
    ax[1].set_ylabel('recall')
    ax[1].set_xlabel('Epoch')
    ax[1].legend(['Train', 'Test'], loc='best')
    ax[1].set_ylim((0,1))
    ax[1].grid(axis="x",linewidth=0.5)
    ax[1].grid(axis="y",linewidth=0.5)
    
    if title:
        fig.suptitle(title)

## 2. Neural Network iterations

Now that you have rebalanced your classes, try to fit a neural network to optimize your test score. Feel free to use the following hints:

- Normalize your inputs!
    - Use preferably a [`Normalization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/Normalization) layer inside the model to "pipeline" your preprocessing within your model. 
    - Or use sklearn's [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) outside of your model, applied your `X_train` and `X_val` and `X_test`.
- Make your model overfit, then, regularize  it using:
    - Early Stopping criteria 
    - [`Dropout`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout) layers
    - or [`regularizers`](https://www.tensorflow.org/api_docs/python/tf/keras/regularizers) layers
- 🚨 Think carefully about the metrics you want to track and the loss function you want to use... !


In [58]:
X_train.shape

(85442, 30)

In [149]:
from tensorflow.keras import models, layers, regularizers
from tensorflow.keras.layers.experimental.preprocessing import Normalization
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.metrics import Recall
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.schedules import ExponentialDecay

In [59]:
normalizer = Normalization() # Instantiate a "normalizer" layer
normalizer.adapt(X_train) # "Fit" it on the train set

In [165]:
es = EarlyStopping(patience = 6, restore_best_weights=True, monitor="val_recall", mode='max')

In [141]:
initial_learning_rate = 0.1 # start with default Adam value

lr_schedule = ExponentialDecay(
    # Every 5000 iterations, multiply the learning rate by 0.7
    initial_learning_rate, decay_steps = 4000, decay_rate = 0.7)
    
opt_schedule = Adam(learning_rate=lr_schedule)
opt_schedule

<tensorflow.python.keras.optimizer_v2.adam.Adam at 0x15427ae20>

In [166]:
def initialize_model(lr):
    model = models.Sequential()

    #reg_l1 = regularizers.L1(0.01)
    #reg_l1_l2 = regularizers.l1_l2(l1=0.005, l2=0.005)
    model.add(normalizer)
    model.add(layers.Dense(15, activation='relu', input_dim=30))
    model.add(layers.Dropout(rate=0.3))
    model.add(layers.Dense(10, activation='relu', input_dim=30))
    model.add(layers.Dropout(rate=0.2))
    model.add(layers.Dense(1, activation = 'sigmoid'))

    model.compile(loss='binary_crossentropy',
                  optimizer=Adam(learning_rate=lr),
                  metrics=Recall(
    thresholds=None, top_k=None, class_id=None, name='recall', dtype=None
))
    return model

In [167]:
%%time
results =[]

model = initialize_model(0.05)
history = model.fit(X_train_r, y_train_r, 
                    validation_data=(X_val, y_val), 
                    callbacks=[es], epochs = 1000, verbose = 1)
results.append(model.evaluate(X_val, y_val)[1])
print(results)

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
[0.896774172782898]
CPU times: user 51.8 s, sys: 9.97 s, total: 1min 1s
Wall time: 36.7 s


By optimizing on the recall, we are "sacrificing" the precision!

🎯 As a bank manager, you want all the frauds to be detected.

✅ It's fine to predict False Positives, False Alarms: `Better be safe than sorry...`

## 3. Score your model on the unseen Test set

❓ **Questions** ❓: 

* Evaluate your model on the test 
* Print the Confusion Matrix
* What are your `precision` and `recall` on the test set ? 

In [117]:
model.evaluate(X_test, y_test)



[0.09781114012002945, 0.8186274766921997]

In [168]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, [0 if k <0.5 else 1 for k in model.predict(X_test)])

array([[111165,   2554],
       [    20,    184]])

In [121]:
precision = 167 / (70 + 167)
recall = 167 / (37 + 167)

### 🧪 Test your score

Store below your real test performance on a (`X_test`, `y_test`) representative sample of the original unbalanced dataset into `precision` and `recall` variables.

In [122]:
from nbresult import ChallengeResult

result = ChallengeResult('solution',
    precision=precision,
    recall=recall,
    fraud_number=len(y_test[y_test == 1]),
    non_fraud_number=len(y_test[y_test == 0]),
)

result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/06-Deep-Learning/02-Optimizer-loss-and-fitting/04-Credit-Card-Challenge
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 2 items

tests/test_solution.py::TestSolution::test_is_score_good_enough [32mPASSED[0m[32m   [ 50%][0m
tests/test_solution.py::TestSolution::test_is_test_set_representative [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/solution.pickle

[32mgit[39m commit -m [33m'Completed solution step'[39m

[32mgit[39m push origin master


## 🏁 Optional : Read Google's solution for this challenge

Congratulations for finishing all the challenges for this session!

To conclude, take some time to read Google's own solution direcly [on Colab here](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/structured_data/imbalanced_data.ipynb). 

You will discover interesting techniques and best practices
