# Data Preprocessing

Run this notebook to preprocess the data.
This notebook scales the data, deals with data imbalance and generates a training subset.

> **NOTE**: This notebook performs all the necessary steps to preprocess the data before training.
You do not need to develop additional preprocessing steps.



Install the `imbalanced-learn` package.
The notebook uses this package to rebalance the dataset by using resampling techniques.

In [1]:
%pip install imbalanced-learn==0.11.0


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Import dependencies:

In [2]:
from imblearn.over_sampling import SMOTE
from numpy import save, count_nonzero
from pandas import read_csv
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import RobustScaler

The following code scales the `Amount` and `Time` columns, then builds a training subset that contains a balanced number of each class (`Fraud`(1), `No Fraud`(0)).

The result of preprocessing the data is the `data/training_samples.npy` file, which contains all columns except for the class, and `data/training_samples.npy`, which contains only the class column.

In [3]:
df = read_csv('data/creditcard.csv')

rob_scaler = RobustScaler()

df['scaled_amount'] = rob_scaler.fit_transform(
    df['Amount'].values.reshape(-1, 1)
)
df['scaled_time'] = rob_scaler.fit_transform(
    df['Time'].values.reshape(-1, 1)
)
df.drop(['Time', 'Amount'], axis=1, inplace=True)
scaled_amount = df['scaled_amount']
scaled_time = df['scaled_time']

df.drop(['scaled_amount', 'scaled_time'], axis=1, inplace=True)
df.insert(0, 'scaled_amount', scaled_amount)
df.insert(1, 'scaled_time', scaled_time)

X = df.drop('Class', axis=1)
y = df['Class']
sss = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)

for train_index, test_index in sss.split(X, y):
    original_Xtrain = X.iloc[train_index]
    original_ytrain = y.iloc[train_index]

original_Xtrain = original_Xtrain.values
original_ytrain = original_ytrain.values

sm = SMOTE(sampling_strategy='minority', random_state=42)
Xsm_train, ysm_train = sm.fit_resample(original_Xtrain, original_ytrain)

save(f'data/training_samples.npy', Xsm_train)
save(f'data/training_labels.npy', ysm_train)

print('Data processing done!')

Data processing done!


In [4]:
num_frauds = count_nonzero(ysm_train)

print("Fraud cases", num_frauds)
print("No fraud cases", ysm_train.size - num_frauds)

Fraud cases 227452
No fraud cases 227452
