# Prepare fraud detection dataset to compare fraud detection classifier
We use the kaggle "Credit Card Fraud Detection" dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud

This dataset has to be downloaded, placed in the data folder and be unziped there.

This notebook splits the data to a stratified train and test set and does the preprocessing. This way the ML models can be compared. They should be compared on their performance on the test set. The useful metrics to compare unbalanced datasets like this are F1 and average precision. See here:

- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html

Since I am not sure about copyright issues I will not include any data here. Just download it from kaggle and use this notebook to generate the train and test datasets.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


In [2]:
# load the data to pandas frame
data = pd.read_csv('./data/creditcard.csv')


In [3]:
# show the data
data.head()


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
# get some info about the data
data.describe()


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.16598e-15,3.416908e-16,-1.37315e-15,2.086869e-15,9.604066e-16,1.490107e-15,-5.556467e-16,1.177556e-16,-2.406455e-15,...,1.656562e-16,-3.44485e-16,2.578648e-16,4.471968e-15,5.340915e-16,1.687098e-15,-3.666453e-16,-1.220404e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


Here we see that the data is not fully normalized. We will do that later after train / test split.

# Size of the dataset?

In [5]:
data.shape


(284807, 31)

30 features, 1 label, 284807 rows


# Do we have missing values?

In [6]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26  

We have no missing values.

# Is the dataset balanced?

In [7]:
fraud = data[(data['Class'] != 0)]
normal = data[(data['Class'] == 0)]

print('len fraud: {}'.format(len(fraud)))
print('len normal: {}'.format(len(normal)))


len fraud: 492
len normal: 284315


No... 284315 normal cases with just 492 fraud.

# Drop the Time
The time feature has the following meaning:
Number of seconds elapsed between this transaction and the first transaction in the dataset.

Now we drop the time values. Although this can be used to do useful feature engineering
this dataset is just to compare different ML techniques and not to test fancy feature engineering
methods.

In [8]:
data = data.drop(['Time'],axis=1)


# Split features and labels

In [9]:
# create label
y = np.array(data['Class'].tolist())

# create features
data = data.drop('Class', 1)
x = np.array(data.values)

print('x.shape:', x.shape)
print('y.shape:', y.shape)


x.shape: (284807, 29)
y.shape: (284807,)


# Split Train (80%) and Test (20%) Dataset
Data is split in a stratified fashion, using `y` as the class labels.

This way the ratio of fraud and non fraud in test and train set will be the same.

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, 
                                                    random_state = 42, 
                                                    stratify = y,
                                                    shuffle = True
                                                   )

print('x_train.shape: {}'.format(x_train.shape))
print('y_train.shape: {}'.format(y_train.shape))
print('Non fraud in train set:', sum(y_train == 0))
print('Fraud in train set:', sum(y_train == 1))
print('x_test.shape: {}'.format(x_test.shape))
print('y_test.shape: {}'.format(y_test.shape))
print('Non fraud in test set:', sum(y_test == 0))
print('Fraud in test set:', sum(y_test == 1))

x_train.shape: (227845, 29)
y_train.shape: (227845,)
Non fraud in train set: 227451
Fraud in train set: 394
x_test.shape: (56962, 29)
y_test.shape: (56962,)
Non fraud in test set: 56864
Fraud in test set: 98


Shapes and fraud distribution looks good after split.

# Do we need to scale the data?
Yes. See below.

In [11]:
# this should all be zero
np.mean(x_train, axis=0)


array([ 7.95290922e-04, -4.81013687e-04, -3.83022392e-04, -1.97223997e-05,
        1.74584369e-04, -1.17118617e-03,  7.67387574e-05, -4.33928397e-04,
        7.01756965e-04, -3.89440775e-04, -7.87247374e-04,  2.71812789e-03,
       -4.93121683e-04, -5.86174977e-05,  7.44064525e-04, -1.06990340e-03,
        3.90248279e-04,  5.62369121e-05,  6.63511187e-04, -9.98938638e-04,
        3.69741873e-04,  3.02706709e-04,  5.03893413e-05, -2.77277157e-04,
       -6.55396231e-04,  1.26095357e-04, -7.04733971e-05,  1.53721000e-04,
        8.81762977e+01])

Last Value is > 0.

In [12]:
# this should all be 1
np.std(x_train, axis=0)


array([  1.95892123,   1.64908876,   1.51528902,   1.41586397,
         1.37956617,   1.33121479,   1.23842802,   1.19147935,
         1.0973489 ,   1.08624961,   1.01919752,   0.9965068 ,
         0.99410071,   0.95599104,   0.91514082,   0.87424144,
         0.84423309,   0.83887189,   0.81392879,   0.77080421,
         0.73496097,   0.72573391,   0.62741717,   0.60532034,
         0.52123459,   0.48193803,   0.40493302,   0.32693846,
       250.72205134])

Last Value is >> 1.

# Scale
First we have to fit the scaler just on the train set and then scale the train and test set.

In [13]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train) # fit and transform in one step
x_test = scaler.transform(x_test)


# Scaling ok now?
Yes. It is.

In [14]:
# this should all be zero
np.mean(x_train, axis=0)


array([-2.32333336e-17, -3.71305513e-17,  1.42867911e-18, -1.38308270e-17,
        1.19600712e-17, -1.29258426e-17,  9.00313908e-17, -2.18386721e-17,
       -5.10488131e-18, -2.65148614e-18, -1.75819624e-18,  1.15741524e-17,
        1.88963764e-18,  4.17006678e-18, -1.88232858e-18, -4.77779139e-17,
        1.46770953e-17, -1.46766080e-17,  3.12769626e-17,  4.66440339e-19,
       -1.10884648e-17,  3.34774793e-17, -1.33478195e-17,  1.26963379e-17,
       -2.10359842e-17, -4.68993640e-17,  2.35824634e-17,  2.05876947e-17,
        7.10472435e-14])

In [15]:
# this should all be one
np.std(x_train, axis=0)


array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

# Check the shapes again

In [16]:
print('x_train.shape:', x_train.shape)
print('y_train.shape:', y_train.shape)
print('x_test.shape:', x_test.shape)
print('y_test.shape:', y_test.shape)


x_train.shape: (227845, 29)
y_train.shape: (227845,)
x_test.shape: (56962, 29)
y_test.shape: (56962,)


# Save the data

In [17]:
x_train_df = pd.DataFrame(x_train)
y_train_df = pd.DataFrame(y_train)
x_test_df = pd.DataFrame(x_test)
y_test_df = pd.DataFrame(y_test)

# we save with index=False so just the data is saved and not the panda index
x_train_df.to_csv('./data/x_train.csv', index=False)
y_train_df.to_csv('./data/y_train.csv', index=False)
x_test_df.to_csv('./data/x_test.csv', index=False)
y_test_df.to_csv('./data/y_test.csv', index=False)


# Load data again and check shape

In [18]:
x_train_read = pd.read_csv('./data/x_train.csv').values
y_train_read = pd.read_csv('./data/y_train.csv').values[:,0]
x_test_read = pd.read_csv('./data/x_test.csv').values
y_test_read = pd.read_csv('./data/y_test.csv').values[:,0]

print('x_train_read.shape:', x_train_read.shape)
print('y_train_read.shape:', y_train_read.shape)
print('x_test_read.shape:', x_test_read.shape)
print('y_test_read.shape:', y_test_read.shape)


x_train_read.shape: (227845, 29)
y_train_read.shape: (227845,)
x_test_read.shape: (56962, 29)
y_test_read.shape: (56962,)


# Test if loaded data and saved data is the same

In [19]:
# all must be true
print(np.allclose(x_train_read, x_train))
print(np.allclose(x_train_read, x_train))
print(np.allclose(x_train_read, x_train))
print(np.allclose(x_train_read, x_train))


True
True
True
True
