# Introduction

This notebook performs the following on the dataset:
* Train:Test Split
* Scaling via Robust Scaler
* Resamples via SMOTE and/or Random Undersampling

# Import Libraries

In [1]:
# Dataframes
import pandas as pd
import numpy as np
# Data Preparation
    # Train:Test
from sklearn.model_selection import train_test_split
    # Scaling
from sklearn.preprocessing import RobustScaler

# Resampling
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
from imblearn.pipeline import make_pipeline, Pipeline

In [2]:
# Import functions.py file
import sys
sys.path.append('../')

from functions.functions import resample_training_data, convert_array_to_dataframe

In [3]:
random_seed = 1

# Import Cleaned Data

In [4]:
df = pd.read_csv('../data/processed/cleaned_dataframe.gz', compression='gzip')
df.head(3)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0


# Train:Test Split

In [5]:
X = df.drop(['Class'], axis=1)
y = df['Class']

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.3, 
                                                    stratify=y, 
                                                    random_state=random_seed)

print("No. of samples in each training set:\t{}".format(X_train.shape[0]))
print("No. of samples in each test set:\t{}".format(X_test.shape[0]))

No. of samples in each training set:	199364
No. of samples in each test set:	85443


* We choose to stratify as we want to keep the distribution of classes the same in the training set as the test set. 
* Additionally we must ensure there are at least some fraudulent transactions in the test dataset.

# Scaling

* Scaling the data **improves the predictive performance** of some machine learning models.  
    * Without scaling, gradient-based estimating models (such as linear or logistic regression) will take longer to converge or may not be able to converge at all.

In [6]:
scaler = RobustScaler()

* As seen earlier with the "Distribution of Amount", the data is not normally distributed and there are a **lot of outliers in our data, so we have chosen to use Robust Scaling**, rather than Standard Scaling.

In [7]:
# Fit scaler on training data and transform Training data
X_train = scaler.fit_transform(X_train)

# Use the scaler to transform the Test data
X_test = scaler.transform(X_test)

* We must be careful to only **scale the test data using the scaling parameters learned on the train data**.

# Resampling

We will be using resampling techniques to improve the effectiveness of our machine learning models
* Resampling involves drawing repeated samples from the original dataset to create a new dataset which either reduces the ratio of the majority class (undersampling) or increases the ratio of the minority class (oversampling). A combination of the two can also be used.

For this project we will be using **Random Undersampling** and **Synthetic Minority Oversampling TEchnique (SMOTE)**.

### Random Undersampling
* This technique under-samples the majority class (legitimate transactions) randomly and uniformly.  
* This can lead to a loss of information, but if the transactions have similar feature values this loss will be minimized. 

### SMOTE
* This technique over-samples the minority class by generating synthetic data.
* This new data is based on the feature space similarities between fraudulent transactions. 
    * It finds the K-nearest neighbors of an individual fraudulent transaction and randomly selects one of them.  
    * A new fraudulent transaction is then synthetically generated in between the original fraud transaction and its neighbor.
    
Note that we **do not resample the test data**, as it represents unseen data. If it is unseen, we will not know which class it falls in, and thus we will not know whether to undersample or oversample it.

In [8]:
# Class distribution before resampling
counter = Counter(y_train)
print('Distribution of data before resampling:')
ratio = round(counter[0]/counter[1], 1)
print('\t0: {:<15}1: {:^10}{:>15} : 1'.format(counter[0], counter[1], ratio))

# Resample the dataset with SMOTE
resample_smote = SMOTE(sampling_strategy=0.1, random_state=random_seed)
X_train_smote, y_train_smote = resample_training_data(resample_smote, X_train, y_train, 'after SMOTE')

# Resample the dataset with Random Undersampling
resample_under = RandomUnderSampler(sampling_strategy=0.5, random_state=random_seed)
X_train_under, y_train_under = resample_training_data(resample_under, X_train, y_train, 'after Random Undersampling')

# Resample the dataset with SMOTE and Random Undersampling
over = SMOTE(sampling_strategy=0.1, random_state=random_seed)
under = RandomUnderSampler(sampling_strategy=0.5, random_state=random_seed)
steps = [('o', over), ('u', under)]
resample_smote_under = Pipeline(steps=steps)
X_train_smote_under, y_train_smote_under = resample_training_data(resample_smote_under, X_train, y_train, 'after SMOTE and Random Undersampling')

Distribution of data before resampling:
	0: 199020         1:    344              578.5 : 1

Distribution of Training Data after SMOTE:
	0: 199020         1:   19902              10.0 : 1

Distribution of Training Data after Random Undersampling:
	0: 688            1:    344                2.0 : 1

Distribution of Training Data after SMOTE and Random Undersampling:
	0: 39804          1:   19902               2.0 : 1


# Save new data

## Convert numpy arrays to pandas dataframes

In [9]:
# create list of training numpy arrays for ease of iteration
list_of_X_trains = [X_train, X_train_smote, X_train_under, X_train_smote_under]
list_of_y_trains = [y_train, y_train_smote, y_train_under, y_train_smote_under]

In [10]:
# Set columns for training and test dataframes
X_columns = df.columns[:-1]
y_columns = [df.columns[-1]]

In [11]:
# Original Train
X_train, input_dtype, output_dtype = convert_array_to_dataframe(X_train, X_columns)
print('X_train\n\tInput:\t{}\n\tOutput:\t{}\n'.format(input_dtype, output_dtype))
y_train, input_dtype, output_dtype = convert_array_to_dataframe(y_train, y_columns)
print('y_train\n\tInput:\t{}\n\tOutput:\t{}\n'.format(input_dtype, output_dtype))

# Original Test
X_test, input_dtype, output_dtype = convert_array_to_dataframe(X_test, X_columns)
print('X_test\n\tInput:\t{}\n\tOutput:\t{}\n'.format(input_dtype, output_dtype))
y_test, input_dtype, output_dtype = convert_array_to_dataframe(y_test, y_columns)
print('y_test\n\tInput:\t{}\n\tOutput:\t{}\n'.format(input_dtype, output_dtype))

# SMOTE Train
X_train_smote, input_dtype, output_dtype = convert_array_to_dataframe(X_train_smote, X_columns)
print('X_train_smote\n\tInput:\t{}\n\tOutput:\t{}\n'.format(input_dtype, output_dtype))
y_train_smote, input_dtype, output_dtype = convert_array_to_dataframe(y_train_smote, y_columns)
print('y_train_smote\n\tInput:\t{}\n\tOutput:\t{}\n'.format(input_dtype, output_dtype))

# Undersampled Train
X_train_under, input_dtype, output_dtype = convert_array_to_dataframe(X_train_under, X_columns)
print('X_train_under\n\tInput:\t{}\n\tOutput:\t{}\n'.format(input_dtype, output_dtype))
y_train_under, input_dtype, output_dtype = convert_array_to_dataframe(y_train_under, y_columns)
print('y_train_under\n\tInput:\t{}\n\tOutput:\t{}\n'.format(input_dtype, output_dtype))

# SMOTE and Undersampled Train
X_train_smote_under, input_dtype, output_dtype = convert_array_to_dataframe(X_train_smote_under, X_columns)
print('X_train_smote_under\n\tInput:\t{}\n\tOutput:\t{}\n'.format(input_dtype, output_dtype))
y_train_smote_under, input_dtype, output_dtype = convert_array_to_dataframe(y_train_smote_under, y_columns)
print('X_train_smote_under\n\tInput:\t{}\n\tOutput:\t{}\n'.format(input_dtype, output_dtype))

X_train
	Input:	<class 'numpy.ndarray'>
	Output:	<class 'pandas.core.frame.DataFrame'>

y_train
	Input:	<class 'pandas.core.series.Series'>
	Output:	<class 'pandas.core.frame.DataFrame'>

X_test
	Input:	<class 'numpy.ndarray'>
	Output:	<class 'pandas.core.frame.DataFrame'>

y_test
	Input:	<class 'pandas.core.series.Series'>
	Output:	<class 'pandas.core.frame.DataFrame'>

X_train_smote
	Input:	<class 'numpy.ndarray'>
	Output:	<class 'pandas.core.frame.DataFrame'>

y_train_smote
	Input:	<class 'pandas.core.series.Series'>
	Output:	<class 'pandas.core.frame.DataFrame'>

X_train_under
	Input:	<class 'numpy.ndarray'>
	Output:	<class 'pandas.core.frame.DataFrame'>

y_train_under
	Input:	<class 'pandas.core.series.Series'>
	Output:	<class 'pandas.core.frame.DataFrame'>

X_train_smote_under
	Input:	<class 'numpy.ndarray'>
	Output:	<class 'pandas.core.frame.DataFrame'>

X_train_smote_under
	Input:	<class 'pandas.core.series.Series'>
	Output:	<class 'pandas.core.frame.DataFrame'>



## Save Training Data

In [12]:
# Original data
X_train.to_csv('../data/processed/X_train.gz', index=0, compression='gzip')
y_train.to_csv('../data/processed/y_train.gz', index=0, compression='gzip', header=True)

# Resampled with SMOTE
X_train_smote.to_csv('../data/processed/X_train_smote.gz', index=0, compression='gzip')
y_train_smote.to_csv('../data/processed/y_train_smote.gz', index=0, compression='gzip', header=True)

# Resampled with Random Undersampling
X_train_under.to_csv('../data/processed/X_train_under.gz', index=0, compression='gzip')
y_train_under.to_csv('../data/processed/y_train_under.gz', index=0, compression='gzip', header=True)

# Resampled with SMOTE and Random Undersampling
X_train_smote_under.to_csv('../data/processed/X_train_smote_under.gz', index=0, compression='gzip')
y_train_smote_under.to_csv('../data/processed/y_train_smote_under.gz', index=0, compression='gzip', header=True)

## Save Test Data

In [13]:
X_test.to_csv('../data/processed/X_test.gz', index=0, compression='gzip')
y_test.to_csv('../data/processed/y_test.gz', index=0, compression='gzip', header=True)