# Feature Engineering

### Action Plan
- Scale "amount" and "time" features to match remaining
- Check "describe" on features to confirm scale
- Will need to subsample and/or oversample to balance classes
- Do comparative heatmaps, boxplots, histplots (see if I can save images from EDA)
- Retry easy/difficult visualization on sampled data
- Do the sampling BEFORE train/test splitting (! no, other way around? https://imbalanced-learn.org/stable/common_pitfalls.html)
- Remember outlier corrections
- Try both subsampling and oversampling to see if either produces "better" results
- Extra credit: repeat from the sampling step as a kind of cross-validation of the sampling tactic

### Notes:
- You may need to install imblearn (imbalanced-learn)
- You may need to download data yourself (too big to upload with my current GitHub file size limit)
    -  https://www.kaggle.com/mlg-ulb/creditcardfraud
- Credit to https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets
- and to https://www.kaggle.com/hazratnit/credit-fraud-detection
- easy/difficult visualization from https://www.researchgate.net/publication/283349138_Calibrating_Probability_with_Undersampling_for_Unbalanced_Classification (Section III-D)

In [2]:
import numpy as np 
import pandas as pd 
# import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
# from sklearn.manifold import TSNE
# from sklearn.decomposition import PCA, TruncatedSVD
# import matplotlib.patches as mpatches
# import time

# Classifier Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# import collections


# Other Libraries
from sklearn.model_selection import train_test_split
# from sklearn.pipeline import make_pipeline
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
from collections import Counter
from sklearn.model_selection import KFold, StratifiedKFold
# import warnings
# warnings.filterwarnings("ignore")


df = pd.read_csv('data\creditcard.csv')

In [4]:
# Check "describe" on features to confirm scale
df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,3.918649e-15,5.682686e-16,-8.761736e-15,2.811118e-15,-1.552103e-15,2.04013e-15,-1.698953e-15,-1.893285e-16,-3.14764e-15,...,1.47312e-16,8.042109e-16,5.282512e-16,4.456271e-15,1.426896e-15,1.70164e-15,-3.662252e-16,-1.217809e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


Hmpf. My scale goes from negative single-digits to positive tens with the occasional hundred-ish, and not on a super logical scale. Let's see what the scaler produces.

In [11]:
# Scale "amount" and "time" features to match remaining
from sklearn.preprocessing import StandardScaler, RobustScaler

# RobustScaler is less prone to outliers.

std_scaler = StandardScaler()
rob_scaler = RobustScaler()

df['std_scaled_amount'] = std_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['std_scaled_time'] = std_scaler.fit_transform(df['Time'].values.reshape(-1,1))

df['rob_scaled_amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['rob_scaled_time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1,1))

df[['std_scaled_amount', 'std_scaled_time', 'rob_scaled_amount', 'rob_scaled_time']].describe()


Unnamed: 0,std_scaled_amount,std_scaled_time,rob_scaled_amount,rob_scaled_time
count,284807.0,284807.0,284807.0,284807.0
mean,3.202236e-16,-1.050379e-14,0.927124,0.118914
std,1.000002,1.000002,3.495006,0.557903
min,-0.3532294,-1.996583,-0.307413,-0.994983
25%,-0.3308401,-0.855212,-0.229162,-0.35821
50%,-0.2652715,-0.2131453,0.0,0.0
75%,-0.04471707,0.9372174,0.770838,0.64179
max,102.3622,1.642058,358.683155,1.035022


I think I will use standard scaler because it seems more like the PCA-transformed data.

In [12]:
df.drop(['Time','Amount', 'rob_scaled_amount', 'rob_scaled_time'], axis=1, inplace=True)

In [15]:
# train/test split BEFORE under or over sampling
X = df.drop('Class', axis=1)
y = df['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# split Class balance evaluation
classcount = y_train.value_counts()
perc_pos = (classcount[1]/(classcount[0]+classcount[1]))*100
print("Training Data contains {} fraudulent transactions which are {:.3f}% of all transactions".format(classcount[1], perc_pos))
classcount = y_test.value_counts()
perc_pos = (classcount[1]/(classcount[0]+classcount[1]))*100
print("Test Data contains {} fraudulent transactions which are {:.3f}% of all transactions".format(classcount[1], perc_pos))
print("Original Data contains 492 fraudulent transactions which are 0.173% of all transactions") # calculated in data wrangling

Training Data contains 344 fraudulent transactions which are 0.173% of all transactions
Test Data contains 148 fraudulent transactions which are 0.173% of all transactions
Original Data contains 492 fraudulent transactions which are 0.173% of all transactions


In [None]:
# Will need to subsample and/or oversample to balance classes
