This notebook contains the work for Step 4 of the Data Science Method:

The Data Science Method

   1.Problem Identification

   2.Data Wrangling

     
   . Data Collection
   
   . Data Organization
   
   . Data Definition
   
   . Data Cleaning


3.Exploratory Data Analysis

    .Build data profile tables and plots

    .Outliers & Anomalies

    .Explore data relationships
    
    .Identification and creation of features

4.Pre-processing and Training Data Development


    . Create dummy or indicator features for categorical variables
    
    . Standardize the magnitude of numeric features
    
    . Split into testing and training datasets
    
    . Apply scaler to the testing set


5.Modeling


    .Fit Models with Training Data Set
    
    .Review Model Outcomes — Iterate over additional models as needed
    
    .Identify the Final Model


6.Documentation


    .Review the Results
    
    .Present and share your findings - storytelling
    
    .Finalize Code
    
    .Finalize Documentation


In [44]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
from scipy.stats import chi2_contingency
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [45]:
# set options
pd.set_option('display.max_rows', 1500)

In [46]:
# load the data saved from step 3
df=pd.read_csv('C:\\Users\\arna_mora\\Springboard\\unit 7\\creditcard.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [47]:
df.dtypes

Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object

Data Pre-processing

we can see what the distribution really looks like between the fraudulent and valid transactions.

In [48]:
valid_df = df[df['Class'] == 0]
fraud_df = df[df['Class'] == 1]

print("Number of valid transactions: " + str(len(valid_df.index)))
print("Number of fraudulent transactions: " + str(len(fraud_df.index)))

Number of valid transactions: 283253
Number of fraudulent transactions: 473


So, let’s start by equalizing the number of fraud and valid transactions. This will allow for better prediction since the model will be trained to expect fraud and not-fraud with equal chance, since it is a binary outcome.
If we had left the distribution as is, the model would be heavily skewed towards valid, and the data from 492 frauds would have almost no effect on the model at all.

Equalization of data between fraudulent and nonfraudulent.

Now, we’re going to equalize the number of fraudulent and nonfraudulent transactions. Let’s start by extracting 473 valid transactions at random.

In [49]:
valid_df = df.loc[df['Class'] == 0][:473]
equalized_df = pd.concat([fraud_df, valid_df])
equalized_df = equalized_df.sample(frac = 1, random_state = 42)
valid_df = equalized_df[equalized_df['Class'] == 0]
fraud_df = equalized_df[equalized_df['Class'] == 1]

print("Number of nonfraudulent transactions: " + str(len(valid_df.index)))
print("Number of fraudulent transactions: " + str(len(fraud_df.index)))

Number of nonfraudulent transactions: 473
Number of fraudulent transactions: 473


Now that we’ve equalized the number of fraudulent and non-fraudulent transactions, we should normalize all of the column values in order to best identify features in the dataset. In doing so, we minimize inaccuracies in having large values that may skew results.

Normalization is a process by which we scale values to be between specified limits, usually -1 to 1 or 0 to 1. This process is important because our machine learning models are heavily affected by differences in number size.

In [60]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(equalized_df['Time'].values.reshape(-1, 1))
equalized_df['Time'] = scaler.fit_transform(equalized_df['Time'].values.reshape(-1, 1))
scaler.fit(equalized_df['Amount'].values.reshape(-1, 1))
equalized_df['Amount'] = scaler.fit_transform(equalized_df['Amount'].values.reshape(-1, 1))
equalized_df = equalized_df.drop(['Time','Amount'],axis=1)
equalized_df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Class
154103,-4.221221,2.871121,-5.888716,6.890952,-3.404894,-1.154394,-7.739928,2.851363,-2.507569,-5.110728,...,-0.227882,1.620591,1.567947,-0.578007,-0.059045,-1.829169,-0.072429,0.136734,-0.599848,1
8803,-4.696795,2.693867,-4.475133,5.467685,-1.556758,-1.54942,-4.104215,0.553934,-1.498468,-4.594952,...,-0.158971,0.573898,-0.080163,0.318408,-0.245862,0.338238,0.032271,-1.508458,0.608075,1
347,1.026702,-0.661665,0.897601,0.144403,-1.105205,0.020245,-0.707267,0.195918,0.743371,-0.118939,...,0.112745,0.150661,0.322398,-0.154736,0.097655,0.174639,1.103205,-0.060709,0.016598,0
232390,-1.611877,-0.40841,-3.829762,6.249462,-3.360922,1.147964,1.858425,0.474858,-3.838399,-1.445375,...,2.425677,1.245582,0.616383,2.251439,-0.066096,0.53871,0.541325,-0.136243,-0.009852,1
17262,-27.848181,15.598193,-28.923756,6.418442,-20.346228,-4.828202,-19.210896,18.329406,-3.668735,-8.009159,...,1.697856,1.802149,-2.062934,-1.269843,0.165409,1.999499,-0.211059,1.324809,0.38809,1


Let's perform the splitting of our data into test, train, validation using train_test_split.

Our testing will take three phases: testing, training, and validation. Training is first, and it's where our model generates "intuition" about how to approach fraudulent and not fraudulent transactions.
The testing phase is where we see how the model performs against data where we know the outcome.
The validation testing is how we check that the model isn't overfitting to our specific dataset.Cross validation will be used when calculating accuracies.


In [54]:
training,test = train_test_split(equalized_df, train_size = 0.7, test_size = 0.3, shuffle=True)
training, valid = train_test_split(training, train_size = 0.7, test_size =0.3, shuffle=True) 

training_label = training.pop('Class')
test_label = test.pop('Class')
valid_label = valid.pop('Class')

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

Number transactions train dataset:  198608
Number transactions test dataset:  85118
Total number of transactions:  283726
