In [99]:
import pandas as pd
import numpy as np
from sklearn.utils import class_weight
from sklearn.utils import resample

df_raw = pd.read_csv("Payments.crdownload")

Let's take a look once more into our data set.

In [100]:
df_raw.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,5500631.0,5500631.0,5500630.0,5500630.0,5500630.0,5500630.0,5500630.0,5500630.0
mean,207.5349,181259.6,841164.7,863076.2,1084465.0,1213316.0,0.0007680938,5.453921e-07
std,111.2426,629042.3,2918859.0,2955706.0,3269174.0,3584992.0,0.02770386,0.0007385065
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,138.0,13355.06,0.0,0.0,0.0,0.0,0.0,0.0
50%,212.0,75502.8,13869.0,0.0,136265.8,219181.9,0.0,0.0
75%,302.0,209558.7,107363.0,144954.0,946798.6,1118979.0,0.0,0.0
max,380.0,92445520.0,43818860.0,43686620.0,355553400.0,356015900.0,1.0,1.0


In [101]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5500631 entries, 0 to 5500630
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         float64
 10  isFlaggedFraud  float64
dtypes: float64(7), int64(1), object(3)
memory usage: 461.6+ MB


In [102]:
df_raw.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     1
newbalanceOrig    1
nameDest          1
oldbalanceDest    1
newbalanceDest    1
isFraud           1
isFlaggedFraud    1
dtype: int64

In [103]:
df = df_raw.dropna()

We will now drop all columns that are irrelevant to our modeling according to our discoveries in our EDA

In [104]:
df = df.drop("isFlaggedFraud", axis=1).drop("nameDest", axis=1).drop("nameOrig", axis=1)

We now need to take measures towards the columns represented by strings. Without some sort of transformation of those columns we will not be able to model our data. Due to the importance of the Type columns and its strong relation to Frauds I will be choosing Frequency Encoding as the frequency of certain transactions such as Cash Out and Transfer can be informative for our fraud prediction.

In [105]:
frequency_map = df['type'].value_counts().to_dict()
df['type'] = df['type'].map(frequency_map)

Now considering the computing power needed for the modeling we will be using a sample to dive forward into the modeling stage. Considering the entire data frame has 5500630 rows, I would like to use an approximate number to half of the data. In this case we will be using 30000.

In [106]:
df = df.sample(n=30000, replace=False)

In [109]:
df.to_csv("Wrangled_Payments.csv", index=False)