## Data Collection


We will be using data from here: https://www.kaggle.com/ealaxi/paysim1

In [1]:
import pandas as pd

In [2]:
rawData = pd.DataFrame(pd.read_csv('Synthetic_Fraud_Data.csv'))
rawData.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


Dataset is synthetic, so there should be very little cleaning to do.

In [3]:
rawData.shape

(6362620, 11)

With over 6.3 million records, no further data needs to be added, which is good, because I couldn't find any more.

## Data Organization

Created a repository at https://github.com/NickLamm/SpringboardCapstone2

## Data Definition

In [4]:
print(rawData.columns)
rawData.describe()

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')


Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


###### Column definitions, from the kaggle page:

step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

amount - amount of the transaction in local currency.

nameOrig - customer who started the transaction

oldbalanceOrg - initial balance before the transaction

newbalanceOrig - new balance after the transaction

nameDest - customer who is the recipient of the transaction

oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

##### Checking data 

Step should have 744 unique values. Type should have 5, and isFlaggedFraud/isFraud are binary so should be 0 or 1.

In [5]:
unq = rawData.apply(lambda col: len(col.unique()))
unq

step                  743
type                    5
amount            5316900
nameOrig          6353307
oldbalanceOrg     1845844
newbalanceOrig    2682586
nameDest          2722362
oldbalanceDest    3614697
newbalanceDest    3555499
isFraud                 2
isFlaggedFraud          2
dtype: int64

Good thing we checked, there are only 743 steps in this data.

Checking unique values for names:

In [6]:
print(f'Unique origin names: {unq["nameOrig"]}, {round(100*unq["nameOrig"]/len(rawData),2)}% of total records\n')
print(f'Unique destination names: {unq["nameDest"]}, {round(100*unq["nameDest"]/len(rawData),2)}% of total records')

Unique origin names: 6353307, 99.85% of total records

Unique destination names: 2722362, 42.79% of total records


While there are almost as many transactions as there are origin names, there are less than half as many destinations, most likely representing many individuals paying to larger financial entities.

In [7]:
rawData.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

## Data Cleaning

Not much will have to be done, as I said before.

Type can be a category, fraud flags can be boolean.

In [8]:
rawData['type'] = rawData['type'].astype('category')
rawData['isFraud'] = rawData['isFraud'].astype('bool')
rawData['isFlaggedFraud'] = rawData['isFlaggedFraud'].astype('bool')
rawData.dtypes

step                 int64
type              category
amount             float64
nameOrig            object
oldbalanceOrg      float64
newbalanceOrig     float64
nameDest            object
oldbalanceDest     float64
newbalanceDest     float64
isFraud               bool
isFlaggedFraud        bool
dtype: object

In [9]:
rawData.isna().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

No missing values, due to coming from a simulation.

In [10]:
rawData.duplicated().sum()

0

No exact duplicates.

In [19]:
try:
    assert rawData[rawData['isFlaggedFraud']]['amount'].min() >= 200000
    assert rawData[rawData['amount']<200000]['isFlaggedFraud'].sum() == 0
except AssertionError: print('There are incorrect values in the \'isFlaggedFraud\' column.')
else: print('All transactions are correctly flagged based on amount.')

All transactions are correctly flagged based on amount.


In [12]:
flagCt = len(rawData[rawData['isFlaggedFraud']])
correctFlagsRatio = len(rawData[rawData['isFlaggedFraud'] & rawData['isFraud']])/(flagCt)
fraudCt = len(rawData[rawData['isFraud']])
print(f'Filter identified {flagCt} fraudulent transactions, {100*correctFlagsRatio}% of which were truly fraud.')
print(f'However, {fraudCt-(flagCt*correctFlagsRatio)} fraudulent transactions were not flagged')

Filter identified 16 fraudulent transactions, 100.0% of which were truly fraud.
However, 8197.0 fraudulent transactions were not flagged


'isFlaggedFraud' will likely be irrelavalent, especially since it's just a simple boolean condition. Depending on data exploration, it will probably be dropped.