### The purpose of this notebook is to perform initial data explorations for the Banking Fraud Data set.

### Data Sample Specifications 

This is a sample of 1 row with headers explanation:

1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0

step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

amount -
amount of the transaction in local currency.

nameOrig - customer who started the transaction

oldbalanceOrg - initial balance before the transaction

newbalanceOrig - new balance after the transaction

nameDest - customer who is the recipient of the transaction

oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

In [34]:
import os
import boto3
import pandas as pd
import sagemaker as sm
from io import StringIO

# aws_key = os.environ['AWS_ACCESS_KEY']
# aws_secret = os.environ['AWS_SECRET_ACCESS_KEY']

# role = sm.get_execution_role()
# sm_session = sm.session.Session()

bucket = '1s-gary'
data_key = 'aiml-blackbelt-2021/rawdata/PS_20174392719_1491204439457_log.csv'
data_sample_key = 'aiml-blackbelt-2021/rawdata/PS_20174392719_1491204439457_log_small.csv'

In [35]:
s3_client = boto3.client('s3')
csv_obj = s3_client.get_object(Bucket=bucket, Key=data_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')
df = pd.read_csv(StringIO(csv_string))

csv_obj = s3_client.get_object(Bucket=bucket, Key=data_sample_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')
df_sample = pd.read_csv(StringIO(csv_string))

In [36]:
# s3loc = f's3://{bucket}/{data_key}'
df_sample

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,285,TRANSFER,3339025.78,C1026038263,0.00,0.00,C1232579010,6667523.60,10006549.38,0,0
1,274,CASH_OUT,312496.50,C1686642649,0.00,0.00,C233417034,840086.56,1152583.05,0,0
2,353,CASH_IN,97106.03,C36774230,1980240.72,2077346.74,C16389454,157997.97,60891.94,0,0
3,20,PAYMENT,20027.28,C1302277217,23063.50,3036.22,M861546597,0.00,0.00,0,0
4,322,PAYMENT,8367.50,C853508209,0.00,0.00,M1037844532,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
636257,300,CASH_OUT,73810.32,C832374530,2447.00,0.00,C1385336131,26856.93,100667.25,0,0
636258,331,CASH_IN,154164.85,C584806951,10597079.36,10751244.21,C1545515550,383048.47,228883.62,0,0
636259,309,TRANSFER,1475519.65,C1085117261,0.00,0.00,C1949312416,1476274.08,2951793.73,0,0
636260,158,CASH_OUT,410630.52,C1436073414,10324.00,0.00,C1710527462,821797.30,1232427.81,0,0


### Philosophy of this dataset

A number of great explorations of the dataset have been performed, and since my focus is on creating an AutoML / MLOps exploration, I will give most of that data exploration short shrift so that I can start the data engineering.  We look at just a few items:

- dealing with nameOrig and nameDest: if there are enough transactions that history might help us we might retain them; this turns out not to be the case, so we will drop both nameOrig and nameDest
- The dataset is well behaved (no missing data, etc.) with the exception of matching the transaction amounts to the changes in balance
- We will leave the data much as it is
- isFlaggedFraud is not useful and so is dropped

In [48]:
# simple data quality checks
def quality_checks(df):
    print(f'The shape of the data is {df_sample.shape}')
    print(f'the number of rows with NaN is {sum(df.isna().any(axis=1))}')

quality_checks(df_sample)

The shape of the data is (636262, 11)
the number of rows with NaN is 0


### nameOrig

Run through full dataset and subsample by nameOrig for exploratory work.

Our first thesis is that individual users may be prone to fraudulent transactions or whether the transaction history of individuals
will help us create a better model.  If that's the case, then we would need to randomize by user for test / train splitting.

To this end, we start by exploring nameOrig.

In [4]:
gb = df.groupby('nameOrig')

In [5]:
print(f'Maximum number of entries for a single nameOrig: {max(gb.count()["step"])}')
print('Conclusion: the maximum number of repeated transactions is small')

Maximum number of entries for a single nameOrig: 2
Conclusion: the maximum number of repeated transactions is small


In [16]:
df1 = df.groupby('nameOrig').filter(lambda s: s.step.count()>=2)

if sum(df1.isFraud)==0:
    print(f'None of the repeated {len(df1.nameOrig.unique())} nameOrig\'s result in fraud so we conclude that names will not help us identify positive cases of fraud  ')
else:
    print('check fraud in repeated nameOrig')

None of the repeated 88 nameOrig's result in fraud so we conclude that names will not help us identify positive cases of fraud  


In [17]:
df1 = df.groupby('nameDest').filter(lambda s: s.step.count()>=2)

if sum(df1.isFraud)==0:
    print(f'None of the repeated {len(df1.nameDest.unique())} nameOrig\'s result in fraud so we conclude that names will not help us identify positive cases of fraud  ')
else:
    print('check fraud in repeated nameDest')

check fraud in repeated nameDest


In [22]:
(df.nameDest.unique())

array(['C1232579010', 'C233417034', 'C16389454', ..., 'C1545515550',
       'C1949312416', 'C1710527462'], dtype=object)

### IsFlaggedFraud

In [None]:
# feature engineering by nameOrig:
# count how many transactions per period