# Solution Planning

- OUTPUT:
    - The classification model will give the results and we need to compare in order to bring high results, such as:
        - 25% of the amount if the transaction is a fraud and the model identify as a fraud;
        - 5% of the amount if the transaction is legit, but we classify as a fraud;
        - refund 100% of the amount for each transaction detected as legitimate, however the transaction is truly a fraud

- PROCESS:
    - Verify the amount of the data;
    - Verify if has NULL values and OUTLIERS;
    - Bring new variables to make a feature engineering in order to find relevant correlations;
    - If the dataset is unbalanced, try some balanced methods to bring better results;
    - Compare models and verify the cost-benefit of each one;
    - Do a improvement with hyperparameter fine-tunning (if it is necessary);
    - Bring monetary results of this action (loss or profit)

 - INPUT:
    - The CSV data from PaySim;
    - Verify the size of this data and made some upgrades, like save as parquet file ineach step of the CRISP-DM method

This is a sample of 1 row with headers explanation:

1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0

- **step** - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

- **type** - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

- **amount** - amount of the transaction in local currency.

- **nameOrig** - customer who started the transaction

- **oldbalanceOrg** - initial balance before the transaction

- **newbalanceOrig** - new balance after the transaction

- **nameDest** - customer who is the recipient of the transaction

- **oldbalanceDest** - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

- **newbalanceDest** - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

- **isFraud** - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

- **isFlaggedFraud** - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

dataset => https://www.kaggle.com/datasets/ealaxi/paysim1

# Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import math
import joblib

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score,recall_score
from sklearn.preprocessing import RobustScaler, MinMaxScaler,LabelEncoder
from sklearn.ensemble import RandomForestClassifier 

pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [2]:
def ml_metrics( model_name, y, yhat):
    precision = precision_score(y, yhat)
    recall = recall_score(y, yhat)
    accuracy = accuracy_score(y, yhat)
    return pd.DataFrame({'Model Name': model_name, 
                        'precision': precision,
                        'recall': recall,
                        'accuracy': accuracy}, index = [0])

In [3]:
def ml_cross_validation(model_name, model, k_fold, X, y):
    folder = StratifiedKFold(n_splits = k_fold)

    measures = pd.DataFrame()
    for train_index, test_index in folder.split(X, y):
        x_train_fold, x_test_fold = X.iloc[train_index,:], X.iloc[test_index,:]
        y_train_fold, y_test_fold = y.iloc[train_index,:], y.iloc[test_index,:]

        model.fit(x_train_fold,y_train_fold.values.ravel())

        y_hat = model.predict(x_test_fold)

        measure = ml_metrics( model_name, y_test_fold, y_hat)
        measures = pd.concat( [ measures, measure], axis = 0, ignore_index = True)
    return measures

In [4]:
path = '..\\data\\raw\\'

df = pd.read_csv(path + 'PS_20174392719_1491204439457_log.csv')
df.head()

df.to_parquet('..\\data\\processed\\credit_card_transaction.parquet', index = False)

In [5]:
path = '..\\data\\processed\\'

In [6]:
df = pd.read_parquet(path + 'credit_card_transaction.parquet')
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [7]:
print('Number of rows: {} \n Number of columns: {}'.format(df.shape[0], df.shape[1]))

Number of rows: 6362620 
 Number of columns: 11


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [9]:
df.sort_values(by=['isFlaggedFraud'], ascending = False)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
6296014,671,TRANSFER,3441041.46,C917414431,3441041.46,3441041.46,C1082139865,0.00,0.00,1,1
6362460,730,TRANSFER,10000000.00,C2140038573,17316255.05,17316255.05,C1395467927,0.00,0.00,1,1
5563713,387,TRANSFER,4892193.09,C908544136,4892193.09,4892193.09,C891140444,0.00,0.00,1,1
6362584,741,TRANSFER,5674547.89,C992223106,5674547.89,5674547.89,C1366804249,0.00,0.00,1,1
3760288,279,TRANSFER,536624.41,C1035541766,536624.41,536624.41,C1100697970,0.00,0.00,1,1
...,...,...,...,...,...,...,...,...,...,...,...
2120869,183,CASH_IN,118980.42,C1325911476,1704431.57,1823411.99,C257232928,2001605.12,1882624.70,0,0
2120868,183,CASH_IN,203094.44,C24786701,1501337.12,1704431.57,C316761923,8095303.60,7892209.15,0,0
2120867,183,CASH_IN,174346.32,C1795889203,1326990.80,1501337.12,C547359094,276757.88,102411.55,0,0
2120866,183,CASH_IN,67507.70,C1035027223,1259483.10,1326990.80,C1730848226,94625.84,27118.13,0,0


In [10]:
df.isna().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64