# Technical Case Study - ML Ops
### Carlo Alberto Barranco 2022-05-08

In this notebook I created a ML classification model to identify frauds in a given transactions dataset and wrote a report of the performance of such model.

Original dataset was taken from from https://www.kaggle.com/kartik2112/fraud-detection-on-paysim-dataset/data

In paragrafh "Data Exploration" I loaded the data in a pandas data-frame and studied for each column if some particular values or conditions are associated with an over-rapresentation of frauds (compared to overall average). As a result of this study I selected the columns to be used by the model and, moreovere I added a few columns expected to make easier to spot frauds.

In paragraph "Model training" I selected method XGBoost and trained a model using a sample of 80% of the original data.

In paragraph "Model Evaluation" I used the remained 20% of data to check if the model was actually spotting the frauds and collected the results in terms of
* True Negative (non frauds, correctly on flagged)
* False Negative (non frauds, incorrectly flagged)
* True Positive (frauds, correctly flagged)
* False Positive (frauds, incorrectly non flagged)
* Accuracy (number of correctly identified transactions over total)
* Precision (True positives over all positives)
* Recall (True positives over all frauds)
* Average precision score (average of precision calculated on varios sub-sets)

In paragraph "Save model" the model defined and trained in the previous paragraph is saved as a pickle in folder "models". Also the test set is saved for future checks on the model efficacy.

In [2]:
DATA_PATH = 'data/initial_dataset.csv'

## Data Exploration

In [69]:
import pandas as pd
import numpy as np

In [43]:
df = pd.read_csv(DATA_PATH)

In [20]:
df.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [9]:
len(df), sum(df.isFraud), sum(df.isFlaggedFraud), sum(df.isFraud*df.isFlaggedFraud)

(6362620, 8213, 16, 16)

The dataset includes 6362620 transactions of which 8213 (0.13%) are flagged as frauds.

The falg "isFlaggedFraud" doesn't seem really useful.

In [66]:
df.sample(5).T

Unnamed: 0,2967978,1282774,55781,2803552,1245043
step,231,135,9,225,134
type,CASH_OUT,CASH_OUT,CASH_IN,PAYMENT,CASH_IN
amount,327854,54882.1,292949,53977.5,260425
nameOrig,C1632635545,C149577620,C1445583652,C1542264198,C325118727
oldbalanceOrg,418,15,652285,105336,753229
newbalanceOrig,0,0,945233,51358.7,1.01365e+06
nameDest,C1714935544,C856300696,C1527452964,M2137613813,C136564603
oldbalanceDest,318869,321191,1.17492e+06,0,311306
newbalanceDest,646723,713766,0,0,50880.9
isFraud,0,0,0,0,0


Values on balances (oldbalanceOrg, newbalanceOrig, oldbalanceDest, newbalanceDest) are not always coherent with amount values.

In particular in some cases there is 0 change in balance for non 0 amounts or viceversa.

In [19]:
for col in df.columns:
    print(col, len(df[df[col]!=df[col]]))

step 0
type 0
amount 0
nameOrig 0
oldbalanceOrg 0
newbalanceOrig 0
nameDest 0
oldbalanceDest 0
newbalanceDest 0
isFraud 0
isFlaggedFraud 0


Dataset does not have any null value

In [23]:
txn_types = list(set(df['type']))
txn_types

['TRANSFER', 'PAYMENT', 'CASH_IN', 'DEBIT', 'CASH_OUT']

In [24]:
for t in txn_types:
    dft = df[df['type']==t]
    print(t, len(dft), sum(dft.isFraud), sum(dft.isFraud)/len(dft))

TRANSFER 532909 4097 0.007687991758442811
PAYMENT 2151495 0 0.0
CASH_IN 1399284 0 0.0
DEBIT 41432 0 0.0
CASH_OUT 2237500 4116 0.0018395530726256983


Frauds seems to occour only for type equal to "TRANSFER" or "CASH_OUT"

In [44]:
txn_types_dict = {'TRANSFER': 0, 'PAYMENT': 1, 'CASH_IN': 2, 'DEBIT': 3, 'CASH_OUT': 4}
txn_types_rev_dict = {0: 'TRANSFER', 1: 'PAYMENT', 2: 'CASH_IN', 3: 'DEBIT', 4: 'CASH_OUT'}

In [45]:
df['typeId'] = df['type'].apply(lambda x: txn_types_dict[x])

Converted "type" to int using txn_types_dict to use it among other columns as one of the predictors

In [70]:
df['incoerenceBalanceOrig'] = np.where(
    df.oldbalanceOrg==df.newbalanceOrig, np.where(df.amount!=0, 1, 0), np.where(df.amount==0, 1, 0)
)
df['incoerenceBalanceDest'] = np.where(
    df.oldbalanceDest==df.newbalanceDest, np.where(df.amount!=0, 1, 0), np.where(df.amount==0, 1, 0)
)

In [80]:
for col in ['incoerenceBalanceOrig', 'incoerenceBalanceDest']:
    for val in [0, 1]:
        dfcv = df[(df[col]==val)&(df['type'].isin(["TRANSFER", "CASH_OUT"]))]
        print(col, val, len(dfcv), sum(dfcv.isFraud), sum(dfcv.isFraud)/len(dfcv))

incoerenceBalanceOrig 0 1461824 8172 0.005590276257606935
incoerenceBalanceOrig 1 1308585 41 3.133155278411414e-05
incoerenceBalanceDest 0 2764633 4143 0.0014985714197870024
incoerenceBalanceDest 1 5776 4070 0.7046398891966759


Frauds seems to be much more likely when there is incoerence in BalanceDest

Cold be worth to check if balances are more relevantly off for frouds

In [106]:
df['errorBalanceOrig'] = df['newbalanceOrig'] + df['amount'] - df['oldbalanceOrg']
df['errorBalanceDest'] = df['oldbalanceDest'] + df['amount'] - df['newbalanceDest']

In [110]:
for col in ['errorBalanceOrig', 'errorBalanceDest']:
    for val in [0, 1]:
        X = df[(df['isFraud']==val)&(df['type'].isin(["TRANSFER", "CASH_OUT"]))][col].abs().mean()
        print(col, val, round(X, 2))

errorBalanceOrig 0 286803.51
errorBalanceOrig 1 10692.33
errorBalanceDest 0 44302.66
errorBalanceDest 1 745138.59


Discrepacies in BalanceDest seems way more relevant for frauds

## Model training

In [83]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import average_precision_score
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier

In [113]:
Id_columns = ['nameOrig', 'nameDest']
X_columns = [
    'step', 'typeId', 'amount',
    'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest',
    'incoerenceBalanceOrig', 'incoerenceBalanceDest',
    'errorBalanceOrig', 'errorBalanceDest'
]
Y_col = 'isFraud'

In [114]:
Y = df[Y_col]
X = df[X_columns]

In [115]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=1)

Trying to use xgboost classifier with 4 layers

In [116]:
weights = (Y == 0).sum() / (Y == 1).sum()
clsf_model = XGBClassifier(max_depth = 4, scale_pos_weight = weights, n_jobs = 4)

In [117]:
clsf_model.fit(Xtrain,Ytrain)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=4, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=4, num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

## Model Evaluation

In [118]:
# confusion matrix for test set
Ypredict = clsf_model.predict(Xtest)
conf = confusion_matrix(Ytest, Ypredict)
print("Confusion matrix")
print(conf)

Confusion matrix
[[1270870       7]
 [      4    1643]]


In [142]:
print('Accuracy', (conf[0][0] + conf[1][1])/len(Ypredict))
print('Precision', conf[1][1]/(conf[1][1]+conf[0][1]))
print('Recall', conf[1][1]/(conf[1][1]+conf[1][0]))
print('Average precision score', average_precision_score(Ytest, clsf_model.predict_proba(Xtest)[:,1]))

Accuracy 0.9999913557622488
Precision 0.9957575757575757
Recall 0.9975713418336369
Average precision score 0.9989724183186716


## Save model

In [121]:
import datetime as dt
import pickle

In [125]:
d = str(dt.datetime.now())[:10]
filename = f'models/{d}.sav'
pickle.dump(clsf_model, open(filename, 'wb'))

In [131]:
df_test = Xtest.copy()
df_test['isFraud'] = Ytest
filename = f'data/tests/test-{d}.csv'
df_test.to_csv(filename, index=False)