## Modeling

After the wrangling and preprocessing we can now begin our modeling. I will be utilizing Random Forest Classifier due to its robustness and accuracy that can handle classification problems. We will first import all whole libraries, packages and processed data

In [58]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import f1_score

df = pd.read_csv("../Wrangled_Payments.csv")


We define here our target variable and independent variables following by a train_test_split

In [59]:
y = df["isFraud"]

X = df.drop("isFraud", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


We then call the function RandomForestClassifier to fit our training data and start our modeling.

In [70]:

clf = RandomForestClassifier()

clf.fit(X_train, y_train)

In [79]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(X_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

In [80]:
print_score(clf, X_train, y_train, X_test, y_test, train=True)
print_score(clf, X_train, y_train, X_test, y_test, train=False)

test_pred = clf.predict(X_test)

scorer = f1_score(y_test, test_pred)

"F1 Score:", scorer

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
               0.0   1.0  accuracy  macro avg  weighted avg
precision      1.0   1.0       1.0        1.0           1.0
recall         1.0   1.0       1.0        1.0           1.0
f1-score       1.0   1.0       1.0        1.0           1.0
support    20083.0  17.0       1.0    20100.0       20100.0
_______________________________________________
Confusion Matrix: 
 [[20083     0]
 [    0    17]]

Test Result:
Accuracy Score: 99.99%
_______________________________________________
CLASSIFICATION REPORT:
                   0.0       1.0  accuracy    macro avg  weighted avg
precision     0.999899  1.000000  0.999899     0.999949      0.999899
recall        1.000000  0.750000  0.999899     0.875000      0.999899
f1-score      0.999949  0.857143  0.999899     0.928546      0.999892
support    9896.000000  4.000000  0.999899  9900.000000   9900.000000
__________________________________

('F1 Score:', 0.8571428571428571)

We can see there is an imbalance for the two isFraud classes, while the f1-score of 0.0 is 0.99, the one for class 1.0 is 0.7. Let's take a look at the best parameters to using by using GridSearch as hyperparameter.

In [75]:

param_grid = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10, 15],
   
}

scorer = make_scorer(f1_score)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring=scorer, verbose=1)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
best_clf_classifier = RandomForestClassifier()
best_clf_classifier.fit(X_train, y_train)

best_score = grid_search.best_score_

test_predictions = best_clf_classifier.predict(X_test)
"Best Parameters:", best_params


Fitting 5 folds for each of 64 candidates, totalling 320 fits


('Best Parameters:',
 {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 50})

After running the GridSearch we discover that the best parameters are 'max_depth': None, 'min_samples_split': 2, 'n_estimators': 50. Let's take a look at our results after redefining the parameters for Random Forest Classifier

In [77]:
final_clf = RandomForestClassifier(
                                  max_depth=None, min_samples_split=2, n_estimators=50)
final_clf.fit(X_train, y_train)


def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = final_clf.predict(X_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = final_clf.predict(X_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")
        
        
print_score(clf, X_train, y_train, X_test, y_test, train=True)
print_score(clf, X_train, y_train, X_test, y_test, train=False)

test_pred = final_clf.predict(X_test)

scorer = f1_score(y_test, test_pred)

"F1 Score:", scorer

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
               0.0   1.0  accuracy  macro avg  weighted avg
precision      1.0   1.0       1.0        1.0           1.0
recall         1.0   1.0       1.0        1.0           1.0
f1-score       1.0   1.0       1.0        1.0           1.0
support    20083.0  17.0       1.0    20100.0       20100.0
_______________________________________________
Confusion Matrix: 
 [[20083     0]
 [    0    17]]

Test Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
              0.0  1.0  accuracy  macro avg  weighted avg
precision     1.0  1.0       1.0        1.0           1.0
recall        1.0  1.0       1.0        1.0           1.0
f1-score      1.0  1.0       1.0        1.0           1.0
support    9896.0  4.0       1.0     9900.0        9900.0
_______________________________________________
Confusion Matrix: 
 [[9896    0]
 [   0    4]

('F1 Score:', 1.0)

After utilizing the hyperparameter we can observe a testing accuracy of 100% and an f1-score of 1.0 demonstrating that the model can predict perfectly the fraudulent payments