## Modeling

After the wrangling and preprocessing we can now begin our modeling. I will be utilizing Random Forest Classifier due to its robustness and accuracy that can handle classification problems. We will first import all whole libraries, packages and processed data

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.utils import class_weight
from sklearn.metrics import make_scorer, f1_score

df = pd.read_csv("Wrangled_Payments.csv")



We define here our target variable and independent variables following by a train_test_split

In [9]:
y = df["isFraud"]

X = df.drop("isFraud", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


In [10]:

clf = RandomForestClassifier(n_estimators=100)

clf.fit(X_train, y_train)

In [11]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(X_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

In [12]:
print_score(clf, X_train, y_train, X_test, y_test, train=True)
print_score(clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
              0.0   1.0  accuracy  macro avg  weighted avg
precision     1.0   1.0       1.0        1.0           1.0
recall        1.0   1.0       1.0        1.0           1.0
f1-score      1.0   1.0       1.0        1.0           1.0
support    6690.0  10.0       1.0     6700.0        6700.0
_______________________________________________
Confusion Matrix: 
 [[6690    0]
 [   0   10]]

Test Result:
Accuracy Score: 99.91%
_______________________________________________
CLASSIFICATION REPORT:
                   0.0   1.0  accuracy    macro avg  weighted avg
precision     0.999091  1.00  0.999091     0.999545      0.999092
recall        1.000000  0.25  0.999091     0.625000      0.999091
f1-score      0.999545  0.40  0.999091     0.699773      0.998818
support    3296.000000  4.00  0.999091  3300.000000   3300.000000
_______________________________________________
Confusion Matri

In [17]:

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
   
}

scorer = make_scorer(f1_score)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring=scorer, verbose=1)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
best_clf_classifier = RandomForestClassifier()
best_clf_classifier.fit(X_train, y_train)


best_params = grid_search.best_params_
best_score = grid_search.best_score_

test_predictions = best_clf_classifier.predict(X_test)
print("Best Parameters:", best_params)
print("Best Score:", best_score)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best Parameters: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 200}
Best Score: 0.6666666666666666


In [20]:
final_clf = RandomForestClassifier(class_weight={0: class_weights[0], 1: class_weights[1]},
                                              max_depth=20, min_samples_split=5, n_estimators=200)
final_clf.fit(X_train, y_train)

test_pred = final_clf.predict(X_test)

print("Final Classification Report:\n", classification_report(y_test, test_pred))
print("Final Confusion Matrix:\n", confusion_matrix(y_test, test_pred))

Final Classification Report:
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      3296
         1.0       1.00      0.25      0.40         4

    accuracy                           1.00      3300
   macro avg       1.00      0.62      0.70      3300
weighted avg       1.00      1.00      1.00      3300

Final Confusion Matrix:
 [[3296    0]
 [   3    1]]
