<a href="https://colab.research.google.com/github/ShubhamPednekar19/Credit-Card-Fraud-Detection/blob/main/Credit-Card.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Credit Card Fraud Detection

## Problem Statement
As we are moving towards the digital world — cybersecurity is becoming a crucial part of our life. When we talk about security in digital life then the main challenge is to find the abnormal activity.

When we make any transaction while purchasing any product online — a good amount of people prefer credit cards. The credit limit in credit cards sometimes helps us me making purchases even if we don’t have the amount at that time. but, on the other hand, these features are misused by cyber attackers.

To tackle this problem we need a system that can abort the transaction if it finds fishy.

Here, comes the need for a system that can track the pattern of all the transactions and if any pattern is abnormal then the transaction should be aborted.

## Data Analysis
In September 2013, during the course of two days, European cardholders conducted credit card transactions that are part of the data collection. 492 fraudulent transactions out of a total of 2,84,807 transactions. With the positive class (frauds) accounting for 0.172% of all transactions, this data set is seriously out of balance. In order to ensure secrecy, the data set has also been adjusted using principal component analysis (PCA). All other features (V1, V2, V3, up to V28) are main components generated using PCA, with the exception of "time" and "amount." The seconds that passed between the initial transaction in the data set and the subsequent transactions are contained in the feature "time." The transaction amount is the feature's 'amount'. The feature 'class' stands for class labeling and accepts a value of 1 in cases of fraud and 0 in others.

# Importing Dependencies

In [None]:
# Importing the libraries
import numpy as np
import pandas as pd
import time

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from scipy import stats
from scipy.stats import norm, skew
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

import sklearn
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import average_precision_score, precision_recall_curve

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.linear_model import Ridge, Lasso, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
from sklearn.ensemble import AdaBoostClassifier

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Data Analysis

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
df = pd.read_csv('gdrive/MyDrive/Minor_CS354N/creditcard.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [None]:
# Drop duplicates
df.drop_duplicates(inplace=True)
df.dropna(axis=0, how="any")

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [None]:
# As time is given in relative fashion, we are using pandas.Timedelta which Represents a duration, the difference between two times or dates.
Delta_Time = pd.to_timedelta(df['Time'], unit='s')

#Create derived columns Mins and hours
df['Time_Day'] = (Delta_Time.dt.components.days).astype(int)
df['Time_Hour'] = (Delta_Time.dt.components.hours).astype(int)
df['Time_Min'] = (Delta_Time.dt.components.minutes).astype(int)

# We will drop Time,as we have derived the Day/Hour/Minutes from the time column
df.drop('Time', axis = 1, inplace= True)
# We will keep only derived column hour, as day/minutes might not be very useful
df.drop(['Time_Day', 'Time_Min', 'Time_Hour'], axis = 1, inplace= True)

import copy

v1 = copy.deepcopy(df)
v2 = copy.deepcopy(df)
v3 = copy.deepcopy(df)
v4 = copy.deepcopy(df)
v5 = copy.deepcopy(df)

In [None]:
#Create a dataframe to store results
df_Results = pd.DataFrame(columns=['Methodology','Model','Train-Accuracy', 'Train-F-1 score', 'Train-ROC', 'Test-Accuracy', 'Test-F1 Score'])



# Splitting existing data into Test and Train dataset

In [None]:
# Import library
from sklearn.model_selection import train_test_split

X = df.drop(['Class'], axis=1)
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=100)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train['Amount'] = scaler.fit_transform(X_train[['Amount']])
X_test['Amount'] = scaler.transform(X_test[['Amount']])

# Model Building

## Without Oversampling

### Decision Tree Classifier

In [None]:

from sklearn.tree import DecisionTreeClassifier

# Create the parameter grid
param_grid = {
    'max_depth': range(5, 15, 5),
    'min_samples_leaf': range(50, 150, 50),
    'min_samples_split': range(50, 150, 50),
}


# Instantiate the grid search model
dtree = DecisionTreeClassifier()

grid_search = GridSearchCV(estimator = dtree,
                           param_grid = param_grid,
                           scoring= 'roc_auc',
                           cv = 3,
                           verbose = 0)

# Fit the grid search to the data
grid_search.fit(X_train,y_train)
print(grid_search.best_estimator_)

DecisionTreeClassifier(max_depth=10, min_samples_leaf=100, min_samples_split=50)


In [None]:
dt_imb_model = DecisionTreeClassifier(criterion = "gini",
                                  random_state = 100,
                                  max_depth=10,
                                  min_samples_leaf=100,
                                  min_samples_split=50)

dt_imb_model.fit(X_train, y_train)

y_train_pred = dt_imb_model.predict(X_train)

Accuracy_train = metrics.accuracy_score(y_train, y_train_pred)
F1_train = f1_score(y_train, y_train_pred)
y_train_pred_proba = dt_imb_model.predict_proba(X_train)[:,1]
auc_train = metrics.roc_auc_score(y_train, y_train_pred_proba)


# Predictions on the test set
y_test_pred = dt_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_train, y_train_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Without Oversampling','Model': 'Decision Tree','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results


              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.74      0.84      0.79       113

    accuracy                           1.00     56746
   macro avg       0.87      0.92      0.89     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274


### Random Forest Classifier

In [None]:
# Importing random forest classifier
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'max_depth': range(5,10,5),
    'min_samples_leaf': range(50, 150, 50),
    'min_samples_split': range(50, 150, 50),
    'n_estimators': [100,200,300],
    'max_features': [10, 20]
}
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf,
                           param_grid = param_grid,
                           cv = 2,
                           n_jobs = -1,
                           verbose = 1,
                           return_train_score=True)

# Fit the model
grid_search.fit(X_train, y_train)
print(grid_search.best_estimator_)

In [None]:
# model with the best hyperparameters

rfc_imb_model = RandomForestClassifier(bootstrap=True,
                             max_depth=5,
                             min_samples_leaf=50,
                             min_samples_split=50,
                             max_features=10,
                             n_estimators=100)

rfc_imb_model.fit(X_train, y_train)
y_train_pred = rfc_imb_model.predict(X_train)

Accuracy_train = metrics.accuracy_score(y_train, y_train_pred)
F1_train = f1_score(y_train, y_train_pred)
y_train_pred_proba = rfc_imb_model.predict_proba(X_train)[:,1]
auc_train = metrics.roc_auc_score(y_train, y_train_pred_proba)


y_test_pred = rfc_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Without Oversampling','Model': 'Random Forest','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.84      0.73      0.78       113

    accuracy                           1.00     56746
   macro avg       0.92      0.86      0.89     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999189,0.762274,0.985278,0.999207,0.762274
2,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739


### XGBoost

In [None]:
from xgboost import XGBClassifier

params = {'learning_rate': 0.2,
          'max_depth': 2,
          'n_estimators':200,
          'subsample':0.9,
         'objective':'binary:logistic'}

# fit model on training data
xgb_imb_model = XGBClassifier(params = params)
xgb_imb_model.fit(X_train, y_train)

y_train_pred = xgb_imb_model.predict(X_train)

Accuracy_train = metrics.accuracy_score(y_train, y_train_pred)
F1_train = f1_score(y_train, y_train_pred)
y_train_pred_proba = xgb_imb_model.predict_proba(X_train)[:,1]
auc_train = metrics.roc_auc_score(y_train, y_train_pred_proba)

y_test_pred = xgb_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Without Oversampling','Model': 'XGBoost','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

Parameters: { "params" } are not used.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.96      0.83      0.89       113

    accuracy                           1.00     56746
   macro avg       0.98      0.92      0.95     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995


### Aritifical Neural Networks

In [None]:
from sklearn.neural_network import MLPClassifier


mlp = MLPClassifier(hidden_layer_sizes=(5), activation='logistic', solver='adam', max_iter=100)
mlp.fit(X_train, y_train)

y_train_pred = mlp.predict(X_train)

Accuracy_train = metrics.accuracy_score(y_train, y_train_pred)
F1_train = f1_score(y_train, y_train_pred)
y_train_pred_proba = mlp.predict_proba(X_train)[:,1]
auc_train = metrics.roc_auc_score(y_train, y_train_pred_proba)

y_test_pred = mlp.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Without Oversampling','Model': 'ANN','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results


              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.83      0.83      0.83       113

    accuracy                           1.00     56746
   macro avg       0.92      0.92      0.92     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,XGBoost,0.99944,0.816208,0.989431,0.999401,0.846847
4,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858


## Applying Feature selection and then computing metrics

In [None]:
v1.drop(['V2', 'V3', 'V4', 'V6', 'V9', 'V10', 'V12', 'V25', 'V26', 'V27', 'V28'], axis = 1, inplace= True)

X = v1.drop('Class', axis = 1)
y = v1['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=100)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train['Amount'] = scaler.fit_transform(X_train[['Amount']])
X_test['Amount'] = scaler.transform(X_test[['Amount']])



### Decision tree classifier

In [None]:
dt_imb_model = DecisionTreeClassifier(criterion = "gini",
                                  random_state = 100,
                                  max_depth=10,
                                  min_samples_leaf=100,
                                  min_samples_split=50)

dt_imb_model.fit(X_train, y_train)

y_train_pred = dt_imb_model.predict(X_train)

Accuracy_train = metrics.accuracy_score(y_train, y_train_pred)
F1_train = f1_score(y_train, y_train_pred)
y_train_pred_proba = dt_imb_model.predict_proba(X_train)[:,1]
auc_train = metrics.roc_auc_score(y_train, y_train_pred_proba)


# Predictions on the test set
y_test_pred = dt_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_train, y_train_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Without Oversampling (V1)','Model': 'Decision Tree','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.72      0.83      0.77       113

    accuracy                           1.00     56746
   macro avg       0.86      0.92      0.88     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844


### Random Forest classifier

In [None]:
# model with the best hyperparameters

rfc_imb_model = RandomForestClassifier(bootstrap=True,
                             max_depth=5,
                             min_samples_leaf=50,
                             min_samples_split=50,
                             max_features=10,
                             n_estimators=100)

rfc_imb_model.fit(X_train, y_train)
y_train_pred = rfc_imb_model.predict(X_train)

Accuracy_train = metrics.accuracy_score(y_train, y_train_pred)
F1_train = f1_score(y_train, y_train_pred)
y_train_pred_proba = rfc_imb_model.predict_proba(X_train)[:,1]
auc_train = metrics.roc_auc_score(y_train, y_train_pred_proba)


y_test_pred = rfc_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Without Oversampling (V1)','Model': 'Random Forest','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.80      0.78      0.79       113

    accuracy                           1.00     56746
   macro avg       0.90      0.89      0.89     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238


### XGBoost

In [None]:
from xgboost import XGBClassifier

params = {'learning_rate': 0.2,
          'max_depth': 2,
          'n_estimators':200,
          'subsample':0.9,
         'objective':'binary:logistic'}

# fit model on training data
xgb_imb_model = XGBClassifier(params = params)
xgb_imb_model.fit(X_train, y_train)

y_train_pred = xgb_imb_model.predict(X_train)

Accuracy_train = metrics.accuracy_score(y_train, y_train_pred)
F1_train = f1_score(y_train, y_train_pred)
y_train_pred_proba = xgb_imb_model.predict_proba(X_train)[:,1]
auc_train = metrics.roc_auc_score(y_train, y_train_pred_proba)

y_test_pred = xgb_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Without Oversampling (V1)','Model': 'XGBoost','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

Parameters: { "params" } are not used.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.95      0.82      0.88       113

    accuracy                           1.00     56746
   macro avg       0.97      0.91      0.94     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517


### Aritifical Neural Networks

In [None]:
from sklearn.neural_network import MLPClassifier


mlp = MLPClassifier(hidden_layer_sizes=(5), activation='logistic', solver='adam', max_iter=100)
mlp.fit(X_train, y_train)

y_train_pred = mlp.predict(X_train)

Accuracy_train = metrics.accuracy_score(y_train, y_train_pred)
F1_train = f1_score(y_train, y_train_pred)
y_train_pred_proba = mlp.predict_proba(X_train)[:,1]
auc_train = metrics.roc_auc_score(y_train, y_train_pred_proba)

y_test_pred = mlp.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Without Oversampling (V1)','Model': 'ANN','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.88      0.81      0.84       113

    accuracy                           1.00     56746
   macro avg       0.94      0.90      0.92     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871


## Random Oversampling

In [None]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy=0.005)
X_train_over, y_train_over = ros.fit_resample(X_train, y_train)

### Decision tree classifier

In [None]:
dt_imb_model = DecisionTreeClassifier(criterion = "gini",
                                  random_state = 100,
                                  max_depth=10,
                                  min_samples_leaf=100,
                                  min_samples_split=50)

dt_imb_model.fit(X_train_over, y_train_over)

y_train_pred = dt_imb_model.predict(X_train_over)

Accuracy_train = metrics.accuracy_score(y_train_over, y_train_pred)
F1_train = f1_score(y_train_over, y_train_pred)
y_train_pred_proba = dt_imb_model.predict_proba(X_train_over)[:,1]
auc_train = metrics.roc_auc_score(y_train_over, y_train_pred_proba)


# Predictions on the test set
y_test_pred = dt_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Random Oversampling (V1)','Model': 'Decision Tree','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.72      0.83      0.77       113

    accuracy                           1.00     56746
   macro avg       0.86      0.92      0.88     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492


### Random Forest classifier

In [None]:
# model with the best hyperparameters

rfc_imb_model = RandomForestClassifier(bootstrap=True,
                             max_depth=5,
                             min_samples_leaf=50,
                             min_samples_split=50,
                             max_features=10,
                             n_estimators=100)

rfc_imb_model.fit(X_train_over, y_train_over)
y_train_pred = rfc_imb_model.predict(X_train_over)

Accuracy_train = metrics.accuracy_score(y_train_over, y_train_pred)
F1_train = f1_score(y_train_over, y_train_pred)
y_train_pred_proba = rfc_imb_model.predict_proba(X_train_over)[:,1]
auc_train = metrics.roc_auc_score(y_train_over, y_train_pred_proba)


y_test_pred = rfc_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Random Oversampling (V1)','Model': 'Random Forest','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.79      0.83      0.81       113

    accuracy                           1.00     56746
   macro avg       0.89      0.92      0.90     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


### XGBoost

In [None]:
from xgboost import XGBClassifier

params = {'learning_rate': 0.2,
          'max_depth': 2,
          'n_estimators':200,
          'subsample':0.9,
         'objective':'binary:logistic'}

# fit model on training data
xgb_imb_model = XGBClassifier(params = params)
xgb_imb_model.fit(X_train_over, y_train_over)

y_train_pred = xgb_imb_model.predict(X_train_over)

Accuracy_train = metrics.accuracy_score(y_train_over, y_train_pred)
F1_train = f1_score(y_train_over, y_train_pred)
y_train_pred_proba = xgb_imb_model.predict_proba(X_train_over)[:,1]
auc_train = metrics.roc_auc_score(y_train_over, y_train_pred_proba)

y_test_pred = xgb_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Random Oversampling (V1)','Model': 'XGBoost','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

Parameters: { "params" } are not used.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.93      0.82      0.87       113

    accuracy                           1.00     56746
   macro avg       0.96      0.91      0.94     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


### Aritifical Neural Networks

In [None]:
from sklearn.neural_network import MLPClassifier


mlp = MLPClassifier(hidden_layer_sizes=(5), activation='logistic', solver='adam', max_iter=100)
mlp.fit(X_train_over, y_train_over)

y_train_pred = mlp.predict(X_train_over)

Accuracy_train = metrics.accuracy_score(y_train_over, y_train_pred)
F1_train = f1_score(y_train_over, y_train_pred)
y_train_pred_proba = mlp.predict_proba(X_train_over)[:,1]
auc_train = metrics.roc_auc_score(y_train_over, y_train_pred_proba)

y_test_pred = mlp.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Random Oversampling (V1)','Model': 'ANN','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.85      0.83      0.84       113

    accuracy                           1.00     56746
   macro avg       0.93      0.92      0.92     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


## Oversampling - SMOTE

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy=0.005)
X_train_smote, y_train_smote = ros.fit_resample(X_train, y_train)

### Decision tree classifier

In [None]:
dt_imb_model = DecisionTreeClassifier(criterion = "gini",
                                  random_state = 100,
                                  max_depth=10,
                                  min_samples_leaf=100,
                                  min_samples_split=50)

dt_imb_model.fit(X_train_smote, y_train_smote)

y_train_pred = dt_imb_model.predict(X_train_smote)

Accuracy_train = metrics.accuracy_score(y_train_smote, y_train_pred)
F1_train = f1_score(y_train_smote, y_train_pred)
y_train_pred_proba = dt_imb_model.predict_proba(X_train_smote)[:,1]
auc_train = metrics.roc_auc_score(y_train_smote, y_train_pred_proba)


# Predictions on the test set
y_test_pred = dt_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Smote Oversampling (V1)','Model': 'Decision Tree','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.72      0.83      0.77       113

    accuracy                           1.00     56746
   macro avg       0.86      0.92      0.88     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


### Random Forest classifier

In [None]:
# model with the best hyperparameters

rfc_imb_model = RandomForestClassifier(bootstrap=True,
                             max_depth=5,
                             min_samples_leaf=50,
                             min_samples_split=50,
                             max_features=10,
                             n_estimators=100)

rfc_imb_model.fit(X_train_smote, y_train_smote)
y_train_pred = rfc_imb_model.predict(X_train_smote)

Accuracy_train = metrics.accuracy_score(y_train_smote, y_train_pred)
F1_train = f1_score(y_train_smote, y_train_pred)
y_train_pred_proba = rfc_imb_model.predict_proba(X_train_smote)[:,1]
auc_train = metrics.roc_auc_score(y_train_smote, y_train_pred_proba)


y_test_pred = rfc_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Smote Oversampling (V1)','Model': 'Random Forest','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.78      0.82      0.80       113

    accuracy                           1.00     56746
   macro avg       0.89      0.91      0.90     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


### XGBoost

In [None]:
from xgboost import XGBClassifier

params = {'learning_rate': 0.2,
          'max_depth': 2,
          'n_estimators':200,
          'subsample':0.9,
         'objective':'binary:logistic'}

# fit model on training data
xgb_imb_model = XGBClassifier(params = params)
xgb_imb_model.fit(X_train_smote, y_train_smote)

y_train_pred = xgb_imb_model.predict(X_train_smote)

Accuracy_train = metrics.accuracy_score(y_train_smote, y_train_pred)
F1_train = f1_score(y_train_smote, y_train_pred)
y_train_pred_proba = xgb_imb_model.predict_proba(X_train_smote)[:,1]
auc_train = metrics.roc_auc_score(y_train_smote, y_train_pred_proba)

y_test_pred = xgb_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Smote Oversampling (V1)','Model': 'XGBoost','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

Parameters: { "params" } are not used.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.93      0.82      0.87       113

    accuracy                           1.00     56746
   macro avg       0.96      0.91      0.94     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


### Aritifical Neural Networks

In [None]:
from sklearn.neural_network import MLPClassifier


mlp = MLPClassifier(hidden_layer_sizes=(5), activation='logistic', solver='adam', max_iter=100)
mlp.fit(X_train_smote, y_train_smote)

y_train_pred = mlp.predict(X_train_smote)

Accuracy_train = metrics.accuracy_score(y_train_smote, y_train_pred)
F1_train = f1_score(y_train_smote, y_train_pred)
y_train_pred_proba = mlp.predict_proba(X_train_smote)[:,1]
auc_train = metrics.roc_auc_score(y_train_smote, y_train_pred_proba)

y_test_pred = mlp.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Smote Oversampling (V1)','Model': 'ANN','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.86      0.82      0.84       113

    accuracy                           1.00     56746
   macro avg       0.93      0.91      0.92     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


## Oversampling - Borderline SMOTE

In [None]:
from imblearn.over_sampling import BorderlineSMOTE
bsm = BorderlineSMOTE(sampling_strategy=0.005)
X_train_bsmote, y_train_bsmote = ros.fit_resample(X_train, y_train)

### Decision tree classifier

In [None]:
dt_imb_model = DecisionTreeClassifier(criterion = "gini",
                                  random_state = 100,
                                  max_depth=10,
                                  min_samples_leaf=100,
                                  min_samples_split=50)

dt_imb_model.fit(X_train_bsmote, y_train_bsmote)

y_train_pred = dt_imb_model.predict(X_train_bsmote)

Accuracy_train = metrics.accuracy_score(y_train_bsmote, y_train_pred)
F1_train = f1_score(y_train_bsmote, y_train_pred)
y_train_pred_proba = dt_imb_model.predict_proba(X_train_bsmote)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmote, y_train_pred_proba)


# Predictions on the test set
y_test_pred = dt_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Borderline Smote Oversampling (V1)','Model': 'Decision Tree','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.72      0.83      0.77       113

    accuracy                           1.00     56746
   macro avg       0.86      0.92      0.88     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


### Random Forest classifier

In [None]:
# model with the best hyperparameters

rfc_imb_model = RandomForestClassifier(bootstrap=True,
                             max_depth=5,
                             min_samples_leaf=50,
                             min_samples_split=50,
                             max_features=10,
                             n_estimators=100)

rfc_imb_model.fit(X_train_bsmote, y_train_bsmote)
y_train_pred = rfc_imb_model.predict(X_train_bsmote)

Accuracy_train = metrics.accuracy_score(y_train_bsmote, y_train_pred)
F1_train = f1_score(y_train_bsmote, y_train_pred)
y_train_pred_proba = rfc_imb_model.predict_proba(X_train_bsmote)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmote, y_train_pred_proba)


y_test_pred = rfc_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Borderline Smote Oversampling (V1)','Model': 'Random Forest','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.78      0.83      0.80       113

    accuracy                           1.00     56746
   macro avg       0.89      0.92      0.90     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


### XGBoost

In [None]:
from xgboost import XGBClassifier

params = {'learning_rate': 0.2,
          'max_depth': 2,
          'n_estimators':200,
          'subsample':0.9,
         'objective':'binary:logistic'}

# fit model on training data
xgb_imb_model = XGBClassifier(params = params)
xgb_imb_model.fit(X_train_bsmote, y_train_bsmote)

y_train_pred = xgb_imb_model.predict(X_train_bsmote)

Accuracy_train = metrics.accuracy_score(y_train_bsmote, y_train_pred)
F1_train = f1_score(y_train_bsmote, y_train_pred)
y_train_pred_proba = xgb_imb_model.predict_proba(X_train_bsmote)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmote, y_train_pred_proba)

y_test_pred = xgb_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Borderline Smote Oversampling (V1)','Model': 'XGBoost','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

Parameters: { "params" } are not used.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.94      0.83      0.88       113

    accuracy                           1.00     56746
   macro avg       0.97      0.92      0.94     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


### Aritifical Neural Networks

In [None]:
from sklearn.neural_network import MLPClassifier


mlp = MLPClassifier(hidden_layer_sizes=(5), activation='logistic', solver='adam', max_iter=100)
mlp.fit(X_train_bsmote, y_train_bsmote)

y_train_pred = mlp.predict(X_train_bsmote)

Accuracy_train = metrics.accuracy_score(y_train_bsmote, y_train_pred)
F1_train = f1_score(y_train_bsmote, y_train_pred)
y_train_pred_proba = mlp.predict_proba(X_train_bsmote)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmote, y_train_pred_proba)

y_test_pred = mlp.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Borderline Smote Oversampling (V1)','Model': 'ANN','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.84      0.81      0.83       113

    accuracy                           1.00     56746
   macro avg       0.92      0.91      0.91     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
from imblearn.combine import SMOTEENN
sme = SMOTEENN(sampling_strategy=0.005)
X_train_smoteenn, y_train_bsmoteenn = ros.fit_resample(X_train, y_train)

In [None]:
dt_imb_model = DecisionTreeClassifier(criterion = "gini",
                                  random_state = 100,
                                  max_depth=10,
                                  min_samples_leaf=100,
                                  min_samples_split=50)

dt_imb_model.fit(X_train_smoteenn, y_train_bsmoteenn)

y_train_pred = dt_imb_model.predict(X_train_smoteenn)

Accuracy_train = metrics.accuracy_score(y_train_bsmoteenn, y_train_pred)
F1_train = f1_score(y_train_bsmoteenn, y_train_pred)
y_train_pred_proba = dt_imb_model.predict_proba(X_train_smoteenn)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmoteenn, y_train_pred_proba)


# Predictions on the test set
y_test_pred = dt_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Smoteenn Oversampling (V1)','Model': 'Decision Tree','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.72      0.83      0.77       113

    accuracy                           1.00     56746
   macro avg       0.86      0.92      0.88     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
# model with the best hyperparameters

rfc_imb_model = RandomForestClassifier(bootstrap=True,
                             max_depth=5,
                             min_samples_leaf=50,
                             min_samples_split=50,
                             max_features=10,
                             n_estimators=100)

rfc_imb_model.fit(X_train_smoteenn, y_train_bsmoteenn)
y_train_pred = rfc_imb_model.predict(X_train_smoteenn)

Accuracy_train = metrics.accuracy_score(y_train_bsmoteenn, y_train_pred)
F1_train = f1_score(y_train_bsmoteenn, y_train_pred)
y_train_pred_proba = rfc_imb_model.predict_proba(X_train_smoteenn)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmoteenn, y_train_pred_proba)


y_test_pred = rfc_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Smoteenn Oversampling (V1)','Model': 'Random Forest','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.79      0.82      0.81       113

    accuracy                           1.00     56746
   macro avg       0.89      0.91      0.90     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
from xgboost import XGBClassifier

params = {'learning_rate': 0.2,
          'max_depth': 2,
          'n_estimators':200,
          'subsample':0.9,
         'objective':'binary:logistic'}

# fit model on training data
xgb_imb_model = XGBClassifier(params = params)
xgb_imb_model.fit(X_train_smoteenn, y_train_bsmoteenn)

y_train_pred = xgb_imb_model.predict(X_train_smoteenn)

Accuracy_train = metrics.accuracy_score(y_train_bsmoteenn, y_train_pred)
F1_train = f1_score(y_train_bsmoteenn, y_train_pred)
y_train_pred_proba = xgb_imb_model.predict_proba(X_train_smoteenn)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmoteenn, y_train_pred_proba)

y_test_pred = xgb_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Smoteenn Oversampling (V1)','Model': 'XGBoost','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

Parameters: { "params" } are not used.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.93      0.81      0.87       113

    accuracy                           1.00     56746
   macro avg       0.96      0.91      0.93     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
from sklearn.neural_network import MLPClassifier


mlp = MLPClassifier(hidden_layer_sizes=(5), activation='logistic', solver='adam', max_iter=100)
mlp.fit(X_train_smoteenn, y_train_bsmoteenn)

y_train_pred = mlp.predict(X_train_smoteenn)

Accuracy_train = metrics.accuracy_score(y_train_bsmoteenn, y_train_pred)
F1_train = f1_score(y_train_bsmoteenn, y_train_pred)
y_train_pred_proba = mlp.predict_proba(X_train_smoteenn)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmoteenn, y_train_pred_proba)

y_test_pred = mlp.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Smoteenn Oversampling (V1)','Model': 'ANN','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.86      0.81      0.83       113

    accuracy                           1.00     56746
   macro avg       0.93      0.90      0.92     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
df1 = pd.read_csv('gdrive/MyDrive/Minor_CS354N/creditcard.csv')
df1.drop_duplicates(inplace=True)
df1.dropna(axis=0, how="any")


Delta_Time = pd.to_timedelta(df1['Time'], unit='s')


df1['Time_Day'] = (Delta_Time.dt.components.days).astype(int)
df1['Time_Hour'] = (Delta_Time.dt.components.hours).astype(int)
df1['Time_Min'] = (Delta_Time.dt.components.minutes).astype(int)


df1.drop('Time', axis = 1, inplace= True)

df1.drop(['Time_Day', 'Time_Min', 'Time_Hour'], axis = 1, inplace= True)

from sklearn.model_selection import train_test_split

X = df1.drop(['Class'], axis=1)
y = df1['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=100)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train['Amount'] = scaler.fit_transform(X_train[['Amount']])
X_test['Amount'] = scaler.transform(X_test[['Amount']])

In [None]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy=0.005)
X_train_over, y_train_over = ros.fit_resample(X_train, y_train)

In [None]:
dt_imb_model = DecisionTreeClassifier(criterion = "gini",
                                  random_state = 100,
                                  max_depth=10,
                                  min_samples_leaf=100,
                                  min_samples_split=50)

dt_imb_model.fit(X_train_over, y_train_over)

y_train_pred = dt_imb_model.predict(X_train_over)

Accuracy_train = metrics.accuracy_score(y_train_over, y_train_pred)
F1_train = f1_score(y_train_over, y_train_pred)
y_train_pred_proba = dt_imb_model.predict_proba(X_train_over)[:,1]
auc_train = metrics.roc_auc_score(y_train_over, y_train_pred_proba)


# Predictions on the test set
y_test_pred = dt_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Random Oversampling (V1)','Model': 'Decision Tree','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.82      0.79      0.81       113

    accuracy                           1.00     56746
   macro avg       0.91      0.89      0.90     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
# model with the best hyperparameters

rfc_imb_model = RandomForestClassifier(bootstrap=True,
                             max_depth=5,
                             min_samples_leaf=50,
                             min_samples_split=50,
                             max_features=10,
                             n_estimators=100)

rfc_imb_model.fit(X_train_over, y_train_over)
y_train_pred = rfc_imb_model.predict(X_train_over)

Accuracy_train = metrics.accuracy_score(y_train_over, y_train_pred)
F1_train = f1_score(y_train_over, y_train_pred)
y_train_pred_proba = rfc_imb_model.predict_proba(X_train_over)[:,1]
auc_train = metrics.roc_auc_score(y_train_over, y_train_pred_proba)


y_test_pred = rfc_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Random Oversampling (V1)','Model': 'Random Forest','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.81      0.83      0.82       113

    accuracy                           1.00     56746
   macro avg       0.91      0.92      0.91     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
from xgboost import XGBClassifier

params = {'learning_rate': 0.2,
          'max_depth': 2,
          'n_estimators':200,
          'subsample':0.9,
         'objective':'binary:logistic'}

# fit model on training data
xgb_imb_model = XGBClassifier(params = params)
xgb_imb_model.fit(X_train_over, y_train_over)

y_train_pred = xgb_imb_model.predict(X_train_over)

Accuracy_train = metrics.accuracy_score(y_train_over, y_train_pred)
F1_train = f1_score(y_train_over, y_train_pred)
y_train_pred_proba = xgb_imb_model.predict_proba(X_train_over)[:,1]
auc_train = metrics.roc_auc_score(y_train_over, y_train_pred_proba)

y_test_pred = xgb_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Random Oversampling (V1)','Model': 'XGBoost','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

Parameters: { "params" } are not used.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.96      0.82      0.89       113

    accuracy                           1.00     56746
   macro avg       0.98      0.91      0.94     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
from sklearn.neural_network import MLPClassifier


mlp = MLPClassifier(hidden_layer_sizes=(5), activation='logistic', solver='adam', max_iter=100)
mlp.fit(X_train_over, y_train_over)

y_train_pred = mlp.predict(X_train_over)

Accuracy_train = metrics.accuracy_score(y_train_over, y_train_pred)
F1_train = f1_score(y_train_over, y_train_pred)
y_train_pred_proba = mlp.predict_proba(X_train_over)[:,1]
auc_train = metrics.roc_auc_score(y_train_over, y_train_pred_proba)

y_test_pred = mlp.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Random Oversampling (V1)','Model': 'ANN','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.82      0.83      0.82       113

    accuracy                           1.00     56746
   macro avg       0.91      0.92      0.91     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy=0.005)
X_train_smote, y_train_smote = ros.fit_resample(X_train, y_train)

In [None]:
dt_imb_model = DecisionTreeClassifier(criterion = "gini",
                                  random_state = 100,
                                  max_depth=10,
                                  min_samples_leaf=100,
                                  min_samples_split=50)

dt_imb_model.fit(X_train_smote, y_train_smote)

y_train_pred = dt_imb_model.predict(X_train_smote)

Accuracy_train = metrics.accuracy_score(y_train_smote, y_train_pred)
F1_train = f1_score(y_train_smote, y_train_pred)
y_train_pred_proba = dt_imb_model.predict_proba(X_train_smote)[:,1]
auc_train = metrics.roc_auc_score(y_train_smote, y_train_pred_proba)


# Predictions on the test set
y_test_pred = dt_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Smote Oversampling (V1)','Model': 'Decision Tree','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results


              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.74      0.84      0.79       113

    accuracy                           1.00     56746
   macro avg       0.87      0.92      0.89     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
# model with the best hyperparameters

rfc_imb_model = RandomForestClassifier(bootstrap=True,
                             max_depth=5,
                             min_samples_leaf=50,
                             min_samples_split=50,
                             max_features=10,
                             n_estimators=100)

rfc_imb_model.fit(X_train_smote, y_train_smote)
y_train_pred = rfc_imb_model.predict(X_train_smote)

Accuracy_train = metrics.accuracy_score(y_train_smote, y_train_pred)
F1_train = f1_score(y_train_smote, y_train_pred)
y_train_pred_proba = rfc_imb_model.predict_proba(X_train_smote)[:,1]
auc_train = metrics.roc_auc_score(y_train_smote, y_train_pred_proba)


y_test_pred = rfc_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Smote Oversampling','Model': 'Random Forest','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.79      0.82      0.81       113

    accuracy                           1.00     56746
   macro avg       0.90      0.91      0.90     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
from xgboost import XGBClassifier

params = {'learning_rate': 0.2,
          'max_depth': 2,
          'n_estimators':200,
          'subsample':0.9,
         'objective':'binary:logistic'}

# fit model on training data
xgb_imb_model = XGBClassifier(params = params)
xgb_imb_model.fit(X_train_smote, y_train_smote)

y_train_pred = xgb_imb_model.predict(X_train_smote)

Accuracy_train = metrics.accuracy_score(y_train_smote, y_train_pred)
F1_train = f1_score(y_train_smote, y_train_pred)
y_train_pred_proba = xgb_imb_model.predict_proba(X_train_smote)[:,1]
auc_train = metrics.roc_auc_score(y_train_smote, y_train_pred_proba)

y_test_pred = xgb_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Smote Oversampling','Model': 'XGBoost','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

Parameters: { "params" } are not used.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.95      0.84      0.89       113

    accuracy                           1.00     56746
   macro avg       0.97      0.92      0.95     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
from sklearn.neural_network import MLPClassifier


mlp = MLPClassifier(hidden_layer_sizes=(5), activation='logistic', solver='adam', max_iter=100)
mlp.fit(X_train_smote, y_train_smote)

y_train_pred = mlp.predict(X_train_smote)

Accuracy_train = metrics.accuracy_score(y_train_smote, y_train_pred)
F1_train = f1_score(y_train_smote, y_train_pred)
y_train_pred_proba = mlp.predict_proba(X_train_smote)[:,1]
auc_train = metrics.roc_auc_score(y_train_smote, y_train_pred_proba)

y_test_pred = mlp.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Smote Oversampling','Model': 'ANN','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.83      0.82      0.83       113

    accuracy                           1.00     56746
   macro avg       0.92      0.91      0.91     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
from imblearn.over_sampling import BorderlineSMOTE
bsm = BorderlineSMOTE(sampling_strategy=0.005)
X_train_bsmote, y_train_bsmote = ros.fit_resample(X_train, y_train)

In [None]:

dt_imb_model = DecisionTreeClassifier(criterion = "gini",
                                  random_state = 100,
                                  max_depth=10,
                                  min_samples_leaf=100,
                                  min_samples_split=50)

dt_imb_model.fit(X_train_bsmote, y_train_bsmote)

y_train_pred = dt_imb_model.predict(X_train_bsmote)

Accuracy_train = metrics.accuracy_score(y_train_bsmote, y_train_pred)
F1_train = f1_score(y_train_bsmote, y_train_pred)
y_train_pred_proba = dt_imb_model.predict_proba(X_train_bsmote)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmote, y_train_pred_proba)


# Predictions on the test set
y_test_pred = dt_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Borderline Smote Oversampling','Model': 'Decision Tree','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results


              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.74      0.84      0.79       113

    accuracy                           1.00     56746
   macro avg       0.87      0.92      0.89     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:


# model with the best hyperparameters

rfc_imb_model = RandomForestClassifier(bootstrap=True,
                             max_depth=5,
                             min_samples_leaf=50,
                             min_samples_split=50,
                             max_features=10,
                             n_estimators=100)

rfc_imb_model.fit(X_train_bsmote, y_train_bsmote)
y_train_pred = rfc_imb_model.predict(X_train_bsmote)

Accuracy_train = metrics.accuracy_score(y_train_bsmote, y_train_pred)
F1_train = f1_score(y_train_bsmote, y_train_pred)
y_train_pred_proba = rfc_imb_model.predict_proba(X_train_bsmote)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmote, y_train_pred_proba)


y_test_pred = rfc_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Borderline Smote Oversampling','Model': 'Random Forest','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.81      0.84      0.82       113

    accuracy                           1.00     56746
   macro avg       0.90      0.92      0.91     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
from xgboost import XGBClassifier

params = {'learning_rate': 0.2,
          'max_depth': 2,
          'n_estimators':200,
          'subsample':0.9,
         'objective':'binary:logistic'}

# fit model on training data
xgb_imb_model = XGBClassifier(params = params)
xgb_imb_model.fit(X_train_bsmote, y_train_bsmote)

y_train_pred = xgb_imb_model.predict(X_train_bsmote)

Accuracy_train = metrics.accuracy_score(y_train_bsmote, y_train_pred)
F1_train = f1_score(y_train_bsmote, y_train_pred)
y_train_pred_proba = xgb_imb_model.predict_proba(X_train_bsmote)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmote, y_train_pred_proba)

y_test_pred = xgb_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Borderline Smote Oversampling','Model': 'XGBoost','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results


Parameters: { "params" } are not used.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.94      0.84      0.89       113

    accuracy                           1.00     56746
   macro avg       0.97      0.92      0.94     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
from sklearn.neural_network import MLPClassifier


mlp = MLPClassifier(hidden_layer_sizes=(5), activation='logistic', solver='adam', max_iter=100)
mlp.fit(X_train_bsmote, y_train_bsmote)

y_train_pred = mlp.predict(X_train_bsmote)

Accuracy_train = metrics.accuracy_score(y_train_bsmote, y_train_pred)
F1_train = f1_score(y_train_bsmote, y_train_pred)
y_train_pred_proba = mlp.predict_proba(X_train_bsmote)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmote, y_train_pred_proba)

y_test_pred = mlp.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'Borderline Smote Oversampling','Model': 'ANN','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.83      0.83      0.83       113

    accuracy                           1.00     56746
   macro avg       0.92      0.92      0.92     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
from imblearn.combine import SMOTETomek
smt = SMOTETomek(sampling_strategy=0.005)
X_train_smotet, y_train_bsmotet = ros.fit_resample(X_train, y_train)

In [None]:
from xgboost import XGBClassifier

params = {'learning_rate': 0.2,
          'max_depth': 2,
          'n_estimators':200,
          'subsample':0.9,
         'objective':'binary:logistic'}

# fit model on training data
xgb_imb_model = XGBClassifier(params = params)
xgb_imb_model.fit(X_train_smotet, y_train_bsmotet)

y_train_pred = xgb_imb_model.predict(X_train_smotet)

Accuracy_train = metrics.accuracy_score(y_train_bsmotet, y_train_pred)
F1_train = f1_score(y_train_bsmotet, y_train_pred)
y_train_pred_proba = xgb_imb_model.predict_proba(X_train_smotet)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmotet, y_train_pred_proba)

y_test_pred = xgb_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'SmoteTomek Oversampling','Model': 'XGBoost','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results


Parameters: { "params" } are not used.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.96      0.83      0.89       113

    accuracy                           1.00     56746
   macro avg       0.98      0.92      0.95     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
from imblearn.over_sampling import SVMSMOTE
sm = SVMSMOTE(sampling_strategy=0.005)
X_train_smotet, y_train_bsmotet = ros.fit_resample(X_train, y_train)

from xgboost import XGBClassifier

params = {'learning_rate': 0.2,
          'max_depth': 2,
          'n_estimators':200,
          'subsample':0.9,
         'objective':'binary:logistic'}

# fit model on training data
xgb_imb_model = XGBClassifier(params = params)
xgb_imb_model.fit(X_train_smotet, y_train_bsmotet)

y_train_pred = xgb_imb_model.predict(X_train_smotet)

Accuracy_train = metrics.accuracy_score(y_train_bsmotet, y_train_pred)
F1_train = f1_score(y_train_bsmotet, y_train_pred)
y_train_pred_proba = xgb_imb_model.predict_proba(X_train_smotet)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmotet, y_train_pred_proba)

y_test_pred = xgb_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'SVMSmote Oversampling','Model': 'XGBoost','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

Parameters: { "params" } are not used.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.94      0.83      0.88       113

    accuracy                           1.00     56746
   macro avg       0.97      0.92      0.94     56746
weighted avg       1.00      1.00      1.00     56746



Unnamed: 0,Methodology,Model,Train-Accuracy,Train-F-1 score,Train-ROC,Test-Accuracy,Test-F1 Score
0,Without Oversampling,Decision Tree,0.999189,0.762274,0.985278,0.999084,0.762274
1,Without Oversampling,Random Forest,0.999344,0.7739,0.985278,0.999172,0.7739
2,Without Oversampling,XGBoost,1.0,1.0,1.0,0.999595,0.890995
3,Without Oversampling,ANN,0.99941,0.808023,0.990551,0.99933,0.831858
4,Without Oversampling (V1),Decision Tree,0.999172,0.755844,0.977998,0.999013,0.755844
5,Without Oversampling (V1),Random Forest,0.999326,0.777293,0.978577,0.999172,0.789238
6,Without Oversampling (V1),XGBoost,1.0,1.0,1.0,0.999559,0.881517
7,Without Oversampling (V1),ANN,0.999454,0.814925,0.972435,0.999383,0.83871
8,Random Oversampling (V1),Decision Tree,0.99852,0.845342,0.976537,0.999013,0.770492
9,Random Oversampling (V1),Random Forest,0.998722,0.863316,0.979454,0.999225,0.810345


In [None]:
from imblearn.over_sampling import KMeansSMOTE
sm = KMeansSMOTE(
    sampling_strategy=0.005,
    cluster_balance_threshold=0.001
)
X_train_smotet, y_train_bsmotet = sm.fit_resample(X_train, y_train)

from xgboost import XGBClassifier

params = {'learning_rate': 0.2,
          'max_depth': 2,
          'n_estimators':200,
          'subsample':0.9,
         'objective':'binary:logistic'}

# fit model on training data
xgb_imb_model = XGBClassifier(params = params)
xgb_imb_model.fit(X_train_smotet, y_train_bsmotet)

y_train_pred = xgb_imb_model.predict(X_train_smotet)

Accuracy_train = metrics.accuracy_score(y_train_bsmotet, y_train_pred)
F1_train = f1_score(y_train_bsmotet, y_train_pred)
y_train_pred_proba = xgb_imb_model.predict_proba(X_train_smotet)[:,1]
auc_train = metrics.roc_auc_score(y_train_bsmotet, y_train_pred_proba)

y_test_pred = xgb_imb_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
Accuracy_test = metrics.accuracy_score(y_test, y_test_pred)
F1_test = f1_score(y_test, y_test_pred)

df_Results = df_Results.append(pd.DataFrame({'Methodology': 'KmeansSmote Oversampling','Model': 'XGBoost','Train-Accuracy': Accuracy_train,'Train-F-1 score': F1_train,'Train-ROC': auc_train, 'Test-Accuracy': Accuracy_test, 'Test-F1 Score': F1_test}, index=[0]),ignore_index= True)

df_Results

Parameters: { "params" } are not used.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56633
           1       0.96      0.82      0.89       113

    accuracy                           1.00     56746
   macro avg       0.98      0.91      0.94     56746
weighted avg       1.00      1.00      1.00     56746



NameError: ignored