## QUESTION 4

## MODELLING THE MACHINE LEARNING ALGORITHMS FOR FRAUD CLASSIFICATION

This Problem deals with the algorithm being able to correctly classify a Transaction as Fraudulent or not.
It is a binary classification problem. <br />
<br />
Here I will fit four classification models :

1) Logistic Regression

2) Decision Tree

3) Random Forest 

4) XGBOOST


We will compare the performance of these algorithms and take a call as to which is suitable.

In [1]:
#Import Libraries Required

import numpy as np
import pandas as pd
from matplotlib import pyplot
import matplotlib.pyplot as plt
import sklearn
import imblearn
from xgboost import XGBClassifier
import seaborn as sns
%matplotlib inline

#set random seed for reproducibility
np.random.seed(20)
import warnings
warnings.filterwarnings('ignore')

In [2]:
#Load the Data

df_preprocess = pd.read_pickle('data/pickles/df_preprocess')

In [None]:
#Check for Percentage Statistics of Fraudulent vs Genuine Transactions [Non-Fraud: Class 1, Fraud : Class 0]

fraud_trans = len(df_preprocess[df_preprocess.isFraud == 0])
gen_trans = len(df_preprocess[df_preprocess.isFraud == 1])

fraud_pctg = (fraud_trans)/(fraud_trans + gen_trans) * 100

print("Number of Genuine transactions: ", gen_trans)
print("Number of Fraud transactions: ", fraud_trans)
print("Percentage of Fraud transactions: {:.4f}".format(fraud_pctg))

In [None]:
#Plotting the Class Imbalance 

sns.countplot(x ='isFraud', data = df_preprocess)


In [None]:
#Let us normalize fields for Logistic Regression 

from sklearn import preprocessing

def normalize(df): 
    # Define columns to normalize
    fields = ["creditLimit", "availableMoney", "transactionAmount", "transactionformatted",
              "currentBalance", "timeSinceAccountOpening", "timeTillExp", "timeSinceAddressChange"]
    
    for j in fields:
        df[j] = preprocessing.normalize(df[[j]], axis=0)
    
    return df

df_normalized = normalize(df_preprocess)

In [None]:
#Define our Dependent and Independent Variables i.e X and Y

Y = df_normalized["isFraud"]
X = df_normalized.drop(["isFraud"], axis=1)



In [None]:
#Perform Undersampling to Address Class Imbalance in our Data 

sample = imblearn.under_sampling.RandomUnderSampler(random_state = 42)
X, Y = sample.fit_resample(X, Y)

In [None]:
from sklearn.model_selection import train_test_split 

#Split Data into Training and Testing Sets
X_train,X_test,Y_train,Y_test = train_test_split(X,Y, test_size = 0.30, random_state = 0)

#Display Train and Test Set Shapes
print("Training Set Shape : " ,X_train.shape)
print("Testing Set Shape : " ,X_test.shape)

## Logistic Regression 



In [None]:
#Fitting Logistic Regression to our Data 

from sklearn.linear_model  import LogisticRegression
Log_Regression = LogisticRegression(random_state=42, max_iter=1000)

Log_Regression.fit(X_train, Y_train)
Log_Reg_predictions = Log_Regression.predict(X_test)

Log_Reg_predictions_prob = Log_Regression.predict_proba(X_test)
Log_Reg_Roc = [Y_test, Log_Reg_predictions_prob ]



In [None]:
#Here I write Reusable function that calculates the imporatant metrics of classification algorithms 

from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, recall_score, f1_score,roc_auc_score, ConfusionMatrixDisplay,roc_curve

def display_classification_metrics(actuals, predictions):
    acc = accuracy_score(actuals, predictions)
    prec = precision_score(actuals, predictions)
    rec = recall_score(actuals, predictions)
    f1 = f1_score(actuals, predictions)
    cm = confusion_matrix(actuals, predictions)
    fpr, tpr, _ = roc_curve(actuals, predictions)
    
    # Print the classification metrics
    print("Classification Metrics:\n")
    print("--------------------------")
    print(f"Accuracy:\t {acc:.4f}")
    print(f"Precision:\t {prec:.4f}")
    print(f"Recall:\t\t {rec:.4f}")
    print(f"F1-score:\t {f1:.4f}")
     
    
    # Display the confusion matrix
    cmd = ConfusionMatrixDisplay(cm, display_labels=[0, 1])
    cmd.plot(cmap='Blues')
    plt.title("Confusion Matrix")
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.show()
    

In [None]:
#Function that will calculate feature Importance score and plot it for 50 features

def feature_importance(importance):
    important_features = np.sort(importance)[:50]
    num = len(important_features)
    print(f"Plotting Feature Importance Scores of {model} model for top {num} features:")
    print("---------------------------------------------------------------------------------")
   # summarize feature importance
    for i,v in enumerate(importance[:num]):
        print('Feature: %0d, Score: %.5f' % (i,v))
   # plot feature importance
    plt.figure(figsize=(20,5))
    pyplot.bar(np.arange(0,num), importance[:num], color = list('rgbkymc'))
    plt.xticks(np.arange(0,num), np.array(X_train.columns[:num]), rotation = "vertical")
    pyplot.show()



In [None]:
#Display Logistic Regression Metrics

print("Evaluation of Logistic Regression Model")
print("-------------------------------------------")
print()
display_classification_metrics(Y_test, Log_Reg_predictions.round())
print()
print()
print("ROC CURVE FOR LOGISTIC REGRESSION")
print("-------------------------------------------")
sklearn.metrics.plot_roc_curve(Log_Regression, X_test, Y_test)



In [None]:
#Feature Importance for Logistic Regression along with a plot

model = "Logistic Regression"
feature_importance(Log_Regression.coef_[0].argsort())


## DECISION TREE

In [None]:
#Fiting Decison Tree to our data 

from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()

decision_tree.fit(X_train, Y_train)
predictions_dt = decision_tree.predict(X_test)

dt_predictions_prob = decision_tree.predict_proba(X_test)
dt_Roc = [Y_test, dt_predictions_prob]


In [None]:
#Display Decision Tree Metrics

print("Evaluation of Decision Tree Model")
print("-------------------------------------------")
display_classification_metrics(Y_test, predictions_dt.round())
print()
print()
print("ROC CURVE FOR DECISION TREE")
print("-------------------------------------------")
sklearn.metrics.plot_roc_curve(decision_tree, X_test, Y_test)


In [None]:
#Feature Imporatnce for Decision Tree
model = "Decision Tree"
feature_importance(decision_tree.feature_importances_.argsort())


In [None]:
#Fit Random Forest to our data 

from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators= 100)

random_forest.fit(X_train, Y_train)
predictions_rf = random_forest.predict(X_test)

rf_predictions_prob = random_forest.predict_proba(X_test)
rf_Roc = [Y_test, rf_predictions_prob]

In [None]:
#Display Random Forest Metrics

print("Evaluation of Random Forest Model")
print("-------------------------------------------")
print()
display_classification_metrics(Y_test, predictions_rf.round())
print()
print()
print("ROC CURVE FOR RANDOM FOREST")
print("-------------------------------------------")
sklearn.metrics.plot_roc_curve(random_forest, X_test, Y_test)

In [None]:
#Feature Importance for Random Forest
model = "Random Forest"
feature_importance(random_forest.feature_importances_.argsort())

##  eXtreme Gradient Boosting - XGBOOST 

I am using XGBOOST primarily because it has built-in regularization techniques,which help to prevent overfitting and improve the generalization performance of the model.


In [None]:
#Fit XGBOOST to data 

from xgboost import XGBClassifier
XGB_Model = XGBClassifier(random_state=42)

XGB_Model.fit(X_train, Y_train)
predictions_XGB = XGB_Model.predict(X_test)

XGB_pred_prob = XGB_Model.predict_proba(X_test)[:,1]
xgb_roc = [Y_test, XGB_pred_prob]

In [None]:
#Display XGBOOST Metrics

print("Evaluation of XGBOOST Model")
print("-------------------------------------------")
display_classification_metrics(Y_test, predictions_XGB.round())
print()
print()
print("ROC CURVE FOR XGBOOST")
print("-------------------------------------------")
sklearn.metrics.plot_roc_curve(XGB_Model, X_test, Y_test)


In [None]:
print(f"Training accuracy is {Log_Regression.score(X_train, Y_train)}")
print(f"Training accuracy is {decision_tree.score(X_train, Y_train)}")
print(f"Training accuracy is {random_forest.score(X_train, Y_train)}")
print(f"Testing accuracy is {XGB_Model.score(X_train, Y_train)}")

In [None]:
#Plot the ROC Curves of all models together 

import matplotlib.pyplot as plt
from sklearn.metrics import plot_roc_curve

# Create a figure and axis
fig, ax = plt.subplots(figsize=(10, 5))
ax = sns.lineplot(x=[0, 0.5, 1], y=[0, 0.5, 1], linestyle="dashed")

#Draw the Plots
plot_roc_curve(XGB_Model, X_test, Y_test, ax=ax, label='XGB_Model')
plot_roc_curve(random_forest, X_test, Y_test, ax=ax, label='Random Forest')
plot_roc_curve(decision_tree, X_test, Y_test, ax=ax, label='Decision Tree')
plot_roc_curve(Log_Regression, X_test, Y_test, ax=ax, label='Logistic Regression')

# Add a legend to the plot
plt.legend()

# Show the plot
plt.show()


## Comparing the Performance of Different Models



| Model  |  Training Accuracy | Testing Accuracy | Precision | Recall | F-1 Score | AUC |
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
| Logistic Regression | 0.6799 |  0.6721 | 0.6616  | 0.6978 | 0.6792 | 0.73 |
| Decision Tree Classifier | 1.0 |  0.6372 | 0.6312 | 0.6508 | 0.6409 | 0.64 |
| Random Forests | 1.0  | 0.7203 | 0.7175  | 0.7218 | 0.7197 | 0.80 |
| XGBoost | 0.8268 |  0.7243 | 0.7217  |  0.7256 | 0.7236 | 0.80 |

## RESULTS & CONCLUSIONS

1) XGBOOST performs the best on this data 

2) None of the models can be used to make reliable predictions on this data.

3) The Class Imabalance heavily affects the results we obtained, moreover undersampling removes a large amount of data.

4) I would be interested to see how our models will perform with more data.

## FUTURE SCOPE 

1) Addressing Class Imbalance in a much more concrete manner, explore Synthetic Minority Over Sampling (SMOTE)

2) We should consider some feature selection algorithms and parameter optimizations. This would be interesting to explore.

3) Transaction Time is a very interesting variable that must be studied more carefully. Checking for trends or seasonality might lead to some interesting insights. Fraud prevelance around Public Holidays, Christmas, Easter etc can be studied.

4) Explore some Ensemble Models to classify Fraud.

## References 

1) https://machinelearningmastery.com/ - Machine Learning Techniques and Code Support 

2) https://www.scirp.org/html/12-1501916_94450.htm - Research Paper Classifying Fraud in Automobile Insurance Claims 

3) https://marthawhite.github.io/mlcourse/notes.pdf - Machine Learning Handbook by Predrag Radivojac and Martha White