## XGBoost Model Implementation

To overcome the limitations of logistic regression in capturing complex and nonlinear fraud patterns, an ensemble tree-based model (XGBoost) was implemented. XGBoost is well-suited for imbalanced classification problems due to its ability to handle class weighting and learn complex decision boundaries.


### Importing required libraries

In [31]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

from xgboost import XGBClassifier

### Reading the cleaned dataset

In [32]:
df = pd.read_csv('fraud_cleaned.csv')
print(df.shape)
df.head(3)

(1296675, 15)


Unnamed: 0.1,Unnamed: 0,category,amt,street,city,state,lat,long,city_pop,merch_lat,merch_long,is_fraud,hour,day,is_weekend
0,0,misc_net,4.97,561 Perry Cove,Moravian Falls,NC,36.0788,-81.1781,3495,36.011293,-82.048315,0,0,1,0
1,1,grocery_pos,107.23,43039 Riley Greens Suite 393,Orient,WA,48.8878,-118.2105,149,49.159047,-118.186462,0,0,1,0
2,2,entertainment,220.11,594 White Dale Suite 530,Malad City,ID,42.1808,-112.262,4154,43.150704,-112.154481,0,0,1,0


In [33]:
# Unnamed: 0 column is dropped as pandas by default gives us index values, so no need of explicit indices
del(df['Unnamed: 0'])
df.shape

(1296675, 14)

### Training & Testing

In [16]:
X = df.drop("is_fraud", axis=1)
y = df["is_fraud"]

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

In [18]:
num_cols = X_train.select_dtypes(include=['int64','float64']).columns
cat_cols = X_train.select_dtypes(include=['object']).columns

In [19]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
    ]
)

#### Handling Class Imbalance

The dataset is highly imbalanced, with fraudulent transactions representing a very small portion of the total data. To address this, we will use `scale_pos_weight` parameter in XGBoost.

This parameter assigns higher importance to fraud samples during training and is calculated as:

scale_pos_weight = (Number of normal transactions) / (Number of fraud transactions)

This ensures that the model does not become biased toward the majority class.


In [20]:
xgb_model = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    scale_pos_weight=172,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1,
    eval_metric='logloss'
)

In [21]:
model = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('classifier', xgb_model)
])

In [22]:
# Testing at threshold = 0.5 by default
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_proba))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99    257834
           1       0.27      0.96      0.42      1501

    accuracy                           0.98    259335
   macro avg       0.63      0.97      0.71    259335
weighted avg       1.00      0.98      0.99    259335

ROC-AUC: 0.9976489283446176
[[253861   3973]
 [    55   1446]]


In [23]:
lst = [0.89, 0.92, 0.93, 0.94]
for t in lst:
    print(f"\nThreshold: {t}")
    y_pred_custom = (y_proba >= t).astype(int)
    print(classification_report(y_test, y_pred_custom))

print(f'THE OPTIMAL VALUE OF THRESHOLD IN THIS SCENARIO IS : {lst[2]}')


Threshold: 0.89
              precision    recall  f1-score   support

           0       1.00      0.99      1.00    257834
           1       0.50      0.89      0.64      1501

    accuracy                           0.99    259335
   macro avg       0.75      0.94      0.82    259335
weighted avg       1.00      0.99      1.00    259335


Threshold: 0.92
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    257834
           1       0.57      0.86      0.69      1501

    accuracy                           1.00    259335
   macro avg       0.79      0.93      0.84    259335
weighted avg       1.00      1.00      1.00    259335


Threshold: 0.93
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    257834
           1       0.61      0.85      0.71      1501

    accuracy                           1.00    259335
   macro avg       0.81      0.92      0.85    259335
weighted avg       1.00

## Model Comparison

The performance of Logistic Regression and XGBoost models was compared.

Logistic Regression:
- Recall: ~61%
- Precision: ~27%
- Struggled to capture nonlinear fraud patterns

XGBoost:
- Recall: ~85%
- Precision: ~61%
- Successfully captured complex feature interactions

The results clearly show that XGBoost significantly outperforms logistic regression in fraud deection tasks.
ect alerts.


### Saving XGBoost Pipeline

In [24]:
import joblib

# Save full pipeline (preprocessing + model)
joblib.dump(model, "fraud_xgboost_pipeline.pkl")

print("Model successfully saved!")

Model successfully saved!


In [40]:
df_ext = pd.read_csv("fraudTest.csv")
df_ext.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,...,33.9659,-80.9355,333497,Mechanical engineer,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,1371816865,33.986391,-81.200714,0
1,1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,...,40.3207,-110.436,302,"Sales professional, IT",1990-01-17,324cc204407e99f51b0d6ca0055005e7,1371816873,39.450498,-109.960431,0
2,2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,...,40.6729,-73.5365,34496,"Librarian, public",1970-10-21,c81755dbbbea9d5c77f094348a7579be,1371816893,40.49581,-74.196111,0
3,3,2020-06-21 12:15:15,3591919803438423,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,...,28.5697,-80.8191,54767,Set designer,1987-07-25,2159175b9efe66dc301f149d3d5abf8c,1371816915,28.812398,-80.883061,0
4,4,2020-06-21 12:15:17,3526826139003047,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,...,44.2529,-85.017,1126,Furniture designer,1955-07-06,57ff021bd3f328f8738bb535c302a31b,1371816917,44.959148,-85.884734,0


## Test Data preprocessing

### Converting dates columns from object to datetime

In [41]:
df_ext['trans_date_trans_time'] = pd.to_datetime(df_ext['trans_date_trans_time'])
print(type(df_ext['trans_date_trans_time'][0]))

df_ext['dob'] = pd.to_datetime(df_ext['dob'])
print(type(df_ext['dob'][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>


### Dropping the df['first'] & df['last'] column as Name plays no role in frauds

In [42]:
del([df_ext['first'],df_ext['last']])
df_ext.shape

(555719, 21)

### Removing Unnecessary Columns

In [68]:
del(
        [df_ext['trans_num'], df_ext['cc_num'], 
         df_ext['unix_time'],df_ext['merchant'],
         df_ext['job'],df_ext['dob'],df_ext['zip'], df_ext['gender']]
)
df_ext.shape

(555719, 14)

In [46]:
del(df_ext['Unnamed: 0'])
df_ext.shape

(555719, 14)

### Extracting Hour, Day, and Weekend Days from Test Data

In [44]:
df_ext['hour'] = df_ext['trans_date_trans_time'].dt.hour
df_ext['day'] = df_ext['trans_date_trans_time'].dt.dayofweek
df_ext['is_weekend'] = df_ext['day'].isin([5,6]).astype(int)

In [45]:
del(df_ext['trans_date_trans_time'])
df_ext.shape

(555719, 15)

## Testing the Saved Model

### Separating Features & Target

In [51]:
X_ext = df_ext.drop("is_fraud", axis=1)
y_ext = df_ext["is_fraud"]

### Loading Saved Pipeline

In [52]:
# Load saved model
model = joblib.load("fraud_xgboost_pipeline.pkl")

print("Model successfully loaded!")

Model successfully loaded!


### Prediction

In [53]:
y_proba_ext = model.predict_proba(X_ext)[:,1]

In [69]:
threshold = 0.93
y_pred_ext = (y_proba_ext >= threshold).astype(int)

### Printing Evaluation Report

In [70]:
print(classification_report(y_ext, y_pred_ext))
print(confusion_matrix(y_ext, y_pred_ext))
print("ROC-AUC:", roc_auc_score(y_ext, y_proba_ext))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    553574
           1       0.50      0.79      0.61      2145

    accuracy                           1.00    555719
   macro avg       0.75      0.89      0.80    555719
weighted avg       1.00      1.00      1.00    555719

[[551872   1702]
 [   458   1687]]
ROC-AUC: 0.9955263442036666
