# Fraud Detection Modeling – XGBoost

## Objective
This notebook focuses on building and evaluating a machine learning model to detect fraudulent transactions using **XGBoost**, a powerful gradient boosting algorithm widely used in fintech applications.

The objective is to accurately identify fraudulent behavior while managing the strong class imbalance inherent in fraud detection problems.


## Modeling Approach
- Train/test split with stratification
- XGBoost classifier with class imbalance handling
- Evaluation using:
  - Precision, Recall, and F1-score
  - ROC-AUC
  - Precision-Recall AUC (primary metric)
- Preparation for model explainability and business reporting






In [1]:
# Import 
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    average_precision_score
)

from xgboost import XGBClassifier


In [None]:
# Load Feature Data

df = pd.read_csv("../data/features_train.csv")
df.head()


Unnamed: 0,isFraud,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,...,id_29_missing,id_30_missing,id_31_missing,id_32_missing,id_33_missing,id_34_missing,id_35_missing,id_36_missing,id_37_missing,id_38_missing
0,0,68.5,W,13926,,150.0,discover,142.0,credit,315.0,...,1,1,1,1,1,1,1,1,1,1
1,0,29.0,W,2755,404.0,150.0,mastercard,102.0,credit,325.0,...,1,1,1,1,1,1,1,1,1,1
2,0,59.0,W,4663,490.0,150.0,visa,166.0,debit,330.0,...,1,1,1,1,1,1,1,1,1,1
3,0,50.0,W,18132,567.0,150.0,mastercard,117.0,debit,476.0,...,1,1,1,1,1,1,1,1,1,1
4,0,50.0,H,4497,514.0,150.0,mastercard,102.0,credit,420.0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# Split the Data into Featur and Target 

x = df.drop(columns='isFraud')
y = df['isFraud']

In [4]:
# Identify Categorical Columns
cat_cols = x.select_dtypes(include=["object"]).columns
num_cols = x.select_dtypes(exclude=["object"]).columns

len(cat_cols), len(num_cols)


(31, 445)

In [5]:
# Encode Categorical Features, we will used label Encoding per column cuse there are so many columns
for col in cat_cols :
    le = LabelEncoder()
    x[col] = le.fit_transform(x[col].astype(str))



In [9]:
# Time-Based Train - test split 
n = len(df)

train_end = int(n * 0.7)
val_end = int(n * 0.85)

x_train = x.iloc[:train_end]
y_train = y.iloc[:train_end]

x_val = x.iloc[train_end:val_end]
y_val = y.iloc[train_end:val_end]

x_test = x.iloc[val_end:]
y_test = y.iloc[val_end:]



In [10]:
# Handle Class Imbalance
# Fraud is a rare -> model can cheat by predicting "non-fraud"
# We calculate scale_pos_weight:
fraud = y_train.sum()
non_fraud = len(y_train) - fraud

scale_pos_weight =non_fraud / fraud
scale_pos_weight

np.float64(27.434310083918007)

This tells XGBoost:

Fraud is rare - pay more attention to it.

In [11]:
# Train Baseline XGBoost Model
xgb_model = XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    eval_metric="logloss"
)

xgb_model.fit(x_train, y_train)


In [12]:
# Predictions
y_pred = xgb_model.predict(x_test)
y_proba = xgb_model.predict_proba(x_test)[:,1]

In [None]:
# Evaluation (Fraud Metrics Only)
# 1.Classification Report 
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.91      0.95     85498
           1       0.22      0.69      0.33      3083

    accuracy                           0.90     88581
   macro avg       0.60      0.80      0.64     88581
weighted avg       0.96      0.90      0.93     88581



In [14]:
# 2.Confusion Matrix
confusion_matrix(y_test, y_pred)


array([[77720,  7778],
       [  942,  2141]])

In [15]:
# 3.ROC-AUC
roc_auc_score(y_test, y_proba)

np.float64(0.8961249125318836)

In [16]:
# Precision-Recall AUC 
average_precision_score(y_test, y_proba)

np.float64(0.504435911174399)

In [17]:
# Save the trained model as a .pkl file
import pickle

with open('xgb_model.pkl', 'wb') as file:
    pickle.dump(xgb_model, file)

print("Model saved as xgb_model.pkl")


Model saved as xgb_model.pkl


In [18]:
# Save the result 
results = x_test.copy()
results['is_fraud'] = y_test.values
results['risk_score'] = y_proba

results.to_csv("../data/results.csv", index=False)

results.head()

Unnamed: 0,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,...,id_31_missing,id_32_missing,id_33_missing,id_34_missing,id_35_missing,id_36_missing,id_37_missing,id_38_missing,is_fraud,risk_score
501959,57.95,4,7919,194.0,150.0,2,166.0,2,143.0,87.0,...,1,1,1,1,1,1,1,1,0,0.006633
501960,47.95,4,1764,158.0,150.0,4,226.0,2,315.0,87.0,...,1,1,1,1,1,1,1,1,0,0.11517
501961,209.95,4,2455,321.0,150.0,4,226.0,1,225.0,87.0,...,1,1,1,1,1,1,1,1,0,0.275809
501962,107.95,4,7919,194.0,150.0,2,166.0,2,126.0,87.0,...,1,1,1,1,1,1,1,1,0,0.035903
501963,58.95,4,10838,143.0,150.0,4,226.0,2,205.0,87.0,...,1,1,1,1,1,1,1,1,0,0.036681
