# AICA x DataCamp Capstone project:
## FraudGuard: Credit Card Transaction Anomaly Detection
**Your Role**: You are a Machine Learning Scientist at a Fintech startup. Your company processes thousands of credit card transactions daily. The fraud team has complained that their current rule-based system is missing too many sophisticated fraud attempts (False Negatives) but also flagging too many legitimate transactions (False Positives), annoying customers.

**Your Objective**: Build a machine learning model that detects fraudulent credit card transactions with a focus on maximizing Recall (catching fraud) while maintaining acceptable Precision (minimizing false alarms).

1. The Data Source
You will use the industry-standard Credit Card Fraud Detection Dataset hosted on Kaggle. This dataset contains transactions made by credit cards in September 2013 by European cardholders.

Dataset Link: Kaggle: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

The "Real World" Twist: The dataset features (V1, V2, ... V28) are the result of a PCA transformation (to protect user confidentiality). The only features which have not been transformed are Time and Amount.

### Import dependencies

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import randint, uniform

from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

### Load the Dataset

In [2]:
credit_df = pd.read_csv('creditcard.csv')
credit_df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Data Exploration

In [60]:
credit_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [61]:
credit_df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.168375e-15,3.416908e-16,-1.379537e-15,2.074095e-15,9.604066e-16,1.487313e-15,-5.556467e-16,1.213481e-16,-2.406331e-15,2.239053e-15,1.673327e-15,-1.247012e-15,8.190001e-16,1.207294e-15,4.887456e-15,1.437716e-15,-3.772171e-16,9.564149e-16,1.039917e-15,6.406204e-16,1.654067e-16,-3.568593e-16,2.578648e-16,4.473266e-15,5.340915e-16,1.683437e-15,-3.660091e-16,-1.22739e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,1.08885,1.020713,0.9992014,0.9952742,0.9585956,0.915316,0.8762529,0.8493371,0.8381762,0.8140405,0.770925,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,-24.58826,-4.797473,-18.68371,-5.791881,-19.21433,-4.498945,-14.12985,-25.1628,-9.498746,-7.213527,-54.49772,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,-0.5354257,-0.7624942,-0.4055715,-0.6485393,-0.425574,-0.5828843,-0.4680368,-0.4837483,-0.4988498,-0.4562989,-0.2117214,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,-0.09291738,-0.03275735,0.1400326,-0.01356806,0.05060132,0.04807155,0.06641332,-0.06567575,-0.003636312,0.003734823,-0.06248109,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,0.4539234,0.7395934,0.618238,0.662505,0.4931498,0.6488208,0.5232963,0.399675,0.5008067,0.4589494,0.1330408,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,23.74514,12.01891,7.848392,7.126883,10.52677,8.877742,17.31511,9.253526,5.041069,5.591971,39.4209,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [62]:
credit_df['Class'].unique()

array([0, 1])

In [63]:
credit_df['Class'].value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

## Conclusions from Data Exploration:
- Dataset is quite clean.
- No null values.
- Features require standardisation or normalization
- Data classes are imbalanced. Stratify while splitting.
- Time column is relative and is not a good predictor of class.

In [3]:
# Setting feature and target variables
y = credit_df['Class']
X = credit_df.drop(columns=['Class', 'Time'])

### Data Splitting

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=64)

### Feature Scaling

In [6]:
scaler = StandardScaler()
X_train_scaled = X_train.drop(columns='Amount')
X_test_scaled = X_test.drop(columns='Amount')

X_train_scaled['Amount_scaled'] = scaler.fit_transform(X_train[['Amount']])
X_test_scaled['Amount_scaled'] = scaler.transform(X_test[['Amount']])

In [7]:
X_train.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
48674,-0.708341,0.040445,2.200083,0.749542,-0.744579,-0.01029,0.317844,0.073231,0.081554,-0.62499,-0.708012,0.55606,0.965374,-0.604689,0.240499,-0.23401,0.02175,-0.022601,0.745787,0.447952,0.128794,0.312569,0.267857,0.446607,-0.435194,0.332069,0.107945,0.168127,138.0
37114,1.107081,-0.499778,0.882831,0.18068,-0.909528,-0.050028,-0.472365,-0.012496,1.023381,-0.525302,-1.00365,1.145725,1.447667,-0.841939,-0.408806,-0.030448,-0.112201,-0.631152,0.620632,0.173777,-0.206754,-0.429161,-0.062181,0.000762,0.222293,0.961318,-0.039028,0.02483,81.12
172262,-0.345023,-4.093449,-3.284379,0.903968,-0.525472,-0.025134,2.06915,-0.650226,0.392074,-0.754407,-1.143273,0.834769,1.368906,0.460527,0.28152,-0.203375,-0.447419,-0.398204,0.065261,2.484688,0.600749,-0.981669,-0.998897,0.130377,-0.307035,-0.206071,-0.288431,0.162162,1286.22
189737,1.68089,-1.257498,-0.969382,-0.669416,-0.75548,-0.315123,-0.345804,0.059293,1.533915,-0.380343,0.305051,0.582418,-1.275604,0.356037,-0.324246,0.134541,-0.401212,-0.075229,1.057018,0.106868,-0.312931,-1.184365,0.264087,-0.578846,-0.705899,0.199461,-0.08809,-0.037789,184.4
154022,-0.940966,0.488571,0.075272,-1.988581,2.219415,4.348808,-1.230589,-0.483671,2.024167,-0.977051,0.500102,-2.363679,1.460949,1.215671,-0.408515,0.753529,-0.203869,0.291025,-1.370067,-0.588936,1.530308,-0.290991,0.29827,0.638862,-1.158172,0.239968,0.147404,0.238342,14.95


In [8]:
X_train_scaled.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount_scaled
48674,-0.708341,0.040445,2.200083,0.749542,-0.744579,-0.01029,0.317844,0.073231,0.081554,-0.62499,-0.708012,0.55606,0.965374,-0.604689,0.240499,-0.23401,0.02175,-0.022601,0.745787,0.447952,0.128794,0.312569,0.267857,0.446607,-0.435194,0.332069,0.107945,0.168127,0.194439
37114,1.107081,-0.499778,0.882831,0.18068,-0.909528,-0.050028,-0.472365,-0.012496,1.023381,-0.525302,-1.00365,1.145725,1.447667,-0.841939,-0.408806,-0.030448,-0.112201,-0.631152,0.620632,0.173777,-0.206754,-0.429161,-0.062181,0.000762,0.222293,0.961318,-0.039028,0.02483,-0.029926
172262,-0.345023,-4.093449,-3.284379,0.903968,-0.525472,-0.025134,2.06915,-0.650226,0.392074,-0.754407,-1.143273,0.834769,1.368906,0.460527,0.28152,-0.203375,-0.447419,-0.398204,0.065261,2.484688,0.600749,-0.981669,-0.998897,0.130377,-0.307035,-0.206071,-0.288431,0.162162,4.723629
189737,1.68089,-1.257498,-0.969382,-0.669416,-0.75548,-0.315123,-0.345804,0.059293,1.533915,-0.380343,0.305051,0.582418,-1.275604,0.356037,-0.324246,0.134541,-0.401212,-0.075229,1.057018,0.106868,-0.312931,-1.184365,0.264087,-0.578846,-0.705899,0.199461,-0.08809,-0.037789,0.377465
154022,-0.940966,0.488571,0.075272,-1.988581,2.219415,4.348808,-1.230589,-0.483671,2.024167,-0.977051,0.500102,-2.363679,1.460949,1.215671,-0.408515,0.753529,-0.203869,0.291025,-1.370067,-0.588936,1.530308,-0.290991,0.29827,0.638862,-1.158172,0.239968,0.147404,0.238342,-0.290936


## Training and Evaluating Different Classification Models with Default Params

## 1. Fitting and Evaluating Logistic Regression Model as Baseline

In [9]:
logreg = LogisticRegression()
logreg.fit(X_train_scaled, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [10]:
y_pred_logreg_base = logreg.predict(X_test_scaled)

In [11]:
print("LOGISTIC REGRESSION METRICS (Baseline metrics)") # Serves as the baseline
print('='*50)
print('Accuracy:',accuracy_score(y_test,y_pred_logreg_base))
print('F1 Score:',f1_score(y_test,y_pred_logreg_base))
print('Precision:',precision_score(y_test,y_pred_logreg_base))
print('Recall:',recall_score(y_test,y_pred_logreg_base))

LOGISTIC REGRESSION METRICS (Baseline metrics)
Accuracy: 0.9992977774656788
F1 Score: 0.7727272727272727
Precision: 0.8717948717948718
Recall: 0.6938775510204082


## 2. Fitting and Evaluating Random Forest Model

In [18]:
rf = RandomForestClassifier(class_weight = 'balanced', random_state=0, n_jobs=-1)
rf.fit(X_train_scaled, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [19]:
y_pred_rf_base = rf.predict(X_test_scaled)

In [20]:
print("RANDOM FOREST METRICS")
print('='*30)
print('Accuracy:',accuracy_score(y_test,y_pred_rf_base))
print('F1 Score:',f1_score(y_test,y_pred_rf_base))
print('Precision:',precision_score(y_test,y_pred_rf_base))
print('Recall:',recall_score(y_test,y_pred_rf_base))

RANDOM FOREST METRICS
Accuracy: 0.9996839998595555
F1 Score: 0.9021739130434783
Precision: 0.9651162790697675
Recall: 0.8469387755102041


## 3. Fitting and Evaluating XGBoost Model

In [15]:
xgb = XGBClassifier()
xgb.fit(X_train_scaled, y_train)

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


In [16]:
y_pred_xgb_base = xgb.predict(X_test_scaled)

In [17]:
print("XGBOOST METRICS")
print('='*30)
print('Accuracy:',accuracy_score(y_test,y_pred_xgb_base))
print('F1 Score:',f1_score(y_test,y_pred_xgb_base))
print('Precision:',precision_score(y_test,y_pred_xgb_base))
print('Recall:',recall_score(y_test,y_pred_xgb_base))

XGBOOST METRICS
Accuracy: 0.9995962220427653
F1 Score: 0.8808290155440415
Precision: 0.8947368421052632
Recall: 0.8673469387755102


## TUNING HYPERPARAMETERS

### 1. Random Forest

In [21]:
# Create parameter ranges for random search
param_dist = {
  'n_estimators': randint(100, 800),
  'max_depth': randint(3, 16),
  'min_samples_split': randint(2, 10),
  'min_samples_leaf': randint(1, 5),
  'max_features': ['sqrt', 'log2']
}

# Use random search to find the best hyperparameters with recall as priority
rand_search = RandomizedSearchCV(
  rf, param_distributions=param_dist,
  n_iter=15, cv=3, scoring='recall',
  n_jobs=-1, random_state=42)


In [22]:
# Fitting to the train set
rand_search.fit(X_train_scaled, y_train)

0,1,2
,estimator,RandomForestC...andom_state=0)
,param_distributions,"{'max_depth': <scipy.stats....x7f4f846e7260>, 'max_features': ['sqrt', 'log2'], 'min_samples_leaf': <scipy.stats....x7f4f559c2060>, 'min_samples_split': <scipy.stats....x7f4f559c3a10>, ...}"
,n_iter,15
,scoring,'recall'
,n_jobs,-1
,refit,True
,cv,3
,verbose,0
,pre_dispatch,'2*n_jobs'
,random_state,42

0,1,2
,n_estimators,352
,criterion,'gini'
,max_depth,3
,min_samples_split,7
,min_samples_leaf,2
,min_weight_fraction_leaf,0.0
,max_features,'log2'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [29]:
best_rf = rand_search.best_estimator_

# Print the best hyperparameters
print('Best hyperparameters:',  rand_search.best_params_)

Best hyperparameters: {'max_depth': 3, 'max_features': 'log2', 'min_samples_leaf': 2, 'min_samples_split': 7, 'n_estimators': 352}


In [25]:
# Generate predictions with the best model
y_pred_best_rf = best_rf.predict(X_test_scaled)

# Check metrics
print("TUNED RANDOM FOREST METRICS")
print('='*40)
print('Accuracy:',accuracy_score(y_test,y_pred_best_rf))
print('F1 Score:',f1_score(y_test,y_pred_best_rf))
print('Precision:',precision_score(y_test,y_pred_best_rf))
print('Recall:',recall_score(y_test,y_pred_best_rf))

TUNED RANDOM FOREST METRICS
Accuracy: 0.9966995540886907
F1 Score: 0.48633879781420764
Precision: 0.332089552238806
Recall: 0.9081632653061225


### 2. XGBoost

In [26]:
pos_weight = (len(y_train) - sum(y_train)) / sum(y_train)

xgb_model = XGBClassifier(
            objective='binary:logistic',
            eval_metric='logloss',
            scale_pos_weight=pos_weight,
            random_state=42,
            n_jobs=-1,
            tree_method='hist'
)

# Define the hyperparameter search space using distributions
param_dis = {
    'learning_rate': uniform(0.01, 0.5),
    'max_depth': randint(3, 10),
    'n_estimators': randint(50, 500),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.9, 0.4),
    'gamma': uniform(0, 2),
}


random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_dis,
    n_iter=20,
    cv=3,
    scoring='recall',
    verbose=2,
    n_jobs=-1,
    random_state=42
)

In [27]:
# Fit the model to find the best hyperparameters
random_search.fit(X_train_scaled, y_train)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


0,1,2
,estimator,"XGBClassifier...ree=None, ...)"
,param_distributions,"{'colsample_bytree': <scipy.stats....x7f4f5555a690>, 'gamma': <scipy.stats....x7f4f5555be90>, 'learning_rate': <scipy.stats....x7f4f55867770>, 'max_depth': <scipy.stats....x7f4f556d89e0>, ...}"
,n_iter,20
,scoring,'recall'
,n_jobs,-1
,refit,True
,cv,3
,verbose,2
,pre_dispatch,'2*n_jobs'
,random_state,42

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,np.float64(0.9161734358153726)
,device,
,early_stopping_rounds,
,enable_categorical,False


In [28]:
# Use the best estimator to make predictions
best_xgb = random_search.best_estimator_
y_pred_best_xgb = best_xgb.predict(X_test_scaled)

# Check metrics
print("TUNED XGB METRICS")
print('='*40)
print('Accuracy:',accuracy_score(y_test,y_pred_best_xgb))
print('F1 Score:',f1_score(y_test,y_pred_best_xgb))
print('Precision:',precision_score(y_test,y_pred_best_xgb))
print('Recall:',recall_score(y_test,y_pred_best_xgb))

TUNED XGB METRICS
Accuracy: 0.9995611109160493
F1 Score: 0.8756218905472637
Precision: 0.8543689320388349
Recall: 0.8979591836734694


# Solution
The tuned XGB model performs best according to the specified criteria of maximizing **Recall** (catching fraud) while maintaining acceptable Precision (minimizing false alarms).

### Optimum model
`best_xgb` = **XGBClassifier**(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=np.float64(0.9161734358153726), device=None,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric='logloss', feature_types=None, feature_weights=None,
              gamma=np.float64(1.4213257793715748), grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=np.float64(0.06544541040591566), max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=6, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=228, n_jobs=-1,
              num_parallel_tree=None, ...)

### Baseline Model (Logistic Regression) Metrics
- **Accuracy**: 0.9992977774656788
- **F1 Score**: 0.7727272727272727
- **Precision**: 0.8717948717948718
- **Recall**: 0.6938775510204082

### Base Random Forest Metrics
- **Accuracy**: 0.9996839998595555
- **F1 Score**: 0.9021739130434783
- **Precision**: 0.9651162790697675
- **Recall**: 0.8469387755102041

### Tuned Random Forest Metrics
- **Accuracy**: 0.9966995540886907
- **F1 Score**: 0.48633879781420764
- **Precision**: 0.332089552238806
- **Recall**: 0.9081632653061225

**High Recall, but poor Precision.**

### Base XGB Metrics
- **Accuracy**: 0.9995962220427653
- **F1 Score**: 0.8808290155440415
- **Precision**: 0.8947368421052632
- **Recall**: 0.8673469387755102

## Tuned XGB (**Best Model**) Metrics ✅
- **Accuracy**: 0.9995611109160493
- **F1 Score**: 0.8756218905472637
- **Precision**: 0.8543689320388349
- **Recall**: 0.8979591836734694