# Introduction

**Name:**<br>
Affan Anitya as *Data Scientist 1*,<br>
Aqsal Herdi as *Data Scientist 2*,<br>
Lia Kurniawati as *Data Analyst*,<br>
Yuana Inka as *Data Engineer*<br>
<br>
**Batch:** FTDS HCK-024

**Author of this Notebook:** Affan Anitya

# Import Libraries

These are the libraries we are going to be using.

In [10]:
import pandas as pd
import pickle

from scipy.stats import chi2_contingency, pointbiserialr

import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.metrics import precision_score, classification_report, make_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import resample

from xgboost import XGBClassifier


# Data Loading

We are using data from Kaggle, link can be accessed below.<br>
[Data Link](https://www.kaggle.com/datasets/jainilcoder/online-payment-fraud-detection)

In [11]:
df = pd.read_csv('onlinefraud.csv')

# Data Exploration

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [13]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [14]:
df['type'].value_counts()

CASH_OUT    2237500
PAYMENT     2151495
CASH_IN     1399284
TRANSFER     532909
DEBIT         41432
Name: type, dtype: int64

In [15]:
df['isFraud'].value_counts()

0    6354407
1       8213
Name: isFraud, dtype: int64

Based on some of this exploration, we can concluded that the data is around 6 millions rows with spreadness of Fraud or Not is 6354407 and 8213.

In [16]:
df_majority = df[df['isFraud'] == 0]  # Non-fraud transactions
df_minority = df[df['isFraud'] == 1]  # Fraud transactions

# Undersample the majority class
df_majority_undersampled = resample(df_majority, 
                                    replace=False,  # No replacement (random selection)
                                    n_samples=len(df_minority),  # Match minority class count
                                    random_state=42)  # For reproducibility

# Combine undersampled majority class with minority class
df_undersampled = pd.concat([df_majority_undersampled, df_minority])

# Shuffle dataset
df_undersampled = df_undersampled.sample(frac=1, random_state=42).reset_index(drop=True)

# Verify new class distribution
df_undersampled['isFraud'].value_counts()


1    8213
0    8213
Name: isFraud, dtype: int64

I decide to undersample the data since the difference of the fraud or not is too high, it could lead to overfitting.

# Split Data

Now before we delve further, we will split the data first, with the distribution of 80 - 20.

In [17]:
# define the X and y by drop the target and define the target column
X = df_undersampled.drop(columns=['isFraud'])
y = df_undersampled['isFraud']

# split it 80 - 20 size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=26)

print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")

Training set size: (13140, 10)
Testing set size: (3286, 10)


Then we will decide which column are numerical and which column are categorical.

In [18]:
num_cols = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
cat_cols = ['type','nameOrig','nameDest']

I want to check correlation for the categorical features with the target. Since the target is categorical and our columns are categorical we will use chi square statistics calculation.

In [19]:
# get the categorical features
p_values = []
results = []

for feature in cat_cols:
    # Create a contingency table (cross-tabulation between target and feature)
    contingency_table = pd.crosstab(X_train[feature], y_train)
    
    # Perform chi-squared test
    chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)
    
    # Put the values in p_values
    p_values.append(p_value)
    
    # Interpret the result based on p-value
    if p_value < 0.05:
        results.append(f'{feature} is correlated with isFraud')
    else:
        results.append(f'{feature} is not correlated with isFraud')

# Display results
correlation_results = pd.DataFrame({
    'Feature': cat_cols,
    'P-Value': p_values,
    'Interpretation': results
})

correlation_results

Unnamed: 0,Feature,P-Value,Interpretation
0,type,0.0,type is correlated with isFraud
1,nameOrig,0.495898,nameOrig is not correlated with isFraud
2,nameDest,0.492579,nameDest is not correlated with isFraud


Based on this, the step and type are the correlated features with the target, therefore we will use it in our features selection. We then continued for correlation between our numerical features and categorical target, for this we will use point biserial.

In [20]:
p_values = []
results = []

# Point Biserial Correlation (for binary categorical variables)
for feature in num_cols:
    corr, p_value = pointbiserialr(X_train[feature], y_train)  # Compute correlation

    p_values.append(p_value)
    
    # Interpret result
    if p_value < 0.05:
        results.append(f'{feature} is correlated with isFraud')
    else:
        results.append(f'{feature} is not correlated with isFraud')

# Display results
correlation_results = pd.DataFrame({
    'Feature': num_cols,
    'P-Value': p_values,
    'Interpretation': results
})

correlation_results

Unnamed: 0,Feature,P-Value,Interpretation
0,step,4.616981e-306,step is correlated with isFraud
1,amount,0.0,amount is correlated with isFraud
2,oldbalanceOrg,4.883443999999999e-50,oldbalanceOrg is correlated with isFraud
3,newbalanceOrig,2.0371699999999998e-51,newbalanceOrig is correlated with isFraud
4,oldbalanceDest,1.5565499999999998e-19,oldbalanceDest is correlated with isFraud
5,newbalanceDest,0.1987584,newbalanceDest is not correlated with isFraud


Based on this calculations, all of the numerical features are correlated with the target, therefore we will use it.

### Features Selection

We choose the features based on our correlation count.

In [21]:
select_num_cols = ['step','amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest']
select_cat_cols = ['type']

# Pipeline Creation

Since we talking about pretty big data and it is a categorical target, we will use either Random Forest or XGBoost.

## Preprocessing

In [22]:
onehot_encoder = OneHotEncoder()
robust_scaler = RobustScaler()
# create a preprocessing pipeline, using column transformer
preprocessing = ColumnTransformer(
    transformers=[
        # encode the low cardinality features using onehot encoder
        ('onehot', onehot_encoder, select_cat_cols),
        # scale the numerical column using standard scaler
        ('num', robust_scaler, select_num_cols)
    ],
    # as for the features we are not selected, we will drop it
    remainder='drop'
)

In [23]:
# checking the preprocessing if it's working
X_train_preprocess = preprocessing.fit_transform(X_train)
X_test_preprocess = preprocessing.transform(X_test)

## Model Random Forest

In [24]:
rf_model = RandomForestClassifier(
    n_estimators=100,
    class_weight="balanced",  # Automatically adjusts weights for fraud cases
    random_state=26
)

In [25]:
# creating pipeline with preprocessing and Random Forest classifier
pipeline_RandFor = Pipeline(steps=[
    ('preprocessor', preprocessing),
    ('classifier', rf_model)
])

In [26]:
# fit the pipeline
pipeline_RandFor.fit(X_train, y_train)

In [27]:
# Define a custom scorer for macro precision (for multiclass problems)
custom_precision_scorer = make_scorer(precision_score, average='macro')

cv_scores = cross_val_score(
    estimator=pipeline_RandFor,  # Your Random Forest pipeline
    X=X_train,                   # Training features
    y=y_train,            # Encoded training target
    cv=5,                         # Number of folds
    scoring=custom_precision_scorer  # Macro precision metric
)

# Print cross-validation results
print('Precision Score - All - Cross Validation  : ', cv_scores)
print('Precision Score - Mean - Cross Validation : ', cv_scores.mean())
print('Precision Score - Std - Cross Validation  : ', cv_scores.std())
print('Precision Score - Range of Test-Set       : ',
      (cv_scores.mean() - cv_scores.std()), '-', (cv_scores.mean() + cv_scores.std()))


Precision Score - All - Cross Validation  :  [0.98914898 0.99391742 0.98937662 0.99430081 0.99394315]
Precision Score - Mean - Cross Validation :  0.9921373951016115
Precision Score - Std - Cross Validation  :  0.0023521109826631554
Precision Score - Range of Test-Set       :  0.9897852841189483 - 0.9944895060842747


## Model XGBoost

In [28]:
xgb_model = XGBClassifier(
    n_estimators=100,         # Number of trees, same as RF
    scale_pos_weight=10,      # Adjust based on class imbalance ratio (to be tuned)
    random_state=26,
    use_label_encoder=False,  # Avoids unnecessary warnings
    eval_metric="logloss"     # Standard evaluation metric for classification
)


In [29]:
# creating pipeline with preprocessing and Random Forest classifier
pipeline_XG = Pipeline(steps=[
    ('preprocessor', preprocessing),
    ('classifier', xgb_model)
])

In [30]:
# Define a custom scorer for macro precision (for multiclass problems)
custom_precision_scorer = make_scorer(precision_score, average='macro')

cv_scores = cross_val_score(
    estimator=pipeline_XG,  # Your Random Forest pipeline
    X=X_train,                   # Training features
    y=y_train,            # Encoded training target
    cv=5,                         # Number of folds
    scoring=custom_precision_scorer  # Macro precision metric
)

# Print cross-validation results
print('Precision Score - All - Cross Validation  : ', cv_scores)
print('Precision Score - Mean - Cross Validation : ', cv_scores.mean())
print('Precision Score - Std - Cross Validation  : ', cv_scores.std())
print('Precision Score - Range of Test-Set       : ',
      (cv_scores.mean() - cv_scores.std()), '-', (cv_scores.mean() + cv_scores.std()))

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.



Precision Score - All - Cross Validation  :  [0.99064379 0.99354718 0.99165989 0.99506931 0.99282206]
Precision Score - Mean - Cross Validation :  0.9927484444826007
Precision Score - Std - Cross Validation  :  0.0015260613922171901
Precision Score - Range of Test-Set       :  0.9912223830903835 - 0.9942745058748179


Parameters: { "use_label_encoder" } are not used.



Based on our cross validation, random forest and xgboost are tied but in my knowledge xgboost is better in handling bigger data, since we will use it for big data in the future and the actual data of the bank is also usually bigger than 6 millions we will use xgboost.

## Hyperparameter Tuning

To make the model better, we will hyperparameter tune it further to achieve even better result.

In [31]:
# Define the hyperparameter grid for XGBoost
parameters = {
    'classifier__max_depth': [3, 6, 9],  # Depth of each tree
    'classifier__learning_rate': [0.01, 0.1, 0.2],  # Learning rate
    'classifier__n_estimators': [100, 300, 500],  # Number of trees
    'classifier__min_child_weight': [1, 3, 5],  # Minimum sum of instance weight (hessian)
    'classifier__gamma': [0, 0.1, 0.3],  # Minimum loss reduction
    'classifier__subsample': [0.8, 1.0],  # Fraction of samples used per tree
    'classifier__colsample_bytree': [0.8, 1.0]  # Fraction of features used per tree
}

# Using GridSearchCV
grid_search = GridSearchCV(
    estimator=pipeline_XG,  # Assuming you have an XGBoost pipeline
    param_grid=parameters,
    cv=5,  # 5-fold cross-validation is typically sufficient
    n_jobs=-1,  # Use all available cores
    scoring=custom_precision_scorer,  # Ensure this scorer is defined properly
    verbose=2
)

# Fit the GridSearchCV
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 972 candidates, totalling 4860 fits


Parameters: { "use_label_encoder" } are not used.



In [32]:
# check the parameter and check the recall score
print("Best Parameters:", grid_search.best_params_)
print("Best Recall:", grid_search.best_score_)

Best Parameters: {'classifier__colsample_bytree': 1.0, 'classifier__gamma': 0, 'classifier__learning_rate': 0.2, 'classifier__max_depth': 6, 'classifier__min_child_weight': 1, 'classifier__n_estimators': 500, 'classifier__subsample': 1.0}
Best Recall: 0.9931218535209666


In [56]:
# Get the best model
best_model = grid_search.best_estimator_

# Predict on the test set
y_pred = best_model.predict(X_test)

# Evaluate performance
print(classification_report(y_test, y_pred, target_names=['Not Fraud','Fraud']))

              precision    recall  f1-score   support

   Not Fraud       1.00      0.99      0.99      1648
       Fraud       0.99      1.00      0.99      1638

    accuracy                           0.99      3286
   macro avg       0.99      0.99      0.99      3286
weighted avg       0.99      0.99      0.99      3286



Since we want to reduce the prediction of False Negative, in which the Fraud (1) is failed to be detected as Fraud (or detected as 0), we will use recall as our main score of validation. In this classification report after we hyperparameter tune it, we achieve a score of 1.00 in Fraud and 0.99 in Not Fraud we concluded that this is the best model and the best hyperparameter tune.

# Model Saving

In [49]:
# Export the model using pickle
with open('model_xgb.pkl','wb') as file:
    pickle.dump(best_model, file)

# Model Inference

Now, we will try to use our model. We will create new data randomly.

In [53]:
num_samples = 10  # Define how many new data points you want
sampled_data = df.sample(n=num_samples, random_state=26)  # Ensuring reproducibility

# Create the new_data dictionary with randomly sampled values
new_data = {
    'step': sampled_data['step'].tolist(),
    'type': sampled_data['type'].tolist(),
    'amount': sampled_data['amount'].tolist(),
    'nameOrig': sampled_data['nameOrig'].tolist(),
    'oldbalanceOrg': sampled_data['oldbalanceOrg'].tolist(),
    'newbalanceOrig': sampled_data['newbalanceOrig'].tolist(),
    'nameDest': sampled_data['nameDest'].tolist(),
    'oldbalanceDest': sampled_data['oldbalanceDest'].tolist(),
    'newbalanceDest': sampled_data['newbalanceDest'].tolist(),
}


In [54]:
new_data_df = pd.DataFrame(new_data)
new_data_df

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest
0,304,TRANSFER,10000000.0,C878073444,0.0,0.0,C127527113,24748703.03,35524005.63
1,188,CASH_IN,61732.14,C434639441,38919.0,100651.14,C617732023,0.0,0.0
2,48,PAYMENT,9534.51,C73806064,1558.0,0.0,M1382990621,0.0,0.0
3,178,CASH_OUT,317256.91,C647444544,0.0,0.0,C1162645552,388598.19,705855.1
4,41,CASH_OUT,157159.32,C838821458,0.0,0.0,C1588468151,609492.03,662387.08
5,177,PAYMENT,22616.23,C40484457,0.0,0.0,M1937185807,0.0,0.0
6,404,CASH_OUT,17367.35,C821839809,0.0,0.0,C1107140692,3062731.92,3080099.27
7,299,CASH_OUT,197170.73,C1335798944,0.0,0.0,C1876595582,261501.33,458672.06
8,157,CASH_IN,30523.5,C1756748481,89.0,30612.5,C1103359406,71521.19,40997.7
9,377,CASH_OUT,374755.25,C649066660,41713.0,0.0,C2034359429,0.0,374755.25


In [55]:
# make a prediction with new_data_df
prediction = best_model.predict(new_data_df)

# for loop to predict each rows
for i in prediction:
    if i == 0:
        print('This is not A Fraud')
    elif i == 1:
        print('This is A Fraud')

This is not A Fraud
This is not A Fraud
This is not A Fraud
This is not A Fraud
This is not A Fraud
This is not A Fraud
This is not A Fraud
This is not A Fraud
This is not A Fraud
This is not A Fraud
