# Fraud Transaction Detection (End-to-End Machine Learning)

**Name**: Anom Nur Maulid  
**Class**: TK4601  
**NIM**: 1103223193  

## Objective
Build an end-to-end machine learning pipeline to predict the probability of an online transaction being fraudulent (**isFraud**).

## Output
Generate a submission file:
- `TransactionID, isFraud` (probability)



DATASETS OVERVIEW AND DESCRIPTION FOR MACHINE LEARNING CLASS  - INDIVIDUAL TASK



1. Main Objective
To design and implement an end-to-end machine learning and deep learning pipeline that can predict the probability of an online transaction being fraudulent

2. Task Overview:
"In this assignment, you will build an end-to-end fraud detection, You will work with both the transaction and identity tables, perform data cleaning and preprocessing, handle missing values and class imbalance, and engineer or select relevant features. You are required to implement machine learning or deep learning models to predict the probability that a transaction is fraudulent (isFraud). The workflow should cover data preprocessing, model training, hyperparameter tuning (at a basic level), evaluation using appropriate metrics"

3. Link Datasets:
https://drive.google.com/drive/folders/1JvI5xhPfN3VmjpWYZk9fCHG41xG697um

4. Link Notebook:
https://colab.research.google.com/drive/1oz46ISmhMqGWVSsHWQcfdzZYy0tR4kVH?usp=sharing

## 1. Mount Google Drive
This step mounts Google Drive so the notebook can access the dataset stored in Drive.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 2. Import Libraries
We import required libraries for:
- data processing (pandas, numpy)
- preprocessing pipeline (ColumnTransformer, imputation, encoding)
- model training (Logistic Regression, RandomForest)
- evaluation metrics (ROC-AUC, PR-AUC, confusion matrix)


In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    roc_auc_score, average_precision_score, classification_report,
    confusion_matrix, precision_recall_curve
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier


## 3. Locate Dataset Files
We verify the dataset folder exists and automatically detect:
- `train_transaction.csv`
- `test_transaction.csv`


In [None]:
import os, re

DATA_DIR = "/content/drive/MyDrive/UAS ML DL/Fraud Transcation (ML)"  # pastikan sama persis

print("DATA_DIR exists?", os.path.exists(DATA_DIR))
print("\nIsi folder:")
files = sorted(os.listdir(DATA_DIR))
for f in files:
    print("-", f)

train_path = next((os.path.join(DATA_DIR, f) for f in files if re.match(r"train_transaction.*\.csv$", f)), None)
test_path  = next((os.path.join(DATA_DIR, f) for f in files if re.match(r"test_transaction.*\.csv$", f)), None)

print("\nDetected train_path:", train_path)
print("Detected test_path :", test_path)


DATA_DIR exists? True

Isi folder:
- Fraud Transaction.ipynb
- submission_fraud.csv
- submission_fraud_rf_baseline.csv
- test_transaction.csv
- train_transaction.csv

Detected train_path: /content/drive/MyDrive/UAS ML DL/Fraud Transcation (ML)/train_transaction.csv
Detected test_path : /content/drive/MyDrive/UAS ML DL/Fraud Transcation (ML)/test_transaction.csv


## 4. Load Data and Sanity Check
We load train and test datasets, then verify:
- train contains `isFraud`
- test does not contain `isFraud`
- preview a few rows to confirm column structure


In [None]:
import pandas as pd

train = pd.read_csv(train_path)
test  = pd.read_csv(test_path)

print("Train shape:", train.shape)
print("Test shape :", test.shape)

print("\nCek kolom wajib:")
print("TransactionID in train?", "TransactionID" in train.columns)
print("isFraud in train?", "isFraud" in train.columns)
print("TransactionID in test?", "TransactionID" in test.columns)
print("isFraud in test?", "isFraud" in test.columns)

print("\nPreview train:")
display(train.head())

print("\nPreview test:")
display(test.head())


Train shape: (590540, 394)
Test shape : (506691, 393)

Cek kolom wajib:
TransactionID in train? True
isFraud in train? True
TransactionID in test? True
isFraud in test? False

Preview train:


Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



Preview test:


Unnamed: 0,TransactionID,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,3663549,18403224,31.95,W,10409,111.0,150.0,visa,226.0,debit,...,,,,,,,,,,
1,3663550,18403263,49.0,W,4272,111.0,150.0,visa,226.0,debit,...,,,,,,,,,,
2,3663551,18403310,171.0,W,4476,574.0,150.0,visa,226.0,debit,...,,,,,,,,,,
3,3663552,18403310,284.95,W,10989,360.0,150.0,visa,166.0,debit,...,,,,,,,,,,
4,3663553,18403317,67.95,W,18018,452.0,150.0,mastercard,117.0,debit,...,,,,,,,,,,


## 5. Prepare Features and Target
Steps:
- separate target label `isFraud` from training data
- store `TransactionID` from test for submission
- remove `TransactionID` from features (identifier)
- convert ±inf to NaN and measure missing values


In [None]:
import numpy as np

target = "isFraud"

# simpan id untuk submission
test_ids = test["TransactionID"].copy()

# pisahkan target & fitur
y = train[target].astype(int)
X = train.drop(columns=[target])

# buang TransactionID dari fitur (identifier)
X = X.drop(columns=["TransactionID"], errors="ignore")
test = test.drop(columns=["TransactionID"], errors="ignore")

# bersihkan nilai inf menjadi NaN
X = X.replace([np.inf, -np.inf], np.nan)
test = test.replace([np.inf, -np.inf], np.nan)

print("X shape:", X.shape)
print("test shape:", test.shape)
print("Fraud ratio (y.mean):", y.mean())
print("Jumlah missing (X):", int(X.isna().sum().sum()))
print("Jumlah missing (test):", int(test.isna().sum().sum()))


X shape: (590540, 392)
test shape: (506691, 392)
Fraud ratio (y.mean): 0.03499000914417313
Jumlah missing (X): 95566686
Jumlah missing (test): 73490163


## 6. Train/Validation Split (Stratified)
We use a stratified split to keep the fraud ratio consistent in training and validation sets
because the dataset is imbalanced (~3.5% fraud).


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("X_train:", X_train.shape, "y_train:", y_train.shape)
print("X_valid:", X_valid.shape, "y_valid:", y_valid.shape)

print("Fraud ratio train:", y_train.mean())
print("Fraud ratio valid:", y_valid.mean())


X_train: (472432, 392) y_train: (472432,)
X_valid: (118108, 392) y_valid: (118108,)
Fraud ratio train: 0.03498916246147594
Fraud ratio valid: 0.0349933958749619


## 7. Preprocessing + Baseline Model (Logistic Regression)
Preprocessing:
- numeric features: median imputation
- categorical features: most-frequent imputation + one-hot encoding

Baseline model:
- Logistic Regression with `class_weight="balanced"` to address class imbalance

Metrics:
- ROC-AUC
- PR-AUC (Average Precision), important for imbalanced classification


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, average_precision_score

# pisahkan kolom numeric & categorical
num_cols = X_train.select_dtypes(include=["int64","float64","int32","float32"]).columns.tolist()
cat_cols = [c for c in X_train.columns if c not in num_cols]

print("Numeric cols:", len(num_cols))
print("Categorical cols:", len(cat_cols))

numeric_tf = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
])

categorical_tf = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=True)),
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_tf, num_cols),
        ("cat", categorical_tf, cat_cols),
    ],
    remainder="drop"
)

logreg = Pipeline(steps=[
    ("prep", preprocess),
    ("clf", LogisticRegression(max_iter=300, class_weight="balanced"))
])

logreg.fit(X_train, y_train)
p_valid = logreg.predict_proba(X_valid)[:, 1]

roc = roc_auc_score(y_valid, p_valid)
pr  = average_precision_score(y_valid, p_valid)

print("LogReg | ROC-AUC:", roc, "| PR-AUC:", pr)


Numeric cols: 378
Categorical cols: 14


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogReg | ROC-AUC: 0.7398248593978234 | PR-AUC: 0.13386514569429775


## 8. Main Model (RandomForest)
RandomForest is used as the main model because it can capture non-linear patterns
and feature interactions commonly found in fraud detection data.

We evaluate using ROC-AUC and PR-AUC and compare with the baseline model.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, average_precision_score

rf = Pipeline(steps=[
    ("prep", preprocess),
    ("clf", RandomForestClassifier(
        n_estimators=300,
        random_state=42,
        n_jobs=-1,
        class_weight="balanced_subsample"
    ))
])

rf.fit(X_train, y_train)
p_valid_rf = rf.predict_proba(X_valid)[:, 1]

roc_rf = roc_auc_score(y_valid, p_valid_rf)
pr_rf  = average_precision_score(y_valid, p_valid_rf)

print("RF | ROC-AUC:", roc_rf, "| PR-AUC:", pr_rf)


RF | ROC-AUC: 0.9407063270833512 | PR-AUC: 0.7364622807426647


## 9. Threshold Selection and Confusion Matrix
In addition to probability metrics, we select a decision threshold based on F1-score
to obtain interpretable classification performance:
- precision/recall/F1 report
- confusion matrix


In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import precision_recall_curve, classification_report, confusion_matrix

# 1) tabel ringkas hasil (buat laporan)
results = pd.DataFrame([
    {"model": "LogisticRegression", "roc_auc": roc, "pr_auc": pr},
    {"model": "RandomForest",       "roc_auc": roc_rf, "pr_auc": pr_rf},
]).sort_values("pr_auc", ascending=False)

print("=== Model Comparison ===")
display(results)

# 2) pilih threshold terbaik berdasarkan F1 di validation
prec, rec, thr = precision_recall_curve(y_valid, p_valid_rf)
f1 = 2 * (prec * rec) / (prec + rec + 1e-12)

best_idx = int(np.argmax(f1))
best_thr = float(thr[best_idx-1]) if best_idx > 0 else 0.5  # aman kalau idx=0

print("\n=== Thresholding (F1-based) ===")
print("Best F1:", float(f1[best_idx]))
print("Best threshold:", best_thr)

y_pred_rf = (p_valid_rf >= best_thr).astype(int)

print("\n=== Classification Report (RF) ===")
print(classification_report(y_valid, y_pred_rf, digits=4))

print("\n=== Confusion Matrix (RF) ===")
print(confusion_matrix(y_valid, y_pred_rf))


=== Model Comparison ===


Unnamed: 0,model,roc_auc,pr_auc
1,RandomForest,0.940706,0.736462
0,LogisticRegression,0.739825,0.133865



=== Thresholding (F1-based) ===
Best F1: 0.6930315361134382
Best threshold: 0.19666666666666666

=== Classification Report (RF) ===
              precision    recall  f1-score   support

           0     0.9878    0.9909    0.9893    113975
           1     0.7247    0.6637    0.6929      4133

    accuracy                         0.9794    118108
   macro avg     0.8563    0.8273    0.8411    118108
weighted avg     0.9786    0.9794    0.9790    118108


=== Confusion Matrix (RF) ===
[[112933   1042]
 [  1390   2743]]


## 10. Basic Hyperparameter Tuning (Memory-Aware)
Hyperparameter tuning is performed using a subset of the training data to reduce memory usage.
This is necessary because one-hot encoding and CV can consume large RAM.

We tune:
- `max_depth`
- `min_samples_leaf`
- `max_features`

Scoring uses PR-AUC (Average Precision).


In [None]:
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold, train_test_split
import numpy as np

# 1) ambil subset untuk tuning (misal 120k baris)
X_tune, _, y_tune, _ = train_test_split(
    X_train, y_train,
    train_size=120000,
    random_state=42,
    stratify=y_train
)

print("X_tune:", X_tune.shape, "Fraud ratio:", y_tune.mean())

# 2) pipeline RF khusus tuning (pohon lebih sedikit biar cepat)
rf_tune = Pipeline(steps=[
    ("prep", preprocess),
    ("clf", RandomForestClassifier(
        n_estimators=150,
        random_state=42,
        n_jobs=-1,  # ini n_jobs milik RF internal; aman karena search-nya n_jobs=1
        class_weight="balanced_subsample"
    ))
])

param_dist = {
    "clf__max_depth": [None, 10, 20, 30],
    "clf__min_samples_leaf": [1, 2, 5, 10],
    "clf__max_features": ["sqrt", "log2", None],
}

cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)

search = RandomizedSearchCV(
    rf_tune,
    param_distributions=param_dist,
    n_iter=5,
    scoring="average_precision",
    cv=cv,
    random_state=42,
    n_jobs=1,          # PENTING: jangan -1
    verbose=2,
    pre_dispatch=1     # bantu hemat RAM
)

search.fit(X_tune, y_tune)

print("Best params:", search.best_params_)
print("Best CV PR-AUC:", search.best_score_)

best_rf = search.best_estimator_


X_tune: (120000, 392) Fraud ratio: 0.034991666666666664
Fitting 2 folds for each of 5 candidates, totalling 10 fits
[CV] END clf__max_depth=20, clf__max_features=sqrt, clf__min_samples_leaf=10; total time=  10.7s
[CV] END clf__max_depth=20, clf__max_features=sqrt, clf__min_samples_leaf=10; total time=  10.4s
[CV] END clf__max_depth=30, clf__max_features=log2, clf__min_samples_leaf=1; total time=   8.7s
[CV] END clf__max_depth=30, clf__max_features=log2, clf__min_samples_leaf=1; total time=   8.9s
[CV] END clf__max_depth=20, clf__max_features=sqrt, clf__min_samples_leaf=5; total time=  10.8s
[CV] END clf__max_depth=20, clf__max_features=sqrt, clf__min_samples_leaf=5; total time=  10.6s
[CV] END clf__max_depth=30, clf__max_features=log2, clf__min_samples_leaf=10; total time=   8.3s
[CV] END clf__max_depth=30, clf__max_features=log2, clf__min_samples_leaf=10; total time=   9.4s
[CV] END clf__max_depth=20, clf__max_features=sqrt, clf__min_samples_leaf=1; total time=  11.1s
[CV] END clf__ma

## 11. Evaluate Tuned Model on Full Validation Set
After tuning on a subset, we evaluate the tuned model on the full validation set
to decide whether it truly improves performance.


In [None]:
from sklearn.metrics import roc_auc_score, average_precision_score

# evaluasi best_rf hasil tuning pada validation full
p_valid_best = best_rf.predict_proba(X_valid)[:, 1]

roc_best = roc_auc_score(y_valid, p_valid_best)
pr_best  = average_precision_score(y_valid, p_valid_best)

print("Tuned RF (150 trees) | ROC-AUC:", roc_best, "| PR-AUC:", pr_best)


Tuned RF (150 trees) | ROC-AUC: 0.902583904223821 | PR-AUC: 0.5421079847684398


## 12. Stronger Tuned Model (More Trees) and Comparison
We re-train the tuned configuration with more trees to check if performance improves.
The final decision is based on validation PR-AUC.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

final_rf = Pipeline(steps=[
    ("prep", preprocess),
    ("clf", RandomForestClassifier(
        n_estimators=300,
        random_state=42,
        n_jobs=-1,
        class_weight="balanced_subsample",
        max_depth=20,
        max_features="sqrt",
        min_samples_leaf=10
    ))
])

final_rf.fit(X_train, y_train)
p_valid_final = final_rf.predict_proba(X_valid)[:, 1]

roc_final = roc_auc_score(y_valid, p_valid_final)
pr_final  = average_precision_score(y_valid, p_valid_final)

print("Final Tuned RF (300 trees) | ROC-AUC:", roc_final, "| PR-AUC:", pr_final)


Final Tuned RF (300 trees) | ROC-AUC: 0.9277353760654127 | PR-AUC: 0.6260872675114829


## 13. Train Final Model and Generate Submission
We train the selected best model on the full training data and generate predictions for the test set.
The output is saved as a CSV submission file with required columns.


In [None]:
import pandas as pd
import os

# Pastikan rf adalah pipeline baseline RF yang PR-AUC-nya 0.736 (dari CELL 7)
rf.fit(X, y)
test_proba = rf.predict_proba(test)[:, 1]

submission = pd.DataFrame({
    "TransactionID": test_ids,
    "isFraud": test_proba
})

out_path = os.path.join(DATA_DIR, "submission_fraud_rf_baseline.csv")
submission.to_csv(out_path, index=False)

print("Saved:", out_path)
submission.head()


Saved: /content/drive/MyDrive/UAS ML DL/Fraud Transcation (ML)/submission_fraud_rf_baseline.csv


Unnamed: 0,TransactionID,isFraud
0,3663549,0.013333
1,3663550,0.01
2,3663551,0.02
3,3663552,0.016667
4,3663553,0.003333


## 14. Validate Submission File
We validate the submission format:
- correct shape: (number_of_test_rows, 2)
- correct columns: `TransactionID`, `isFraud`
- no missing values
- probabilities are in [0, 1]
- `TransactionID` is unique


In [None]:
import pandas as pd

out_path = "/content/drive/MyDrive/UAS ML DL/Fraud Transcation (ML)/submission_fraud_rf_baseline.csv"
sub = pd.read_csv(out_path)

print("Shape:", sub.shape)
print("Columns:", sub.columns.tolist())
print("Nulls:", sub.isna().sum().to_dict())
print("isFraud min/max:", sub["isFraud"].min(), sub["isFraud"].max())
print("Unique TransactionID:", sub["TransactionID"].nunique())

sub.head()


Shape: (506691, 2)
Columns: ['TransactionID', 'isFraud']
Nulls: {'TransactionID': 0, 'isFraud': 0}
isFraud min/max: 0.0 0.9966666666666668
Unique TransactionID: 506691


Unnamed: 0,TransactionID,isFraud
0,3663549,0.013333
1,3663550,0.01
2,3663551,0.02
3,3663552,0.016667
4,3663553,0.003333


## Conclusion
- The dataset is imbalanced (~3.5% fraud), so PR-AUC is used as a key metric.
- RandomForest achieved the best validation PR-AUC compared to Logistic Regression.
- Tuning was tested, but the baseline RandomForest remained superior on validation.
- Final predictions were exported to `submission_fraud_rf_baseline.csv`.


## Deep Learning (MLP) for Fraud Detection (Tabular)
We build a Deep Learning model using:
- Numeric features (impute + standardize)
- Categorical features (StringLookup + Embedding)
We evaluate with ROC-AUC and PR-AUC (Average Precision).


In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf

print("TF version:", tf.__version__)
print("GPU devices:", tf.config.list_physical_devices("GPU"))

SEED = 42
tf.random.set_seed(SEED)
np.random.seed(SEED)


TF version: 2.19.0
GPU devices: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Prepare Numeric Features (Impute + Standardize)
We compute median imputation and standard scaling for numeric columns.


In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# tentukan kolom numeric & categorical dari X_train
num_cols = X_train.select_dtypes(include=["int64","float64","int32","float32"]).columns.tolist()
cat_cols = [c for c in X_train.columns if c not in num_cols]

print("Numeric cols:", len(num_cols))
print("Categorical cols:", len(cat_cols))
print("First 10 cat cols:", cat_cols[:10])

num_imputer = SimpleImputer(strategy="median")
num_scaler = StandardScaler()

X_train_num = num_imputer.fit_transform(X_train[num_cols])
X_valid_num = num_imputer.transform(X_valid[num_cols])
X_test_num  = num_imputer.transform(test[num_cols])

X_train_num = num_scaler.fit_transform(X_train_num).astype("float32")
X_valid_num = num_scaler.transform(X_valid_num).astype("float32")
X_test_num  = num_scaler.transform(X_test_num).astype("float32")

print("X_train_num:", X_train_num.shape, X_train_num.dtype)
print("X_valid_num:", X_valid_num.shape, X_valid_num.dtype)
print("X_test_num :", X_test_num.shape,  X_test_num.dtype)


Numeric cols: 378
Categorical cols: 14
First 10 cat cols: ['ProductCD', 'card4', 'card6', 'P_emaildomain', 'R_emaildomain', 'M1', 'M2', 'M3', 'M4', 'M5']
X_train_num: (472432, 378) float32
X_valid_num: (118108, 378) float32
X_test_num : (506691, 378) float32


## Prepare Categorical Features (Fill Missing + Build Vocab)
We convert categorical columns to string, fill missing values, and build vocabularies from train split.


In [None]:
# buat versi string + isi missing
def prep_cat(df, cols):
    out = df[cols].copy()
    for c in cols:
        out[c] = out[c].astype("object").fillna("missing").astype(str)
    return out

X_train_cat = prep_cat(X_train, cat_cols)
X_valid_cat = prep_cat(X_valid, cat_cols)
X_test_cat  = prep_cat(test,   cat_cols)

# vocab per kolom (dari train saja)
cat_vocab = {}
for c in cat_cols:
    cat_vocab[c] = sorted(X_train_cat[c].unique().tolist())

# ringkas ukuran vocab
vocab_sizes = {c: len(v) for c, v in cat_vocab.items()}
print("Vocab sizes (first 10):", list(vocab_sizes.items())[:10])
print("Max vocab size:", max(vocab_sizes.values()) if vocab_sizes else 0)


Vocab sizes (first 10): [('ProductCD', 5), ('card4', 5), ('card6', 5), ('P_emaildomain', 60), ('R_emaildomain', 61), ('M1', 3), ('M2', 3), ('M3', 3), ('M4', 4), ('M5', 3)]
Max vocab size: 61


## Prepare Categorical Features (Fill Missing + Build Vocab)
We convert categorical columns to string, fill missing values, and build vocabularies from train split.


In [None]:
# buat versi string + isi missing
def prep_cat(df, cols):
    out = df[cols].copy()
    for c in cols:
        out[c] = out[c].astype("object").fillna("missing").astype(str)
    return out

X_train_cat = prep_cat(X_train, cat_cols)
X_valid_cat = prep_cat(X_valid, cat_cols)
X_test_cat  = prep_cat(test,   cat_cols)

# vocab per kolom (dari train saja)
cat_vocab = {}
for c in cat_cols:
    cat_vocab[c] = sorted(X_train_cat[c].unique().tolist())

# ringkas ukuran vocab
vocab_sizes = {c: len(v) for c, v in cat_vocab.items()}
print("Vocab sizes (first 10):", list(vocab_sizes.items())[:10])
print("Max vocab size:", max(vocab_sizes.values()) if vocab_sizes else 0)


Vocab sizes (first 10): [('ProductCD', 5), ('card4', 5), ('card6', 5), ('P_emaildomain', 60), ('R_emaildomain', 61), ('M1', 3), ('M2', 3), ('M3', 3), ('M4', 4), ('M5', 3)]
Max vocab size: 61


## Build MLP Model (Numeric + Embeddings for Categorical)
We use embeddings for categorical features and dense layers for classification.
Metrics: ROC-AUC and PR-AUC.


In [None]:
def emb_dim(vocab_size: int) -> int:
    # aturan sederhana & aman untuk dim embedding
    return int(min(50, max(4, round(np.sqrt(vocab_size) + 1))))

# inputs
inputs = {}
inputs["num"] = tf.keras.Input(shape=(len(num_cols),), dtype=tf.float32, name="num")

embeddings = []

# categorical -> StringLookup -> Embedding
for c in cat_cols:
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name=c)
    inputs[c] = inp

    lookup = tf.keras.layers.StringLookup(
        vocabulary=cat_vocab[c],
        mask_token=None,
        num_oov_indices=1,
        name=f"{c}_lookup"
    )
    idx = lookup(inp)

    vs = lookup.vocabulary_size()
    ed = emb_dim(vs)
    emb = tf.keras.layers.Embedding(input_dim=vs, output_dim=ed, name=f"{c}_emb")(idx)
    emb = tf.keras.layers.Reshape((ed,))(emb)
    embeddings.append(emb)

# gabung semua fitur
x = inputs["num"]
if embeddings:
    x = tf.keras.layers.Concatenate()([x] + embeddings)

x = tf.keras.layers.Dense(256, activation="relu")(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(0.3)(x)

x = tf.keras.layers.Dense(128, activation="relu")(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(0.3)(x)

x = tf.keras.layers.Dense(64, activation="relu")(x)
x = tf.keras.layers.Dropout(0.2)(x)

out = tf.keras.layers.Dense(1, activation="sigmoid")(x)

dl_model = tf.keras.Model(inputs=inputs, outputs=out)

dl_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss="binary_crossentropy",
    metrics=[
        tf.keras.metrics.AUC(name="roc_auc"),
        tf.keras.metrics.AUC(curve="PR", name="pr_auc"),
    ]
)

dl_model.summary()


## Train Deep Learning Model (with class_weight)
We train using tf.data pipelines and handle class imbalance via class_weight.
We monitor val_pr_auc for EarlyStopping.


In [None]:
from sklearn.metrics import roc_auc_score, average_precision_score
import os

# class_weight untuk imbalance
pos = float(y_train.sum())
neg = float(len(y_train) - y_train.sum())
class_weight = {0: 1.0, 1: (neg / max(pos, 1.0))}
print("class_weight:", class_weight)

def make_input_dict(X_num, X_cat_df):
    d = {"num": X_num}
    for c in cat_cols:
        d[c] = X_cat_df[c].values.reshape(-1, 1)
    return d

train_in = make_input_dict(X_train_num, X_train_cat)
valid_in = make_input_dict(X_valid_num, X_valid_cat)

BATCH_SIZE = 2048

train_ds_dl = tf.data.Dataset.from_tensor_slices((train_in, y_train.values.astype("float32")))
train_ds_dl = train_ds_dl.shuffle(50000, seed=SEED).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

valid_ds_dl = tf.data.Dataset.from_tensor_slices((valid_in, y_valid.values.astype("float32")))
valid_ds_dl = valid_ds_dl.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

CKPT_PATH = os.path.join(DATA_DIR, "fraud_mlp_dl_best.keras")

callbacks = [
    tf.keras.callbacks.EarlyStopping(monitor="val_pr_auc", mode="max", patience=2, restore_best_weights=True),
    tf.keras.callbacks.ModelCheckpoint(CKPT_PATH, monitor="val_pr_auc", mode="max", save_best_only=True),
    tf.keras.callbacks.ReduceLROnPlateau(monitor="val_pr_auc", mode="max", factor=0.5, patience=1, min_lr=1e-5),
]

history_dl = dl_model.fit(
    train_ds_dl,
    validation_data=valid_ds_dl,
    epochs=10,
    class_weight=class_weight,
    callbacks=callbacks,
    verbose=1
)

# evaluasi ROC-AUC & PR-AUC (sklearn) di validation
p_valid_dl = dl_model.predict(valid_ds_dl, verbose=0).ravel()
roc_dl = roc_auc_score(y_valid, p_valid_dl)
pr_dl  = average_precision_score(y_valid, p_valid_dl)

print("DL MLP | ROC-AUC:", roc_dl, "| PR-AUC:", pr_dl)
print("Saved best model to:", CKPT_PATH)


class_weight: {0: 1.0, 1: 27.580278281911674}
Epoch 1/10
[1m231/231[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 48ms/step - loss: 1.1123 - pr_auc: 0.2460 - roc_auc: 0.7899 - val_loss: 0.3947 - val_pr_auc: 0.4520 - val_roc_auc: 0.8721 - learning_rate: 0.0010
Epoch 2/10
[1m231/231[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 43ms/step - loss: 0.8855 - pr_auc: 0.3973 - roc_auc: 0.8631 - val_loss: 0.3719 - val_pr_auc: 0.4643 - val_roc_auc: 0.8805 - learning_rate: 0.0010
Epoch 3/10
[1m231/231[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 42ms/step - loss: 0.8546 - pr_auc: 0.4300 - roc_auc: 0.8735 - val_loss: 0.3434 - val_pr_auc: 0.4909 - val_roc_auc: 0.8901 - learning_rate: 0.0010
Epoch 4/10
[1m231/231[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 44ms/step - loss: 0.8188 - pr_auc: 0.4582 - roc_auc: 0.8861 - val_loss: 0.3383 - val_pr_auc: 0.5140 - val_roc_auc: 0.8965 - learning_rate: 0.0010
Epoch 5/10
[1m231/231[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[

## Generate Submission (Deep Learning)
We use the trained/best DL model to predict probabilities for the test set
and save the submission file (TransactionID, isFraud).


In [22]:
import pandas as pd

# load best (opsional aman)
best_dl = tf.keras.models.load_model(CKPT_PATH)

test_in = make_input_dict(X_test_num, X_test_cat)
test_ds_dl = tf.data.Dataset.from_tensor_slices(test_in).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

test_proba_dl = best_dl.predict(test_ds_dl, verbose=0).ravel()

submission_dl = pd.DataFrame({
    "TransactionID": test_ids,
    "isFraud": test_proba_dl
})

out_path_dl = os.path.join(DATA_DIR, "submission_fraud_dl_mlp.csv")
submission_dl.to_csv(out_path_dl, index=False)

print("Saved:", out_path_dl)
submission_dl.head()


Saved: /content/drive/MyDrive/UAS ML DL/Fraud Transcation (ML)/submission_fraud_dl_mlp.csv


Unnamed: 0,TransactionID,isFraud
0,3663549,0.078132
1,3663550,0.011784
2,3663551,0.030822
3,3663552,0.01489
4,3663553,0.058203


## DL Evaluation (Validation) + Compare with ML
We evaluate the DL model on the validation set using ROC-AUC and PR-AUC,
then compare against previous ML baselines (LogReg and RandomForest).


In [23]:
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score, average_precision_score

p_valid_dl = best_dl.predict(valid_ds_dl, verbose=0).ravel()
roc_dl = roc_auc_score(y_valid, p_valid_dl)
pr_dl  = average_precision_score(y_valid, p_valid_dl)

print("DL MLP | ROC-AUC:", roc_dl, "| PR-AUC:", pr_dl)

# Masukkan angka ML dari hasil kamu sebelumnya (kalau variabelnya tidak ada lagi)
logreg_roc, logreg_pr = 0.7453853429193295, 0.13730816967359938
rf_roc, rf_pr = 0.9407063270833512, 0.7364622807426647

compare = pd.DataFrame([
    {"model": "LogisticRegression (ML)", "roc_auc": logreg_roc, "pr_auc": logreg_pr},
    {"model": "RandomForest (ML)",       "roc_auc": rf_roc,     "pr_auc": rf_pr},
    {"model": "MLP (DL)",                "roc_auc": roc_dl,     "pr_auc": pr_dl},
]).sort_values("pr_auc", ascending=False)

compare


DL MLP | ROC-AUC: 0.9166575299775553 | PR-AUC: 0.5794399575866894


Unnamed: 0,model,roc_auc,pr_auc
1,RandomForest (ML),0.940706,0.736462
2,MLP (DL),0.916658,0.57944
0,LogisticRegression (ML),0.745385,0.137308


## Conclusion (Fraud: ML vs DL)
Because the dataset is highly imbalanced (~3.5% fraud), **PR-AUC** is the main metric.

**Results (Validation):**
- Logistic Regression (ML): ROC-AUC ≈ 0.745, PR-AUC ≈ 0.137
- Random Forest (ML): ROC-AUC ≈ 0.941, PR-AUC ≈ 0.736 (**best**)
- MLP (DL): ROC-AUC ≈ 0.917, PR-AUC ≈ 0.579

**Interpretation:**
The deep learning MLP improves over Logistic Regression, but **Random Forest remains the strongest model** on PR-AUC for this tabular fraud dataset. This suggests that tree-based ensembles capture feature interactions effectively under the current preprocessing setup.


In [24]:
import os

out_cmp = os.path.join(DATA_DIR, "fraud_model_comparison.csv")
compare.to_csv(out_cmp, index=False)
print("Saved:", out_cmp)
compare


Saved: /content/drive/MyDrive/UAS ML DL/Fraud Transcation (ML)/fraud_model_comparison.csv


Unnamed: 0,model,roc_auc,pr_auc
1,RandomForest (ML),0.940706,0.736462
2,MLP (DL),0.916658,0.57944
0,LogisticRegression (ML),0.745385,0.137308
