# Fraud Detection

**_Fraud Detection_** dalam transaksi daring menggunakan *Machine Learning* merupakan suatu pendekatan mutakhir dalam melawan ancaman *fraud* dalam lingkungan digital. Dengan pertumbuhan pesat transaksi daring, penting bagi perusahaan untuk memiliki sistem yang mampu mengidentifikasi dan mencegah transaksi yang mencurigakan.

**_Fraud Detection_** merupakan perangkat analisa data untuk mengetahui pola *fraud* dalam transaksi daring berdasarkan profile identitas transaksi dan data transaksi yang telah tersedia.

Kumpulan data ini berisi rincian identitas transaksi dan data transaksi. Variabel target pada studi kasus kali ini adalah variabel biner yakni tidak fraud (0) dan fraud (1). Variabel fitur akan diproses guna memprediksi apakah transaksi terindikasi fraud atau tidak fraud.

Metrik **_Precision_** menghitung rasio True Positive dibandingkan dengan keseluruhan yang diprediksi positive. Hal ini menunjukkan bahwa metode yang digunakan akan fokus memperkecil nilai False Positive (FP) yang berarti mesti menaikkan threshold. Meningkatkan threshold juga dapat mengurangi jumlah True Positive (TP) dan meningkatkan jumlah False Negative (FN). Mengurangi jumlah False Positive (prediksi penipuan yang salah) sangat diperlukan untuk menghindari biaya dan kerugian yang tidak perlu akibat tindakan yang salah terhadap transaksi yang sebenarnya sah.

Namun, **_Recall_** sering kali menjadi metrik yang lebih penting. Hal ini karena tujuan utama dalam deteksi fraud adalah untuk mengidentifikasi sebanyak mungkin kasus penipuan yang sebenarnya. Jumlah False Negative (kasus penipuan yang terlewat) harus diminimalkan agar tidak kehilangan penipuan yang sebenarnya.

Penulis menetapkan metrik bisnis sebagai acuan project machine learning ini dibuat adalah **_F1-Score_** dan **_area under the precision-recall curve (AUPRC)_** hal ini dapat memberikan gambaran yang lebih holistik tentang kinerja model dalam deteksi fraud. Metrik-metrik ini mencoba untuk menyeimbangkan antara precision dan recall.

# Data Preparation

## Importasi Library

In [1]:
#Import library untuk data preparation dan visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# import warnings for ignore the warnings
import warnings 
warnings.filterwarnings("ignore")

# import pickle and json file for columns and model file
import pickle
import json
import joblib
import copy

# Import train test split untuk splitting data
from sklearn.model_selection import train_test_split
import yaml
import src.util as util
from tqdm import tqdm
import os

In [2]:
config_data = util.load_config()

## Data Gathering

Data Fraud Detection saat ini menggunakan data yang disadur dari kaggle dengan laman :
    
 * https://www.kaggle.com/competitions/ieee-fraud-detection/data

In [3]:
#Fungsi read data
def read_raw_data(config: dict) -> pd.DataFrame:
    # Create variable to store raw dataset
    raw_dataset = pd.DataFrame()

    # Raw dataset dir
    raw_dataset_dir_1 = config["raw_dataset_path_1"]
    raw_dataset_dir_2 = config["raw_dataset_path_2"]
    
    train_transaction = pd.read_csv(raw_dataset_dir_1)
    
    # Read train_identity.csv
    train_identity = pd.read_csv(raw_dataset_dir_2)
    
    # Merge train_transaction and train_identity based on 'TransactionID'
    train_set = pd.merge(train_transaction, train_identity, how='left', on='TransactionID')

    # Return raw dataset
    return train_set

In [4]:
# read data menggunakan fungsi read_raw_data
train_set = read_raw_data(config_data)

In [5]:
#Sanity Check
train_set

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987000,0,86400,68.50,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.00,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.00,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.00,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.00,H,4497,514.0,150.0,mastercard,102.0,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
590535,3577535,0,15811047,49.00,W,6550,,150.0,visa,226.0,...,,,,,,,,,,
590536,3577536,0,15811049,39.50,W,10444,225.0,150.0,mastercard,224.0,...,,,,,,,,,,
590537,3577537,0,15811079,30.95,W,12037,595.0,150.0,mastercard,224.0,...,,,,,,,,,,
590538,3577538,0,15811088,117.00,W,7826,481.0,150.0,mastercard,224.0,...,,,,,,,,,,


## Data Definition

Data yang digunakan merupakan data yang menggambarkan identitas transaksi dan data transaksi yang memiliki label fraud dan tidak fraud. adapun fitur profile nasabah yang digunakan antara lain

Data berisi 590540 rows × 434 columns yaitu :

* TransactionID: ID unik untuk setiap transaksi.
* isFraud: Variabel target yang menandakan apakah transaksi tersebut merupakan fraud (penipuan) atau tidak fraud (penipuan).
* TransactionDT: Waktu transaksi dalam satuan detik sejak waktu awal (referensi).
* TransactionAmt: Jumlah transaksi dalam mata uang asli.
* ProductCD: Jenis produk yang digunakan dalam transaksi.
* card1 - card6: Informasi terkait kartu kredit seperti jenis, kategori, dan penerbit.
* addr1 - addr2: Informasi terkait alamat pelanggan.
* P_emaildomain dan R_emaildomain: Domain email pelanggan yang terkait dengan transaksi pengirim (P) atau penerima (R).
* C1 - C14: Fitur perhitungan yang berhubungan dengan kartu kredit.
* D1 - D15: Fitur perhitungan yang mencerminkan waktu antara transaksi.
* M1 - M9: Fitur kategorikal yang mengindikasikan kecocokan antara nama di kartu, alamat, dan alamat email.
* V1 - V339: Fitur-fitur yang dihasilkan oleh PCA dan berisi informasi terkait dengan transaksi.
* TransactionID: ID unik untuk setiap transaksi.
* id-01 hingga id-38: Fitur-fitur identitas yang mengandung informasi terkait dengan profil identitas pelanggan.
* DeviceType: Jenis perangkat yang digunakan oleh pelanggan saat melakukan transaksi.
* DeviceInfo: Informasi terkait dengan perangkat yang digunakan oleh pelanggan saat melakukan transaksi.
* id_12 - id_38: Fitur-fitur identitas tambahan yang memberikan informasi tambahan tentang profil identitas pelanggan.

## Data Validation

Proses mengevaluasi dan memvalidasi integritas, kualitas, dan kecocokan data yang digunakan dalam proses pembangunan dan evaluasi model machine learning. Hal ini bertujuan untuk memastikan bahwa data yang digunakan memenuhi persyaratan dan standar yang diperlukan untuk menghasilkan model yang akurat dan dapat diandalkan.

In [6]:
# Check Tipe Data
# Mendapatkan tipe data dari setiap kolom
column_data_types = train_set.dtypes.to_dict()

# Menampilkan tipe data kolom
for column, dtype in column_data_types.items():
    print(f"Column: {column}, Data Type: {dtype}")

Column: TransactionID, Data Type: int64
Column: isFraud, Data Type: int64
Column: TransactionDT, Data Type: int64
Column: TransactionAmt, Data Type: float64
Column: ProductCD, Data Type: object
Column: card1, Data Type: int64
Column: card2, Data Type: float64
Column: card3, Data Type: float64
Column: card4, Data Type: object
Column: card5, Data Type: float64
Column: card6, Data Type: object
Column: addr1, Data Type: float64
Column: addr2, Data Type: float64
Column: dist1, Data Type: float64
Column: dist2, Data Type: float64
Column: P_emaildomain, Data Type: object
Column: R_emaildomain, Data Type: object
Column: C1, Data Type: float64
Column: C2, Data Type: float64
Column: C3, Data Type: float64
Column: C4, Data Type: float64
Column: C5, Data Type: float64
Column: C6, Data Type: float64
Column: C7, Data Type: float64
Column: C8, Data Type: float64
Column: C9, Data Type: float64
Column: C10, Data Type: float64
Column: C11, Data Type: float64
Column: C12, Data Type: float64
Column: C13, 

In [7]:
# melihat nilai nul dari masing-masing kolom
# Menggunakan loop untuk melihat persentase nilai null pada masing-masing kolom
for column in train_set.columns:
    null_percentage = train_set[column].isnull().mean() * 100
    print(f"Persentase nilai null pada kolom {column}: {null_percentage:.2f}%")

Persentase nilai null pada kolom TransactionID: 0.00%
Persentase nilai null pada kolom isFraud: 0.00%
Persentase nilai null pada kolom TransactionDT: 0.00%
Persentase nilai null pada kolom TransactionAmt: 0.00%
Persentase nilai null pada kolom ProductCD: 0.00%
Persentase nilai null pada kolom card1: 0.00%
Persentase nilai null pada kolom card2: 1.51%
Persentase nilai null pada kolom card3: 0.27%
Persentase nilai null pada kolom card4: 0.27%
Persentase nilai null pada kolom card5: 0.72%
Persentase nilai null pada kolom card6: 0.27%
Persentase nilai null pada kolom addr1: 11.13%
Persentase nilai null pada kolom addr2: 11.13%
Persentase nilai null pada kolom dist1: 59.65%
Persentase nilai null pada kolom dist2: 93.63%
Persentase nilai null pada kolom P_emaildomain: 15.99%
Persentase nilai null pada kolom R_emaildomain: 76.75%
Persentase nilai null pada kolom C1: 0.00%
Persentase nilai null pada kolom C2: 0.00%
Persentase nilai null pada kolom C3: 0.00%
Persentase nilai null pada kolom C4:

In [8]:
#Drop kolom yang tidak dibutuhkan
columns_drop = ['TransactionID', 'TransactionDT']
train_set.drop(columns_drop, axis=1, inplace=True)

In [9]:
# Menghitung persentase nilai null pada setiap kolom
null_percentages = train_set.isnull().mean() * 100

# Mengidentifikasi kolom-kolom dengan persentase nilai null di atas 50%
columns_to_drop = null_percentages[null_percentages > 50].index

# Menghapus kolom-kolom yang memiliki persentase nilai null di atas 50%
df_train = train_set.drop(columns=columns_to_drop)

In [10]:
# Sanity Check Hasil Menghapus kolom dengan nilai null diatas 50%
for column in df_train.columns:
    null_percentage = df_train[column].isnull().mean() * 100
    print(f"Persentase nilai null pada kolom {column}: {null_percentage:.2f}%")

Persentase nilai null pada kolom isFraud: 0.00%
Persentase nilai null pada kolom TransactionAmt: 0.00%
Persentase nilai null pada kolom ProductCD: 0.00%
Persentase nilai null pada kolom card1: 0.00%
Persentase nilai null pada kolom card2: 1.51%
Persentase nilai null pada kolom card3: 0.27%
Persentase nilai null pada kolom card4: 0.27%
Persentase nilai null pada kolom card5: 0.72%
Persentase nilai null pada kolom card6: 0.27%
Persentase nilai null pada kolom addr1: 11.13%
Persentase nilai null pada kolom addr2: 11.13%
Persentase nilai null pada kolom P_emaildomain: 15.99%
Persentase nilai null pada kolom C1: 0.00%
Persentase nilai null pada kolom C2: 0.00%
Persentase nilai null pada kolom C3: 0.00%
Persentase nilai null pada kolom C4: 0.00%
Persentase nilai null pada kolom C5: 0.00%
Persentase nilai null pada kolom C6: 0.00%
Persentase nilai null pada kolom C7: 0.00%
Persentase nilai null pada kolom C8: 0.00%
Persentase nilai null pada kolom C9: 0.00%
Persentase nilai null pada kolom C1

## Data Defense

Data Defense merupakan konsep dan praktik yang bertujuan untuk menjaga, melindungi, dan meningkatkan integritas, keamanan, dan kualitas data yang digunakan dalam suatu workflow machine learning. Data Defense menjadi penting karena kualitas data yang baik menjadi landasan utama bagi pembuatan model machine learning yang akurat dan dapat diandalkan.

In [11]:
# melihat unique value pada kolom kategori

# Menggunakan loop untuk melihat unique value pada masing-masing kolom
for column in df_train.columns:
    if df_train[column].dtype == 'object':
        unique_values = df_train[column].unique()
        print(f"Unique values for column {column}:")
        print(unique_values)
        print('\n')

Unique values for column ProductCD:
['W' 'H' 'C' 'S' 'R']


Unique values for column card4:
['discover' 'mastercard' 'visa' 'american express' nan]


Unique values for column card6:
['credit' 'debit' nan 'debit or credit' 'charge card']


Unique values for column P_emaildomain:
[nan 'gmail.com' 'outlook.com' 'yahoo.com' 'mail.com' 'anonymous.com'
 'hotmail.com' 'verizon.net' 'aol.com' 'me.com' 'comcast.net'
 'optonline.net' 'cox.net' 'charter.net' 'rocketmail.com' 'prodigy.net.mx'
 'embarqmail.com' 'icloud.com' 'live.com.mx' 'gmail' 'live.com' 'att.net'
 'juno.com' 'ymail.com' 'sbcglobal.net' 'bellsouth.net' 'msn.com' 'q.com'
 'yahoo.com.mx' 'centurylink.net' 'servicios-ta.com' 'earthlink.net'
 'hotmail.es' 'cfl.rr.com' 'roadrunner.com' 'netzero.net' 'gmx.de'
 'suddenlink.net' 'frontiernet.net' 'windstream.net' 'frontier.com'
 'outlook.es' 'mac.com' 'netzero.com' 'aim.com' 'web.de' 'twc.com'
 'cableone.net' 'yahoo.fr' 'yahoo.de' 'yahoo.es' 'sc.rr.com' 'ptd.net'
 'live.fr' 'yahoo.co.uk'

In [12]:
# Mengambil kolom-kolom numerik
numeric_columns = df_train.select_dtypes(include=['float64', 'int64'])

# Menampilkan rentang nilai pada masing-masing kolom
for column in numeric_columns.columns:
    min_value = numeric_columns[column].min()
    max_value = numeric_columns[column].max()
    print(f"Rentang nilai pada kolom {column}: {min_value} - {max_value}")

Rentang nilai pada kolom isFraud: 0 - 1
Rentang nilai pada kolom TransactionAmt: 0.251 - 31937.391
Rentang nilai pada kolom card1: 1000 - 18396
Rentang nilai pada kolom card2: 100.0 - 600.0
Rentang nilai pada kolom card3: 100.0 - 231.0
Rentang nilai pada kolom card5: 100.0 - 237.0
Rentang nilai pada kolom addr1: 100.0 - 540.0
Rentang nilai pada kolom addr2: 10.0 - 102.0
Rentang nilai pada kolom C1: 0.0 - 4685.0
Rentang nilai pada kolom C2: 0.0 - 5691.0
Rentang nilai pada kolom C3: 0.0 - 26.0
Rentang nilai pada kolom C4: 0.0 - 2253.0
Rentang nilai pada kolom C5: 0.0 - 349.0
Rentang nilai pada kolom C6: 0.0 - 2253.0
Rentang nilai pada kolom C7: 0.0 - 2255.0
Rentang nilai pada kolom C8: 0.0 - 3331.0
Rentang nilai pada kolom C9: 0.0 - 210.0
Rentang nilai pada kolom C10: 0.0 - 3257.0
Rentang nilai pada kolom C11: 0.0 - 3188.0
Rentang nilai pada kolom C12: 0.0 - 3188.0
Rentang nilai pada kolom C13: 0.0 - 2918.0
Rentang nilai pada kolom C14: 0.0 - 1429.0
Rentang nilai pada kolom D1: 0.0 - 640

## Data Splitting 

Data Splitting adalah membagi dataset menjadi subset yang berbeda untuk keperluan pelatihan (training), validasi (validation), dan evaluasi (testing) model machine learning.

In [13]:
# Pemisahan Variabel X dan Y
X = df_train.drop(columns = "isFraud")
y = df_train["isFraud"]

In [14]:
# Sanity Check
X

Unnamed: 0,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,...,V312,V313,V314,V315,V316,V317,V318,V319,V320,V321
0,68.50,W,13926,,150.0,discover,142.0,credit,315.0,87.0,...,0.0,0.000000,0.000000,0.000000,0.0,117.0,0.0,0.000000,0.000000,0.000000
1,29.00,W,2755,404.0,150.0,mastercard,102.0,credit,325.0,87.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000
2,59.00,W,4663,490.0,150.0,visa,166.0,debit,330.0,87.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000
3,50.00,W,18132,567.0,150.0,mastercard,117.0,debit,476.0,87.0,...,135.0,0.000000,0.000000,0.000000,50.0,1404.0,790.0,0.000000,0.000000,0.000000
4,50.00,H,4497,514.0,150.0,mastercard,102.0,credit,420.0,87.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
590535,49.00,W,6550,,150.0,visa,226.0,debit,272.0,87.0,...,0.0,47.950001,47.950001,47.950001,0.0,0.0,0.0,0.000000,0.000000,0.000000
590536,39.50,W,10444,225.0,150.0,mastercard,224.0,debit,204.0,87.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000
590537,30.95,W,12037,595.0,150.0,mastercard,224.0,debit,231.0,87.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000
590538,117.00,W,7826,481.0,150.0,mastercard,224.0,debit,387.0,87.0,...,117.0,317.500000,669.500000,317.500000,0.0,2234.0,0.0,0.000000,0.000000,0.000000


In [15]:
# Sanity Check
y

0         0
1         0
2         0
3         0
4         0
         ..
590535    0
590536    0
590537    0
590538    0
590539    0
Name: isFraud, Length: 590540, dtype: int64

In [16]:
#Split Data 70% training 30% testing & validation
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.3,
                                                    random_state = 123)

In [17]:
#Sanity Check
X_train

Unnamed: 0,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,...,V312,V313,V314,V315,V316,V317,V318,V319,V320,V321
378585,59.000,W,6470,111.0,150.0,visa,226.0,debit,299.0,87.0,...,102.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000
85054,39.000,W,7826,481.0,150.0,mastercard,224.0,debit,184.0,87.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000
442830,48.950,W,3070,537.0,150.0,visa,226.0,debit,315.0,87.0,...,0.0,36.950001,36.950001,36.950001,0.0,0.0,0.0,0.000000,0.000000,0.000000
448182,3224.450,W,9803,583.0,150.0,visa,226.0,credit,264.0,87.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,6448.899902,6448.899902,6448.899902
390346,34.000,W,13139,512.0,150.0,mastercard,224.0,debit,143.0,87.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194278,59.000,W,3277,111.0,150.0,visa,226.0,debit,231.0,87.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000
192476,100.000,H,7585,553.0,150.0,visa,226.0,credit,299.0,87.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000
17730,47.732,C,10086,500.0,185.0,mastercard,224.0,credit,431.0,60.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000
28030,25.000,W,6933,477.0,150.0,mastercard,117.0,debit,204.0,87.0,...,119.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000


In [18]:
#Sanity Check
y_train

378585    0
85054     0
442830    0
448182    0
390346    0
         ..
194278    0
192476    0
17730     1
28030     0
277869    0
Name: isFraud, Length: 413378, dtype: int64

In [19]:
# Sanity Check 
X_test

Unnamed: 0,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,...,V312,V313,V314,V315,V316,V317,V318,V319,V320,V321
462165,250.00,R,15497,490.0,150.0,visa,226.0,debit,299.0,87.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
153958,108.50,W,2706,555.0,150.0,visa,226.0,debit,170.0,87.0,...,0.000000,0.000000,0.000000,0.000000,117.0,117.0,117.0,0.0,0.0,0.0
31115,54.50,W,13481,199.0,150.0,mastercard,224.0,debit,123.0,87.0,...,59.000000,59.000000,59.000000,59.000000,0.0,0.0,0.0,0.0,0.0,0.0
488561,88.95,W,11727,301.0,150.0,mastercard,224.0,debit,485.0,87.0,...,26.950001,88.949997,115.900002,88.949997,0.0,0.0,0.0,0.0,0.0,0.0
559509,125.00,R,9530,554.0,150.0,visa,226.0,credit,191.0,87.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
186074,107.95,W,14482,512.0,150.0,visa,226.0,debit,181.0,87.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
557005,311.95,W,11919,170.0,150.0,mastercard,224.0,debit,299.0,87.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
348947,280.00,W,7288,321.0,150.0,visa,226.0,debit,122.0,87.0,...,1356.000000,0.000000,0.000000,0.000000,280.0,3973.0,280.0,0.0,0.0,0.0
407320,57.95,W,9480,170.0,150.0,visa,226.0,debit,225.0,87.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
#Sanity Check
y_test

462165    0
153958    0
31115     0
488561    0
559509    0
         ..
186074    0
557005    0
348947    0
407320    0
507145    0
Name: isFraud, Length: 177162, dtype: int64

In [21]:
# Split data test menjadi test dan validation set
X_test, X_valid, y_test, y_valid = train_test_split(X_test, y_test, 
                                                    test_size=0.4, 
                                                    random_state=42,
                                                    stratify = y_test
                                                   )

In [22]:
#Sanity Check
X_test

Unnamed: 0,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,...,V312,V313,V314,V315,V316,V317,V318,V319,V320,V321
504494,20.000,S,7919,194.0,150.0,mastercard,166.0,debit,203.0,87.0,...,25.0,25.0,25.0,25.0,0.0,0.0,0.0,50.0,50.0,50.0
454181,470.000,W,1804,161.0,150.0,mastercard,117.0,debit,123.0,87.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
497372,66.030,C,2256,545.0,185.0,visa,226.0,credit,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
418123,82.950,W,7505,175.0,150.0,visa,226.0,debit,264.0,87.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
387902,30.950,W,9992,455.0,150.0,mastercard,126.0,debit,143.0,87.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
171690,764.950,W,11690,111.0,150.0,visa,226.0,credit,299.0,87.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
478699,35.950,W,17400,174.0,150.0,visa,226.0,debit,123.0,87.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
351637,59.000,W,8953,321.0,150.0,visa,226.0,debit,441.0,87.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
300019,26.826,C,15885,545.0,185.0,visa,138.0,debit,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
#Sanity Check
y_test

504494    1
454181    0
497372    0
418123    0
387902    0
         ..
171690    0
478699    0
351637    0
300019    0
434151    0
Name: isFraud, Length: 106297, dtype: int64

In [24]:
#Sanity Check
X_valid

Unnamed: 0,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,...,V312,V313,V314,V315,V316,V317,V318,V319,V320,V321
26703,100.000,W,7919,194.0,150.0,mastercard,202.0,debit,143.0,87.0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0
2877,82.950,W,11666,555.0,150.0,visa,226.0,debit,343.0,87.0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0
350334,100.000,R,3821,111.0,150.0,mastercard,219.0,credit,337.0,87.0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0
236126,107.950,W,18385,555.0,150.0,visa,226.0,debit,191.0,87.0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0
404743,209.950,W,18243,543.0,150.0,mastercard,224.0,debit,299.0,87.0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
280531,20.950,W,17188,321.0,150.0,visa,226.0,debit,299.0,87.0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0
516994,261.950,W,17188,321.0,150.0,visa,226.0,debit,299.0,87.0,...,107.949997,107.949997,107.949997,107.949997,523.900024,523.900024,523.900024,0.0,0.0,0.0
347517,49.000,W,17188,321.0,150.0,visa,226.0,debit,310.0,87.0,...,0.000000,98.000000,98.000000,98.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0
34139,77.000,W,10112,360.0,150.0,visa,166.0,debit,272.0,87.0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0


In [25]:
#Sanity Check
y_valid

26703     0
2877      0
350334    0
236126    0
404743    0
         ..
280531    0
516994    0
347517    0
34139     0
320755    0
Name: isFraud, Length: 70865, dtype: int64

## Final Result - Data Preparation

Ekspor Hasil data preparation dengan file pickle

In [26]:
util.pickle_dump(X_train, config_data["train_set_path"][0])
util.pickle_dump(y_train, config_data["train_set_path"][1])

util.pickle_dump(X_valid, config_data["valid_set_path"][0])
util.pickle_dump(y_valid, config_data["valid_set_path"][1])

util.pickle_dump(X_test, config_data["test_set_path"][0])
util.pickle_dump(y_test, config_data["test_set_path"][1])