# Assignment Chapter 2 - DEEP LEARNING [Case #4]
Startup Campus, Indonesia - `Artificial Intelligence (AI)` (Batch 7)
* Task: **CLASSIFICATION**
* DL Framework: **PyTorch**
* Dataset: Credit Card Fraud 2023
* Libraries: Pandas/cuDF, Scikit-learn/cuML, Numpy/cuPy
* Objective: Classify credit fraud transactions using Multilayer Perceptron

`PERSYARATAN` Semua modul (termasuk versi yang sesuai) sudah di-install dengan benar.
<br>`CARA PENGERJAAN` Lengkapi baris kode yang ditandai dengan **#TODO**.
<br>`TARGET PORTFOLIO` Peserta mampu mengklasifikasi transaksi fraud menggunakan *Multilayer Perceptron*

### Import Libraries

In [None]:
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 530, done.[K
remote: Counting objects: 100% (261/261), done.[K
remote: Compressing objects: 100% (167/167), done.[K
remote: Total 530 (delta 171), reused 129 (delta 94), pack-reused 269 (from 1)[K
Receiving objects: 100% (530/530), 170.56 KiB | 964.00 KiB/s, done.
Resolving deltas: 100% (273/273), done.
Collecting pynvml
  Downloading pynvml-11.5.3-py3-none-any.whl.metadata (8.8 kB)
Downloading pynvml-11.5.3-py3-none-any.whl (53 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 2.1 MB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.5.3
Installing RAPIDS remaining 24.6.* libraries
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cudf-cu12==24.6.*
  Downloading https://pypi.nvidia.com/cudf-cu12/cudf_cu12-24.6.1-cp310-cp310-manylinux_2_28_x86_64.whl (478.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 478.0/478.0 MB 3.8 MB/s eta 0:00:00
C

In [None]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import shutil
import cudf
import os
import matplotlib.pyplot as plt
from google.colab import files

<font color="red">**- - - - MOHON DIPERHATIKAN - - - -**</font>
<br>**Aktifkan GPU sekarang.** Di Google Colab, klik **Runtime > Change Runtime Type**, lalu pilih **T4 GPU**.

### Dataset Loading (CPU vs. GPU)

In [None]:
from pandas import read_csv as read_by_CPU
from cudf import read_csv as read_by_GPU

In [None]:
# Unzip the file
shutil.unpack_archive ('dataset_case_04.zip', '/content/sample_data', 'zip')

In [None]:
# TODO: Impor dataset dengan Pandas, gunakan fungsi "read_by_CPU"
%time data_cpu = read_by_CPU ('/content/sample_data/creditcard_2023.csv')

In [None]:
# Impor dataset dengan cuDF (Pandas di GPU)
%time data_gpu = read_by_GPU ('/content/sample_data/creditcard_2023.csv')
print("Data shape (GPU):", data_gpu)

In [None]:
# TODO: Hilangkan kolom ID
data_gpu = data_gpu.drop(columns=['id'], inplace=True)
print(data_gpu)

NameError: name 'data_cpu' is not defined

### Standardization (CPU vs. GPU)

In [None]:
from sklearn.preprocessing import StandardScaler as StandardScaler_CPU
from cuml.preprocessing import StandardScaler as StandardScaler_GPU

In [None]:
ScalerCPU = StandardScalerCPU()
ScalerGPU = StandardScalerGPU()

arbitrary_features = ["V"+str(i+1) for i in range(27)]

In [None]:
%%time

data_cpu[arbitrary_features] = ScalerCPU.fit_transform(data_cpu[arbitrary_features].values)
data_cpu["Amount"] = ScalerCPU.fit_transform(data_cpu["Amount"].values.reshape(-1, 1)).squeeze()

In [None]:
%%time

data_gpu[arbitrary_features] = ScalerGPU.fit_transform(data_gpu[arbitrary_features].values)
data_gpu["Amount"] = ScalerGPU.fit_transform(data_gpu["Amount"].values.reshape(-1, 1)).squeeze()

NameError: name 'gpu_scaler' is not defined

### Train/Test Split (CPU vs. GPU)

In [None]:
from sklearn.model_selection import train_test_split as splitCPU
from cuml.preprocessing import train_test_split as splitGPU

In [None]:
# TODO: Tentukan X (features) dan Y (target), gunakan "data_gpu"
X = X_gpu.values
y = y_gpu.values

In [None]:
# TODO: Pecah dataset dengan komposisi 80% train set dan 20% test set, dengan fungsi "splitCPU"
def splitCPU(X, y, test_size=0.2, random_state=42) :
    return train_test_split(X, Y, test_size=test_size, random_state=random_state)

  X_train_cpu, X_test_cpu, y_train_cpu, y_test_cpu = splitCPU(X_cpu_scaled, y_cpu)

In [None]:
#TO DO: Lakukan hal yang sama untuk data spliting, tetapi dengan fungsi "splitGPU"
def splitGPU(X, y, test_size=0.2, random_state=42) :
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return (torch.tensor (X_train, dtype=torch.float32).cuda(),
            torch.tensor (X_test, dtype=torch.float32).cuda(),
            torch.tensor (y_train, dtype=torch.float32).cuda(),
            torch.tensor (y_test, dtype=torch.float32).cuda())

X_train_gpu, X_test_gpu, y_train_gpu, y_test_gpu = splitGPU(X.values, y.values)

### Convert the dataset into Tensor

In [None]:
class CreditCardDataset(Dataset):
  def __iniit__(self, X, y):
    self.X = X
    self.y = y

  def __len__(self):
    return len(self.X)

  def __getitem__(self, idx):
    return self.X[idx], self.y[idx]

In [None]:
# TODO: Aktifkan GPU (CUDA) sebagai device untuk training
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print (f"Using device : {device}")

### Batching the Dataset with PyTorch DataLoader

In [None]:
# TODO: Tentukan nilai batch
BATCH_SIZE = 32

train_dataset = CreditCardDataset(X_train_gpu, y_train_gpu)
test_dataset = CreditCardDataset(X_test_gpu, y_test_gpu)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

### Model Blueprint

In [None]:
class MLP(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, output_size):
        super(MLP, self).__init__()
        self.layer1 = nn.Linear(input_size, hidden_size1)
        self.layer2 = nn.Linear(hidden_size1, hidden_size2)
        self.layer3 = nn.Linear(hidden_size2, output_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.2)
        self.batch_norm1 = nn.BatchNorm1d(hidden_size1)
        self.batch_norm2 = nn.BatchNorm1d(hidden_size2)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.layer1(x)
        x = self.batch_norm1(x)
        x = self.relu(x)
        x = self.dropout(x)

        x = self.layer2(x)
        x = self.batch_norm2(x)
        x = self.relu(x)
        x = self.dropout(x)

        x = self.sigmoid(self.layer3(x))
        return x

In [None]:
model = MLP(input_size, hidden_size1, hidden_size2, output_size).to(device)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total number of parameters: {total_params}")
print(f"Number of trainable parameters: {trainable_params}")

### Model Hyperparameters and Parameters

In [None]:
# [ PERTANYAAN ]
# Apa perbedaan hyperparameters dan parameters?

[ ANSWER HERE ]

In [None]:
# TODO: Tentukan hyperparameters
input_size = X_gpu.shape[1]
hidden_size1 = 128
hidden_size2 = 64
output_size = 1
learning_rate = 0.001
num_epochs = 100

In [None]:
# TODO: Tentukan besaran input untuk model
num_inputs = 0

model = Net(in_features=num_inputs, num_layers=num_layers, num_neurons=num_neurons)
model = model.to(device)

In [None]:
# Set the optimizer and loss function
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.BCELoss()

In [None]:
# Check the number of parameters
print("Number of parameters: {:,}".format(sum(p.numel() for p in model.parameters() if p.requires_grad)))
print("Number of trainable parameters: {:,}".format(sum(p.numel() for p in model.parameters() if p.requires_grad)))

In [None]:
# [ PERTANYAAN ]
# Mengapa total "trainable parameters" sama dengan total keseluruhan parameter?

[ ANSWER HERE ]

### Train the Model

In [None]:
print("Start training ...")
for epoch in range(epochs):
    train_loss = 0.0
    model.train()

    for data, label in Train_dataset:
        data = data.to(device)
        label = label.squeeze()
        label = label.to(device)
        optimizer.zero_grad()
        output = model.forward(data.float())

        loss = criterion(output.squeeze(), label.float())
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    train_loss = train_loss / len(Train_dataset.dataset)
    if(epoch % 10 == 0):
        print('  - Epoch: {} \tTraining_loss: {:.6f}'.format(epoch, train_loss))

In [None]:
for epoch in range (num_epochs):
  model.train()
  total_loss = 0
  for batch_X, batch_y in Train_loader:
    outputs = model(batch_X)
    loss = criterion(outputs, batch_y.view(-1,1))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    total_loss += loss.item()

  if (epoch + 1) % 10 == 0:
    print (f'Epoch [{epoch+1}/{num_epochs}], Loss: {total_loss/len(Train_loader):.4f}')


In [None]:
model.eval()
correct = 0
total = 0
with torch.no_grad():
  for batch_X, batch_y in test_loader:
    outputs = model(batch_X)
    predicted = (outputs > 0.5).float()
    total += batch_y.size(0)
    correct += (predicted.view(-1) == batch_y).sum().item()

accuracy = 100 * correct / total
print(f'Accuracy: {accuracy:.2f}%')

### Model ACCURACY should reach >= 95%

In [None]:
# TODO: Jika akurasi masih dibawah 95%, silakan lakukan fine-tuning

In [None]:
correct_preds = 0
total_samples = 0

with torch.no_grad():
    for data, labels in Test_dataset:
        labels = labels.squeeze()
        output = model.forward(data.float())
        output = output.squeeze(1)

        predictions = (output >= 0.5).float()
        correct_preds += (predictions == labels).sum().item()
        total_samples += labels.numel()

accuracy = correct_preds / total_samples
print("Model accuracy: {:.2f}%".format(accuracy*100))

In [None]:
if accuracy < 95 :
  print ("Performing fine-tuning")
  optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate/10)

  for epoch in range (50):
    model.train()
    total_loss = 0
    for batch_X, batch_y in Train_loader:
      outputs = model(batch_X)
      loss = criterion(outputs, batch_y.view (-1,1))
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
      total_loss += loss.item()

    if (epoch + 1) % 10 == 0:
      print (f'Fine-tuning Epoch [{epoch+1}/50], Loss: {total_loss/len(Train_loader):.4f}')

In [None]:
#Final evaluation
model.eval ()
correct = 0
total = 0
with torch.no_grad():
  for batch_X, batch_y in Test_loader:
    outputs = model(batch_X)
    predicted = (outputs >= 0.5).float()\
    total += batch_y.size(0)
    correct += (predicted.view(-1) == batch_y).sum().item()

final_accuracy = 100 * correct / total
print (f'Final Accuracy after fine-tuning: {final_accuracy:.2f}%')

### Scoring
Total `#TODO` = 12
<br>Checklist:

- [ ] Impor dataset dengan Pandas, gunakan fungsi "read_by_CPU"
- [ ] Hilangkan kolom ID
- [ ] Tentukan X (features) dan Y (target), gunakan "data_gpu"
- [ ] Pecah dataset dengan komposisi 80% train set dan 20% test set, dengan fungsi "splitCPU"
- [ ] Lakukan hal yang sama untuk data spliting, tetapi dengan fungsi "splitGPU"
- [ ] Aktifkan GPU (CUDA) sebagai device untuk training
- [ ] Tentukan nilai batch
- [ ] PERTANYAAN: Apa perbedaan hyperparameters dan parameters?
- [ ] Tentukan hyperparameters
- [ ] Tentukan besaran input untuk model
- [ ] PERTANYAAN: Mengapa total "trainable parameters" sama dengan total keseluruhan parameter?
- [ ] Jika akurasi masih dibawah 95%, silakan lakukan fine-tuning

### Additional readings
- N/A

### Copyright © 2024 Startup Campus, Indonesia
* Prepared by **Nicholas Dominic, M.Kom.** [(profile)](https://linkedin.com/in/nicholas-dominic)
* You may **NOT** use this file except there is written permission from PT. Kampus Merdeka Belajar (Startup Campus).
* Please address your questions to mentors.