# **Preprocessing**
Features V1, V2, â€¦ V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. 

## Setup Environment
Import necessary modules and configure the project path to access source code modules from the src directory.

In [1]:
import sys
import os

project_root = os.path.abspath("..")
sys.path.append(project_root)

## Import Required Libraries
Import NumPy for numerical operations and import visualization and data processing functions from custom modules.

In [2]:
import numpy as np
from src.visualization import *
from src.data_processing import *

## Load Raw Data
Read the creditcard.csv file from the raw data directory using NumPy's genfromtxt function.

In [3]:
path_data = '../Data/raw/creditcard.csv'
print("Reading creditcard.csv...")
data = np.genfromtxt(path_data , delimiter=',', skip_header=1, dtype= str, encoding='utf-8', missing_values=None)
print("Successful!")

Reading creditcard.csv...
Successful!


In [4]:
X_raw = data[:, :-1].astype(np.float64)   
y = np.char.strip(data[:, -1], '"').astype(np.int32)  

## Handle Missing Values
Check for missing (NaN) values in the dataset and impute them using the mean of each column.

In [5]:
missing_mask = np.isnan(X_raw)
print("Total missing values:", missing_mask.sum()) 

if missing_mask.any():
    col_means = np.nanmean(X_raw, axis=0)
    inds = np.where(missing_mask)
    X_raw[inds] = np.take(col_means, inds[1])

Total missing values: 0


## Handle Outliers in Amount
Clip the Amount column (index 29) at the 99.5th percentile to remove extreme outliers.

In [6]:
amount_col_idx = 29
p99_5 = np.percentile(X_raw[:, amount_col_idx], 99.5)
X_raw[:, amount_col_idx] = np.clip(X_raw[:, amount_col_idx], None, p99_5)


## Scale Time and Amount Features
Apply robust scaling to Time and Amount features to normalize their distributions and reduce the impact of outliers.

In [7]:
time_col   = X_raw[:, 0]    
amount_col = X_raw[:, 29]   
scaled_time   = robust_scale(time_col)
scaled_amount = robust_scale(amount_col)

## Reconstruct Feature Matrix
Remove the original Time (index 0) and Amount (index 29) columns, then combine scaled versions of these features with PCA components to create the final feature matrix.

In [8]:
X_clean = np.delete(X_raw, [0, 29], axis=1)  

X_final = np.column_stack((scaled_amount, scaled_time, X_clean))


## Save Processed Data
Export the processed and cleaned dataset to a CSV file with appropriate headers for use in the next stages of analysis.

In [9]:
path_data_processed = '../Data/processed/data_processed.csv'

header = "scaled_amount,scaled_time," + ",".join([f"V{i}" for i in range(1, 29)]) + ",Class"
output_csv = np.hstack([X_final, y.reshape(-1, 1)])

np.savetxt(path_data_processed,
           output_csv,
           delimiter=",",
           header=header,
           comments="",
           fmt="%.10f")   

print("Saved: creditcard_processed.csv")

Saved: creditcard_processed.csv


## Clean Up Memory
Delete intermediate variables to free up memory after processing is complete.

In [10]:
del data, X_raw, X_clean, scaled_amount, scaled_time