In [1]:
import pandas as pd

df = pd.read_csv(r".\data\creditcard.csv")

In [2]:
# Drop 'Time' Column
# The 'Time' (elapsed seconds) feature is often less directly impactful than the transaction characteristics themselves.

# Scale 'Amount' feature: The 'Amount' feature has a very different scale and distribution compared to the PCA-transformed 'V' features. Algorithms like Isolation Forest (which rely on partitioning data based on feature values) can be sensitive to feature scales. Standardizing 'Amount' ensures it contributes fairly to the model without disproportionately influencing the splitting process due to its larger magnitude. The 'V' features are already implicitly scaled by PCA.

In [3]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

print("\n--- Starting Data Preprocessing ---")

df_processed = df.drop('Time', axis=1)
print(f"Dropped 'Time' column. New DataFrame shape: {df_processed.shape}")

# 2. Separate features (X) and target (y_true)
# The 'Class' column (0: legitimate, 1: fraud) is our ground truth.
# For unsupervised learning, the model will not see 'y_true' during training.
# We keep it separate solely for later evaluation of how well the unsupervised model
# aligns with actual fraud.

X = df_processed.drop('Class', axis=1)
y_true = df_processed['Class']

print(f"Separated features (X) and target (y_true). X shape: {X.shape}, y_true shape: {y_true.shape}")


--- Starting Data Preprocessing ---
Dropped 'Time' column. New DataFrame shape: (284807, 30)
Separated features (X) and target (y_true). X shape: (284807, 29), y_true shape: (284807,)


In [4]:
# 3. Scale the 'Amount' feature

scaler = StandardScaler()

X['Amount'] = scaler.fit_transform(X[['Amount']])

In [5]:
print("Scaled 'Amount' feature using StandardScaler.")
print("First 5 rows of features (X) after preprocessing:")
print(X.head())

Scaled 'Amount' feature using StandardScaler.
First 5 rows of features (X) after preprocessing:
         V1        V2        V3        V4        V5        V6        V7  \
0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9       V10  ...       V20       V21       V22       V23  \
0  0.098698  0.363787  0.090794  ...  0.251412 -0.018307  0.277838 -0.110474   
1  0.085102 -0.255425 -0.166974  ... -0.069083 -0.225775 -0.638672  0.101288   
2  0.247676 -1.514654  0.207643  ...  0.524980  0.247998  0.771679  0.909412   
3  0.377436 -1.387024 -0.054952  ... -0.208038 -0.108300  0.005274 -0.190321   
4 -0.270533  0.817739  0.753074  ...  0.408542 -0.009

In [6]:
X.to_csv(r".\data\preprocessed_data.csv", index=False)

In [7]:
y_true.to_csv(r".\data\y_true.csv", index=False)

- Created feature matrix X which the Isolation Forest model will learn from.
- The Amount feature is now on a comparable scale to the V features, preventing it from dominating the model's decision-making solely due to its magnitude.
- Explicitly separated y_true (the Class column) to reinforce that the model is unsupervised – it trains without ever seeing these labels.