In [1]:
import os
import pandas as pd

#### 2.1. Load Dataset from Local Storage

We read the previously saved Parquet file into a Pandas DataFrame.
Using Parquet ensures fast loading and preserves data types.

In [2]:
df = pd.read_parquet("./data/1/df.parquet")

#### 2.2. Initial Data Exploration

We display the first 5 rows to get a quick overview of the dataset's structure
and verify that the data loaded correctly.

In [3]:
print("First 5 rows of the dataset:")
display(df.head())

First 5 rows of the dataset:


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,371,CASH_OUT,367336.05,sdv-pii-r8zd6,4514816.83,2108392.86,sdv-pii-q6998,1265486.06,2454140.46,0,0
1,368,TRANSFER,238.63,sdv-pii-xq6z3,430944.71,1865444.6,sdv-pii-n2ql8,107927.46,2021.16,0,0
2,141,CASH_OUT,254.93,sdv-pii-805w0,839593.53,8008353.88,sdv-pii-yo0z6,773352.22,20.79,0,0
3,191,CASH_IN,501547.39,sdv-pii-279tw,41226.4,28633.52,sdv-pii-9zlyl,6825363.55,16442078.24,0,0
4,169,TRANSFER,71832.0,sdv-pii-ksz58,248694.6,793617.86,sdv-pii-0ykbo,579313.76,829850.96,0,0


We also print the dataset's shape to understand the number of rows (transactions)
and columns (features).

In [4]:
print(f"Dataset shape: {df.shape[0]} rows x {df.shape[1]} columns")

Dataset shape: 21000000 rows x 11 columns


#### 2.3. Column Data Types

Displaying the data type of each column helps identify:
- Numerical features (Int64, Float64) suitable for calculations
- Categorical features (string) suitable for encoding or analysis
- Target variable(s) for fraud detection: 'isFraud'

In [5]:
print("Column data types:")
display(df.dtypes.to_frame("Data Type"))

Column data types:


Unnamed: 0,Data Type
step,Int64
type,string[python]
amount,Float64
nameOrig,string[python]
oldbalanceOrg,Float64
newbalanceOrig,Float64
nameDest,string[python]
oldbalanceDest,Float64
newbalanceDest,Float64
isFraud,Int64


#### 2.4. Descriptive Statistics

We calculate summary statistics for all columns to understand:
- The distribution of numerical features (mean, min, max, std)
- The unique values and counts for categorical features
- Potential anomalies or unexpected ranges in the data

In [6]:
print("Descriptive statistics for all columns:")
display(df.describe(include="all").T)

Descriptive statistics for all columns:


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
step,21000000.0,,,,238.869977,163.244711,1.0,112.0,205.0,333.0,743.0
type,21000000.0,5.0,CASH_OUT,7392758.0,,,,,,,
amount,21000000.0,,,,179899.52086,253462.841157,0.0,18588.0975,82421.835,238472.9125,6007851.17
nameOrig,21000000.0,11464100.0,C1727902553,14.0,,,,,,,
oldbalanceOrg,21000000.0,,,,1337090.191497,2576687.424684,0.01,10633.18,215652.925,1419982.4125,40283517.67
newbalanceOrig,21000000.0,,,,2089204.302756,3416911.618226,0.0,58884.7675,586785.795,2605437.03,38668109.84
nameDest,21000000.0,8219031.0,C985934102,292.0,,,,,,,
oldbalanceDest,21000000.0,,,,1939226.612396,2508570.53668,0.01,273171.8275,1012997.705,2634455.3575,44522776.72
newbalanceDest,21000000.0,,,,3756326.413547,10367144.491859,0.0,101.26,64790.495,2020361.3425,271284551.63
isFraud,21000000.0,,,,0.001308,0.036144,0.0,0.0,0.0,0.0,1.0


#### 2.5. Missing Values Check

We check for missing values to ensure data completeness.
A total of 0 missing values means the dataset is complete and no imputation is needed.

In [7]:
missing_count = df.isna().sum().sum()
print(f"Total missing values in the dataset: {missing_count}")

Total missing values in the dataset: 0


#### 2.6. Duplicated Rows Check

We check for duplicate rows which could bias analysis or model training.
A total of 0 duplicates means the dataset has unique transactions.

In [8]:
duplicate_count = df.duplicated().sum()
print(f"Total duplicated rows in the dataset: {duplicate_count}")

Total duplicated rows in the dataset: 0


#### 2.7. Save Cleaned Dataset

After validation, we save the cleaned dataset to a new folder.
This allows us to preserve a verified copy for future analysis or model training.

In [9]:
os.makedirs("./data/2/", exist_ok=True)
df.to_parquet("./data/2/df.parquet", index=False)