# Libraries using for Tabular Preprocessing

In [4]:
import shutil
import os
import kagglehub
import pandas as pd

## I.Downloading the dataset:

In this step, we download the selected tabular dataset directly from **Kaggle** using the `kagglehub` library. This approach ensures that the dataset is always retrieved in its **latest available version**, improving reproducibility and consistency across different environments.

After downloading, all dataset files are copied into a predefined project directory structure (`data/tabular/`). This organization helps maintain a clean and standardized layout for data preprocessing experiments.

In [2]:
# Download latest version
path = kagglehub.dataset_download("kartik2112/fraud-detection")

# Copy files to correct directory
destination = "../data/tabular"
if not os.path.exists(destination):
    os.makedirs(destination, exist_ok=True)

for file in os.listdir(path):
    src = os.path.join(path, file)
    dst = os.path.join(destination, file)
    shutil.copy2(src, dst)
    
print(f"Dataset copied to {os.path.abspath(destination)}")

Downloading from https://www.kaggle.com/api/v1/datasets/download/kartik2112/fraud-detection?dataset_version_number=1...


100%|██████████| 202M/202M [00:13<00:00, 15.2MB/s] 

Extracting files...





Dataset copied to d:\Data\Learning\University\Year3\Semester 8\Data Mining\Current Semester\Preprocessing_Methods\data\tabular


After downloading and organizing the dataset, the following files are available in the `data/tabular/` directory:

```
data/tabular/
├── fraudTrain.csv
└── fraudTest.csv
```

### 1. `fraudTrain.csv`

This file contains the **training dataset**, which is used to explore the data distribution and perform preprocessing techniques such as handling missing values, normalization, categorical encoding, and feature selection.

* Represents historical transaction records used for model training.
* Includes both **numerical** and **categorical** attributes related to transactions and customers.
* Contains the **target variable** indicating whether a transaction is fraudulent or legitimate.
* Used as the primary dataset for analyzing preprocessing effects and feature behavior.

### 2. `fraudTest.csv`

This file contains the **testing dataset**, which is separated from the training data to simulate unseen data.

* Has the same schema and feature structure as `fraudTrain.csv`.
* Used to validate preprocessing consistency and evaluate how preprocessing decisions generalize to new data.
* Ensures that preprocessing pipelines do not rely on information leakage from the training set.

## II. Loading the datasets

In [6]:
data = pd.read_csv(os.path.join(destination, "fraudTrain.csv"))

### Analyzing some basic information of the dataset

In [8]:
# Display basic information about the dataset
print("Dataset Shape:", data.shape)
print("\n" + "="*50)
print("Column Names and Data Types:")
print(data.dtypes)
print("\n" + "="*50)

# Check for missing values
print("Missing Values:")
print(data.isnull().sum())
print("\n" + "="*50)

# # Basic statistics for numerical columns
# print("Numerical Statistics:")
# print(data.describe())
# print("\n" + "="*50)

# # Check class distribution (fraud vs non-fraud)
# print("Fraud Distribution:")
# print(data['is_fraud'].value_counts())
# print("\nFraud Percentage:")
# print(data['is_fraud'].value_counts(normalize=True) * 100)
# print("\n" + "="*50)

# Check numerical columns
numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns
print(f"Numerical Columns ({len(numerical_cols)}):")
print(list(numerical_cols))
print("\n" + "="*50)

# Check categorical columns
categorical_cols = data.select_dtypes(include=['object']).columns
print(f"Categorical Columns ({len(categorical_cols)}):")
print(list(categorical_cols))
print("\n" + "="*50)

# # Display unique values for some key categorical columns
# key_columns = ['category', 'gender', 'state']
# for col in key_columns:
#     if col in data.columns:
#         print(f"\nUnique values in '{col}': {data[col].nunique()}")
#         print(data[col].value_counts().head(10))

Dataset Shape: (1296675, 23)

Column Names and Data Types:
Unnamed: 0                 int64
trans_date_trans_time     object
cc_num                     int64
merchant                  object
category                  object
amt                      float64
first                     object
last                      object
gender                    object
street                    object
city                      object
state                     object
zip                        int64
lat                      float64
long                     float64
city_pop                   int64
job                       object
dob                       object
trans_num                 object
unix_time                  int64
merch_lat                float64
merch_long               float64
is_fraud                   int64
dtype: object

Missing Values:
Unnamed: 0               0
trans_date_trans_time    0
cc_num                   0
merchant                 0
category                 0
amt             

### III. Handling Missing Values: 
Identify patterns of missing data (MCAR, MAR, MNAR). Then apply appro-priate imputation techniques (mean, median, mode, forward/backward fill, K-NN imputation). After that, compare the impact of different imputation strategies on data quality and distribution

### IV. Data Normalization: 
Apply Min-Max scaling, standardization (Z-score normalization), and robust scaling for data with outliers. Then compare distributions before and after normalization using appropriate visualizations.

### V. Categorical Encoding: 
Identify categorical variables requiring encoding. Apply suitable encoding for each type of categorical variable, such as one-hot encoding for nominal variables and ordinal encoding for ordinal variables. Discuss strategies for handling high-cardinality categorical features.

### VI. Feature Selection: 
Choose a suitable feature selection method. For example, calculate the correlation matrix, use variance threshold, apply feature importance from tree-based models, or implement recursive feature elimination (RFE). Then compare the selected feature sets and justify your final selection.