# Preprocessing

This jupyter notebook creates an sklearn pipeline for our data preprocessing. The steps that we need to follow are motivated by the EDA and are split into

- Numerical features
- High-cardinality categorical features
- Low-cardinality categorical features
- Boolean categorical features

In [44]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import Binarizer
import seaborn as sns

In [60]:
# Data collection
total_df = pd.read_csv('./Data/Base.csv')

# Split the DataFrame into training and test sets using stratified sampling to maintain anomaly distribution
train_df, test_df = train_test_split(total_df, test_size=0.2, stratify=total_df['fraud_bool'], random_state=42)

# Validate the size of the data
train_shape = train_df.shape
test_shape = test_df.shape
print(f"The training data has {train_shape[0]} rows and {train_shape[1]} columns.")
print(f"The testing data has {test_shape[0]} rows and {test_shape[1]} columns.")

The training data has 800000 rows and 32 columns.
The testing data has 200000 rows and 32 columns.


### Feature transformations for numerical features

The numerical features will be transformed as follows

Name_email_similarity:
- Convert into 5 bins, 4 with range 0.24 and 1 with range 0.04 (0.96-1)
- There are no outliers

Days_since_request:
- Convert into 3 bins [0->Q2, Q2->Q3+1.5* IQR, Q3+1.5*IQR+]

Intended_balcon_amount
- Replace with boolean feature if positive account

Velocity_6h, _24h, 4w
- Remove all

Session_length_in_minutes
- Remove

In [73]:
# Custom transformer for binning 'name_email_similarity'
class NameEmailSimilarityBinner(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.bin_edges = np.array([0.0, 0.24, 0.48, 0.72, 0.96, 1.0])

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return np.digitize(X.to_numpy().flatten(), bins=self.bin_edges, right=False).reshape(-1, 1)

# Custom transformer for binning 'days_since_request'
class DaysSinceRequestBinner(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        X_series = X.to_numpy().flatten()
        self.Q2 = np.quantile(X_series, 0.5)
        self.upper_fence = np.quantile(X_series, 0.75) + 1.5 * (np.quantile(X_series, 0.75) - np.quantile(X_series, 0.25))
        return self

    def transform(self, X):
        bin_edges = np.array([-np.inf, self.Q2, self.upper_fence, np.inf])
        return np.digitize(X.to_numpy().flatten(), bins=bin_edges, right=False).reshape(-1, 1)

# Pipeline to drop velocity columns and apply transformations while keeping other columns
numerical_preprocessor = Pipeline([
    ('drop_velocity_columns', FunctionTransformer(lambda X: X.drop(columns=["velocity_6h", "velocity_24h", "velocity_4w"]))),
    ('transformer', ColumnTransformer([
        ('name_email_similarity_binned', NameEmailSimilarityBinner(), ['name_email_similarity']),
        ('days_since_request_binned', DaysSinceRequestBinner(), ['days_since_request']),
        ('intended_balcon_amount_positive', Binarizer(threshold=0.0), ['intended_balcon_amount'])
    ], remainder='passthrough'))  # Keep other columns as they are
])

### Feature transformations for high-cardinality categorical features

### Feature transformations for low-cardinality categorical features

### Feature transformations for boolean categorical features

In [None]:
# SHOULD BE MOVED TO DATAPREPROCESSING?

# One-hot encoding for categorical variables
encoded_train_df = pd.get_dummies(train_df, columns=cat_df.columns, drop_first=True)

In [None]:
# SHOULD BE MOVED TO DATAPREPROCESSING?

# Apply log transformation
for column in ['velocity_6h', 'velocity_24h', 'zip_count_4w']:
    train_df[column] = np.log1p(train_df[column])