# Preprocessing

This jupyter notebook creates an sklearn pipeline for our data preprocessing. The steps that we need to follow are motivated by the EDA and are split into

- Numerical features
- High-cardinality categorical features
- Low-cardinality categorical features
- Boolean categorical features

In [21]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
import seaborn as sns

In [12]:
# Data collection
total_df = pd.read_csv('./Data/Base.csv')

# Split the DataFrame into training and test sets using stratified sampling to maintain anomaly distribution
train_df, test_df = train_test_split(total_df, test_size=0.2, stratify=total_df['fraud_bool'], random_state=42)

# Validate the size of the data
train_shape = train_df.shape
test_shape = test_df.shape
print(f"The training data has {train_shape[0]} rows and {train_shape[1]} columns.")
print(f"The testing data has {test_shape[0]} rows and {test_shape[1]} columns.")

The training data has 800000 rows and 32 columns.
The testing data has 200000 rows and 32 columns.


### Feature transformations for numerical features

The numerical features will be transformed as follows

Name_email_similarity:
- Convert into 5 bins, 4 with range 0.24 and 1 with range 0.04 (0.96-1)
- There are no outliers

Days_since_request:
- Use a log transform and then remove outliers
- Add boolean feature called is_days_since_request_outlier 

Intended_balcon_amount
- Use a log transform and then remove outliers
- Add boolean feature called has_positive_account

Velocity_6h, _24h, 4w
- Remove all

Session_length_in_minutes
- Remove

In [27]:
# Custom binning function with dynamic bins
def binner(X, bins):
    return np.digitize(X, bins=bins, right=False) - 1

# Create boolean feature for positive threshold
def bool_from_positives(X):
    return (X > 0).astype(int)

# Log transformation
def log_transform(X):
    return np.log1p(X)

# Clipping outliers using IQR
def clip_outliers(X):
    Q1, Q3 = np.percentile(X, [25, 75], axis=0)
    IQR = Q3 - Q1
    upper_bound = Q3 + 1.5 * IQR
    return np.minimum(X, upper_bound)

# Create outlier indicator (Ensures output is 0 or 1)
def create_outlier_indicator(X):
    Q1, Q3 = np.percentile(X, [25, 75], axis=0)
    IQR = Q3 - Q1
    threshold = Q3 + 1.5 * IQR
    return (X > threshold).astype(int)

# Define bins for name_email_similarity
similarity_bins = np.array([0.0, 0.24, 0.48, 0.72, 0.96, 1.0])

# Pipeline for name_email_similarity
name_email_similarity_pipeline = Pipeline(steps=[
    ('binner', FunctionTransformer(func=lambda X: binner(X, bins=similarity_bins)))
])

# Pipeline for days_since_request outlier indicator
days_since_request_outlier_pipeline = Pipeline(steps=[
    ('outlier_indicator', FunctionTransformer(func=create_outlier_indicator))
])

# Pipeline for days_since_request numerical transformations
days_since_request_numeric_pipeline = Pipeline(steps=[
    ('log_transform', FunctionTransformer(func=log_transform)),
    ('clip_outliers', FunctionTransformer(func=clip_outliers)),
    ('scaler', StandardScaler())
])

# Pipeline for intended_balcon_amount positive indicator
intended_balcon_amount_positive_pipeline = Pipeline(steps=[
    ('positive_indicator', FunctionTransformer(func=bool_from_positives))
])

# Pipeline for intended_balcon_amount numerical transformations
intended_balcon_amount_numeric_pipeline = Pipeline(steps=[
    ('log_transform', FunctionTransformer(func=log_transform)),
    ('clip_outliers', FunctionTransformer(func=clip_outliers)),
    ('scaler', StandardScaler())
])

# Combine all pipelines into a ColumnTransformer
numerical_preprocessor = ColumnTransformer(transformers=[
    # Binning for name_email_similarity
    ('name_email_similarity_binned', name_email_similarity_pipeline, ['name_email_similarity']),
    
    # Outlier indicator for days_since_request
    ('days_since_request_outlier', days_since_request_outlier_pipeline, ['days_since_request']),
    
    # Numerical transformations for days_since_request
    ('days_since_request_numeric', days_since_request_numeric_pipeline, ['days_since_request']),
    
    # Positive indicator for intended_balcon_amount
    ('intended_balcon_amount_positive', intended_balcon_amount_positive_pipeline, ['intended_balcon_amount']),
    
    # Numerical transformations for intended_balcon_amount
    ('intended_balcon_amount_numeric', intended_balcon_amount_numeric_pipeline, ['intended_balcon_amount'])
])

# Fit and transform the data
preprocessed_train_data = numerical_preprocessor.fit_transform(train_df)

# Define appropriate column names based on the transformations applied
new_columns = [
    'name_email_similarity_binned',               # From the binner in name_email_similarity_pipeline
    'days_since_request_outlier',                 # Boolean outlier indicator
    'days_since_request_transformed_scaled',      # Log transformed, clipped, and scaled days_since_request
    'intended_balcon_amount_positive',            # Boolean positive indicator
    'intended_balcon_amount_transformed_scaled'   # Log transformed, clipped, and scaled intended_balcon_amount
]

# Convert the preprocessed data (NumPy array) to a DataFrame
preprocessed_train_df = pd.DataFrame(preprocessed_train_data, columns=new_columns)

# Show the complete preprocessed DataFrame
preprocessed_train_df.head()

   days_since_request_outlier  intended_balcon_amount_positive
0                         0.0                              1.0
1                         0.0                              1.0
2                         0.0                              0.0
3                         0.0                              0.0
4                         0.0                              0.0


  result = func(self.values, **kwargs)
  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


Unnamed: 0,name_email_similarity_binned,days_since_request_outlier,days_since_request_transformed_scaled,intended_balcon_amount_positive,intended_balcon_amount_transformed_scaled
0,0.0,0.0,0.358111,1.0,
1,3.0,0.0,-0.318824,1.0,
2,2.0,0.0,1.687875,0.0,
3,3.0,0.0,-0.87395,0.0,
4,4.0,0.0,-0.533002,0.0,


### Feature transformations for high-cardinality categorical features

### Feature transformations for low-cardinality categorical features

### Feature transformations for boolean categorical features

In [None]:
# SHOULD BE MOVED TO DATAPREPROCESSING?

# One-hot encoding for categorical variables
encoded_train_df = pd.get_dummies(train_df, columns=cat_df.columns, drop_first=True)

In [None]:
# SHOULD BE MOVED TO DATAPREPROCESSING?

# Apply log transformation
for column in ['velocity_6h', 'velocity_24h', 'zip_count_4w']:
    train_df[column] = np.log1p(train_df[column])