# Preprocessing

This jupyter notebook creates an sklearn pipeline for our data preprocessing. The steps that we need to follow are motivated by the EDA and are split into

- Numerical features
- High-cardinality categorical features
- Low-cardinality categorical features
- Boolean categorical features

In [44]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import Binarizer
import seaborn as sns

In [60]:
# Data collection
total_df = pd.read_csv('./Data/Base.csv')

# Split the DataFrame into training and test sets using stratified sampling to maintain anomaly distribution
train_df, test_df = train_test_split(total_df, test_size=0.2, stratify=total_df['fraud_bool'], random_state=42)

# Validate the size of the data
train_shape = train_df.shape
test_shape = test_df.shape
print(f"The training data has {train_shape[0]} rows and {train_shape[1]} columns.")
print(f"The testing data has {test_shape[0]} rows and {test_shape[1]} columns.")

The training data has 800000 rows and 32 columns.
The testing data has 200000 rows and 32 columns.


### Feature transformations for numerical features

The numerical features will be transformed as follows

Name_email_similarity:
- Convert into 5 bins, 4 with range 0.24 and 1 with range 0.04 (0.96-1)
- There are no outliers

Days_since_request:
- Convert into 3 bins [0->Q2, Q2->Q3+1.5* IQR, Q3+1.5*IQR+]

Intended_balcon_amount
- Replace with boolean feature if positive account

Velocity_6h, _24h, 4w
- Remove all

Session_length_in_minutes
- Remove

In [73]:
# Custom transformer for binning 'name_email_similarity'
class NameEmailSimilarityBinner(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.bin_edges = np.array([0.0, 0.24, 0.48, 0.72, 0.96, 1.0])

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return np.digitize(X.to_numpy().flatten(), bins=self.bin_edges, right=False).reshape(-1, 1)

# Custom transformer for binning 'days_since_request'
class DaysSinceRequestBinner(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        X_series = X.to_numpy().flatten()
        self.Q2 = np.quantile(X_series, 0.5)
        self.upper_fence = np.quantile(X_series, 0.75) + 1.5 * (np.quantile(X_series, 0.75) - np.quantile(X_series, 0.25))
        return self

    def transform(self, X):
        bin_edges = np.array([-np.inf, self.Q2, self.upper_fence, np.inf])
        return np.digitize(X.to_numpy().flatten(), bins=bin_edges, right=False).reshape(-1, 1)

# Pipeline to drop velocity columns and apply transformations while keeping other columns
numerical_preprocessor = Pipeline([
    ('drop_velocity_columns', FunctionTransformer(lambda X: X.drop(columns=["velocity_6h", "velocity_24h", "velocity_4w"]))),
    ('transformer', ColumnTransformer([
        ('name_email_similarity_binned', NameEmailSimilarityBinner(), ['name_email_similarity']),
        ('days_since_request_binned', DaysSinceRequestBinner(), ['days_since_request']),
        ('intended_balcon_amount_positive', Binarizer(threshold=0.0), ['intended_balcon_amount'])
    ], remainder='passthrough'))  # Keep other columns as they are
])

### Feature transformations for high-cardinality categorical features

### Feature transformations for low-cardinality categorical features
The categorical features will be transformed as follows:

customer_age:
- Shows linear relationship with fraud
- Use ordinal encoding to preserve order
- Higher age correlates with higher fraud risk

payment_type:
- Categories: AA, AB, AC, AD, AE
- Use one-hot encoding
- Shows category-specific risks with AC having highest fraud rate

employment_status:
- Categories: CA, CB, CC, CD, CE, CF, CG
- Use one-hot encoding
- "Unemployed" (CD) shows highest fraud rate (~5%)
- "Self-employed" (CC) shows moderate fraud rate (~2%)
- "Employed" (CA) shows lowest fraud rate (~0.8%)

housing_status:
- Categories: BA, BB, BC, BD, BE, BF, BG
- Use one-hot encoding
- "Other" and "Parents" categories show elevated fraud rates (3-4%)
- "Own" shows lowest fraud rate (~0.5%)

device_os:
- Categories: linux, macintosh, other, windows, x11
- Use one-hot encoding
- Windows shows slightly higher fraud rate
- Keep all categories as they show distinct risk profiles

device_distinct_emails_8w:
- Already numeric (values: -1, 0, 1, 2)
- Keep as is - pass through without transformation
- Shows clear pattern with fraud rates

month:
- Categories: 0-7
- Use one-hot encoding
- No strong relationship with fraud
Keep for potential seasonal patterns

income:
- U-shaped relationship with fraud
- Create 5 bins: [0-0.2, 0.2-0.4, 0.4-0.6, 0.6-0.8, 0.8-1.0]
- Use one-hot encoding on bins
- Both very low and very high values show higher fraud rates (~4-5%)

proposed_credit_limit:
- U-shaped relationship with fraud
- Create 6 bins: [0-500, 500-1000, 1000-2000, 2000-5000, 5000-10000, 10000+]
- Use one-hot encoding on bins
- Both very low and very high values show higher fraud rates

In [None]:
class LowCardinalityPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
       
        # one-hot encoded features
        self.onehot_features = [
            'payment_type',
            'employment_status',
            'housing_status',
            'device_os',
            'month',
            'credit_bin', # proposed_credit_limit
            'income_bin' # income
        ]
        self.onehot_encoder = OneHotEncoder(drop='first', handle_unknown='ignore')
        
        # ordinal encoded feature bc shows linear relationship
        self.ordinal_features = ['customer_age']
        self.ordinal_encoder = OrdinalEncoder()
        
        
    def create_bins(self, X):
        X_transformed = X.copy()
        
        # bin credit limit (U-shaped relationship)
        credit_bins = [0, 500, 1000, 2000, 5000, 10000, float('inf')]
        X_transformed['credit_bin'] = pd.cut(X['proposed_credit_limit'], 
                                           bins=credit_bins, 
                                           labels=['very_low', 'low', 'medium', 'high', 'very_high', 'extreme'])
        
        # bin income (U-shaped relationship)
        income_bins = [0, 0.2, 0.4, 0.6, 0.8, 1.0]
        X_transformed['income_bin'] = pd.cut(X['income'], 
                                           bins=income_bins, 
                                           labels=['very_low', 'low', 'medium', 'high', 'very_high'])
        
        return X_transformed
        
    def fit(self, X, y=None):
        
        # creating bins and fitting categorical features
        X_binned = self.create_bins(X)
        self.onehot_encoder.fit(X_binned[self.onehot_features])
        self.ordinal_encoder.fit(X[self.ordinal_features])
        
        return self
    
    def transform(self, X):
        # create bins
        X_transformed = self.create_bins(X)
        
        # transform form one-hot features --> to dense array
        onehot_encoded = self.onehot_encoder.transform(X_transformed[self.onehot_features]).toarray()
        onehot_feature_names = self.onehot_encoder.get_feature_names_out(self.onehot_features)
        
        # transform ordinal feature (customer_age)
        ordinal_encoded = self.ordinal_encoder.transform(X[self.ordinal_features])
        
        # combine all transformed features
        transformed_array = np.hstack([onehot_encoded, ordinal_encoded])
        
        # create feature names
        feature_names = [*onehot_feature_names, *self.ordinal_features]
        
        # create final dataframe
        transformed_df = pd.DataFrame(
            transformed_array,
            columns=feature_names,
            index=X.index
        )
        
        # separately handle device_distinct_emails_8w since it's already numeric
        transformed_df['device_distinct_emails_8w'] = X['device_distinct_emails_8w']
        
        return transformed_df

### Feature transformations for boolean categorical features


For feature selection, we will retain features with strong associations to fraud and drop or combine weaker predictors. Since binary features generally don’t require scaling, they can move directly into modeling, though combining weaker features may enhance their predictive utility. With this strategy, we can optimize the dataset for fraud prediction while reducing unnecessary complexity.

### Summary of feature transformations for boolean features

fraud_bool
- Retain as it is the target variable

email_is_free
- Retain

phone_home_valid and phone_mobile_valid
- total_valid_phones which counts valid phone numbers (0, 1, or 2) by combining phone_home_valid and phone_mobile_valid.

has_other_cards
- Retain

foreign_request and keep_alive_session
- Combined feature foreign_long_session to check for long session times from foreign locations which might indicate unusual behavior, like accessing accounts from unfamiliar regions.

source
- One-hot encoding

device_fraud_count and keep_alive_session
- Combine to create a flag that captures both device history and session longevity, potentially signaling abnormal use patterns.

total_risk_flags
- A derived aggregated fraud risk score using high-risk indicators (email_is_free, foreign_request, has_other_cards, and keep_alive_session)

In [None]:
# Custom transformer to create `total_valid_phones`
class TotalValidPhonesTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return (X['phone_home_valid'] + X['phone_mobile_valid']).values.reshape(-1, 1)

# Custom transformer to create `foreign_long_session`
class ForeignLongSessionTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return (X['foreign_request'] & X['keep_alive_session']).astype(int).values.reshape(-1, 1)

# Custom transformer to create `device_and_account_history`
class DeviceAccountHistoryTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return (X['device_fraud_count'] * X['keep_alive_session']).values.reshape(-1, 1)

# Custom transformer to create `total_risk_flags`
class TotalRiskFlagsTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.high_risk_features = ['email_is_free', 'foreign_request', 'has_other_cards', 'keep_alive_session']

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.high_risk_features].sum(axis=1).values.reshape(-1, 1)

# Pipeline to create new features from Boolean data
boolean_feature_preprocessor = ColumnTransformer([
    ('total_valid_phones', TotalValidPhonesTransformer(), ['phone_home_valid', 'phone_mobile_valid']),
    ('foreign_long_session', ForeignLongSessionTransformer(), ['foreign_request', 'keep_alive_session']),
    ('device_and_account_history', DeviceAccountHistoryTransformer(), ['device_fraud_count', 'keep_alive_session']),
    ('total_risk_flags', TotalRiskFlagsTransformer(), ['email_is_free', 'foreign_request', 'has_other_cards', 'keep_alive_session']),
    ('source_encoded', 'passthrough', ['source'])  # Assuming 'source' is preprocessed with one-hot encoding externally
], remainder='passthrough')

# Sample implementation for preprocessing pipeline
boolean_preprocessing_pipeline = Pipeline([
    ('boolean_feature_transform', boolean_feature_preprocessor)
])

# Assuming train_df is your DataFrame
# Apply transformations and create a new DataFrame with engineered features
transformed_df = boolean_preprocessing_pipeline.fit_transform(train_df)
transformed_df = pd.DataFrame(transformed_df, columns=[
    'total_valid_phones', 'foreign_long_session', 'device_and_account_history', 'total_risk_flags', 'source_encoded'
])

# Display the transformed DataFrame
print(transformed_df.head())

In [None]:
# SHOULD BE MOVED TO DATAPREPROCESSING?

# One-hot encoding for categorical variables
encoded_train_df = pd.get_dummies(train_df, columns=cat_df.columns, drop_first=True)

In [None]:
# SHOULD BE MOVED TO DATAPREPROCESSING?

# Apply log transformation
for column in ['velocity_6h', 'velocity_24h', 'zip_count_4w']:
    train_df[column] = np.log1p(train_df[column])

```python
   df['MISSING_FLAG_intended_balcon_amount'] = df['intended_balcon_amount'].isna().astype(int)
   df['intended_balcon_amount'].fillna(-1, inplace=True)
   df['MISSING_FLAG_prev_address_months_count'] = df['prev_address_months_count'].isna().astype(int)
   df['prev_address_months_count'].fillna(-1, inplace=True)
   df['MISSING_FLAG_current_address_months_count'] = df['current_address_months_count'].isna().astype(int)
   df['current_address_months_count'].fillna(-1, inplace=True)
   df['MISSING_FLAG_bank_months_count'] = df['bank_months_count'].isna().astype(int)
   df['bank_months_count'].fillna(-1, inplace=True)
   df['session_length_in_minutes'].fillna(-1, inplace=True)
   ```