# Preprocessing + Feature engineering

During the preprocessing step we are going to try several different things like:
- Feature encoding
- Balancing dataset (optional)
- Scale data

In [1]:
# import libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import polars as pl

df = pl.read_csv("../data/Digital_Payment_Fraud_Detection_Dataset.csv")

## Feature Engineering
**Generation of # of previous frauds and # of transactions made**

In [3]:
# First will ensure all transactions are in order, 
# as the transaction id is build T+number will remake the id by removing the T 
# and casting the column to a number
def previousActionsGenerator(df: pl.DataFrame) -> pl.DataFrame: 
    """
    Generation of feature engineering variables based on past actions.
    The features fenerated are:
        - previous_fraud : Boolean variable that tell if user previously commited faud
        - previous_fraud_num : Numeric variable that tells how many times was comited fraud previously
                                by the user
        - previous_transactions : Boolean variable that shows if a user previouly made a transaction
        - previous_transactions_num : Numeric variable that shows how many transaction the user 
                                        made before
    
    Parameters:
        df:pd.DataFrame 
    
    Returns:
        df:pd.DataFrame
    """

    df = (
        df
        .with_columns(pl.col("transaction_id").str.replace("T","").alias("transaction_id_number"))
        .cast({"transaction_id_number": pl.Int16})
        .sort("transaction_id_number")

    )

    df2 = df.select(["transaction_id_number","user_id","fraud_label"])

    df_join = (
        df
        .join(df2, 
              on='user_id',
              how="inner",
              suffix='_r')
        .filter(
            pl.col('transaction_id_number') > pl.col("transaction_id_number_r")
        )
    )

    df_agg =df_join.group_by("user_id","transaction_id_number").agg(
        fraud_any=pl.col('fraud_label').any().cast(pl.Int8),
        fraud_number=pl.col("fraud_label").sum(),
        transaction_any=1,
        transaction_number=pl.col("fraud_label").count(),
    )

    df = (
        df
        .join(
            df_agg,
            on=['user_id','transaction_id_number'],
            how='left'
        )
        .fill_null(0)
        .drop('transaction_id_number')
    )

    return df


In [4]:
df_engineer = previousActionsGenerator(df)

df_engineer.write_csv("../data/02_data_engineering.csv")

## Train/Test split
For the train test validation split we are not going to split it in the normal manner, doing a for example 70/30 split. The approach we are going to take is for the test, we are going to take the last 500 transaction, and will check if balance of fraude is consistent with the one on the training data. For the trainig validation, we will keep it as usual, will do a k-fold cross vaidation, which, fiven our small amout of data, will be useful.

In [2]:
df = pl.read_csv("../data/02_data_engineering.csv")
# As the data is already soted we can easily make the split to get the test data 
test = df.tail(500)
train = df.head(7000)

In [3]:
print("Test fraud ratio:")
display(test.select("fraud_label").mean())

print("Trian fraud ratio")
display(train.select("fraud_label").mean())

Test fraud ratio:


fraud_label
f64
0.068


Trian fraud ratio


fraud_label
f64
0.065


In [4]:
train.write_csv("../data/03_train.csv")
test.write_csv("../data/03_test.csv")

Looking at both results we can see that both have the same ratio, so we can savely test on them

## Feature encoding
We have some features that are encoded such as *transaction_type* or *payment_method*. This variables need to be encoded into numbers in order to be used by the different models. 

In [21]:
df_train:pl.DataFrame = pl.read_csv("../data/03_train.csv")

In [22]:
def encode_columns(df: pl.DataFrame, categorical_columns:list[str]) -> tuple[pl.DataFrame, dict[str,LabelEncoder]]:
    labelEncoder_dict= dict()
    for col in categorical_columns:
        le = LabelEncoder()
        encoded_col = le.fit_transform(df.select(col))
        df = df.with_columns(pl.lit(encoded_col).alias(col))
        labelEncoder_dict[col] = le

    return df, labelEncoder_dict

In [23]:
categorical_columns = ["transaction_type", "payment_mode", "device_type", "device_location"]
df_train, labelEncoder_dict = encode_columns(df_train, categorical_columns)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [24]:
df_train.write_csv("../data/train_preprocessed.csv")