# Fraud Data Preprocessing Pipeline (from raw data to model-ready dataframe)

#### Creating a synthetic dataset to simulate the fraud detection problem and data preprocessing pipeline.

In this approach we will conduct a real world fraud detection tests simulation. For this, we're using a synthetic dataset that mimics real-world bank transactions.

The dataset includes demographic, financial, and behavioral attributes, along with a binary target variable (1 = fraud, 0 = legitimate).

Data preprocessing includes:

- Handling missing values
- Removing outliers
- Normalizing numeric features
- Encoding categorical variables
- Creating and selecting the most relevant features (Feature Engineering)

These steps ensure data quality and improve model performance.

## 1. Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

## 2. Simulating a raw dataset

To create a fake raw dataset that simulates real-world bank transactions, we will use numpy and pandas.

In [2]:
np.random.seed(42)  # for reproducibility

n = 10000  # number of samples
raw_df = pd.DataFrame({
    'transaction_id': np.arange(n),
    'country': np.random.choice(['Brazil', 'USA', 'UK', 'Germany'], n),
    'city': np.random.choice(['Rio', 'New York', 'London', 'Berlin'], n),
    'district': np.random.choice(['Center', 'North', 'South', 'West'], n),
    'zip': np.random.randint(10000, 99999, n),
    'ip': [f"192.168.{np.random.randint(0,255)}.{np.random.randint(0,255)}" for _ in range(n)],
    'datetime': pd.date_range('2024-01-01', periods=n, freq='min'),
    'os': np.random.choice(['Windows', 'Android', 'iOS'], n),
    'value': np.random.exponential(scale=100, size=n).round(2),
    'background_checks': np.random.randint(0, 5, n),
    'complaints': np.random.randint(0, 10, n),
    'transactions': np.random.randint(1, 50, n),
    'credit_score': np.random.normal(650, 100, n).astype(int),
    'credit_limit': np.random.uniform(500, 20000, n).round(2),
    'device': np.random.choice(['mobile', 'desktop'], n),
    'browser': np.random.choice(['Chrome', 'Safari', 'Firefox', 'Edge'], n),
    'is_fraud': np.random.choice([0, 1], n, p=[0.985, 0.015])  # imbalanced target
})

display("Raw dataset sample:")
display(raw_df.sample(5))

'Raw dataset sample:'

Unnamed: 0,transaction_id,country,city,district,zip,ip,datetime,os,value,background_checks,complaints,transactions,credit_score,credit_limit,device,browser,is_fraud
9339,9339,Brazil,London,North,93048,192.168.176.75,2024-01-07 11:39:00,Windows,111.96,4,1,19,693,9281.39,mobile,Safari,0
7915,7915,USA,Berlin,Center,25948,192.168.241.49,2024-01-06 11:55:00,iOS,3.18,2,1,48,682,3341.21,desktop,Firefox,0
3962,3962,Brazil,Berlin,West,78315,192.168.188.88,2024-01-03 18:02:00,Windows,7.04,1,7,38,716,8277.55,mobile,Edge,0
7723,7723,Brazil,New York,Center,46559,192.168.72.88,2024-01-06 08:43:00,iOS,13.48,2,2,44,762,4688.45,mobile,Safari,0
2932,2932,Brazil,Berlin,South,87903,192.168.47.86,2024-01-03 00:52:00,Windows,20.57,0,3,34,646,15598.18,mobile,Safari,0


## 3. Feature Engineering

To enrich the dataset with additional features, we can perform feature engineering. This can include creating new features from existing data, such as time-based features, transaction amount-based features, or other relevant features.

In this case we will split datetime column into day, hour and minute columns and drop the original datetime column to reduce the dimensionality of the dataset.

In [3]:
# --- 3. Feature extraction from datetime ---
raw_df['day'] = raw_df['datetime'].dt.day
raw_df['hour'] = raw_df['datetime'].dt.hour
raw_df['minute'] = raw_df['datetime'].dt.minute
raw_df.drop(columns=['datetime'], inplace=True)
raw_df.head()

Unnamed: 0,transaction_id,country,city,district,zip,ip,os,value,background_checks,complaints,transactions,credit_score,credit_limit,device,browser,is_fraud,day,hour,minute
0,0,UK,London,South,35499,192.168.140.92,iOS,121.49,0,3,19,768,6246.51,desktop,Chrome,0,1,0,0
1,1,Germany,Berlin,West,16421,192.168.106.108,Windows,182.55,2,4,20,601,18677.37,desktop,Chrome,0,1,0,1
2,2,Brazil,London,West,62204,192.168.87.249,Windows,36.29,0,9,11,655,14260.93,mobile,Chrome,0,1,0,2
3,3,UK,Berlin,North,34591,192.168.44.107,Windows,1.48,2,1,24,603,2218.27,mobile,Chrome,0,1,0,3
4,4,UK,Berlin,Center,70940,192.168.45.76,iOS,80.12,0,9,22,534,13281.11,desktop,Chrome,0,1,0,4


Creating a **"security index"** feature based on the background checks and complaints and a **"average value"** of transactions.

In [4]:
raw_df['security_index'] = raw_df['background_checks'] * 0.7 + raw_df['complaints'] * 0.3
raw_df['avg_value_per_tx'] = raw_df['value'] / raw_df['transactions']

Checking summary of dataset.

In [5]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   transaction_id     10000 non-null  int64  
 1   country            10000 non-null  object 
 2   city               10000 non-null  object 
 3   district           10000 non-null  object 
 4   zip                10000 non-null  int64  
 5   ip                 10000 non-null  object 
 6   os                 10000 non-null  object 
 7   value              10000 non-null  float64
 8   background_checks  10000 non-null  int64  
 9   complaints         10000 non-null  int64  
 10  transactions       10000 non-null  int64  
 11  credit_score       10000 non-null  int64  
 12  credit_limit       10000 non-null  float64
 13  device             10000 non-null  object 
 14  browser            10000 non-null  object 
 15  is_fraud           10000 non-null  int64  
 16  day                1000

## 4. Explotary data analysis

Handling data types, missing values and outliers is a common task in data preprocessing. As our dataset is synthetic, we injected some **missing values** in the dataset to simulate real-world scenarios.

In [6]:
# Cast zip to string
raw_df['zip'] = raw_df['zip'].astype(str)

# Remove a random fraction of rows for some columns
for col in ['city', 'district', 'os', 'zip']:
    random_percentage = np.random.uniform(0.01, 0.03)
    raw_df.loc[raw_df.sample(frac=random_percentage).index, col] = np.nan

raw_df.isnull().sum()

transaction_id         0
country                0
city                 261
district             276
zip                  223
ip                     0
os                   203
value                  0
background_checks      0
complaints             0
transactions           0
credit_score           0
credit_limit           0
device                 0
browser                0
is_fraud               0
day                    0
hour                   0
minute                 0
security_index         0
avg_value_per_tx       0
dtype: int64

Handling missing values

In [7]:
raw_df['city'] = raw_df['city'].fillna('Unknown')
raw_df['district'] = raw_df['district'].fillna('Unknown')
raw_df['os'] = raw_df['os'].fillna('Unknown')
raw_df['zip'] = raw_df['zip'].fillna('Unknown')

raw_df.isnull().sum()

transaction_id       0
country              0
city                 0
district             0
zip                  0
ip                   0
os                   0
value                0
background_checks    0
complaints           0
transactions         0
credit_score         0
credit_limit         0
device               0
browser              0
is_fraud             0
day                  0
hour                 0
minute               0
security_index       0
avg_value_per_tx     0
dtype: int64

## 5. Variables encoding

Checking the distribution of categorical variables:

In [8]:
cat_cols = ['country', 'city', 'district', 'os', 'device', 'browser']
for col in cat_cols:
    print(f"{col}: {raw_df[col].nunique()} in {raw_df[col].shape[0]}")

country: 4 in 10000
city: 5 in 10000
district: 5 in 10000
os: 4 in 10000
device: 2 in 10000
browser: 4 in 10000


Since we have a few classes, we will use one-hot encoding to encode categorical variables.

In [9]:
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded = encoder.fit_transform(raw_df[cat_cols])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(cat_cols))

encoded_df.head()

Unnamed: 0,country_Brazil,country_Germany,country_UK,country_USA,city_Berlin,city_London,city_New York,city_Rio,city_Unknown,district_Center,...,os_Android,os_Unknown,os_Windows,os_iOS,device_desktop,device_mobile,browser_Chrome,browser_Edge,browser_Firefox,browser_Safari
0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0


In [10]:
# Merge encoded columns back
df = pd.concat([raw_df.drop(columns=cat_cols), encoded_df], axis=1)

pd.set_option('display.max_columns', None)
display(df.shape)
display(df.head())

(10000, 39)

Unnamed: 0,transaction_id,zip,ip,value,background_checks,complaints,transactions,credit_score,credit_limit,is_fraud,day,hour,minute,security_index,avg_value_per_tx,country_Brazil,country_Germany,country_UK,country_USA,city_Berlin,city_London,city_New York,city_Rio,city_Unknown,district_Center,district_North,district_South,district_Unknown,district_West,os_Android,os_Unknown,os_Windows,os_iOS,device_desktop,device_mobile,browser_Chrome,browser_Edge,browser_Firefox,browser_Safari
0,0,35499,192.168.140.92,121.49,0,3,19,768,6246.51,0,1,0,0,0.9,6.394211,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0
1,1,16421,192.168.106.108,182.55,2,4,20,601,18677.37,0,1,0,1,2.6,9.1275,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,2,62204,192.168.87.249,36.29,0,9,11,655,14260.93,0,1,0,2,2.7,3.299091,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
3,3,34591,192.168.44.107,1.48,2,1,24,603,2218.27,0,1,0,3,1.7,0.061667,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,4,70940,192.168.45.76,80.12,0,9,22,534,13281.11,0,1,0,4,2.7,3.641818,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0


----

## 6. Numerical features standardization and dimensionality reduction

Many machine learning estimators does not perform well when the individual features do not more or less look like standard normally distributed data. As a good practice, we should standardize the numerical features and reduce their dimensionality.

In this approach, we will use the `StandardScaler` from `sklearn.preprocessing` to standardize the numerical features and the `PCA` from `sklearn.decomposition` to reduce features dimensionality.

In [11]:
# Scale numerical features
num_cols = df.drop(columns=['transaction_id', 'zip', 'ip', 'is_fraud']).select_dtypes(include=np.number).columns
scaler = StandardScaler()
df_scaled = df.copy()
df_scaled[num_cols] = scaler.fit_transform(df[num_cols])

display(df_scaled.shape)
display(df_scaled.head())

(10000, 39)

Unnamed: 0,transaction_id,zip,ip,value,background_checks,complaints,transactions,credit_score,credit_limit,is_fraud,day,hour,minute,security_index,avg_value_per_tx,country_Brazil,country_Germany,country_UK,country_USA,city_Berlin,city_London,city_New York,city_Rio,city_Unknown,district_Center,district_North,district_South,district_Unknown,district_West,os_Android,os_Unknown,os_Windows,os_iOS,device_desktop,device_mobile,browser_Chrome,browser_Edge,browser_Firefox,browser_Safari
0,0,35499,192.168.140.92,0.203084,-1.411643,-0.543369,-0.430575,1.201543,-0.712857,0,-1.495605,-1.659421,-1.701871,-1.413656,-0.112908,-0.582893,-0.577504,1.74274,-0.575195,-0.567188,1.792289,-0.568112,-0.575965,-0.163705,-0.560873,-0.570576,1.761649,-0.168474,-0.567804,-0.709919,-0.143947,-0.689465,1.449737,1.016537,-1.016537,1.753562,-0.573193,-0.586896,-0.579044
1,1,16421,192.168.106.108,0.811126,0.004177,-0.197494,-0.359746,-0.469407,1.510107,0,-1.495605,-1.659421,-1.644103,-0.126566,0.004015,-0.582893,1.731589,-0.573809,-0.575195,1.763084,-0.557946,-0.568112,-0.575965,-0.163705,-0.560873,-0.570576,-0.56765,5.935645,-0.567804,-0.709919,-0.143947,1.450401,-0.68978,1.016537,-1.016537,1.753562,-0.573193,-0.586896,-0.579044
2,2,62204,192.168.87.249,-0.645348,-1.411643,1.531879,-0.997205,0.0709,0.720332,0,-1.495605,-1.659421,-1.586334,-0.050855,-0.245308,1.715581,-0.577504,-0.573809,-0.575195,-0.567188,1.792289,-0.568112,-0.575965,-0.163705,-0.560873,-0.570576,-0.56765,-0.168474,1.761171,-0.709919,-0.143947,1.450401,-0.68978,-0.983732,0.983732,1.753562,-0.573193,-0.586896,-0.579044
3,3,34591,192.168.44.107,-0.991991,0.004177,-1.235118,-0.076431,-0.449395,-1.433212,0,-1.495605,-1.659421,-1.528565,-0.807967,-0.383796,-0.582893,-0.577504,1.74274,-0.575195,1.763084,-0.557946,-0.568112,-0.575965,-0.163705,-0.560873,1.752616,-0.56765,-0.168474,-0.567804,-0.709919,-0.143947,1.450401,-0.68978,-0.983732,0.983732,1.753562,-0.573193,-0.586896,-0.579044
4,4,70940,192.168.45.76,-0.208884,-1.411643,1.531879,-0.218089,-1.139788,0.545114,0,-1.495605,-1.659421,-1.470796,-0.050855,-0.230647,-0.582893,-0.577504,1.74274,-0.575195,1.763084,-0.557946,-0.568112,-0.575965,-0.163705,1.782934,-0.570576,-0.56765,-0.168474,-0.567804,-0.709919,-0.143947,-0.689465,1.449737,1.016537,-1.016537,1.753562,-0.573193,-0.586896,-0.579044


In [12]:
# PCA for dimensionality reduction
pca = PCA(n_components=20)
pca_features = pca.fit_transform(df_scaled[num_cols])

df_pca = pd.DataFrame(
    pca_features, 
    columns=[f'pca_{i+1}' for i in range(20)]
    )

# Combine PCA features + target
final_df = pd.concat([df_pca, df_scaled[['value', 'is_fraud']].reset_index(drop=True)], axis=1)

print("\nPreprocessed dataframe sample:")
display(final_df)


Preprocessed dataframe sample:


Unnamed: 0,pca_1,pca_2,pca_3,pca_4,pca_5,pca_6,pca_7,pca_8,pca_9,pca_10,pca_11,pca_12,pca_13,pca_14,pca_15,pca_16,pca_17,pca_18,pca_19,pca_20,value,is_fraud
0,-0.698411,2.376068,0.185079,-0.485557,2.084479,-1.069025,-1.879107,-0.757443,-0.941802,1.287435,0.723569,-0.549726,1.247120,-1.707125,1.018190,0.836955,-1.337190,-0.195031,0.482220,0.743740,0.203084,0
1,0.943554,1.194508,0.903734,-1.094042,-0.764599,-0.582347,0.518358,-1.095465,-1.649087,-1.110251,-0.260732,1.434685,-0.495205,-1.844151,0.301102,-0.142419,-0.316142,3.671213,1.611958,3.124371,0.811126,0
2,-0.894161,-0.945810,0.644399,-1.336722,-1.223677,-1.401517,0.737255,-1.694218,0.419510,0.068825,0.563081,-0.721563,-0.225967,-1.508925,-0.731428,2.788958,-1.380392,0.175359,1.070414,-0.812271,-0.645348,0
3,-1.764501,-0.426027,-0.709287,-1.194461,-1.124568,-0.280330,-0.522195,0.975147,-1.546995,-1.108341,2.165626,0.815748,-0.435222,-1.038596,1.295363,0.123861,-1.654121,-0.181589,0.277669,0.731499,-0.991991,0
4,0.843309,0.930054,-0.943591,-0.514943,1.595102,0.660228,-1.926685,0.623879,-0.765957,-2.027040,0.756387,0.570024,0.760188,-2.455145,-0.279026,0.475649,-0.863283,0.842043,0.492570,-0.717121,-0.208884,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,-3.265183,0.682100,1.261159,-1.079995,2.214764,0.867276,2.081545,-0.384085,-0.770713,-0.891025,0.321215,0.752589,1.056829,1.721603,1.155050,-0.849071,1.595953,-0.510098,-0.382024,-0.123874,1.113355,0
9996,-0.995182,2.587320,0.911480,1.314660,0.172853,0.554367,1.040281,2.946963,-1.001510,0.337151,1.631404,-0.672553,0.734569,0.728397,0.891523,-0.214218,0.292200,-0.549798,-0.008689,-1.202736,-0.543776,0
9997,-2.436027,0.684171,-0.090150,-1.468203,-2.193359,-0.817437,0.060198,-0.149901,-0.066886,-0.261150,-0.126848,0.259602,2.325280,2.153408,-0.915698,-0.629791,1.334523,3.067242,0.671078,3.943802,-0.319021,0
9998,-0.511385,-1.218202,-1.059426,-1.311335,0.450261,0.546786,0.312845,0.008789,1.507280,-0.092385,-0.212725,2.860977,1.309539,0.770416,-1.819281,-0.586398,-0.657498,0.212136,-0.735346,0.471382,0.611367,0


## 7. Saving the preprocessed data

Now we can save the preprocessed data to a CSV file.

In [13]:
os.makedirs("data", exist_ok=True)
final_df.to_csv(os.path.join("data", "bank_transactions_processed.csv"), index=False)