# 3. Feature Engineering

This notebook focuses on creating meaningful features for our fraud detection models. We'll process both the fraud and credit card datasets using the logic implemented in `src/features/`.

In [5]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
from src.data.loading import load_fraud_data, load_ip_country_data, load_creditcard_data, save_processed_data
from src.features.engineering import create_all_features, create_time_features, create_time_since_signup
from src.features.geolocation import merge_ip_country, create_country_features
from src.data.preprocessing import create_preprocessing_pipeline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Fraud Dataset Feature Engineering

In [6]:
# Load raw data
fraud_df = load_fraud_data()
ip_df = load_ip_country_data()

# 1. Merge with Geolocation
merged_df = merge_ip_country(fraud_df, ip_df)

# 2. Create categorical features for country risk
merged_df = create_country_features(merged_df)

# 3. Create time-based features and velocity features
# Note: create_all_features() handles time formats and numeric velocity
fraud_featured = create_all_features(merged_df)

print(f"Final Fraud Features Shape: {fraud_featured.shape}")
fraud_featured.head()

2025-12-27 19:08:08 - fraud_detection - INFO - Loading fraud data from C:\Users\Lenovo\Documents\dawir\Fraud-Detection\data\raw\Fraud_Data.csv
2025-12-27 19:08:09 - fraud_detection - INFO - Loaded 151112 records with 11 columns
2025-12-27 19:08:09 - fraud_detection - INFO - Columns: ['user_id', 'signup_time', 'purchase_time', 'purchase_value', 'device_id', 'source', 'browser', 'sex', 'age', 'ip_address', 'class']
2025-12-27 19:08:09 - fraud_detection - INFO - Loading IP-to-country data from C:\Users\Lenovo\Documents\dawir\Fraud-Detection\data\raw\IpAddress_to_Country.csv
2025-12-27 19:08:09 - fraud_detection - INFO - Loaded 138846 IP ranges
2025-12-27 19:08:09 - fraud_detection - INFO - Merging IP addresses with country data...
2025-12-27 19:08:09 - fraud_detection - INFO - Converting IP addresses to integer format...
2025-12-27 19:08:09 - fraud_detection - INFO - Performing range-based IP lookup...
2025-12-27 19:08:09 - fraud_detection - INFO - Successfully matched 0/151112 IP address

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,...,month,is_weekend,time_of_day,time_since_signup,time_since_last_txn,user_txn_count,avg_time_between_txn,user_total_amount,user_avg_amount,amount_vs_avg


## 2. Credit Card Dataset Feature Engineering

The credit card dataset is mostly pre-processed (PCA). We'll focus on the 'Time' and 'Amount' features.

In [7]:
cc_df = load_creditcard_data()

# The 'Time' feature in the credit card dataset represents seconds elapsed since first transaction
# We could convert it to hours/days if we had a reference date, but for now we'll keep it numerical
cc_featured = cc_df.copy()

print(f"Credit Card Features Shape: {cc_featured.shape}")
cc_featured.head()

2025-12-27 19:08:09 - fraud_detection - INFO - Loading credit card data from C:\Users\Lenovo\Documents\dawir\Fraud-Detection\data\raw\creditcard.csv
2025-12-27 19:08:13 - fraud_detection - INFO - Loaded 284807 records with 31 columns
2025-12-27 19:08:13 - fraud_detection - INFO - Columns: ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class']
Credit Card Features Shape: (284807, 31)


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## 3. Preprocessing Pipelines

Defining and testing our preprocessing logic (scaling and encoding).

In [8]:
# Example for fraud data
numeric_cols = ['purchase_value', 'age', 'time_since_signup', 'user_txn_count', 'user_avg_amount']
categorical_cols = ['source', 'browser', 'sex', 'country_risk_level']

pipeline = create_preprocessing_pipeline(numeric_cols, categorical_cols, scale_strategy='standard', encoding='onehot')
print("Preprocessing pipeline created successfully.")

2025-12-27 19:08:13 - fraud_detection - INFO - Created preprocessing pipeline with 5 numeric and 4 categorical features using standard scaling and onehot encoding
Preprocessing pipeline created successfully.


## Save Processed Data

In [9]:
# Save featured datasets for training step
save_processed_data(fraud_featured, "fraud_featured.csv")
save_processed_data(cc_featured, "creditcard_featured.csv")

print("Datasets saved successfully to data/processed/")

2025-12-27 19:08:14 - fraud_detection - INFO - Saving processed data to C:\Users\Lenovo\Documents\dawir\Fraud-Detection\data\processed\fraud_featured.csv
2025-12-27 19:08:14 - fraud_detection - INFO - Successfully saved 0 records
2025-12-27 19:08:14 - fraud_detection - INFO - Saving processed data to C:\Users\Lenovo\Documents\dawir\Fraud-Detection\data\processed\creditcard_featured.csv
2025-12-27 19:08:34 - fraud_detection - INFO - Successfully saved 284807 records
Datasets saved successfully to data/processed/
