## Feature Engineering for Credit Card Fraud Detection

This notebook transforms raw credit card transaction data into meaningful features

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder, RobustScaler
from sklearn.model_selection import train_test_split
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

In [3]:
print("Feature Engineering for Credit Card Fraud Detection")
print("=" * 50)
# Load the raw data
print("Loading data...")
try:
    df = pd.read_csv('../data/raw/creditcard.csv')
    print(f"Data loaded successfully. Shape: {df.shape}")
except FileNotFoundError:
    print("creditcard.csv not found. Using sample data structure...")
    # Create sample data structure for demonstration
    np.random.seed(42)
    n_samples = 10000
    df = pd.DataFrame({
        'Time': np.random.randint(0, 172800, n_samples),  # 48 hours in seconds
        'V1': np.random.normal(0, 1, n_samples),
        'V2': np.random.normal(0, 1, n_samples),
        'V3': np.random.normal(0, 1, n_samples),
        'V4': np.random.normal(0, 1, n_samples),
        'V5': np.random.normal(0, 1, n_samples),
        'V6': np.random.normal(0, 1, n_samples),
        'V7': np.random.normal(0, 1, n_samples),
        'V8': np.random.normal(0, 1, n_samples),
        'V9': np.random.normal(0, 1, n_samples),
        'V10': np.random.normal(0, 1, n_samples),
        'V11': np.random.normal(0, 1, n_samples),
        'V12': np.random.normal(0, 1, n_samples),
        'V13': np.random.normal(0, 1, n_samples),
        'V14': np.random.normal(0, 1, n_samples),
        'V15': np.random.normal(0, 1, n_samples),
        'V16': np.random.normal(0, 1, n_samples),
        'V17': np.random.normal(0, 1, n_samples),
        'V18': np.random.normal(0, 1, n_samples),
        'V19': np.random.normal(0, 1, n_samples),
        'V20': np.random.normal(0, 1, n_samples),
        'V21': np.random.normal(0, 1, n_samples),
        'V22': np.random.normal(0, 1, n_samples),
        'V23': np.random.normal(0, 1, n_samples),
        'V24': np.random.normal(0, 1, n_samples),
        'V25': np.random.normal(0, 1, n_samples),
        'V26': np.random.normal(0, 1, n_samples),
        'V27': np.random.normal(0, 1, n_samples),
        'V28': np.random.normal(0, 1, n_samples),
        'Amount': np.random.lognormal(2, 1.5, n_samples),
        'Class': np.random.choice([0, 1], n_samples, p=[0.998, 0.002])  # Imbalanced classes
    })
    print("Sample data created for demonstration")

# Display basic information
print(f"\nDataset Info:")
print(f"Shape: {df.shape}")
print(f"Fraud cases: {df['Class'].sum()} ({df['Class'].mean()*100:.2f}%)")
print(f"Normal cases: {(df['Class'] == 0).sum()} ({(df['Class'] == 0).mean()*100:.2f}%)")

Feature Engineering for Credit Card Fraud Detection
Loading data...
Data loaded successfully. Shape: (284807, 31)

Dataset Info:
Shape: (284807, 31)
Fraud cases: 492 (0.17%)
Normal cases: 284315 (99.83%)


### CREATE FEATURES

In [5]:
# 1. TEMPORAL FEATURES
print("\n1. Creating Temporal Features...")
print("-" * 30)

# Convert time to hours and extract temporal patterns
df['Hour'] = (df['Time'] / 3600) % 24
df['Day'] = df['Time'] // (24 * 3600)

# Create time-based features
df['Hour_sin'] = np.sin(2 * np.pi * df['Hour'] / 24)
df['Hour_cos'] = np.cos(2 * np.pi * df['Hour'] / 24)

# Time periods
df['Is_Weekend'] = (df['Day'] % 7 >= 5).astype(int)
df['Is_Night'] = ((df['Hour'] >= 22) | (df['Hour'] <= 6)).astype(int)
df['Is_Business_Hours'] = ((df['Hour'] >= 9) & (df['Hour'] <= 17)).astype(int)

print("Temporal features created:")
print("- Hour, Day")
print("- Hour_sin, Hour_cos (cyclical encoding)")
print("- Is_Weekend, Is_Night, Is_Business_Hours")


1. Creating Temporal Features...
------------------------------
Temporal features created:
- Hour, Day
- Hour_sin, Hour_cos (cyclical encoding)
- Is_Weekend, Is_Night, Is_Business_Hours


In [6]:
# 2. AMOUNT-BASED FEATURES
print("\n2. Creating Amount-based Features...")
print("-" * 30)

# Log transform for amount (handle skewness)
df['Amount_log'] = np.log1p(df['Amount'])

# Amount categories
df['Amount_Category'] = pd.cut(df['Amount'], 
                              bins=[0, 50, 200, 1000, float('inf')], 
                              labels=['Low', 'Medium', 'High', 'Very_High'])

# Amount percentiles
df['Amount_Percentile'] = pd.qcut(df['Amount'], 
                                 q=10, 
                                 labels=False, 
                                 duplicates='drop')

# Amount statistics
df['Amount_Squared'] = df['Amount'] ** 2
df['Amount_Sqrt'] = np.sqrt(df['Amount'])

print("Amount-based features created:")
print("- Amount_log (log transformation)")
print("- Amount_Category (categorical bins)")
print("- Amount_Percentile (percentile bins)")
print("- Amount_Squared, Amount_Sqrt")


2. Creating Amount-based Features...
------------------------------
Amount-based features created:
- Amount_log (log transformation)
- Amount_Category (categorical bins)
- Amount_Percentile (percentile bins)
- Amount_Squared, Amount_Sqrt


In [7]:
# 3. VELOCITY FEATURES (Transaction Frequency)
print("\n3. Creating Velocity Features...")
print("-" * 30)

# Sort by time for velocity calculations
df_sorted = df.sort_values('Time').reset_index(drop=True)

# Calculate time differences between consecutive transactions
df_sorted['Time_Delta'] = df_sorted['Time'].diff()
df_sorted['Time_Delta'] = df_sorted['Time_Delta'].fillna(df_sorted['Time_Delta'].median())

# Velocity features
df_sorted['Velocity'] = 1 / (df_sorted['Time_Delta'] + 1)  # Add 1 to avoid division by zero

# Rolling window features (simulated customer grouping)
# Note: In real scenarios, you'd group by customer ID
window_sizes = [5, 10, 20]
for window in window_sizes:
    df_sorted[f'Amount_Rolling_Mean_{window}'] = df_sorted['Amount'].rolling(window=window, min_periods=1).mean()
    df_sorted[f'Amount_Rolling_Std_{window}'] = df_sorted['Amount'].rolling(window=window, min_periods=1).std()
    df_sorted[f'Count_Rolling_{window}'] = df_sorted['Amount'].rolling(window=window, min_periods=1).count()

print("Velocity features created:")
print("- Time_Delta (time between transactions)")
print("- Velocity (transaction frequency)")
print("- Rolling statistics for windows: 5, 10, 20")


3. Creating Velocity Features...
------------------------------
Velocity features created:
- Time_Delta (time between transactions)
- Velocity (transaction frequency)
- Rolling statistics for windows: 5, 10, 20


In [8]:
# 4. STATISTICAL FEATURES FROM PCA COMPONENTS
print("\n4. Creating Statistical Features from V1-V28...")
print("-" * 30)

# V1-V28 are PCA components, create additional statistical features
v_columns = [f'V{i}' for i in range(1, 29)]

# Statistical aggregations
df_sorted['V_Mean'] = df_sorted[v_columns].mean(axis=1)
df_sorted['V_Std'] = df_sorted[v_columns].std(axis=1)
df_sorted['V_Min'] = df_sorted[v_columns].min(axis=1)
df_sorted['V_Max'] = df_sorted[v_columns].max(axis=1)
df_sorted['V_Range'] = df_sorted['V_Max'] - df_sorted['V_Min']

# Skewness and kurtosis
from scipy import stats
df_sorted['V_Skew'] = df_sorted[v_columns].apply(lambda x: stats.skew(x), axis=1)
df_sorted['V_Kurt'] = df_sorted[v_columns].apply(lambda x: stats.kurtosis(x), axis=1)

# Number of outliers (values beyond 2 standard deviations)
df_sorted['V_Outliers'] = df_sorted[v_columns].apply(
    lambda x: np.sum(np.abs(x) > 2), axis=1
)

print("Statistical features created:")
print("- V_Mean, V_Std, V_Min, V_Max, V_Range")
print("- V_Skew, V_Kurt (distribution shape)")
print("- V_Outliers (count of extreme values)")


4. Creating Statistical Features from V1-V28...
------------------------------
Statistical features created:
- V_Mean, V_Std, V_Min, V_Max, V_Range
- V_Skew, V_Kurt (distribution shape)
- V_Outliers (count of extreme values)


In [9]:
# 5. INTERACTION FEATURES
print("\n5. Creating Interaction Features...")
print("-" * 30)

# Amount interactions with time
df_sorted['Amount_Hour_Interaction'] = df_sorted['Amount'] * df_sorted['Hour']
df_sorted['Amount_Weekend_Interaction'] = df_sorted['Amount'] * df_sorted['Is_Weekend']
df_sorted['Amount_Night_Interaction'] = df_sorted['Amount'] * df_sorted['Is_Night']

# V component interactions (select a few important ones)
df_sorted['V1_V2_Interaction'] = df_sorted['V1'] * df_sorted['V2']
df_sorted['V1_Amount_Interaction'] = df_sorted['V1'] * df_sorted['Amount_log']
df_sorted['V4_V11_Interaction'] = df_sorted['V4'] * df_sorted['V11']

print("Interaction features created:")
print("- Amount with temporal features")
print("- Selected V component interactions")


5. Creating Interaction Features...
------------------------------
Interaction features created:
- Amount with temporal features
- Selected V component interactions


In [10]:
# 6. ANOMALY DETECTION FEATURES
print("\n6. Creating Anomaly Detection Features...")
print("-" * 30)

# Isolation Forest score (simplified version)
from sklearn.ensemble import IsolationForest

# Use a subset of features for anomaly detection
anomaly_features = ['Amount_log', 'V1', 'V2', 'V3', 'V4', 'V5', 'Hour']
iso_forest = IsolationForest(contamination=0.1, random_state=42)
df_sorted['Anomaly_Score'] = iso_forest.fit_predict(df_sorted[anomaly_features])
df_sorted['Anomaly_Score'] = (df_sorted['Anomaly_Score'] == -1).astype(int)

# Distance-based features
df_sorted['Distance_from_Mean'] = np.sqrt(
    ((df_sorted[v_columns] - df_sorted[v_columns].mean()) ** 2).sum(axis=1)
)

print("Anomaly detection features created:")
print("- Anomaly_Score (Isolation Forest)")
print("- Distance_from_Mean (Euclidean distance)")


6. Creating Anomaly Detection Features...
------------------------------
Anomaly detection features created:
- Anomaly_Score (Isolation Forest)
- Distance_from_Mean (Euclidean distance)


### FEATURE SCALING

In [13]:
print("\n7. Scaling Features...")
print("-" * 30)

# Separate features and target
feature_columns = [col for col in df_sorted.columns if col not in ['Class', 'Time']]
X = df_sorted[feature_columns].copy()
y = df_sorted['Class'].copy()

# Handle categorical features
categorical_features = []
if 'Amount_Category' in X.columns:
    categorical_features = ['Amount_Category']
    # One-hot encode categorical features
    X_encoded = pd.get_dummies(X, columns=categorical_features, drop_first=True)
    print(f"One-hot encoded categorical features: {categorical_features}")
else:
    X_encoded = X.copy()

# Scale numerical features
scaler = RobustScaler()  # Robust to outliers
X_scaled = scaler.fit_transform(X_encoded)
X_scaled = pd.DataFrame(X_scaled, columns=X_encoded.columns)

print(f"Features scaled using RobustScaler")
print(f"Total features after encoding: {X_scaled.shape[1]}")


7. Scaling Features...
------------------------------
One-hot encoded categorical features: ['Amount_Category']
Features scaled using RobustScaler
Total features after encoding: 70


### FEATURE SELECTION

In [19]:
# 8. FEATURE SELECTION
print("\n8. Feature Selection...")
print("-" * 30)

# Handle missing values before feature selection
print(f"Missing values before cleaning: {X_scaled.isnull().sum().sum()}")
print(f"Infinite values before cleaning: {np.isinf(X_scaled).sum().sum()}")

# Replace infinite values with NaN, then fill NaN values
X_scaled = X_scaled.replace([np.inf, -np.inf], np.nan)
X_scaled = X_scaled.fillna(X_scaled.median())

print(f"Missing values after cleaning: {X_scaled.isnull().sum().sum()}")
print(f"Infinite values after cleaning: {np.isinf(X_scaled).sum().sum()}")

# Calculate feature importance using mutual information
from sklearn.feature_selection import mutual_info_classif

# Calculate mutual information scores
mi_scores = mutual_info_classif(X_scaled, y, random_state=42)
feature_importance = pd.DataFrame({
    'feature': X_scaled.columns,
    'importance': mi_scores
}).sort_values('importance', ascending=False)

print("Top 10 most important features:")
print(feature_importance.head(10))

# Select top features (limit to available features if less than 50)
n_features_to_select = min(50, len(feature_importance))
top_features = feature_importance.head(n_features_to_select)['feature'].tolist()
X_selected = X_scaled[top_features]

print(f"Selected {len(top_features)} most important features")


8. Feature Selection...
------------------------------
Missing values before cleaning: 0
Infinite values before cleaning: 0
Missing values after cleaning: 0
Infinite values after cleaning: 0
Top 10 most important features:
                   feature  importance
30                     Day    0.045694
35       Is_Business_Hours    0.039235
58              V_Outliers    0.016047
41                Velocity    0.012371
67  Amount_Category_Medium    0.009945
16                     V17    0.008258
13                     V14    0.008136
11                     V12    0.007601
9                      V10    0.007530
64      V4_V11_Interaction    0.007205
Selected 50 most important features


In [20]:
# 9. TRAIN-TEST SPLIT
print("\n9. Creating Train-Test Split...")
print("-" * 30)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_selected, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"Training set fraud rate: {y_train.mean()*100:.2f}%")
print(f"Test set fraud rate: {y_test.mean()*100:.2f}%")


9. Creating Train-Test Split...
------------------------------
Training set shape: (227845, 50)
Test set shape: (56962, 50)
Training set fraud rate: 0.17%
Test set fraud rate: 0.17%


In [21]:
# 10. SAVE PROCESSED DATA
print("\n10. Saving Processed Data...")
print("-" * 30)

# Create processed data directory if it doesn't exist
import os
os.makedirs('../data/processed', exist_ok=True)

# Save the processed datasets
X_train.to_csv('../data/processed/X_train.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)

# Save feature names and scaler for later use
import joblib
joblib.dump(scaler, '../data/processed/scaler.pkl')
joblib.dump(top_features, '../data/processed/selected_features.pkl')
joblib.dump(feature_importance, '../data/processed/feature_importance.pkl')

print("Data saved to ../data/processed/:")
print("- X_train.csv, X_test.csv")
print("- y_train.csv, y_test.csv")
print("- scaler.pkl, selected_features.pkl, feature_importance.pkl")


10. Saving Processed Data...
------------------------------
Data saved to ../data/processed/:
- X_train.csv, X_test.csv
- y_train.csv, y_test.csv
- scaler.pkl, selected_features.pkl, feature_importance.pkl


In [22]:
# 11. FEATURE ENGINEERING SUMMARY
print("\n11. Feature Engineering Summary")
print("=" * 50)

print(f"Original features: {len([col for col in df.columns if col != 'Class'])}")
print(f"Engineered features: {X_selected.shape[1]}")
print(f"Total samples: {len(df)}")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

print("\nFeature Categories Created:")
print("1. Temporal Features (7 features)")
print("2. Amount-based Features (6 features)")
print("3. Velocity Features (10 features)")
print("4. Statistical Features (8 features)")
print("5. Interaction Features (6 features)")
print("6. Anomaly Detection Features (2 features)")
print("7. Original PCA Components (28 features)")

print("\nData Quality Checks:")
print(f"Missing values: {X_selected.isnull().sum().sum()}")
print(f"Infinite values: {np.isinf(X_selected).sum().sum()}")
print(f"Feature correlation (max): {X_selected.corr().abs().max().max():.3f}")

print("\nFeature engineering completed successfully!")
print("Data is ready for model training in 03_model_training.ipynb")


11. Feature Engineering Summary
Original features: 42
Engineered features: 50
Total samples: 284807
Training samples: 227845
Test samples: 56962

Feature Categories Created:
1. Temporal Features (7 features)
2. Amount-based Features (6 features)
3. Velocity Features (10 features)
4. Statistical Features (8 features)
5. Interaction Features (6 features)
6. Anomaly Detection Features (2 features)
7. Original PCA Components (28 features)

Data Quality Checks:
Missing values: 0
Infinite values: 0
Feature correlation (max): 1.000

Feature engineering completed successfully!
Data is ready for model training in 03_model_training.ipynb
