# Feature Engineering for Synthetic Finance Anomaly Detection

This notebook demonstrates the process of building an enriched transaction dataset for anomaly detection in personal finance. It merges multiple synthetic financial tables into a single transaction-level DataFrame and computes features that are useful for detecting suspicious or unusual transactions.

## Overview

The feature engineering process involves:
1. **Data Loading**: Loading cleaned datasets from the processed data directory
2. **Table Merging**: Combining multiple financial tables to create enriched transactions
3. **Feature Creation**: Computing derived features for anomaly detection
4. **Data Validation**: Ensuring data quality and consistency

## Dataset Description

The synthetic dataset contains the following tables:
- **transactions** (48,042 rows) - Core transaction data
- **transaction_types** - Transaction type classifications
- **accounts** - Account information with balances
- **account_types** - Account type classifications (Checking, Savings, etc.)
- **account_statuses** - Account status (Active, Inactive, Closed)
- **customers** - Customer demographic information
- **customer_types** - Customer type classifications
- **addresses** - Customer address information
- **branches** - Bank branch information
- **loans** - Loan information linked to accounts
- **loan_statuses** - Loan status classifications

In [2]:
# Data Loading
import os
import glob
import pandas as pd

# Get all CSV files from the processed data directory
raw_data_path = "../data/processed"
csv_files = glob.glob(os.path.join(raw_data_path, "*.csv"))

# Create a dictionary of DataFrames for easy access
dataframes = []
file_names = []

for csv_file in csv_files:
    # Extract filename without extension for reference
    file_name = os.path.basename(csv_file).replace('.csv', '')
    
    # Load CSV into DataFrame
    df = pd.read_csv(csv_file)
    
    # Store DataFrame and filename
    dataframes.append(df)
    file_names.append(file_name)
    
    print(f"Loaded {file_name}: {df.shape[0]} rows, {df.shape[1]} columns")

print(f"\nTotal DataFrames created: {len(dataframes)}")

# Create dictionary for easy access
df_dict = dict(zip(file_names, dataframes))
print(f"\nAvailable DataFrames: {list(df_dict.keys())}")


Loaded accounts_cleaned: 1635 rows, 6 columns
Loaded account_statuses: 3 rows, 2 columns
Loaded account_types: 5 rows, 2 columns
Loaded addresses_cleaned: 1210 rows, 4 columns
Loaded branches_cleaned: 50 rows, 3 columns
Loaded customers_cleaned: 1058 rows, 6 columns
Loaded customer_types: 3 rows, 2 columns
Loaded loans_cleaned: 316 rows, 7 columns
Loaded loan_statuses: 3 rows, 2 columns
Loaded transactions_cleaned: 48042 rows, 8 columns
Loaded transaction_types: 4 rows, 2 columns

Total DataFrames created: 11

Available DataFrames: ['accounts_cleaned', 'account_statuses', 'account_types', 'addresses_cleaned', 'branches_cleaned', 'customers_cleaned', 'customer_types', 'loans_cleaned', 'loan_statuses', 'transactions_cleaned', 'transaction_types']


## Step 1: Data Loading

The first step involves loading all the cleaned datasets from the processed data directory. This includes:

- **Transaction data**: Core transaction records with amounts, dates, and account references
- **Reference tables**: Lookup tables for transaction types, account types, customer types, etc.
- **Account data**: Account balances, types, and status information
- **Customer data**: Customer demographics and contact information
- **Loan data**: Loan information associated with accounts
- **Branch data**: Bank branch information

All datasets are loaded into a dictionary for easy access during the merging process.


In [3]:
# Assign datasets to individual variables for clarity
df_transactions = df_dict['transactions_cleaned']
df_transaction_types = df_dict['transaction_types']
df_account_statuses = df_dict['account_statuses']
df_account_types = df_dict['account_types']
df_accounts = df_dict['accounts_cleaned']
df_branches = df_dict['branches_cleaned']
df_customer_types = df_dict['customer_types']
df_customers = df_dict['customers_cleaned']
df_loan_statuses = df_dict['loan_statuses']
df_loans = df_dict['loans_cleaned']
df_addresses = df_dict['addresses_cleaned']

print("DataFrames assigned successfully!")
print(f"Transaction data shape: {df_transactions.shape}")
print(f"Account data shape: {df_accounts.shape}")
print(f"Customer data shape: {df_customers.shape}")


DataFrames assigned successfully!
Transaction data shape: (48042, 8)
Account data shape: (1635, 6)
Customer data shape: (1058, 6)


## Step 2: Table Merging Strategy

The goal is to create a single enriched transaction DataFrame containing all relevant information about:
- **Origin accounts**: Source account details, customer info, and loan data
- **Destination accounts**: Target account details, customer info, and loan data
- **Transaction context**: Transaction type, branch information, and timing

### Merging Process:

1. **Transaction Types**: Add transaction type names to transactions
2. **Origin Account Details**: Merge account, account type, and account status for origin accounts
3. **Destination Account Details**: Merge account, account type, and account status for destination accounts
4. **Customer Information**: Add customer demographics for both origin and destination
5. **Loan Information**: Aggregate loan data for both origin and destination accounts
6. **Branch Information**: Add branch details for transaction context

This creates a comprehensive dataset where each row represents a transaction with full context about all parties involved.


In [4]:
def build_enriched_transactions(transactions, transaction_types,
                                accounts, account_types, account_statuses,
                                customers, customer_types, addresses,
                                branches, loans, loan_statuses):
    """
    Build an enriched transaction dataset by merging multiple financial tables.
    
    Parameters:
    -----------
    transactions : pd.DataFrame
        Core transaction data
    transaction_types : pd.DataFrame
        Transaction type lookup table
    accounts : pd.DataFrame
        Account information
    account_types : pd.DataFrame
        Account type lookup table
    account_statuses : pd.DataFrame
        Account status lookup table
    customers : pd.DataFrame
        Customer information
    customer_types : pd.DataFrame
        Customer type lookup table
    addresses : pd.DataFrame
        Address information
    branches : pd.DataFrame
        Branch information
    loans : pd.DataFrame
        Loan information
    loan_statuses : pd.DataFrame
        Loan status lookup table
    
    Returns:
    --------
    pd.DataFrame
        Enriched transaction dataset with all relevant information
    """
    
    # === STEP 1: Transaction types ===
    tx = transactions.merge(
        transaction_types,
        on="TransactionTypeID",
        how="left"
    ).rename(columns={"TypeName": "TransactionTypeName"})

    # === STEP 2: Origin account details ===
    tx = tx.merge(
        accounts.add_prefix("Origin_"),
        left_on="AccountOriginID",
        right_on="Origin_AccountID",
        how="left"
    )
    tx = tx.merge(
        account_types.add_prefix("Origin_"),
        left_on="Origin_AccountTypeID",
        right_on="Origin_AccountTypeID",
        how="left"
    ).rename(columns={"Origin_TypeName": "Origin_AccountType"})
    tx = tx.merge(
        account_statuses.add_prefix("Origin_"),
        left_on="Origin_AccountStatusID",
        right_on="Origin_AccountStatusID",
        how="left"
    ).rename(columns={"Origin_StatusName": "Origin_AccountStatus"})

    # === STEP 3: Destination account details ===
    tx = tx.merge(
        accounts.add_prefix("Dest_"),
        left_on="AccountDestinationID",
        right_on="Dest_AccountID",
        how="left"
    )
    tx = tx.merge(
        account_types.add_prefix("Dest_"),
        left_on="Dest_AccountTypeID",
        right_on="Dest_AccountTypeID",
        how="left"
    ).rename(columns={"Dest_TypeName": "Dest_AccountType"})
    tx = tx.merge(
        account_statuses.add_prefix("Dest_"),
        left_on="Dest_AccountStatusID",
        right_on="Dest_AccountStatusID",
        how="left"
    ).rename(columns={"Dest_StatusName": "Dest_AccountStatus"})

    # === STEP 4: Customer info (origin & dest) ===
    customers_full = customers.merge(
        customer_types,
        on="CustomerTypeID",
        how="left"
    ).merge(
        addresses,
        on="AddressID",
        how="left"
    )

    # Origin customer
    tx = tx.merge(
        customers_full.add_prefix("Origin_"),
        left_on="Origin_CustomerID",
        right_on="Origin_CustomerID",
        how="left"
    )

    # Destination customer
    tx = tx.merge(
        customers_full.add_prefix("Dest_"),
        left_on="Dest_CustomerID",
        right_on="Dest_CustomerID",
        how="left"
    )

    # === STEP 5: Branch info ===
    branches_full = branches.merge(
        addresses.add_prefix("Branch_"),
        left_on="AddressID",
        right_on="Branch_AddressID",
        how="left"
    )

    tx = tx.merge(
        branches_full.add_prefix("Branch_"),
        left_on="BranchID",
        right_on="Branch_BranchID",
        how="left"
    )

    # === STEP 6: Loan features ===
    # Aggregate loan metrics per account
    loan_features = loans.groupby("AccountID").agg(
        LoanCount=("LoanID", "count"),
        TotalPrincipal=("PrincipalAmount", "sum"),
        AvgInterestRate=("InterestRate", "mean"),
        MaxInterestRate=("InterestRate", "max"),
        MinInterestRate=("InterestRate", "min")
    ).reset_index()

    # Loan status counts per account
    loan_status_pivot = loans.merge(
        loan_statuses,
        on="LoanStatusID",
        how="left"
    ).groupby(["AccountID", "StatusName"]).size().unstack(fill_value=0).reset_index()

    # Merge loan features for origin accounts
    tx = tx.merge(
        loan_features.add_prefix("Origin_"),
        left_on="Origin_AccountID",
        right_on="Origin_AccountID",
        how="left"
    )

    # Merge loan features for destination accounts
    tx = tx.merge(
        loan_features.add_prefix("Dest_"),
        left_on="Dest_AccountID",
        right_on="Dest_AccountID",
        how="left"
    )

    # Merge loan status counts for origin accounts
    tx = tx.merge(
        loan_status_pivot.add_prefix("Origin_LoanStatus_"),
        left_on="Origin_AccountID",
        right_on="Origin_LoanStatus_AccountID",
        how="left"
    )

    # Merge loan status counts for destination accounts
    tx = tx.merge(
        loan_status_pivot.add_prefix("Dest_LoanStatus_"),
        left_on="Dest_AccountID",
        right_on="Dest_LoanStatus_AccountID",
        how="left"
    )

    return tx

print("Table merging function defined successfully!")


Table merging function defined successfully!


In [5]:
# Execute the table merging process
print("Starting table merging process...")

df_final = build_enriched_transactions(
    df_transactions, df_transaction_types,
    df_accounts, df_account_types, df_account_statuses,
    df_customers, df_customer_types, df_addresses,
    df_branches, df_loans, df_loan_statuses
)

print(f"✅ Merging completed!")
print(f"Final dataset shape: {df_final.shape}")
print(f"Number of columns: {df_final.shape[1]}")
print(f"Number of rows: {df_final.shape[0]}")

# Display basic information about the merged dataset
print("\n📊 Dataset Overview:")
print(f"Memory usage: {df_final.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"Missing values: {df_final.isnull().sum().sum()}")
print(f"Duplicate rows: {df_final.duplicated().sum()}")


Starting table merging process...
✅ Merging completed!
Final dataset shape: (48042, 68)
Number of columns: 68
Number of rows: 48042

📊 Dataset Overview:
Memory usage: 96.47 MB
Missing values: 764446
Duplicate rows: 0


## Step 3: Feature Engineering

After merging all tables, we create derived features that are useful for anomaly detection. These features fall into several categories:

### Feature Categories:

1. **Amount-based Features**: Ratios and relationships between transaction amounts and account balances
2. **Account Status Features**: Flags for inactive or problematic accounts
3. **Customer Demographics**: Age calculations and customer type information
4. **Loan-related Features**: Leverage ratios and loan status indicators
5. **Temporal Features**: Time-based patterns in transactions
6. **Anomaly Flags**: Heuristic flags for potentially suspicious transactions

### Feature Engineering Goals:

- **Detect unusual transaction patterns**: Large transfers, unusual timing, etc.
- **Identify risky accounts**: Inactive accounts, high leverage, etc.
- **Capture customer behavior**: Age-related patterns, customer type differences
- **Enable machine learning**: Create features suitable for anomaly detection algorithms


In [6]:
def compute_transaction_features(df):
    """
    Compute derived features for anomaly detection from the enriched transaction dataset.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Enriched transaction dataset with merged account, customer, and loan information
    
    Returns:
    --------
    pd.DataFrame
        Dataset with additional engineered features
    """
    import numpy as np
    import pandas as pd
    
    # Create a copy to avoid modifying the original
    df = df.copy()
    
    print("🔧 Computing transaction features...")
    
    # === AMOUNT-BASED FEATURES ===
    print("  → Amount-based features")
    
    # Amount ratios (avoid division by zero)
    df['Amount_to_OriginBalance'] = df['Amount'] / df['Origin_Balance'].replace(0, np.nan)
    df['Amount_to_DestBalance'] = df['Amount'] / df['Dest_Balance'].replace(0, np.nan)
    
    # Amount relative to average transaction size
    avg_transaction = df['Amount'].mean()
    df['Amount_to_AvgTransaction'] = df['Amount'] / avg_transaction
    
    # === ACCOUNT STATUS FEATURES ===
    print("  → Account status features")
    
    # Account inactivity flags
    df['Origin_AccountInactive'] = df['Origin_AccountStatus'].isin(['Inactive', 'Closed']).astype(int)
    df['Dest_AccountInactive'] = df['Dest_AccountStatus'].isin(['Inactive', 'Closed']).astype(int)
    
    # Account type flags
    df['Origin_IsChecking'] = (df['Origin_AccountType'] == 'Checking').astype(int)
    df['Dest_IsChecking'] = (df['Dest_AccountType'] == 'Checking').astype(int)
    df['Origin_IsSavings'] = (df['Origin_AccountType'] == 'Savings').astype(int)
    df['Dest_IsSavings'] = (df['Dest_AccountType'] == 'Savings').astype(int)
    
    # === CUSTOMER DEMOGRAPHIC FEATURES ===
    print("  → Customer demographic features")
    
    # Calculate customer ages
    today = pd.Timestamp.today()
    df['Origin_Age'] = (today - pd.to_datetime(df['Origin_DateOfBirth'], errors='coerce')).dt.days // 365
    df['Dest_Age'] = (today - pd.to_datetime(df['Dest_DateOfBirth'], errors='coerce')).dt.days // 365
    
    # Age difference between origin and destination customers
    df['Age_Difference'] = df['Origin_Age'] - df['Dest_Age']
    
    # Customer type flags
    df['Origin_IsIndividual'] = (df['Origin_TypeName'] == 'Individual').astype(int)
    df['Dest_IsIndividual'] = (df['Dest_TypeName'] == 'Individual').astype(int)
    df['Origin_IsBusiness'] = (df['Origin_TypeName'] == 'Small Business').astype(int)
    df['Dest_IsBusiness'] = (df['Dest_TypeName'] == 'Small Business').astype(int)
    
    # === LOAN-RELATED FEATURES ===
    print("  → Loan-related features")
    
    # Loan leverage ratios (debt to balance)
    df['Origin_LoanLeverage'] = df['Origin_TotalPrincipal'] / df['Origin_Balance'].replace(0, np.nan)
    df['Dest_LoanLeverage'] = df['Dest_TotalPrincipal'] / df['Dest_Balance'].replace(0, np.nan)
    
    # Loan count flags
    df['Origin_HasLoans'] = (df['Origin_LoanCount'] > 0).astype(int)
    df['Dest_HasLoans'] = (df['Dest_LoanCount'] > 0).astype(int)
    
    # High interest rate flags (above 10%)
    df['Origin_HighInterest'] = (df['Origin_AvgInterestRate'] > 10).astype(int)
    df['Dest_HighInterest'] = (df['Dest_AvgInterestRate'] > 10).astype(int)
    
    # === TEMPORAL FEATURES ===
    print("  → Temporal features")
    
    # Extract time components
    df['TransactionHour'] = pd.to_datetime(df['TransactionDate']).dt.hour
    df['TransactionWeekday'] = pd.to_datetime(df['TransactionDate']).dt.dayofweek
    df['TransactionMonth'] = pd.to_datetime(df['TransactionDate']).dt.month
    df['TransactionQuarter'] = pd.to_datetime(df['TransactionDate']).dt.quarter
    
    # Time-based flags
    df['IsWeekend'] = (df['TransactionWeekday'] >= 5).astype(int)
    df['IsBusinessHours'] = ((df['TransactionHour'] >= 9) & (df['TransactionHour'] <= 17)).astype(int)
    df['IsNightTime'] = ((df['TransactionHour'] >= 22) | (df['TransactionHour'] <= 6)).astype(int)
    
    # === ANOMALY FLAGS ===
    print("  → Anomaly detection flags")
    
    # Large transfer flags
    df['LargeTransferFlag'] = (df['Amount_to_OriginBalance'] > 0.5).astype(int)
    df['VeryLargeTransferFlag'] = (df['Amount_to_OriginBalance'] > 0.9).astype(int)
    
    # Unusual timing flags
    df['UnusualTimingFlag'] = ((df['IsNightTime'] == 1) | (df['IsWeekend'] == 1)).astype(int)
    
    # High-risk account combinations
    df['HighRiskFlag'] = ((df['Origin_AccountInactive'] == 1) | 
                         (df['Dest_AccountInactive'] == 1) |
                         (df['Origin_LoanLeverage'] > 2) |
                         (df['Dest_LoanLeverage'] > 2)).astype(int)
    
    # Cross-customer type transfers (business to individual, etc.)
    df['CrossTypeTransfer'] = (df['Origin_TypeName'] != df['Dest_TypeName']).astype(int)
    
    print("✅ Feature engineering completed!")
    
    return df

print("Feature engineering function defined successfully!")


Feature engineering function defined successfully!


In [7]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48042 entries, 0 to 48041
Data columns (total 68 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   TransactionID                48042 non-null  int64  
 1   AccountOriginID              48042 non-null  int64  
 2   AccountDestinationID         48042 non-null  int64  
 3   TransactionTypeID            48042 non-null  int64  
 4   Amount                       48042 non-null  float64
 5   TransactionDate              48042 non-null  object 
 6   BranchID                     48042 non-null  int64  
 7   Description                  48042 non-null  object 
 8   TransactionTypeName          48042 non-null  object 
 9   Origin_AccountID             47581 non-null  float64
 10  Origin_CustomerID            47581 non-null  float64
 11  Origin_AccountTypeID         47581 non-null  float64
 12  Origin_AccountStatusID       47581 non-null  float64
 13  Origin_Balance  

In [8]:
# Execute feature engineering
print("🚀 Starting feature engineering process...")

df_final_with_features = compute_transaction_features(df_final)

print(f"\n✅ Feature engineering completed!")
print(f"Final dataset shape: {df_final_with_features.shape}")
print(f"Total features: {df_final_with_features.shape[1]}")

# Display feature summary
print(f"\n📈 Feature Summary:")
print(f"Original features: {df_final.shape[1]}")
print(f"New features added: {df_final_with_features.shape[1] - df_final.shape[1]}")
print(f"Total features: {df_final_with_features.shape[1]}")

# Show some sample features
print(f"\n🔍 Sample of new features:")
new_features = [col for col in df_final_with_features.columns if col not in df_final.columns]
print(f"New feature count: {len(new_features)}")
print(f"Sample features: {new_features[:10]}")


🚀 Starting feature engineering process...
🔧 Computing transaction features...
  → Amount-based features
  → Account status features
  → Customer demographic features
  → Loan-related features
  → Temporal features
  → Anomaly detection flags
✅ Feature engineering completed!

✅ Feature engineering completed!
Final dataset shape: (48042, 102)
Total features: 102

📈 Feature Summary:
Original features: 68
New features added: 34
Total features: 102

🔍 Sample of new features:
New feature count: 34
Sample features: ['Amount_to_OriginBalance', 'Amount_to_DestBalance', 'Amount_to_AvgTransaction', 'Origin_AccountInactive', 'Dest_AccountInactive', 'Origin_IsChecking', 'Dest_IsChecking', 'Origin_IsSavings', 'Dest_IsSavings', 'Origin_Age']


## Step 4: Data Validation

After feature engineering, we validate the dataset to ensure data quality and identify any issues that need to be addressed before modeling.

### Validation Checks:

1. **Data Completeness**: Check for missing values and data coverage
2. **Feature Quality**: Validate engineered features for reasonable ranges
3. **Data Consistency**: Ensure logical relationships between features
4. **Memory Usage**: Monitor dataset size and memory requirements
5. **Feature Distribution**: Check for extreme values and outliers


In [10]:
import numpy as np
# Data Validation
print("🔍 Performing data validation...")

# Basic dataset information
print(f"\n📊 Dataset Information:")
print(f"Shape: {df_final_with_features.shape}")
print(f"Memory usage: {df_final_with_features.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"Data types: {df_final_with_features.dtypes.value_counts().to_dict()}")

# Missing values analysis
print(f"\n❌ Missing Values Analysis:")
missing_values = df_final_with_features.isnull().sum()
missing_percentage = (missing_values / len(df_final_with_features)) * 100

missing_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing_Count': missing_values.values,
    'Missing_Percentage': missing_percentage.values
}).sort_values('Missing_Percentage', ascending=False)

print(f"Columns with missing values: {(missing_values > 0).sum()}")
print(f"Total missing values: {missing_values.sum()}")

if (missing_values > 0).any():
    print(f"\nTop 10 columns with missing values:")
    print(missing_df.head(10).to_string(index=False))

# Feature quality checks
print(f"\n✅ Feature Quality Checks:")

# Check for infinite values
infinite_cols = []
for col in df_final_with_features.select_dtypes(include=[np.number]).columns:
    if np.isinf(df_final_with_features[col]).any():
        infinite_cols.append(col)

if infinite_cols:
    print(f"⚠️  Columns with infinite values: {infinite_cols}")
else:
    print(f"✅ No infinite values found")

# Check for extreme values in key features
print(f"\n🔍 Extreme Values Analysis:")
key_features = ['Amount', 'Amount_to_OriginBalance', 'Amount_to_DestBalance', 
                'Origin_LoanLeverage', 'Dest_LoanLeverage']

for feature in key_features:
    if feature in df_final_with_features.columns:
        q99 = df_final_with_features[feature].quantile(0.99)
        q01 = df_final_with_features[feature].quantile(0.01)
        print(f"{feature}: 1st percentile = {q01:.2f}, 99th percentile = {q99:.2f}")

# Check feature ranges
print(f"\n📈 Feature Ranges:")
numeric_features = df_final_with_features.select_dtypes(include=[np.number]).columns
print(f"Numeric features: {len(numeric_features)}")

# Sample of feature statistics
print(f"\n📋 Sample Feature Statistics:")
sample_features = ['Amount', 'TransactionHour', 'Origin_Age', 'Dest_Age', 'LargeTransferFlag']
available_features = [f for f in sample_features if f in df_final_with_features.columns]

if available_features:
    print(df_final_with_features[available_features].describe().round(2))

print(f"\n✅ Data validation completed!")


🔍 Performing data validation...

📊 Dataset Information:
Shape: (48042, 102)
Memory usage: 108.19 MB
Data types: {dtype('float64'): 41, dtype('int64'): 30, dtype('O'): 27, dtype('int32'): 4}

❌ Missing Values Analysis:
Columns with missing values: 59
Total missing values: 854449

Top 10 columns with missing values:
                   Column  Missing_Count  Missing_Percentage
     Dest_AvgInterestRate          39553           82.330045
     Dest_MaxInterestRate          39553           82.330045
     Dest_MinInterestRate          39553           82.330045
 Dest_LoanStatus_Paid Off          39553           82.330045
Dest_LoanStatus_AccountID          39553           82.330045
   Dest_LoanStatus_Active          39553           82.330045
  Dest_LoanStatus_Overdue          39553           82.330045
      Dest_TotalPrincipal          39553           82.330045
        Dest_LoanLeverage          39553           82.330045
           Dest_LoanCount          39553           82.330045

✅ Feature Qu

## Step 5: Save Processed Data

The final step is to save the enriched and feature-engineered dataset for use in machine learning models.

### Output Files:

1. **Enriched Transactions**: Complete dataset with all merged information
2. **Feature Summary**: Documentation of all engineered features
3. **Data Quality Report**: Summary of validation results

### Next Steps:

- **Model Training**: Use this dataset for anomaly detection model training
- **Feature Selection**: Identify the most important features for the model
- **Model Evaluation**: Test different algorithms and hyperparameters
- **Production Deployment**: Integrate into the MLOps pipeline


In [62]:
# Save the processed data
print("💾 Saving processed data...")

# Create output directory if it doesn't exist
import os
output_dir = "../data/processed"
os.makedirs(output_dir, exist_ok=True)

# Save the enriched dataset
output_file = os.path.join(output_dir, "enriched_transactions_with_features.csv")
df_final_with_features.to_csv(output_file, index=False)

print(f"✅ Dataset saved to: {output_file}")
print(f"File size: {os.path.getsize(output_file) / 1024**2:.2f} MB")

# Create feature summary
feature_summary = {
    'Total_Features': len(df_final_with_features.columns),
    'Original_Features': len(df_final.columns),
    'Engineered_Features': len(df_final_with_features.columns) - df_final.shape[1],
    'Numeric_Features': len(df_final_with_features.select_dtypes(include=[np.number]).columns),
    'Categorical_Features': len(df_final_with_features.select_dtypes(include=['object']).columns),
    'Boolean_Features': len(df_final_with_features.select_dtypes(include=['bool']).columns),
    'Missing_Values': df_final_with_features.isnull().sum().sum(),
    'Duplicate_Rows': df_final_with_features.duplicated().sum()
}

print(f"\n📋 Feature Summary:")
for key, value in feature_summary.items():
    print(f"{key}: {value}")

# Save feature summary
feature_summary_file = os.path.join(output_dir, "feature_summary.txt")
with open(feature_summary_file, 'w') as f:
    f.write("Feature Engineering Summary\n")
    f.write("=" * 50 + "\n\n")
    for key, value in feature_summary.items():
        f.write(f"{key}: {value}\n")
    
    f.write(f"\nFeature List:\n")
    f.write("-" * 20 + "\n")
    for i, col in enumerate(df_final_with_features.columns, 1):
        f.write(f"{i:3d}. {col}\n")

print(f"✅ Feature summary saved to: {feature_summary_file}")

print(f"\n🎉 Feature engineering pipeline completed successfully!")
print(f"📊 Final dataset: {df_final_with_features.shape[0]} rows × {df_final_with_features.shape[1]} columns")
print(f"💾 Data saved to: {output_file}")
print(f"📋 Summary saved to: {feature_summary_file}")


df_final_with_features.to_csv("../data/raw/finalDataSet.csv")


💾 Saving processed data...
✅ Dataset saved to: ../data/processed\enriched_transactions_with_features.csv
File size: 26.90 MB

📋 Feature Summary:
Total_Features: 102
Original_Features: 68
Engineered_Features: 34
Numeric_Features: 75
Categorical_Features: 27
Boolean_Features: 0
Missing_Values: 854449
Duplicate_Rows: 0
✅ Feature summary saved to: ../data/processed\feature_summary.txt

🎉 Feature engineering pipeline completed successfully!
📊 Final dataset: 48042 rows × 102 columns
💾 Data saved to: ../data/processed\enriched_transactions_with_features.csv
📋 Summary saved to: ../data/processed\feature_summary.txt


In [48]:
# Test basic functionality
print("🧪 Testing basic functionality...")

try:
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    print("✅ All imports successful!")
    
    # Test data loading
    print("📁 Testing data loading...")
    import os
    import glob
    
    raw_data_path = "../data/processed"
    if os.path.exists(raw_data_path):
        csv_files = glob.glob(os.path.join(raw_data_path, "*.csv"))
        print(f"✅ Found {len(csv_files)} CSV files in {raw_data_path}")
        for csv_file in csv_files[:3]:  # Show first 3 files
            print(f"  - {os.path.basename(csv_file)}")
    else:
        print(f"❌ Directory {raw_data_path} not found")
    
    print("✅ Basic functionality test completed!")
    
except Exception as e:
    print(f"❌ Error during testing: {str(e)}")
    print(f"Error type: {type(e).__name__}")


🧪 Testing basic functionality...
✅ All imports successful!
📁 Testing data loading...
✅ Found 13 CSV files in ../data/processed
  - accounts_cleaned.csv
  - account_statuses.csv
  - account_types.csv
✅ Basic functionality test completed!


In [49]:
# Debug: Check what columns are available in the merged dataset
print("🔍 Debugging merged dataset columns...")
print(f"Shape of df_final: {df_final.shape}")
print(f"\nColumns in df_final:")
print(df_final.columns.tolist())

# Check if customer-related columns exist
customer_columns = [col for col in df_final.columns if 'Customer' in col or 'customer' in col]
print(f"\nCustomer-related columns: {customer_columns}")

# Check if we have the expected customer type columns
expected_customer_cols = ['Origin_CustomerType', 'Dest_CustomerType']
missing_cols = [col for col in expected_customer_cols if col not in df_final.columns]
print(f"\nMissing customer columns: {missing_cols}")

# Let's see what customer columns we actually have
print(f"\nAll columns containing 'Customer':")
for col in df_final.columns:
    if 'Customer' in col or 'customer' in col:
        print(f"  - {col}")


🔍 Debugging merged dataset columns...
Shape of df_final: (48042, 68)

Columns in df_final:
['TransactionID', 'AccountOriginID', 'AccountDestinationID', 'TransactionTypeID', 'Amount', 'TransactionDate', 'BranchID', 'Description', 'TransactionTypeName', 'Origin_AccountID', 'Origin_CustomerID', 'Origin_AccountTypeID', 'Origin_AccountStatusID', 'Origin_Balance', 'Origin_OpeningDate', 'Origin_AccountType', 'Origin_AccountStatus', 'Dest_AccountID', 'Dest_CustomerID', 'Dest_AccountTypeID', 'Dest_AccountStatusID', 'Dest_Balance', 'Dest_OpeningDate', 'Dest_AccountType', 'Dest_AccountStatus', 'Origin_FirstName', 'Origin_LastName', 'Origin_DateOfBirth', 'Origin_AddressID', 'Origin_CustomerTypeID', 'Origin_TypeName', 'Origin_Street', 'Origin_City', 'Origin_Country', 'Dest_FirstName', 'Dest_LastName', 'Dest_DateOfBirth', 'Dest_AddressID', 'Dest_CustomerTypeID', 'Dest_TypeName', 'Dest_Street', 'Dest_City', 'Dest_Country', 'Branch_BranchID', 'Branch_BranchName', 'Branch_AddressID', 'Branch_Branch_Add

In [50]:
# Debug: Check customer data structure
print("🔍 Debugging customer data structure...")
print(f"Customer types columns: {df_customer_types.columns.tolist()}")
print(f"Customer types data:")
print(df_customer_types.head())

print(f"\nCustomers columns: {df_customers.columns.tolist()}")
print(f"Customers data sample:")
print(df_customers.head())

# Check what happens when we merge customers with customer types
customers_full = df_customers.merge(df_customer_types, on="CustomerTypeID", how="left")
print(f"\nAfter merging customers with customer types:")
print(f"Columns: {customers_full.columns.tolist()}")
print(f"Sample data:")
print(customers_full.head())


🔍 Debugging customer data structure...
Customer types columns: ['CustomerTypeID', 'TypeName']
Customer types data:
   CustomerTypeID          TypeName
0               1        Individual
1               2    Small Business
2               3  Large Enterprise

Customers columns: ['CustomerID', 'FirstName', 'LastName', 'DateOfBirth', 'AddressID', 'CustomerTypeID']
Customers data sample:
   CustomerID FirstName    LastName DateOfBirth  AddressID  CustomerTypeID
0       10832      Nyla     Aguirre  1974-02-07        881               1
1       10983   Unknown      Battle  1963-02-01        958               2
2       10837  Angelena  Harrington  1964-03-25         86               3
3       10107    Remona       Glass  1965-09-16        595               1
4       10553      King      Becker  1966-02-20        969               3

After merging customers with customer types:
Columns: ['CustomerID', 'FirstName', 'LastName', 'DateOfBirth', 'AddressID', 'CustomerTypeID', 'TypeName']
Sample da

In [58]:
df_final_with_features.columns

Index(['TransactionID', 'AccountOriginID', 'AccountDestinationID',
       'TransactionTypeID', 'Amount', 'TransactionDate', 'BranchID',
       'Description', 'TransactionTypeName', 'Origin_AccountID',
       ...
       'TransactionMonth', 'TransactionQuarter', 'IsWeekend',
       'IsBusinessHours', 'IsNightTime', 'LargeTransferFlag',
       'VeryLargeTransferFlag', 'UnusualTimingFlag', 'HighRiskFlag',
       'CrossTypeTransfer'],
      dtype='object', length=102)

In [66]:
import pandas as pd

df=df_final_with_features

# List of necessary features for anomaly detection
features_to_keep = [
    "TransactionTypeID", "Amount",
    "Origin_AccountTypeID", "Origin_AccountStatusID", "Origin_Balance",
    "Dest_AccountTypeID", "Dest_AccountStatusID", "Dest_Balance",
    "Origin_CustomerTypeID", "Dest_CustomerTypeID",
    "Origin_LoanCount", "Origin_TotalPrincipal", "Origin_AvgInterestRate",
    "Dest_LoanCount", "Dest_TotalPrincipal", "Dest_AvgInterestRate",
    "Origin_LoanStatus_Active", "Origin_LoanStatus_Overdue", "Origin_LoanStatus_Paid Off",
    "Dest_LoanStatus_Active", "Dest_LoanStatus_Overdue", "Dest_LoanStatus_Paid Off",
    "Amount_to_OriginBalance", "Amount_to_DestBalance", "Amount_to_AvgTransaction",
    "Origin_AccountInactive", "Dest_AccountInactive", "Age_Difference",
    "Origin_LoanLeverage", "Dest_LoanLeverage",
    "TransactionHour", "TransactionWeekday", "TransactionMonth", "TransactionQuarter",
    "IsWeekend", "IsBusinessHours", "IsNightTime",
    "LargeTransferFlag", "VeryLargeTransferFlag", "UnusualTimingFlag", "HighRiskFlag", "CrossTypeTransfer"
]

# Keep only necessary features
df_important_features = df[features_to_keep]




In [67]:
df_important_features.head()

Unnamed: 0,TransactionTypeID,Amount,Origin_AccountTypeID,Origin_AccountStatusID,Origin_Balance,Dest_AccountTypeID,Dest_AccountStatusID,Dest_Balance,Origin_CustomerTypeID,Dest_CustomerTypeID,Origin_LoanCount,Origin_TotalPrincipal,Origin_AvgInterestRate,Dest_LoanCount,Dest_TotalPrincipal,Dest_AvgInterestRate,Origin_LoanStatus_Active,Origin_LoanStatus_Overdue,Origin_LoanStatus_Paid Off,Dest_LoanStatus_Active,Dest_LoanStatus_Overdue,Dest_LoanStatus_Paid Off,Amount_to_OriginBalance,Amount_to_DestBalance,Amount_to_AvgTransaction,Origin_AccountInactive,Dest_AccountInactive,Age_Difference,Origin_LoanLeverage,Dest_LoanLeverage,TransactionHour,TransactionWeekday,TransactionMonth,TransactionQuarter,IsWeekend,IsBusinessHours,IsNightTime,LargeTransferFlag,VeryLargeTransferFlag,UnusualTimingFlag,HighRiskFlag,CrossTypeTransfer
0,3,2984.02,5.0,1.0,55292.55,2.0,2.0,54095.48,2.0,1.0,1.0,52255.85,0.1283,,,,0.0,1.0,0.0,,,,0.053968,0.055162,1.191976,0,1,4.0,0.945079,,10,4,12,4,0,1,0,0,0,0,1,1
1,3,4713.48,1.0,1.0,22940.53,4.0,1.0,70727.24,3.0,3.0,2.0,146020.58,0.07795,1.0,59673.83,0.1411,2.0,0.0,0.0,1.0,0.0,0.0,0.205465,0.066643,1.882814,0,0,16.0,6.365179,0.843718,2,4,5,2,0,0,1,0,0,1,1,0
2,3,1600.23,4.0,1.0,62435.08,5.0,2.0,90061.73,2.0,1.0,1.0,22288.38,0.0969,,,,1.0,0.0,0.0,,,,0.02563,0.017768,0.639217,0,1,6.0,0.356985,,17,1,7,3,0,1,0,0,0,0,1,1
3,1,4279.61,4.0,1.0,12739.44,3.0,1.0,32623.75,3.0,3.0,1.0,15797.11,0.0636,1.0,97362.97,0.1145,1.0,0.0,0.0,1.0,0.0,0.0,0.335934,0.131181,1.709504,0,0,-3.0,1.240016,2.98442,7,5,9,3,1,0,0,0,0,1,1,0
4,2,4125.32,4.0,1.0,56605.54,,,,2.0,,1.0,37284.69,0.0718,,,,1.0,0.0,0.0,,,,0.072878,,1.647872,0,0,,0.658676,,2,6,5,2,1,0,1,0,0,1,0,1


In [68]:
print("Remaining NaNs:", df_important_features.isna().sum().sum())

Remaining NaNs: 566670


In [69]:
df_important_features.fillna(0)
print("Remaining NaNs:", df_important_features.isna().sum().sum())

Remaining NaNs: 566670
