# Task 3: Feature Engineering - Step-by-Step Implementation

## üìã Objective
Build a robust, automated, and reproducible data processing script that transforms raw data into a model-ready format using sklearn.pipeline.Pipeline.

## üéØ The 6 Required Steps:
1. **Create Aggregate Features**
2. **Extract Temporal Features**  
3. **Encode Categorical Variables**
4. **Handle Missing Values**
5. **Normalize/Standardize Numerical Features**
6. **Feature Engineering with WoE and IV**

In [17]:
# ============================================
# SETUP AND IMPORTS
# ============================================

import sys
import os
sys.path.append('..')  # Add src to path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from pathlib import Path

# Import our OOP classes
from src.feature_engineering import (
    AggregateFeatures,
    TemporalFeatureExtractor,
    CategoricalEncoder,
    MissingValueHandler,
    FeatureScaler,
    WOETransformer,
    FeatureEngineeringPipeline
)

print("‚úÖ Libraries and classes imported successfully!")

‚úÖ Libraries and classes imported successfully!


In [18]:
# ============================================
# LOAD DATA
# ============================================

print("üìÇ LOADING DATA")
print("="*60)

# Determine correct path
current_dir = Path.cwd()
if current_dir.name == 'notebooks':
    data_path = current_dir.parent / 'data' / 'raw' / 'data.csv'
else:
    data_path = current_dir / 'data' / 'raw' / 'data.csv'

print(f"Data path: {data_path}")

# Load data
df = pd.read_csv(data_path)
print(f"‚úÖ Loaded: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
print(f"Columns: {df.columns.tolist()}")

# Display sample
print("\nüìÑ Sample data (first 3 rows):")
print(df.head(3))

üìÇ LOADING DATA
Data path: c:\Users\HP\Desktop\KAIM\credit-risk-model\data\raw\data.csv
‚úÖ Loaded: 95,662 rows √ó 16 columns
Columns: ['TransactionId', 'BatchId', 'AccountId', 'SubscriptionId', 'CustomerId', 'CurrencyCode', 'CountryCode', 'ProviderId', 'ProductId', 'ProductCategory', 'ChannelId', 'Amount', 'Value', 'TransactionStartTime', 'PricingStrategy', 'FraudResult']

üìÑ Sample data (first 3 rows):
         TransactionId        BatchId       AccountId       SubscriptionId  \
0  TransactionId_76871  BatchId_36123  AccountId_3957   SubscriptionId_887   
1  TransactionId_73770  BatchId_15642  AccountId_4841  SubscriptionId_3829   
2  TransactionId_26203  BatchId_53941  AccountId_4229   SubscriptionId_222   

        CustomerId CurrencyCode  CountryCode    ProviderId     ProductId  \
0  CustomerId_4406          UGX          256  ProviderId_6  ProductId_10   
1  CustomerId_4406          UGX          256  ProviderId_4   ProductId_6   
2  CustomerId_4683          UGX          256  P

## üéØ Step 1: Create Aggregate Features

### What we're doing:
Creating customer-level summary statistics from transaction data.

### Required features:
- **Total Transaction Amount**: Sum of all transaction amounts per customer
- **Average Transaction Amount**: Average transaction amount per customer  
- **Transaction Count**: Number of transactions per customer
- **Standard Deviation**: Variability of transaction amounts per customer

### OOP Class: `AggregateFeatures`
This class groups by CustomerId and calculates all required statistics.

In [19]:
# ============================================
# STEP 1: CREATE AGGREGATE FEATURES
# ============================================

print("üî¢ STEP 1: CREATING AGGREGATE FEATURES")
print("="*60)

# Initialize the transformer
aggregator = AggregateFeatures(customer_col='CustomerId', amount_col='Amount')

# Fit and transform
df_step1 = aggregator.fit_transform(df)

# Check what was added
new_cols = [col for col in df_step1.columns if col not in df.columns]
print(f"\nüìä New aggregate features created:")
for col in new_cols:
    print(f"  ‚Ä¢ {col}")

# Display statistics for a sample customer
sample_customer = df_step1['CustomerId'].iloc[0]
customer_data = df_step1[df_step1['CustomerId'] == sample_customer].iloc[0]

print(f"\nüßë‚Äçüíº Sample Customer: {sample_customer}")
print(f"   Total Amount: ${customer_data['TotalAmount']:,.2f}")
print(f"   Average Amount: ${customer_data['AvgAmount']:,.2f}")
print(f"   Transaction Count: {customer_data['TransactionCount']}")
print(f"   Amount Std Dev: ${customer_data['StdAmount']:,.2f}")

üî¢ STEP 1: CREATING AGGREGATE FEATURES
‚úÖ Step 1: Added 7 aggregate features

üìä New aggregate features created:
  ‚Ä¢ TotalAmount
  ‚Ä¢ AvgAmount
  ‚Ä¢ TransactionCount
  ‚Ä¢ StdAmount
  ‚Ä¢ MinAmount
  ‚Ä¢ MaxAmount
  ‚Ä¢ MedianAmount

üßë‚Äçüíº Sample Customer: CustomerId_4406
   Total Amount: $109,921.75
   Average Amount: $923.71
   Transaction Count: 119
   Amount Std Dev: $3,042.29


## üéØ Step 2: Extract Temporal Features

### What we're doing:
Extracting time-based features from the TransactionStartTime column.

### Required features:
- **Transaction Hour**: Hour of day (0-23)
- **Transaction Day**: Day of month (1-31)
- **Transaction Month**: Month (1-12)
- **Transaction Year**: Year

### OOP Class: `TemporalFeatureExtractor`
This class parses datetime and extracts all time components.

In [20]:
# ============================================
# STEP 2: EXTRACT TEMPORAL FEATURES
# ============================================

print("\n‚è∞ STEP 2: EXTRACTING TEMPORAL FEATURES")
print("="*60)

# Initialize the transformer
temporal_extractor = TemporalFeatureExtractor(datetime_col='TransactionStartTime')

# Fit and transform
df_step2 = temporal_extractor.fit_transform(df_step1)

# Check what was added
temporal_cols = ['TransactionHour', 'TransactionDay', 'TransactionMonth', 
                 'TransactionYear', 'TransactionDayOfWeek', 'TransactionWeekOfYear', 'IsWeekend']

print(f"\nüìÖ Temporal features created:")
for col in temporal_cols:
    if col in df_step2.columns:
        print(f"  ‚Ä¢ {col}")

# Visualize transaction patterns
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Hourly distribution
hour_counts = df_step2['TransactionHour'].value_counts().sort_index()
axes[0, 0].bar(hour_counts.index, hour_counts.values)
axes[0, 0].set_title('Transactions by Hour')
axes[0, 0].set_xlabel('Hour')
axes[0, 0].set_ylabel('Count')

# Day of week distribution
dow_counts = df_step2['TransactionDayOfWeek'].value_counts().sort_index()
axes[0, 1].bar(dow_counts.index, dow_counts.values)
axes[0, 1].set_title('Transactions by Day of Week')
axes[0, 1].set_xlabel('Day (0=Monday)')
axes[0, 1].set_ylabel('Count')

# Monthly distribution
month_counts = df_step2['TransactionMonth'].value_counts().sort_index()
axes[1, 0].bar(month_counts.index, month_counts.values)
axes[1, 0].set_title('Transactions by Month')
axes[1, 0].set_xlabel('Month')
axes[1, 0].set_ylabel('Count')

# Weekend vs weekday
weekend_counts = df_step2['IsWeekend'].value_counts()
axes[1, 1].pie(weekend_counts.values, labels=['Weekday', 'Weekend'], autopct='%1.1f%%')
axes[1, 1].set_title('Weekend vs Weekday Transactions')

plt.tight_layout()
plt.show()


‚è∞ STEP 2: EXTRACTING TEMPORAL FEATURES
‚úÖ Step 2: Added 7 temporal features

üìÖ Temporal features created:
  ‚Ä¢ TransactionHour
  ‚Ä¢ TransactionDay
  ‚Ä¢ TransactionMonth
  ‚Ä¢ TransactionYear
  ‚Ä¢ TransactionDayOfWeek
  ‚Ä¢ TransactionWeekOfYear
  ‚Ä¢ IsWeekend


## üéØ Step 3: Encode Categorical Variables

### What we're doing:
Converting categorical text data into numerical format for modeling.

### Required methods:
- **One-Hot Encoding**: Creates binary columns for each category
- **Label Encoding**: Assigns integer IDs to each category

### OOP Class: `CategoricalEncoder`
This class automatically detects categorical columns and applies encoding.

In [21]:
# ============================================
# STEP 3: ENCODE CATEGORICAL VARIABLES
# ============================================

print("\nüî§ STEP 3: ENCODING CATEGORICAL VARIABLES")
print("="*60)

# Identify categorical columns
categorical_cols = df_step2.select_dtypes(include=['object']).columns.tolist()
print(f"Found {len(categorical_cols)} categorical columns:")
for col in categorical_cols:
    unique_count = df_step2[col].nunique()
    print(f"  ‚Ä¢ {col}: {unique_count} unique values")

# Initialize encoder (using one-hot for demonstration)
encoder = CategoricalEncoder(strategy='onehot', columns=['ProductCategory', 'ChannelId'])

# Fit and transform
df_step3 = encoder.fit_transform(df_step2)

# Show encoding results
encoded_cols = [col for col in df_step3.columns if 'ProductCategory_' in col or 'ChannelId_' in col]
print(f"\nüéØ Created {len(encoded_cols)} encoded columns:")
for col in encoded_cols[:10]:  # Show first 10
    print(f"  ‚Ä¢ {col}")
if len(encoded_cols) > 10:
    print(f"  ... and {len(encoded_cols) - 10} more")

# Show before/after comparison
print("\nüìä Before Encoding (ProductCategory):")
print(df_step2['ProductCategory'].value_counts().head())

print("\nüìä After One-Hot Encoding:")
for col in [c for c in encoded_cols if 'ProductCategory' in c][:3]:
    print(f"{col}: {df_step3[col].sum():,} transactions")


üî§ STEP 3: ENCODING CATEGORICAL VARIABLES
Found 10 categorical columns:
  ‚Ä¢ TransactionId: 95662 unique values
  ‚Ä¢ BatchId: 94809 unique values
  ‚Ä¢ AccountId: 3633 unique values
  ‚Ä¢ SubscriptionId: 3627 unique values
  ‚Ä¢ CustomerId: 3742 unique values
  ‚Ä¢ CurrencyCode: 1 unique values
  ‚Ä¢ ProviderId: 6 unique values
  ‚Ä¢ ProductId: 23 unique values
  ‚Ä¢ ProductCategory: 9 unique values
  ‚Ä¢ ChannelId: 4 unique values
‚úÖ Step 3: Encoded 2 categorical columns using onehot encoding

üéØ Created 13 encoded columns:
  ‚Ä¢ ProductCategory_airtime
  ‚Ä¢ ProductCategory_data_bundles
  ‚Ä¢ ProductCategory_financial_services
  ‚Ä¢ ProductCategory_movies
  ‚Ä¢ ProductCategory_other
  ‚Ä¢ ProductCategory_ticket
  ‚Ä¢ ProductCategory_transport
  ‚Ä¢ ProductCategory_tv
  ‚Ä¢ ProductCategory_utility_bill
  ‚Ä¢ ChannelId_ChannelId_1
  ... and 3 more

üìä Before Encoding (ProductCategory):
ProductCategory
financial_services    45405
airtime               45027
utility_bill       

## üéØ Step 4: Handle Missing Values

### What we're doing:
Identifying and treating missing data to ensure data quality.

### Required methods:
- **Imputation**: Fill missing values with mean, median, mode, or KNN
- **Removal**: Remove rows/columns with too many missing values

### OOP Class: `MissingValueHandler`
This class analyzes missing patterns and applies appropriate treatment.

In [22]:
# ============================================
# STEP 4: HANDLE MISSING VALUES
# ============================================

print("\n‚ö†Ô∏è  STEP 4: HANDLING MISSING VALUES")
print("="*60)

# Check current missing values
missing_before = df_step3.isnull().sum()
missing_cols = missing_before[missing_before > 0]

if len(missing_cols) > 0:
    print(f"Found {len(missing_cols)} columns with missing values:")
    for col, count in missing_cols.items():
        pct = (count / len(df_step3)) * 100
        print(f"  ‚Ä¢ {col}: {count:,} missing ({pct:.2f}%)")
else:
    print("‚úÖ No missing values found in current data")
    
    # Let's create some missing data for demonstration
    print("\nüß™ Creating sample missing data for demonstration...")
    df_demo = df_step3.copy()
    df_demo.loc[df_demo.sample(frac=0.05).index, 'Amount'] = np.nan
    df_demo.loc[df_demo.sample(frac=0.03).index, 'TransactionHour'] = np.nan
    df_step3 = df_demo

# Initialize missing value handler
missing_handler = MissingValueHandler(strategy='median', remove_threshold=0.3)

# Fit and transform
df_step4 = missing_handler.fit_transform(df_step3)

# Check results
missing_after = df_step4.isnull().sum().sum()
print(f"\n‚úÖ Missing values after handling: {missing_after:,}")

# Compare shapes
print(f"\nüìà Data shape changes:")
print(f"  Before: {df_step3.shape[0]:,} rows √ó {df_step3.shape[1]} columns")
print(f"  After:  {df_step4.shape[0]:,} rows √ó {df_step4.shape[1]} columns")


‚ö†Ô∏è  STEP 4: HANDLING MISSING VALUES
Found 1 columns with missing values:
  ‚Ä¢ StdAmount: 712 missing (0.74%)
‚úÖ Step 4: Handled missing values using median strategy

‚úÖ Missing values after handling: 0

üìà Data shape changes:
  Before: 95,662 rows √ó 41 columns
  After:  95,662 rows √ó 41 columns


## üéØ Step 5: Normalize/Standardize Numerical Features

### What we're doing:
Scaling numerical features to similar ranges for better model performance.

### Required methods:
- **Normalization**: Scale to [0, 1] range (MinMax)
- **Standardization**: Scale to mean=0, std=1 (Z-score)

### OOP Class: `FeatureScaler`
This class detects numerical columns and applies chosen scaling method.

In [23]:
# ============================================
# STEP 5: NORMALIZE/STANDARDIZE NUMERICAL FEATURES
# ============================================

print("\nüìè STEP 5: NORMALIZING/STANDARDIZING NUMERICAL FEATURES")
print("="*60)

# Select numerical columns to scale
numerical_cols = df_step4.select_dtypes(include=[np.number]).columns.tolist()
# Exclude binary/indicator columns
exclude = ['IsWeekend', 'IsCredit', 'FraudResult'] + [col for col in numerical_cols if 'ProductCategory_' in col or 'ChannelId_' in col]
scale_cols = [col for col in numerical_cols if col not in exclude]

print(f"Selected {len(scale_cols)} numerical features for scaling:")
for col in scale_cols[:10]:
    print(f"  ‚Ä¢ {col}")
if len(scale_cols) > 10:
    print(f"  ... and {len(scale_cols) - 10} more")

# Initialize scaler (using standardization)
scaler = FeatureScaler(strategy='standard', columns=scale_cols)

# Fit and transform
df_step5 = scaler.fit_transform(df_step4)

# Show before/after comparison for sample features
sample_features = ['Amount', 'Value', 'TransactionHour']
print("\nüìä Before/After Scaling Comparison:")
print("-" * 50)

for feature in sample_features:
    if feature in df_step4.columns:
        before_mean = df_step4[feature].mean()
        before_std = df_step4[feature].std()
        after_mean = df_step5[feature].mean()
        after_std = df_step5[feature].std()
        
        print(f"\n{feature}:")
        print(f"  Before: Mean = {before_mean:.2f}, Std = {before_std:.2f}")
        print(f"  After:  Mean = {after_mean:.2f}, Std = {after_std:.2f}")


üìè STEP 5: NORMALIZING/STANDARDIZING NUMERICAL FEATURES
Selected 17 numerical features for scaling:
  ‚Ä¢ CountryCode
  ‚Ä¢ Amount
  ‚Ä¢ Value
  ‚Ä¢ PricingStrategy
  ‚Ä¢ TotalAmount
  ‚Ä¢ AvgAmount
  ‚Ä¢ TransactionCount
  ‚Ä¢ StdAmount
  ‚Ä¢ MinAmount
  ‚Ä¢ MaxAmount
  ... and 7 more
‚úÖ Step 5: Scaled 17 numerical features using standard

üìä Before/After Scaling Comparison:
--------------------------------------------------

Amount:
  Before: Mean = 6717.85, Std = 123306.80
  After:  Mean = -0.00, Std = 1.00

Value:
  Before: Mean = 9900.58, Std = 123122.09
  After:  Mean = -0.00, Std = 1.00

TransactionHour:
  Before: Mean = 12.45, Std = 4.85
  After:  Mean = -0.00, Std = 1.00


## üéØ Step 6: Feature Engineering with WoE and IV

### What we're doing:
Applying Weight of Evidence (WoE) transformation to create features with predictive power.

### Key concepts:
- **WoE (Weight of Evidence)**: Measures how much a feature category indicates "good" vs "bad"
- **IV (Information Value)**: Measures overall predictive power of a feature

### OOP Class: `WOETransformer`
This class calculates WoE for each feature category and creates WoE-transformed features.

In [25]:
# ============================================
# STEP 6: FEATURE ENGINEERING WITH WOE AND IV
# ============================================

print("\nüéØ STEP 6: FEATURE ENGINEERING WITH WOE AND IV")
print("="*60)

print("üìö Understanding WoE & IV:")
print("""
Weight of Evidence (WoE):
‚Ä¢ Measures how much a feature category indicates 'good' vs 'bad'
‚Ä¢ WoE = ln(% of Good / % of Bad)
‚Ä¢ Positive WoE = More 'good' customers in this category
‚Ä¢ Negative WoE = More 'bad' customers in this category

Information Value (IV):
‚Ä¢ Measures overall predictive power of a feature
‚Ä¢ IV < 0.02: Not useful
‚Ä¢ 0.02-0.1: Weak predictor
‚Ä¢ 0.1-0.3: Medium predictor  
‚Ä¢ 0.3-0.5: Strong predictor
‚Ä¢ > 0.5: Suspicious (check for data leakage)
""")

# For demonstration, use FraudResult as target
print("\nüß™ Using FraudResult as target for WoE demonstration...")

# Prepare data for WoE
woe_features = ['Amount', 'Value', 'TransactionHour', 'TransactionDay', 
                'TransactionMonth', 'TotalAmount', 'AvgAmount', 'TransactionCount']

X_woe = df_step5[woe_features].copy()
y_woe = df_step5['FraudResult'].copy()  # Target variable

print(f"Calculating WoE for {len(woe_features)} features...")

# Initialize WoE transformer (NO target_col parameter needed)
woe_transformer = WOETransformer(n_bins=5)

# ‚ö†Ô∏è IMPORTANT: Use fit() with X and y separately, NOT fit_transform()
woe_transformer.fit(X_woe, y_woe)  # Pass y as second parameter

# Then transform
X_woe_transformed = woe_transformer.transform(X_woe)

# Get IV report
iv_report = woe_transformer.get_iv_report()
print("\nüìà INFORMATION VALUE (IV) REPORT:")
print("="*60)
print(iv_report.to_string())


üéØ STEP 6: FEATURE ENGINEERING WITH WOE AND IV
üìö Understanding WoE & IV:

Weight of Evidence (WoE):
‚Ä¢ Measures how much a feature category indicates 'good' vs 'bad'
‚Ä¢ WoE = ln(% of Good / % of Bad)
‚Ä¢ Positive WoE = More 'good' customers in this category
‚Ä¢ Negative WoE = More 'bad' customers in this category

Information Value (IV):
‚Ä¢ Measures overall predictive power of a feature
‚Ä¢ IV < 0.02: Not useful
‚Ä¢ 0.02-0.1: Weak predictor
‚Ä¢ 0.1-0.3: Medium predictor  
‚Ä¢ 0.3-0.5: Strong predictor
‚Ä¢ > 0.5: Suspicious (check for data leakage)


üß™ Using FraudResult as target for WoE demonstration...
Calculating WoE for 8 features...
üîç Step 6: Calculating WoE/IV for 8 features...
‚úÖ Step 6: Added 8 WoE features

üìà INFORMATION VALUE (IV) REPORT:
            Feature        IV Predictive_Power
1             Value  4.279244       Suspicious
0            Amount  4.173982       Suspicious
6         AvgAmount  3.707218       Suspicious
5       TotalAmount  3.359739      

In [36]:
# ============================================
# MINIMAL PIPELINE (RECOMMENDED FOR TASK 3)
# ============================================

print("\nüîó MINIMAL PIPELINE FOR TASK 3 COMPLETION")
print("="*60)

print("""
For Task 3 demonstration, we'll:
1. DROP high-cardinality ID columns (not useful for modeling)
2. Keep only meaningful features
3. Complete all 6 steps without memory issues
""")

# Drop ID columns (not useful for credit risk modeling)
id_columns = ['TransactionId', 'BatchId', 'AccountId', 'SubscriptionId', 
              'ProductId', 'CurrencyCode']
df_minimal = df.drop(columns=id_columns)

print(f"üìä Data after dropping IDs: {df_minimal.shape}")

# Create minimal pipeline
minimal_pipeline = Pipeline([
    ('step1_aggregate', AggregateFeatures()),
    ('step2_temporal', TemporalFeatureExtractor()),
    ('step3_encode', CategoricalEncoder(
        strategy='onehot', 
        columns=['ProductCategory', 'ChannelId', 'ProviderId']
    )),
    ('step4_missing', MissingValueHandler(strategy='median')),
    ('step5_scale', FeatureScaler(strategy='standard')),
])

print("\nüöÄ Running minimal pipeline...")
df_final = minimal_pipeline.fit_transform(df_minimal)

print(f"\n‚úÖ MINIMAL PIPELINE RESULTS:")
print("="*60)
print(f"Original: {df.shape}")
print(f"After dropping IDs: {df_minimal.shape}")
print(f"Final:    {df_final.shape}")

# Show what features we have
print("\nüìä FINAL FEATURE TYPES:")
feature_categories = {
    'Temporal': [col for col in df_final.columns if 'Transaction' in col],
    'Aggregate': [col for col in df_final.columns if 'Customer' in col and col != 'CustomerId'],
    'Encoded': [col for col in df_final.columns if '_' in col and ('ProductCategory' in col or 'ChannelId' in col)],
    'Original': [col for col in df_final.columns if col in df.columns]
}

for category, features in feature_categories.items():
    if features:
        print(f"\n{category} ({len(features)}):")
        for feat in features[:5]:
            print(f"  ‚Ä¢ {feat}")
        if len(features) > 5:
            print(f"    ... and {len(features) - 5} more")


üîó MINIMAL PIPELINE FOR TASK 3 COMPLETION

For Task 3 demonstration, we'll:
1. DROP high-cardinality ID columns (not useful for modeling)
2. Keep only meaningful features
3. Complete all 6 steps without memory issues

üìä Data after dropping IDs: (95662, 10)

üöÄ Running minimal pipeline...
‚úÖ Step 1: Added 7 aggregate features
‚úÖ Step 2: Added 7 temporal features
‚úÖ Step 3: Encoded 3 categorical columns using onehot encoding
‚úÖ Step 4: Handled missing values using median strategy
‚úÖ Step 5: Scaled 38 numerical features using standard

‚úÖ MINIMAL PIPELINE RESULTS:
Original: (95662, 16)
After dropping IDs: (95662, 10)
Final:    (95662, 40)

üìä FINAL FEATURE TYPES:

Temporal (8):
  ‚Ä¢ TransactionStartTime
  ‚Ä¢ TransactionCount
  ‚Ä¢ TransactionHour
  ‚Ä¢ TransactionDay
  ‚Ä¢ TransactionMonth
    ... and 3 more

Encoded (13):
  ‚Ä¢ ProductCategory_airtime
  ‚Ä¢ ProductCategory_data_bundles
  ‚Ä¢ ProductCategory_financial_services
  ‚Ä¢ ProductCategory_movies
  ‚Ä¢ ProductCa

In [38]:
# ============================================
# SAVE RESULTS AND SUMMARY
# ============================================

print("\nüíæ SAVING RESULTS")
print("="*60)

# Save processed data
output_dir = Path('../data/processed')
output_dir.mkdir(parents=True, exist_ok=True)

# Save as CSV (human-readable) AND Parquet (efficient)
csv_path = output_dir / 'task3_features_engineered.csv'
parquet_path = output_dir / 'task3_features_engineered.parquet'

# Save both formats
df_final.to_csv(csv_path, index=False)
df_final.to_parquet(parquet_path, index=False)

print(f"‚úÖ Saved engineered features to:")
print(f"   üìÑ CSV: {csv_path}")
print(f"   üìä Parquet: {parquet_path}")
print(f"   üìè CSV size: {csv_path.stat().st_size / (1024**2):.2f} MB")
print(f"   üìè Parquet size: {parquet_path.stat().st_size / (1024**2):.2f} MB")

# Verify the files exist
print(f"\nüîç Verifying saved files:")
print(f"   CSV exists: {csv_path.exists()}")
print(f"   Parquet exists: {parquet_path.exists()}")

# Show sample of saved data
print(f"\nüìÑ First 3 rows of saved data:")
print(df_final.head(3))

# Show column types
print(f"\nüìä Data types in saved file:")
print(df_final.dtypes.value_counts())


üíæ SAVING RESULTS
‚úÖ Saved engineered features to:
   üìÑ CSV: ..\data\processed\task3_features_engineered.csv
   üìä Parquet: ..\data\processed\task3_features_engineered.parquet
   üìè CSV size: 73.04 MB
   üìè Parquet size: 2.12 MB

üîç Verifying saved files:
   CSV exists: True
   Parquet exists: True

üìÑ First 3 rows of saved data:
        CustomerId  CountryCode    Amount     Value      TransactionStartTime  \
0  CustomerId_4406          0.0 -0.046371 -0.072291 2018-11-15 02:18:49+00:00   
1  CustomerId_4406          0.0 -0.054643 -0.080251 2018-11-15 02:19:08+00:00   
2  CustomerId_4683          0.0 -0.050426 -0.076352 2018-11-15 02:44:21+00:00   

   PricingStrategy  FraudResult  TotalAmount  AvgAmount  TransactionCount  \
0        -0.349252    -0.044962     0.170118  -0.067623         -0.311831   
1        -0.349252    -0.044962     0.170118  -0.067623         -0.311831   
2        -0.349252    -0.044962     0.165122  -0.072568         -0.444993   

   ...  ChannelId