# DataPreprocessor Demo

## Overview

The `DataPreprocessor` class provides a modular, production-ready pipeline for cleaning and transforming data for machine learning. This demo shows how to use it effectively for car price prediction data.

## Key Features

- **Modular Design**: Each preprocessing step is independent and configurable
- **Pipeline Architecture**: Chain multiple processors for complex transformations
- **Type Safety**: Built-in validation and error handling
- **Reproducible**: Consistent processing across train/test splits
- **Production Ready**: Save and load fitted pipelines

## What We'll Demonstrate

1. Quick start example (3 lines of code)
2. Individual feature processing (numerical and categorical)
3. Complete pipeline setup
4. Pipeline persistence for production use

In [3]:
# Essential imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import warnings
warnings.filterwarnings('ignore')

# Load our custom preprocessor
os.chdir('/Users/leonardodicaterina/Documents/GitHub/ML_group_45')
from utils.preprocessing.Preprocessor_divided import DataPreprocessor

# Load car price dataset
train_data = pd.read_csv('Data/train.csv')
test_data = pd.read_csv('Data/test.csv')

print("Dataset loaded successfully!")
print(f"Training data: {train_data.shape}")
print(f"Test data: {test_data.shape}")
print(f"Target: price (range: £{train_data['price'].min():,.0f} - £{train_data['price'].max():,.0f})")

# Show sample data
display(train_data.head())

Dataset loaded successfully!
Training data: (75973, 14)
Test data: (32567, 13)
Target: price (range: £450 - £159,999)


Unnamed: 0,carID,Brand,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize,paintQuality%,previousOwners,hasDamage
0,69512,VW,Golf,2016.0,22290,Semi-Auto,28421.0,Petrol,,11.417268,2.0,63.0,4.0,0.0
1,53000,Toyota,Yaris,2019.0,13790,Manual,4589.0,Petrol,145.0,47.9,1.5,50.0,1.0,0.0
2,6366,Audi,Q2,2019.0,24990,Semi-Auto,3624.0,Petrol,145.0,40.9,1.5,56.0,4.0,0.0
3,29021,Ford,FIESTA,2018.0,12500,anual,9102.0,Petrol,145.0,65.7,1.0,50.0,-2.340306,0.0
4,10062,BMW,2 Series,2019.0,22995,Manual,1000.0,Petrol,145.0,42.8,1.5,97.0,3.0,0.0


In [4]:
# Quick Start: Clean data in 3 lines
preprocessor = DataPreprocessor(target_column='price')

# Configure basic processing for key features
preprocessor.add_feature_pipeline('mileage', missing_strategy='median', scaling_method='standard')
preprocessor.add_feature_pipeline('year', missing_strategy='mean', scaling_method='minmax')
preprocessor.add_feature_pipeline('transmission', missing_strategy='mode', encoding_method='onehot')

# Fit and transform
X_train, y_train = preprocessor.fit_transform(train_data)
X_test, y_test = preprocessor.transform(test_data)

print("Quick Start Results:")
print(f"Original shape: {train_data.shape}")
print(f"Processed shape: {X_train.shape}")
print(f"Missing values: {train_data.isnull().sum().sum()} → {X_train.isnull().sum().sum()}")
print(f"Features processed: mileage, year, transmission")

# Show before/after for one feature
print(f"\nMileage transformation example:")
print(f"Original range: {train_data['mileage'].min():.0f} - {train_data['mileage'].max():.0f}")
print(f"Processed range: {X_train['mileage'].min():.3f} - {X_train['mileage'].max():.3f}")
print(f"Mean: {train_data['mileage'].mean():.0f} → {X_train['mileage'].mean():.3f}")

✓ Fitting pipeline for 'mileage'
✓ Fitting pipeline for 'year'
✓ Fitting pipeline for 'transmission'
✓ Transforming 'mileage'
✓ Transforming 'year'
✓ Transforming 'transmission'
✓ Transforming 'mileage'
✓ Transforming 'year'
✓ Transforming 'transmission'
Quick Start Results:
Original shape: (75973, 14)
Processed shape: (75973, 52)
Missing values: 30993 → 26517
Features processed: mileage, year, transmission

Mileage transformation example:
Original range: -58541 - 323000
Processed range: -3.713 - 13.685
Mean: 23004 → -0.000


## Individual Feature Processing

### Numerical Features

Let's dive deeper into how individual features are processed. We'll show step-by-step transformations for numerical features like mileage and year.

In [5]:
# Demonstrate detailed numerical feature processing
print("NUMERICAL FEATURE PROCESSING DEMO")
print("=" * 40)

# Create a new preprocessor for demonstration
demo_preprocessor = DataPreprocessor(target_column='price')

# Configure mileage with custom transformation
def log_transform(x):
    """Safe log transformation"""
    return np.log1p(np.abs(x))

def year_to_age(year_series):
    """Convert year to car age"""
    return 2025 - year_series

# Add mileage processing with log transformation
demo_preprocessor.add_feature_pipeline(
    'mileage',
    missing_strategy='median',
    outlier_method='iqr',
    transform_func=log_transform,
    scaling_method='standard'
)

# Add year processing with age conversion
demo_preprocessor.add_feature_pipeline(
    'year',
    missing_strategy='mean',
    transform_func=year_to_age,
    scaling_method='minmax'
)

# Fit and transform
demo_preprocessor.fit(train_data)
X_demo, y_demo = demo_preprocessor.transform(train_data)

# Show detailed results
print(f"\nMileage Processing:")
print(f"  Missing values: {train_data['mileage'].isnull().sum()} → {X_demo['mileage'].isnull().sum()}")
print(f"  Skewness: {train_data['mileage'].skew():.3f} → {X_demo['mileage'].skew():.3f}")
print(f"  Range: [{train_data['mileage'].min():.0f}, {train_data['mileage'].max():.0f}] → [{X_demo['mileage'].min():.3f}, {X_demo['mileage'].max():.3f}]")

print(f"\nYear Processing (converted to age):")
print(f"  Missing values: {train_data['year'].isnull().sum()} → {X_demo['year'].isnull().sum()}")
print(f"  Original years: {train_data['year'].min():.0f} - {train_data['year'].max():.0f}")
print(f"  Ages (scaled): {X_demo['year'].min():.3f} - {X_demo['year'].max():.3f}")

NUMERICAL FEATURE PROCESSING DEMO
✓ Fitting pipeline for 'mileage'
✓ Fitting pipeline for 'year'
✓ Transforming 'mileage'
✓ Transforming 'year'

Mileage Processing:
  Missing values: 1463 → 0
  Skewness: 1.555 → -2.357
  Range: [-58541, 323000] → [-5.804, 1.122]

Year Processing (converted to age):
  Missing values: 1491 → 0
  Original years: 1970 - 2024
  Ages (scaled): 0.000 - 1.000


In [7]:
# Demonstrate categorical feature processing
print("CATEGORICAL FEATURE PROCESSING DEMO")
print("=" * 40)

# Create preprocessor for categorical features
cat_preprocessor = DataPreprocessor(target_column='price')

# Configure transmission with one-hot encoding
cat_preprocessor.add_feature_pipeline(
    'transmission',
    missing_strategy='mode',  # Most common value
    encoding_method='onehot'
)

# Configure Brand with mean encoding (for high cardinality)
cat_preprocessor.add_feature_pipeline(
    'Brand',
    missing_strategy='mode',
    encoding_method='mean'  # Target encoding
)

# Fit and transform
cat_preprocessor.fit(train_data)
X_cat, y_cat = cat_preprocessor.transform(train_data)

# Show categorical processing results
print(f"\nTransmission (One-Hot Encoding):")
print(f"  Original categories: {train_data['transmission'].nunique()}")
transmission_cols = [col for col in X_cat.columns if 'transmission' in col]
print(f"  Created columns: {len(transmission_cols)} ({transmission_cols})")

print(f"\nBrand (Mean Encoding):")
print(f"  Original categories: {train_data['Brand'].nunique()}")
print(f"  Encoded to single numerical feature")
if 'Brand_mean' in X_cat.columns:
    print(f"  Value range: {X_cat['Brand_mean'].min():.0f} - {X_cat['Brand_mean'].max():.0f}")

# Show sample of transformed data
print(f"\nSample of categorical transformations:")
relevant_cols = ['transmission', 'Brand'] + [col for col in X_cat.columns if 'transmission' in col or 'Brand' in col]
sample_comparison = pd.DataFrame({
    'Original_Transmission': train_data['transmission'].head(),
    'Original_Brand': train_data['Brand'].head()
})

for col in X_cat.columns:
    if 'transmission' in col or 'Brand' in col:
        sample_comparison[f'Processed_{col}'] = X_cat[col].head()

display(sample_comparison)

CATEGORICAL FEATURE PROCESSING DEMO
✓ Fitting pipeline for 'transmission'
✓ Fitting pipeline for 'Brand'
✓ Transforming 'transmission'
✓ Transforming 'Brand'

Transmission (One-Hot Encoding):
  Original categories: 40
  Created columns: 40 (['transmission_ MANUAL ', 'transmission_ Manual', 'transmission_ Manual ', 'transmission_ manual ', 'transmission_ANUAL', 'transmission_AUTOMATI', 'transmission_AUTOMATIC', 'transmission_Automati', 'transmission_Automatic', 'transmission_EMI-AUTO', 'transmission_MANUA', 'transmission_MANUAL', 'transmission_Manua', 'transmission_Manual', 'transmission_Manual ', 'transmission_Other', 'transmission_SEMI-AUT', 'transmission_SEMI-AUTO', 'transmission_Semi-Aut', 'transmission_Semi-Auto', 'transmission_UNKNOWN', 'transmission_UTOMATIC', 'transmission_anua', 'transmission_anual', 'transmission_automati', 'transmission_automatic', 'transmission_emi-Aut', 'transmission_emi-Auto', 'transmission_emi-auto', 'transmission_manua', 'transmission_manual', 'transmiss

Unnamed: 0,Original_Transmission,Original_Brand,Processed_Brand,Processed_transmission_ MANUAL,Processed_transmission_ Manual,Processed_transmission_ Manual.1,Processed_transmission_ manual,Processed_transmission_ANUAL,Processed_transmission_AUTOMATI,Processed_transmission_AUTOMATIC,...,Processed_transmission_manual,Processed_transmission_manual.1,Processed_transmission_nknow,Processed_transmission_nknown,Processed_transmission_semi-aut,Processed_transmission_semi-auto,Processed_transmission_unknow,Processed_transmission_unknown,Processed_transmission_utomati,Processed_transmission_utomatic
0,Semi-Auto,VW,16897.048057,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Manual,Toyota,12499.829564,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Semi-Auto,Audi,22899.203586,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,anual,Ford,13001.569233,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Manual,BMW,22643.574627,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Complete Pipeline Setup

Now let's create a comprehensive pipeline that processes multiple features at once. This shows how to configure a production-ready preprocessing pipeline.

In [10]:
# Complete pipeline with multiple features
print("COMPLETE PIPELINE DEMO")
print("=" * 40)

# Create comprehensive preprocessor
full_preprocessor = DataPreprocessor(target_column='price')

# Configure all features at once
feature_configs = {
    'mileage': {'missing_strategy': 'median', 'transform_func': log_transform, 'scaling_method': 'standard'},
    'year': {'missing_strategy': 'mean', 'transform_func': year_to_age, 'scaling_method': 'minmax'},
    'engineSize': {'missing_strategy': 1600, 'scaling_method': 'robust'},
    'tax': {'missing_strategy': 'mean', 'scaling_method': 'standard'},
    'mpg': {'missing_strategy': 'median', 'scaling_method': 'standard'},
    'transmission': {'missing_strategy': 'mode', 'encoding_method': 'onehot'},
    'fuelType': {'missing_strategy': 'mode', 'encoding_method': 'onehot'},
    'Brand': {'missing_strategy': 'mode', 'encoding_method': 'mean'}
}

# Add all feature pipelines
for feature, config in feature_configs.items():
    full_preprocessor.add_feature_pipeline(feature, **config)

# Fit and transform
print("Fitting complete pipeline...")
full_preprocessor.fit(train_data)

print("Transforming data...")
X_train_full, y_train_full = full_preprocessor.transform(train_data)
X_test_full, y_test_full = full_preprocessor.transform(test_data)

# Show comprehensive results
print(f"\nComplete Pipeline Results:")
print(f"  Original features: {len(feature_configs)}")
print(f"  Final features: {X_train_full.shape[1]}")
print(f"  Feature expansion: +{X_train_full.shape[1] - len(feature_configs)}")
print(f"  Training data: {X_train_full.shape}")
print(f"  Test data: {X_test_full.shape}")
print(f"  Missing values: {X_train_full.isnull().sum().sum()}")

print(f"\nFeature types in final dataset:")
numeric_features = X_train_full.select_dtypes(include=[np.number]).columns
print(f"  Numerical: {len(numeric_features)} features")
print(f"  All features: {list(X_train_full.columns)}")

COMPLETE PIPELINE DEMO
Fitting complete pipeline...
✓ Fitting pipeline for 'mileage'
✓ Fitting pipeline for 'year'
✓ Fitting pipeline for 'engineSize'
✓ Fitting pipeline for 'tax'
✓ Fitting pipeline for 'mpg'
✓ Fitting pipeline for 'transmission'
✓ Fitting pipeline for 'fuelType'
✓ Fitting pipeline for 'Brand'
Transforming data...
✓ Transforming 'mileage'
✓ Transforming 'year'
✓ Transforming 'engineSize'
✓ Transforming 'tax'
✓ Transforming 'mpg'
✓ Transforming 'transmission'
✓ Transforming 'fuelType'
✓ Transforming 'Brand'
✓ Transforming 'mileage'
✓ Transforming 'year'
✓ Transforming 'engineSize'
✓ Transforming 'tax'
✓ Transforming 'mpg'
✓ Transforming 'transmission'
✓ Transforming 'fuelType'
✓ Transforming 'Brand'

Complete Pipeline Results:
  Original features: 8
  Final features: 85
  Feature expansion: +77
  Training data: (75973, 85)
  Test data: (32567, 85)
  Missing values: 6139

Feature types in final dataset:
  Numerical: 84 features
  All features: ['carID', 'Brand', 'model',

In [11]:
# Apply the fitted pipeline to test data
print("APPLYING FITTED PIPELINE TO TEST DATA")
print("=" * 40)

# Use the already fitted full_preprocessor to transform test data
print("Processing test data with fitted pipeline...")
X_test_full, y_test_full = full_preprocessor.transform(test_data)

# Show consistency between train and test processing
print(f"\nTrain vs Test Consistency Check:")
print(f"  Train features: {X_train_full.shape[1]}")
print(f"  Test features: {X_test_full.shape[1]}")
print(f"  Feature names match: {list(X_train_full.columns) == list(X_test_full.columns)}")
print(f"  Train missing values: {X_train_full.isnull().sum().sum()}")
print(f"  Test missing values: {X_test_full.isnull().sum().sum()}")

# Compare scaling consistency for a numerical feature
print(f"\nScaling Consistency Example (mileage):")
print(f"  Train mean: {X_train_full['mileage'].mean():.3f}, std: {X_train_full['mileage'].std():.3f}")
print(f"  Test mean: {X_test_full['mileage'].mean():.3f}, std: {X_test_full['mileage'].std():.3f}")

# Show sample of processed test data
print(f"\nSample of processed test data:")
display(X_test_full.head())

print(f"\nData is ready for machine learning!")
print(f"  Training set: {X_train_full.shape}")
print(f"  Test set: {X_test_full.shape}")

APPLYING FITTED PIPELINE TO TEST DATA
Processing test data with fitted pipeline...
✓ Transforming 'mileage'
✓ Transforming 'year'
✓ Transforming 'engineSize'
✓ Transforming 'tax'
✓ Transforming 'mpg'
✓ Transforming 'transmission'
✓ Transforming 'fuelType'
✓ Transforming 'Brand'

Train vs Test Consistency Check:
  Train features: 85
  Test features: 85
  Feature names match: True
  Train missing values: 6139
  Test missing values: 2469

Scaling Consistency Example (mileage):
  Train mean: -0.000, std: 1.000
  Test mean: -0.007, std: 1.017

Sample of processed test data:


Unnamed: 0,carID,Brand,model,year,mileage,tax,mpg,engineSize,paintQuality%,previousOwners,...,fuelType_etrol,fuelType_hybrid,fuelType_iese,fuelType_iesel,fuelType_other,fuelType_petro,fuelType_petrol,fuelType_ther,fuelType_ybri,fuelType_ybrid
0,89856,12827.721135,I30,0.022981,0.573837,1.365245,-0.868599,0.0,61.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,106581,16897.048057,Tiguan,0.131588,0.869967,0.478418,-1.079926,0.5,60.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,80886,22643.574627,2 Series,0.150065,0.69272,0.075315,-0.234618,-0.125,94.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,100174,10368.416888,Grandland X,0.094634,-0.55146,0.397797,-0.702099,-0.5,77.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,81376,22643.574627,1 Series,0.094634,-0.227773,0.478418,-0.234618,0.5,45.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



Data is ready for machine learning!
  Training set: (75973, 85)
  Test set: (32567, 85)


In [12]:
# Verify train/test consistency
print("TRAIN/TEST CONSISTENCY CHECK")
print("=" * 40)

# Check that both datasets have the same structure
print(f"Train vs Test Consistency:")
print(f"  Train features: {X_train_full.shape[1]}")
print(f"  Test features: {X_test_full.shape[1]}")
print(f"  Feature names match: {list(X_train_full.columns) == list(X_test_full.columns)}")
print(f"  Train missing values: {X_train_full.isnull().sum().sum()}")
print(f"  Test missing values: {X_test_full.isnull().sum().sum()}")

# Verify scaling consistency (mean should be close to 0 for standardized features)
print(f"\nScaling Consistency (standardized features should have mean ~0):")
for feature in ['mileage', 'tax', 'mpg']:
    if feature in X_train_full.columns:
        train_mean = X_train_full[feature].mean()
        test_mean = X_test_full[feature].mean()
        print(f"  {feature}: train_mean={train_mean:.3f}, test_mean={test_mean:.3f}")

print(f"\nData is ready for machine learning!")

TRAIN/TEST CONSISTENCY CHECK
Train vs Test Consistency:
  Train features: 85
  Test features: 85
  Feature names match: True
  Train missing values: 6139
  Test missing values: 2469

Scaling Consistency (standardized features should have mean ~0):
  mileage: train_mean=-0.000, test_mean=-0.007
  tax: train_mean=-0.000, test_mean=0.003
  mpg: train_mean=-0.000, test_mean=0.004

Data is ready for machine learning!


In [None]:
# Demonstrate k-fold cross validation without data leakage
print("K-FOLD CROSS VALIDATION DEMO")
print("=" * 40)

from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
import numpy as np

# Prepare data for cross validation

X_full = train_data[['mileage', 'year', 'engineSize', 'tax', 'mpg']]
y_full = train_data['price']

print(f"Full dataset for CV: {X_full.shape}")

# K-fold cross validation with proper preprocessing
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []

for fold, (train_idx, val_idx) in enumerate(kfold.split(X_full)):
    print(f"\nFold {fold + 1}:")
    
    # Split data
    X_train_fold = X_full.iloc[train_idx]
    X_val_fold = X_full.iloc[val_idx]
    y_train_fold = y_full.iloc[train_idx]
    y_val_fold = y_full.iloc[val_idx]
    
    # Create new preprocessor for this fold (prevents data leakage)
    fold_preprocessor = DataPreprocessor(target_column='price')
    
    # Configure same pipeline as before
    for feature, config in feature_configs.items():
        fold_preprocessor.add_feature_pipeline(feature, **config)
    
    # Fit on training fold only
    fold_preprocessor.fit(pd.concat([X_train_fold, y_train_fold], axis=1))
    
    # Transform both training and validation folds
    X_train_processed, y_train_processed = fold_preprocessor.transform(
        pd.concat([X_train_fold, y_train_fold], axis=1)
    )
    X_val_processed, y_val_processed = fold_preprocessor.transform(
        pd.concat([X_val_fold, y_val_fold], axis=1)
    )
    
    # Train model
    model = Ridge(alpha=1.0)
    model.fit(X_train_processed, y_train_processed)
    
    # Predict and evaluate
    y_pred = model.predict(X_val_processed)
    rmse = np.sqrt(mean_squared_error(y_val_processed, y_pred))
    cv_scores.append(rmse)
    
    print(f"  Train size: {X_train_processed.shape[0]}")
    print(f"  Val size: {X_val_processed.shape[0]}")
    print(f"  RMSE: £{rmse:,.0f}")

# Show cross validation results
print(f"\nCross Validation Results:")
print(f"  Mean RMSE: £{np.mean(cv_scores):,.0f}")
print(f"  Std RMSE: £{np.std(cv_scores):,.0f}")
print(f"  All folds: {[f'£{score:,.0f}' for score in cv_scores]}")

print(f"\nKey Benefits:")
print(f"  ✓ No data leakage (each fold fits preprocessor independently)")
print(f"  ✓ Consistent preprocessing across all folds")
print(f"  ✓ Realistic performance estimation")
print(f"  ✓ Same preprocessing pipeline used throughout")

K-FOLD CROSS VALIDATION DEMO
Full dataset for CV: (75973, 5)

Fold 1:
✓ Fitting pipeline for 'mileage'
✓ Fitting pipeline for 'year'
✓ Fitting pipeline for 'engineSize'
✓ Fitting pipeline for 'tax'
✓ Fitting pipeline for 'mpg'
✓ Transforming 'mileage'
✓ Transforming 'year'
✓ Transforming 'engineSize'
✓ Transforming 'tax'
✓ Transforming 'mpg'
✓ Transforming 'mileage'
✓ Transforming 'year'
✓ Transforming 'engineSize'
✓ Transforming 'tax'
✓ Transforming 'mpg'
  Train size: 60778
  Val size: 15195
  RMSE: £7,939

Fold 2:
✓ Fitting pipeline for 'mileage'
✓ Fitting pipeline for 'year'
✓ Fitting pipeline for 'engineSize'
✓ Fitting pipeline for 'tax'
✓ Fitting pipeline for 'mpg'
✓ Transforming 'mileage'
✓ Transforming 'year'
✓ Transforming 'engineSize'
✓ Transforming 'tax'
✓ Transforming 'mpg'
✓ Transforming 'mileage'
✓ Transforming 'year'
✓ Transforming 'engineSize'
✓ Transforming 'tax'
✓ Transforming 'mpg'
  Train size: 60778
  Val size: 15195
  RMSE: £7,987

Fold 3:
✓ Fitting pipeline for '