# DataPrep Class - Advanced Data Preprocessing Pipeline

A comprehensive, configurable data preprocessing class for machine learning workflows with support for feature engineering, missing value handling, outlier detection, and custom transformations.

## Table of Contents
1. [Overview](#overview)
2. [Installation & Imports](#installation--imports)
3. [Core Features](#core-features)
4. [Basic Usage](#basic-usage)
5. [Advanced Features](#advanced-features)
6. [Testing Suite](#testing-suite)

---

## Overview

The `DataPrep` class provides a sklearn-like interface for data preprocessing with these key advantages:

- **Consistent Processing**: Fit once on training data, apply same transformations to test/validation
- **Feature Engineering**: Create new features by combining existing ones
- **Flexible Configuration**: JSON-like configuration for each feature
- **Custom Transformations**: Support for any callable function
- **Multiple Outlier Methods**: IQR and Z-score based detection
- **Error Handling**: Graceful handling of missing features and failed transformations

---

## Installation & Imports

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from typing import Dict, Any, Optional, Union, Callable, List

---

## Core Features

### Feature Configuration Options

Each feature can be configured with the following options:

| Parameter | Type | Description | Options |
|-----------|------|-------------|---------|
| `fill_na` | str/float | Missing value strategy | `'mean'`, `'median'`, `'mode'`, or numeric value |
| `outlier_method` | str | Outlier detection method | `'iqr'`, `'zscore'` |
| `outlier_action` | str | How to handle outliers | `'clip'`, `'remove'` |
| `transform` | callable | Custom transformation function | Any function that takes a Series |
| `zscore_threshold` | float | Z-score threshold (if using zscore) | Default: 3.0 |

### Combined Features

Create new features by combining existing ones:

| Parameter | Type | Description |
|-----------|------|-------------|
| `parent_features` | list | List of existing feature names to combine |
| `combine_func` | callable | Function that takes DataFrame and returns Series |
| `fill_na` | str/float | How to handle missing values in new feature |
| `transform` | callable | Transformation to apply after combination |

---

## DataPrep Class Implementation

In [7]:
import pandas as pd
import numpy as np
from typing import Dict, Any, Optional, Union, Callable, List

class DataPrep:
    def __init__(self, target_column: str = 'price'):
        self.target_column = target_column
        self.feature_configs = {}
        self.fitted_params = {}
        
    def set_feature_config(self, feature_name: str, config: Dict[str, Any]):
        """
        Configure cleaning settings for a specific feature.
        
        Args:
            feature_name: Name of the feature to configure
            config: Dictionary with cleaning configuration
                   For existing features:
                   {'fill_na': 'median', 'transform': lambda x: np.log1p(x), 'outlier_method': 'iqr'}
                   
                   For new combined features:
                   {
                       'parent_features': ['feature1', 'feature2'],
                       'combine_func': lambda df: df['feature1'] / df['feature2'],
                       'fill_na': 'mean',
                       'transform': lambda x: np.log1p(x),
                       'outlier_method': 'iqr'
                   }
        """
        self.feature_configs[feature_name] = config
        
    def set_multiple_configs(self, configs: Dict[str, Dict[str, Any]]):
        """Set configurations for multiple features at once."""
        for feature_name, config in configs.items():
            self.set_feature_config(feature_name, config)
    
    def fit(self, train_data: pd.DataFrame):
        """
        Fit the cleaning parameters on training data.
        This calculates means, medians, IQR bounds, etc. based on training data.
        """
        X = train_data.drop(columns=[self.target_column])
        
        # First, create any new combined features
        X_with_new_features = self._create_combined_features(X)
        
        for feature_name, config in self.feature_configs.items():
            if feature_name not in X_with_new_features.columns:
                print(f"Warning: Feature '{feature_name}' not found in data after feature creation")
                continue
                
            feature_params = {}
            
            # Store parent features if this is a combined feature
            if 'parent_features' in config:
                feature_params['parent_features'] = config['parent_features']
                feature_params['combine_func'] = config['combine_func']
            
            # Handle missing value parameters
            if 'fill_na' in config:
                fill_method = config['fill_na']
                if fill_method == 'mean':
                    feature_params['fill_value'] = X_with_new_features[feature_name].mean()
                elif fill_method == 'median':
                    feature_params['fill_value'] = X_with_new_features[feature_name].median()
                elif fill_method == 'mode':
                    mode_val = X_with_new_features[feature_name].mode()
                    feature_params['fill_value'] = mode_val.iloc[0] if not mode_val.empty else 0
                elif isinstance(fill_method, (int, float)):
                    feature_params['fill_value'] = fill_method
                    
            # Handle outlier detection parameters
            if 'outlier_method' in config:
                if config['outlier_method'] == 'iqr':
                    Q1 = X_with_new_features[feature_name].quantile(0.25)
                    Q3 = X_with_new_features[feature_name].quantile(0.75)
                    IQR = Q3 - Q1
                    feature_params['outlier_lower'] = Q1 - 1.5 * IQR
                    feature_params['outlier_upper'] = Q3 + 1.5 * IQR
                elif config['outlier_method'] == 'zscore':
                    mean = X_with_new_features[feature_name].mean()
                    std = X_with_new_features[feature_name].std()
                    threshold = config.get('zscore_threshold', 3)
                    feature_params['outlier_lower'] = mean - threshold * std
                    feature_params['outlier_upper'] = mean + threshold * std
            
            # Store transformation function
            if 'transform' in config:
                if callable(config['transform']):
                    feature_params['transform_func'] = config['transform']
                else:
                    raise ValueError(f"Transform for feature '{feature_name}' must be a callable function")
                
            self.fitted_params[feature_name] = feature_params
            
    def _create_combined_features(self, data: pd.DataFrame) -> pd.DataFrame:
        """
        Create new features by combining existing ones based on configurations.
        
        Args:
            data: DataFrame with original features
            
        Returns:
            DataFrame with original + new combined features
        """
        data_copy = data.copy()
        
        for feature_name, config in self.feature_configs.items():
            if 'parent_features' in config and 'combine_func' in config:
                parent_features = config['parent_features']
                combine_func = config['combine_func']
                
                # Check if all parent features exist
                missing_parents = [f for f in parent_features if f not in data_copy.columns]
                if missing_parents:
                    print(f"Warning: Parent features {missing_parents} not found for combined feature '{feature_name}'")
                    continue
                
                try:
                    # Apply the combination function
                    data_copy[feature_name] = combine_func(data_copy)
                except Exception as e:
                    print(f"Warning: Failed to create combined feature '{feature_name}': {e}")
                    
        return data_copy
    
    def clean_feature(self, data: pd.DataFrame, feature_name: str) -> pd.DataFrame:
        """
        Apply cleaning to a specific feature in the given data.
        
        Args:
            data: DataFrame to clean
            feature_name: Name of the feature to clean
            
        Returns:
            DataFrame with the cleaned feature
        """
        if feature_name not in self.fitted_params:
            print(f"Warning: No configuration found for feature '{feature_name}'")
            return data
            
        data_copy = data.copy()
        params = self.fitted_params[feature_name]
        
        # Create combined feature if needed
        if 'parent_features' in params and 'combine_func' in params:
            parent_features = params['parent_features']
            missing_parents = [f for f in parent_features if f not in data_copy.columns]
            if missing_parents:
                print(f"Warning: Parent features {missing_parents} not found for combined feature '{feature_name}'")
                return data_copy
            
            try:
                data_copy[feature_name] = params['combine_func'](data_copy)
            except Exception as e:
                print(f"Warning: Failed to create combined feature '{feature_name}': {e}")
                return data_copy
        
        if feature_name not in data_copy.columns:
            print(f"Warning: Feature '{feature_name}' not found in data after creation attempt")
            return data_copy
        
        # Handle missing values
        if 'fill_value' in params:
            data_copy[feature_name] = data_copy[feature_name].fillna(params['fill_value'])
            
        # Handle outliers
        if 'outlier_lower' in params and 'outlier_upper' in params:
            outlier_method = self.feature_configs[feature_name].get('outlier_action', 'clip')
            
            if outlier_method == 'clip':
                data_copy[feature_name] = data_copy[feature_name].clip(
                    lower=params['outlier_lower'], 
                    upper=params['outlier_upper']
                )
            elif outlier_method == 'remove':
                mask = (data_copy[feature_name] >= params['outlier_lower']) & \
                       (data_copy[feature_name] <= params['outlier_upper'])
                data_copy = data_copy[mask]
                
        # Apply custom transformation function
        if 'transform_func' in params:
            try:
                data_copy[feature_name] = params['transform_func'](data_copy[feature_name])
            except Exception as e:
                print(f"Warning: Transform function failed for feature '{feature_name}': {e}")
                
        return data_copy
    
    def clean_data(self, data: pd.DataFrame, features: Optional[list] = None) -> tuple:
        """
        Apply cleaning to specified features or all configured features.
        
        Args:
            data: DataFrame to clean
            features: List of features to clean. If None, cleans all configured features.
            
        Returns:
            Tuple of (X_clean, y) where X_clean is cleaned features and y is target
        """
        if features is None:
            features = list(self.fitted_params.keys())
            
        data_clean = data.copy()
        
        # First create all combined features, then clean them
        for feature in features:
            data_clean = self.clean_feature(data_clean, feature)
            
        # Split into X and y
        y = data_clean[self.target_column] if self.target_column in data_clean.columns else None
        X = data_clean.drop(columns=[self.target_column]) if self.target_column in data_clean.columns else data_clean
        
        return X, y
    
    def get_feature_info(self, feature_name: str = None):
        """Get information about configured features and their fitted parameters."""
        if feature_name:
            if feature_name in self.feature_configs:
                info = {
                    'config': self.feature_configs[feature_name].copy(),
                    'fitted_params': self.fitted_params.get(feature_name, {}).copy()
                }
                # Replace functions with descriptions for readability
                if 'transform' in info['config'] and callable(info['config']['transform']):
                    info['config']['transform'] = f"<function {info['config']['transform'].__name__}>"
                if 'combine_func' in info['config'] and callable(info['config']['combine_func']):
                    info['config']['combine_func'] = f"<function {info['config']['combine_func'].__name__}>"
                if 'transform_func' in info['fitted_params']:
                    info['fitted_params']['transform_func'] = f"<function {info['fitted_params']['transform_func'].__name__}>"
                if 'combine_func' in info['fitted_params']:
                    info['fitted_params']['combine_func'] = f"<function {info['fitted_params']['combine_func'].__name__}>"
                return info
            else:
                return f"No configuration found for feature '{feature_name}'"
        else:
            return {
                'configured_features': list(self.feature_configs.keys()),
                'fitted_features': list(self.fitted_params.keys())
            }

---

## Basic Usage

### 1. Simple Feature Cleaning

Basic example showing individual feature configuration:

In [8]:
# Load sample data
train_relative_path = '../Data/train.csv'
test_relative_path = '../Data/test.csv'
train_data = pd.read_csv(train_relative_path)
test_data = pd.read_csv(test_relative_path)

print("Sample data:")
print(train_data.head())

Sample data:
   carID   Brand      model    year  price transmission  mileage fuelType  \
0  69512      VW       Golf  2016.0  22290    Semi-Auto  28421.0   Petrol   
1  53000  Toyota      Yaris  2019.0  13790       Manual   4589.0   Petrol   
2   6366    Audi         Q2  2019.0  24990    Semi-Auto   3624.0   Petrol   
3  29021    Ford     FIESTA  2018.0  12500        anual   9102.0   Petrol   
4  10062     BMW   2 Series  2019.0  22995       Manual   1000.0   Petrol   

     tax        mpg  engineSize  paintQuality%  previousOwners  hasDamage  
0    NaN  11.417268         2.0           63.0        4.000000        0.0  
1  145.0  47.900000         1.5           50.0        1.000000        0.0  
2  145.0  40.900000         1.5           56.0        4.000000        0.0  
3  145.0  65.700000         1.0           50.0       -2.340306        0.0  
4  145.0  42.800000         1.5           97.0        3.000000        0.0  


In [9]:
def log_transform(x):
    """Custom log transformation with safety checks"""
    return np.log1p(np.abs(x) + 1)

# Configure basic cleaning for individual features
basic_configs = {
    'mileage': {
        'fill_na': 'median',
        'transform': log_transform,
        'outlier_method': 'iqr',
        'outlier_action': 'clip'
    },
    'year': {
        'fill_na': 'mean',
        'transform': lambda x: 2025 - x,  # Convert to car age
        'outlier_method': 'zscore',
        'zscore_threshold': 2.5,
        'outlier_action': 'clip'
    },
    'engineSize': {
        'fill_na': 1.6,  # Domain-specific default
        'transform': lambda x: np.log(x + 1) * 2,
    }
}

# Initialize and fit the preprocessor
data_prepper = DataPrep(target_column='price')
data_prepper.set_multiple_configs(basic_configs)
data_prepper.fit(train_data)

# Clean the data
X_train_clean, y_train = data_prepper.clean_data(train_data)
X_test_clean, _ = data_prepper.clean_data(test_data)

print(f"Training shape: {X_train_clean.shape}")
print(f"Features cleaned: {list(basic_configs.keys())}")

Training shape: (75973, 13)
Features cleaned: ['mileage', 'year', 'engineSize']


### 2. Inspect Configuration Results

View fitted parameters and configuration details:

In [15]:
# Get information about specific features
print("Mileage configuration:")
print(data_prepper.get_feature_info('mileage'))

print("\nAll configured features:")
print(data_prepper.get_feature_info())

Mileage configuration:
{'config': {'fill_na': 'median', 'transform': '<function log_transform>', 'outlier_method': 'iqr', 'outlier_action': 'clip'}, 'fitted_params': {'fill_value': np.float64(17300.0), 'outlier_lower': np.float64(-30083.125), 'outlier_upper': np.float64(69933.875), 'transform_func': '<function log_transform>'}}

All configured features:
{'configured_features': ['mileage', 'year', 'engineSize'], 'fitted_features': ['mileage', 'year', 'engineSize']}


---

## Advanced Features

### Feature Engineering with Combined Features

Create sophisticated new features by combining existing ones:

In [None]:
def log_transform(x):
    """Enhanced log transformation with safety checks"""
    return np.log1p(np.abs(x) + 1)

# Advanced configuration with feature engineering
advanced_configs = {
    # Base feature cleaning
    'mileage': {
        'fill_na': 'median',
        'transform': log_transform,
        'outlier_method': 'iqr',
        'outlier_action': 'clip'
    },
    'year': {
        'fill_na': 'mean',
        'transform': lambda x: 2025 - x,
        'outlier_method': 'zscore',
        'zscore_threshold': 2.5,
        'outlier_action': 'clip'
    },
    'engineSize': {
        'fill_na': 1.6,
        'transform': lambda x: np.log(x + 1) * 2,
    },
    
    # Engineered Features
    'mileage_per_year': {
        'parent_features': ['mileage', 'year'],
        'combine_func': lambda df: df['mileage'] / (2025 - df['year'] + 1),
        'fill_na': 'median',
        'transform': lambda x: np.sqrt(x),
        'outlier_method': 'iqr',
        'outlier_action': 'clip'
    },
    'power_efficiency': {
        'parent_features': ['engineSize', 'year'],
        'combine_func': lambda df: df['engineSize'] * 100 / (2025 - df['year'] + 50),
        'fill_na': 'mean',
        'transform': log_transform,
    },
    'luxury_score': {
        'parent_features': ['engineSize', 'mileage', 'year'],
        'combine_func': lambda df: (df['engineSize'] * 10) - (df['mileage'] / 1000) + (df['year'] - 2000),
        'fill_na': 'median',
        'outlier_method': 'zscore',
        'zscore_threshold': 2,
        'outlier_action': 'clip'
    }
}

print("Original features:\n", train_data.columns.tolist())
print("-" * 50)

# Apply advanced preprocessing
advanced_prepper = DataPrep(target_column='price')
advanced_prepper.set_multiple_configs(advanced_configs)
advanced_prepper.fit(train_data)

# Create train/validation split
X_train, X_val = train_test_split(train_data, test_size=0.2, random_state=42)

# Clean all datasets
X_train_clean, y_train = advanced_prepper.clean_data(X_train)
X_val_clean, y_val = advanced_prepper.clean_data(X_val)
X_test_final, _ = advanced_prepper.clean_data(test_data)

print("Features after engineering:\n", X_train_clean.columns.tolist())
new_features = [f for f in X_train_clean.columns if f not in train_data.columns]
if new_features:
    print("New features created:")
    for f in new_features:
        print(f" - {f}: dtype={X_train_clean[f].dtype}")
else:
    print("No new features created.")

Original features:
 ['carID', 'Brand', 'model', 'year', 'price', 'transmission', 'mileage', 'fuelType', 'tax', 'mpg', 'engineSize', 'paintQuality%', 'previousOwners', 'hasDamage']
--------------------------------------------------
Features after engineering:
 ['carID', 'Brand', 'model', 'year', 'transmission', 'mileage', 'fuelType', 'tax', 'mpg', 'engineSize', 'paintQuality%', 'previousOwners', 'hasDamage', 'mileage_per_year', 'power_efficiency', 'luxury_score']
New features created:
 - mileage_per_year: dtype=float64
 - power_efficiency: dtype=float64
 - luxury_score: dtype=float64
Features after engineering:
 ['carID', 'Brand', 'model', 'year', 'transmission', 'mileage', 'fuelType', 'tax', 'mpg', 'engineSize', 'paintQuality%', 'previousOwners', 'hasDamage', 'mileage_per_year', 'power_efficiency', 'luxury_score']
New features created:
 - mileage_per_year: dtype=float64
 - power_efficiency: dtype=float64
 - luxury_score: dtype=float64


### Feature Engineering Results

Inspect the newly created features:

In [22]:
# Analyze the engineered features
print("Combined feature details:")
for feature in ['mileage_per_year', 'power_efficiency', 'luxury_score']:
    if feature in advanced_prepper.fitted_params:
        print(f"\n{feature}:")
        print(advanced_prepper.get_feature_info(feature))

# Check for missing values
print(f"\nMissing values after processing: {X_train_clean.isna().sum().sum()}")
print(f"Shape: Train {X_train_clean.shape}, Val {X_val_clean.shape}, Test {X_test_final.shape}")

Combined feature details:

mileage_per_year:
{'config': {'parent_features': ['mileage', 'year'], 'combine_func': '<function <lambda>>', 'fill_na': 'median', 'transform': '<function <lambda>>', 'outlier_method': 'iqr', 'outlier_action': 'clip'}, 'fitted_params': {'parent_features': ['mileage', 'year'], 'combine_func': '<function <lambda>>', 'fill_value': np.float64(1998.0), 'outlier_lower': np.float64(-2555.25), 'outlier_upper': np.float64(6858.75), 'transform_func': '<function <lambda>>'}}

power_efficiency:
{'config': {'parent_features': ['engineSize', 'year'], 'combine_func': '<function <lambda>>', 'fill_na': 'mean', 'transform': '<function log_transform>'}, 'fitted_params': {'parent_features': ['engineSize', 'year'], 'combine_func': '<function <lambda>>', 'fill_value': np.float64(2.8696735843366508), 'transform_func': '<function log_transform>'}}

luxury_score:
{'config': {'parent_features': ['engineSize', 'mileage', 'year'], 'combine_func': '<function <lambda>>', 'fill_na': 'median

---

## Testing Suite

### Comprehensive Unit Tests

I asked chatgpt to create some automated testing ensures the DataPrep class works correctly across various scenarios:

In [23]:
import pandas as pd
import numpy as np
import unittest
from unittest.mock import patch
import warnings

class TestDataPrep(unittest.TestCase):
    
    def setUp(self):
        """Set up test data before each test"""
        # Create sample data for testing
        np.random.seed(42)
        self.sample_data = pd.DataFrame({
            'feature1': [1, 2, np.nan, 4, 5, 100, 7, 8, 9, 10],  # Has missing values and outlier
            'feature2': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
            'feature3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
            'categorical_feature': ['A', 'B', 'A', 'C', np.nan, 'B', 'A', 'C', 'B', 'A'],
            'price': [1000, 2000, 1500, 3000, 2500, 5000, 3500, 4000, 4500, 5500]
        })
        
        # Create another dataset for testing consistency
        self.test_data = pd.DataFrame({
            'feature1': [1.5, 2.5, 3.5, np.nan, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5],
            'feature2': [15, 25, 35, 45, 55, 65, 75, 85, 95, 105],
            'feature3': [0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95, 1.05],
            'categorical_feature': ['A', 'B', np.nan, 'C', 'B', 'A', 'C', 'A', 'B', 'C'],
            'price': [1100, 2100, 1600, 3100, 2600, 5100, 3600, 4100, 4600, 5600]
        })
    
    def test_initialization(self):
        """Test DataPrep initialization"""
        dp = DataPrep(target_column='price')
        self.assertEqual(dp.target_column, 'price')
        self.assertEqual(dp.feature_configs, {})
        self.assertEqual(dp.fitted_params, {})
        
        # Test custom target column
        dp_custom = DataPrep(target_column='custom_target')
        self.assertEqual(dp_custom.target_column, 'custom_target')
    
    def test_set_feature_config(self):
        """Test setting individual feature configurations"""
        dp = DataPrep()
        config = {
            'fill_na': 'median',
            'transform': lambda x: np.log1p(x),
            'outlier_method': 'iqr'
        }
        
        dp.set_feature_config('feature1', config)
        self.assertEqual(dp.feature_configs['feature1'], config)
    
    def test_set_multiple_configs(self):
        """Test setting multiple feature configurations"""
        dp = DataPrep()
        configs = {
            'feature1': {'fill_na': 'mean'},
            'feature2': {'fill_na': 'median', 'transform': lambda x: x**2}
        }
        
        dp.set_multiple_configs(configs)
        self.assertEqual(len(dp.feature_configs), 2)
        self.assertIn('feature1', dp.feature_configs)
        self.assertIn('feature2', dp.feature_configs)
    
    def test_fit_basic_operations(self):
        """Test fitting with basic operations"""
        dp = DataPrep(target_column='price')
        
        configs = {
            'feature1': {
                'fill_na': 'mean',
                'outlier_method': 'iqr',
                'transform': lambda x: np.log1p(x)
            },
            'feature2': {
                'fill_na': 'median',
                'outlier_method': 'zscore',
                'zscore_threshold': 2.0
            }
        }
        
        dp.set_multiple_configs(configs)
        dp.fit(self.sample_data)
        
        # Check that parameters were fitted
        self.assertIn('feature1', dp.fitted_params)
        self.assertIn('feature2', dp.fitted_params)
        
        # Check specific fitted parameters
        feature1_params = dp.fitted_params['feature1']
        self.assertIn('fill_value', feature1_params)
        self.assertIn('outlier_lower', feature1_params)
        self.assertIn('outlier_upper', feature1_params)
        self.assertIn('transform_func', feature1_params)
    
    def test_fill_na_methods(self):
        """Test different fill_na methods"""
        dp = DataPrep(target_column='price')
        
        configs = {
            'feature1': {'fill_na': 'mean'},
            'feature2': {'fill_na': 'median'},
            'categorical_feature': {'fill_na': 'mode'},
            'feature3': {'fill_na': 0.5}  # Specific value
        }
        
        dp.set_multiple_configs(configs)
        dp.fit(self.sample_data)
        
        # Check that fill values were calculated correctly
        expected_mean = self.sample_data['feature1'].mean()
        expected_median = self.sample_data['feature2'].median()
        expected_mode = self.sample_data['categorical_feature'].mode().iloc[0]
        
        self.assertAlmostEqual(dp.fitted_params['feature1']['fill_value'], expected_mean)
        self.assertAlmostEqual(dp.fitted_params['feature2']['fill_value'], expected_median)
        self.assertEqual(dp.fitted_params['categorical_feature']['fill_value'], expected_mode)
        self.assertEqual(dp.fitted_params['feature3']['fill_value'], 0.5)
    
    def test_outlier_methods(self):
        """Test different outlier detection methods"""
        dp = DataPrep(target_column='price')
        
        configs = {
            'feature1': {'outlier_method': 'iqr'},
            'feature2': {'outlier_method': 'zscore', 'zscore_threshold': 2.5}
        }
        
        dp.set_multiple_configs(configs)
        dp.fit(self.sample_data)
        
        # Check IQR method
        Q1 = self.sample_data['feature1'].quantile(0.25)
        Q3 = self.sample_data['feature1'].quantile(0.75)
        IQR = Q3 - Q1
        expected_lower = Q1 - 1.5 * IQR
        expected_upper = Q3 + 1.5 * IQR
        
        self.assertAlmostEqual(dp.fitted_params['feature1']['outlier_lower'], expected_lower)
        self.assertAlmostEqual(dp.fitted_params['feature1']['outlier_upper'], expected_upper)
        
        # Check Z-score method
        mean_f2 = self.sample_data['feature2'].mean()
        std_f2 = self.sample_data['feature2'].std()
        expected_lower_z = mean_f2 - 2.5 * std_f2
        expected_upper_z = mean_f2 + 2.5 * std_f2
        
        self.assertAlmostEqual(dp.fitted_params['feature2']['outlier_lower'], expected_lower_z)
        self.assertAlmostEqual(dp.fitted_params['feature2']['outlier_upper'], expected_upper_z)
    
    def test_clean_feature(self):
        """Test cleaning individual features"""
        dp = DataPrep(target_column='price')
        
        config = {
            'fill_na': 'mean',
            'transform': lambda x: x * 2,
            'outlier_method': 'iqr',
            'outlier_action': 'clip'
        }
        
        dp.set_feature_config('feature1', config)
        dp.fit(self.sample_data)
        
        # Clean the feature
        cleaned_data = dp.clean_feature(self.sample_data, 'feature1')
        
        # Check that missing values were filled
        self.assertFalse(cleaned_data['feature1'].isna().any())
    
    def test_clean_data_full_pipeline(self):
        """Test the complete data cleaning pipeline"""
        dp = DataPrep(target_column='price')
        
        configs = {
            'feature1': {
                'fill_na': 'median',
                'transform': lambda x: np.log1p(x),
                'outlier_method': 'iqr',
                'outlier_action': 'clip'
            },
            'feature2': {
                'fill_na': 'mean',
                'outlier_method': 'zscore',
                'zscore_threshold': 2.0,
                'outlier_action': 'clip'
            }
        }
        
        dp.set_multiple_configs(configs)
        dp.fit(self.sample_data)
        
        X_clean, y_clean = dp.clean_data(self.sample_data)
        
        # Check that target column is separated correctly
        self.assertNotIn('price', X_clean.columns)
        self.assertEqual(len(y_clean), len(self.sample_data))
        
        # Check that configured features were cleaned
        self.assertIn('feature1', X_clean.columns)
        self.assertIn('feature2', X_clean.columns)
        
        # Check no missing values in cleaned features
        self.assertFalse(X_clean['feature1'].isna().any())
        self.assertFalse(X_clean['feature2'].isna().any())
    
    def test_combined_features(self):
        """Test creating and cleaning combined features"""
        dp = DataPrep(target_column='price')
        
        configs = {
            'feature1': {'fill_na': 'mean'},
            'feature2': {'fill_na': 'mean'},
            'combined_feature': {
                'parent_features': ['feature1', 'feature2'],
                'combine_func': lambda df: df['feature1'] + df['feature2'],
                'fill_na': 'median',
                'transform': lambda x: x / 2
            }
        }
        
        dp.set_multiple_configs(configs)
        dp.fit(self.sample_data)
        
        X_clean, y_clean = dp.clean_data(self.sample_data)
        
        # Check that combined feature was created
        self.assertIn('combined_feature', X_clean.columns)
        
        # Check that no missing values exist
        self.assertFalse(X_clean['combined_feature'].isna().any())
    
    def test_consistency_across_datasets(self):
        """Test that the same parameters are applied consistently across different datasets"""
        dp = DataPrep(target_column='price')
        
        configs = {
            'feature1': {
                'fill_na': 'mean',
                'transform': lambda x: x * 2,
                'outlier_method': 'iqr',
                'outlier_action': 'clip'
            }
        }
        
        dp.set_multiple_configs(configs)
        dp.fit(self.sample_data)  # Fit on first dataset
        
        # Get the fitted fill value
        fitted_fill_value = dp.fitted_params['feature1']['fill_value']
        
        # Clean both datasets
        X_train, _ = dp.clean_data(self.sample_data)
        X_test, _ = dp.clean_data(self.test_data)
        
        # Both should have no missing values
        self.assertFalse(X_train['feature1'].isna().any())
        self.assertFalse(X_test['feature1'].isna().any())
        
        # The fill value should be consistent (from training data)
        self.assertAlmostEqual(fitted_fill_value, self.sample_data['feature1'].mean())
    
    def test_error_handling(self):
        """Test error handling for various edge cases"""
        dp = DataPrep(target_column='price')
        
        # Test invalid transform function - this test should work with fit() method
        dp.set_feature_config('feature1', {'fill_na': 'mean'})
        dp.fit(self.sample_data)
        
        # Test with invalid transform function
        try:
            dp.set_feature_config('feature1', {'transform': 'not_a_function'})
            dp.fit(self.sample_data)
        except ValueError:
            pass  # Expected behavior
        
        # Test missing feature warning
        dp2 = DataPrep(target_column='price')
        dp2.set_feature_config('nonexistent_feature', {'fill_na': 'mean'})
        with patch('builtins.print') as mock_print:
            dp2.fit(self.sample_data)
            # Should print warning about missing feature
            mock_print.assert_called()
        
        # Test cleaning feature that wasn't configured
        dp3 = DataPrep(target_column='price')
        result = dp3.clean_feature(self.sample_data, 'unconfigured_feature')
        pd.testing.assert_frame_equal(result, self.sample_data)
    
    def test_get_feature_info(self):
        """Test the get_feature_info method"""
        dp = DataPrep(target_column='price')
        
        def custom_transform(x):
            return x * 2
        
        config = {
            'fill_na': 'mean',
            'transform': custom_transform,
            'outlier_method': 'iqr'
        }
        
        dp.set_feature_config('feature1', config)
        dp.fit(self.sample_data)
        
        # Test getting info for specific feature
        info = dp.get_feature_info('feature1')
        self.assertIn('config', info)
        self.assertIn('fitted_params', info)
        
        # Test getting all features info
        all_info = dp.get_feature_info()
        self.assertIn('configured_features', all_info)
        self.assertIn('fitted_features', all_info)
        
        # Test getting info for non-existent feature
        no_info = dp.get_feature_info('nonexistent')
        self.assertIsInstance(no_info, str)

def run_tests():
    """Run all tests and display results"""
    print("Running DataPrep Class Tests...")
    print("=" * 50)
    
    # Create test suite - FIXED: correct method name
    test_suite = unittest.TestLoader().loadTestsFromTestCase(TestDataPrep)
    
    # Run tests with detailed output
    runner = unittest.TextTestRunner(verbosity=2)
    result = runner.run(test_suite)
    
    # Print summary
    print("\n" + "=" * 50)
    print(f"Tests run: {result.testsRun}")
    print(f"Failures: {len(result.failures)}")
    print(f"Errors: {len(result.errors)}")
    
    if result.failures:
        print("\nFailures:")
        for test, traceback in result.failures:
            print(f"- {test}: {traceback}")
    
    if result.errors:
        print("\nErrors:")
        for test, traceback in result.errors:
            print(f"- {test}: {traceback}")
    
    if result.wasSuccessful():
        print("\n✅ All tests passed!")
    else:
        print(f"\n❌ {len(result.failures + result.errors)} test(s) failed")
    
    return result.wasSuccessful()

# Additional integration tests
def test_real_world_scenario():
    """Test a realistic data preprocessing scenario"""
    print("\n" + "=" * 30)
    print("Running Real-World Scenario Test")
    print("=" * 30)
    
    # Create realistic car data
    np.random.seed(123)
    car_data = pd.DataFrame({
        'mileage': np.random.normal(50000, 30000, 100),
        'year': np.random.choice(range(2000, 2024), 100),
        'engineSize': np.random.normal(2.0, 0.8, 100),
        'price': np.random.normal(15000, 8000, 100)
    })
    
    # Add some missing values and outliers
    car_data.loc[5:10, 'mileage'] = np.nan
    car_data.loc[95:98, 'engineSize'] = np.nan
    car_data.loc[2, 'mileage'] = 500000  # Outlier
    car_data.loc[3, 'price'] = 100000    # Outlier
    
    # Configure preprocessing
    dp = DataPrep(target_column='price')
    
    configs = {
        'mileage': {
            'fill_na': 'median',
            'transform': lambda x: np.log1p(np.abs(x)+1),
            'outlier_method': 'iqr',
            'outlier_action': 'clip'
        },
        'year': {
            'fill_na': 'mean',
            'transform': lambda x: 2024 - x,  # Convert to age
            'outlier_method': 'zscore',
            'zscore_threshold': 3,
            'outlier_action': 'clip'
        },
        'engineSize': {
            'fill_na': 'mean',
            'transform': lambda x: x ** 2,  # Square for non-linearity
        },
        'efficiency_score': {
            'parent_features': ['mileage', 'year'],
            'combine_func': lambda df: df['mileage'] / (2024 - df['year'] + 1),
            'fill_na': 'median',
            'transform': lambda x: np.sqrt(x),
            'outlier_method': 'iqr',
            'outlier_action': 'clip'
        }
    }
    
    try:
        dp.set_multiple_configs(configs)
        dp.fit(car_data)
        
        # Clean the data
        X_clean, y_clean = dp.clean_data(car_data)
        X_clean = X_clean[list(configs.keys())]  # Keep only configured features
        # Verify results
        print(f"Original shape: {car_data.shape}")
        print(f"Cleaned features shape: {X_clean.shape}")
        print(f"Target shape: {y_clean.shape}")
        print(f"Missing values in cleaned data: {X_clean.isna().sum().sum()}")
        print(f"Features created: {X_clean.columns.tolist()}")
        
        # Check that combined feature was created
        assert 'efficiency_score' in X_clean.columns, "Combined feature not created"
        
        # Check no missing values
        assert X_clean.isna().sum().sum() == 0, "Missing values still present"
        
        print("✅ Real-world scenario test passed!")
        return True
        
    except Exception as e:
        print(f"❌ Real-world scenario test failed: {e}")
        return False

In [24]:
# Run all tests
if __name__ == "__main__":
    success1 = run_tests()
    success2 = test_real_world_scenario()
    
    if success1 and success2:
        print("\n🎉 All tests completed successfully!")
    else:
        print("\n⚠️ Some tests failed. Please check the output above.")

test_clean_data_full_pipeline (__main__.TestDataPrep.test_clean_data_full_pipeline)
Test the complete data cleaning pipeline ... ok
test_clean_feature (__main__.TestDataPrep.test_clean_feature)
Test cleaning individual features ... ok
test_combined_features (__main__.TestDataPrep.test_combined_features)
Test creating and cleaning combined features ... ok
ok
test_clean_feature (__main__.TestDataPrep.test_clean_feature)
Test cleaning individual features ... ok
test_combined_features (__main__.TestDataPrep.test_combined_features)
Test creating and cleaning combined features ... ok
test_consistency_across_datasets (__main__.TestDataPrep.test_consistency_across_datasets)
Test that the same parameters are applied consistently across different datasets ... ok
test_consistency_across_datasets (__main__.TestDataPrep.test_consistency_across_datasets)
Test that the same parameters are applied consistently across different datasets ... ok
test_error_handling (__main__.TestDataPrep.test_error_handl

Running DataPrep Class Tests...

Tests run: 12
Failures: 0
Errors: 0

✅ All tests passed!

Running Real-World Scenario Test
Original shape: (100, 4)
Cleaned features shape: (100, 4)
Target shape: (100,)
Missing values in cleaned data: 0
Features created: ['mileage', 'year', 'engineSize', 'efficiency_score']
✅ Real-world scenario test passed!

🎉 All tests completed successfully!
