# üìä Enhanced Data Preparation Pipeline
## Traffic Accident Analysis Project

This notebook demonstrates the enhanced data preparation pipeline with comprehensive preprocessing capabilities including:

- **Smart Missing Value Handling** - Multiple strategies with validation
- **Advanced Outlier Treatment** - IQR, Z-score, and robust methods
- **Intelligent Data Transformation** - Automatic skewness detection
- **Domain-Specific Feature Engineering** - Traffic-specific features
- **Smart Categorical Encoding** - Cardinality-aware encoding
- **Intelligent Feature Scaling** - Excludes binary features
- **Advanced Class Balancing** - Multiple resampling methods
- **Comprehensive Data Validation** - Quality checks and metrics

---

## üîß Setup and Imports

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Add the parent directory to sys.path so we can import from 'src'
sys.path.append(os.path.abspath('../'))
%load_ext autoreload
%autoreload 2

from src.data.traffic_data_prep_pipeline import TrafficDataPrep

print("‚úÖ Setup completed successfully!")
print(f"üìÅ Working directory: {os.getcwd()}")

‚úÖ Setup completed successfully!
üìÅ Working directory: d:\traffic\notebooks


## üì• Data Loading with Quality Assessment

The enhanced pipeline provides comprehensive data loading with automatic quality assessment and validation.

In [2]:
# Initialize the enhanced data preparation pipeline
prep = TrafficDataPrep("../data/raw/traffic_accidents.csv")

# Load data with automatic quality assessment
df = prep.load_data()

print(f"\nüìã Dataset Overview:")
print(f"   Shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
print(f"   Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

# Display first few rows
df.head()

‚úÖ Data loaded successfully.
üìä Dataset shape: 209,306 rows √ó 24 columns
üìà Data Quality Summary:
   Missing values: 0 (0.0%)
   Duplicate rows: 31
   Numeric features: 10
   Categorical features: 14

üìã Dataset Overview:
   Shape: 209,306 rows √ó 24 columns
   Memory usage: 214.1 MB


Unnamed: 0,crash_date,traffic_control_device,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,crash_type,...,most_severe_injury,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,crash_hour,crash_day_of_week,crash_month
0,07/29/2023 01:00:00 PM,TRAFFIC SIGNAL,CLEAR,DAYLIGHT,TURNING,NOT DIVIDED,STRAIGHT AND LEVEL,UNKNOWN,UNKNOWN,NO INJURY / DRIVE AWAY,...,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,3.0,13,7,7
1,08/13/2023 12:11:00 AM,TRAFFIC SIGNAL,CLEAR,"DARKNESS, LIGHTED ROAD",TURNING,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NO INJURY / DRIVE AWAY,...,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,2.0,0,1,8
2,12/09/2021 10:30:00 AM,TRAFFIC SIGNAL,CLEAR,DAYLIGHT,REAR END,T-INTERSECTION,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NO INJURY / DRIVE AWAY,...,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,3.0,10,5,12
3,08/09/2023 07:55:00 PM,TRAFFIC SIGNAL,CLEAR,DAYLIGHT,ANGLE,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,INJURY AND / OR TOW DUE TO CRASH,...,NONINCAPACITATING INJURY,5.0,0.0,0.0,5.0,0.0,0.0,19,4,8
4,08/19/2023 02:55:00 PM,TRAFFIC SIGNAL,CLEAR,DAYLIGHT,REAR END,T-INTERSECTION,STRAIGHT AND LEVEL,UNKNOWN,UNKNOWN,NO INJURY / DRIVE AWAY,...,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,3.0,14,7,8


## üîç Enhanced Missing Value Handling

Smart missing value imputation with multiple strategies:
- **Categorical features**: Mode imputation
- **Numeric features**: Median imputation (robust to outliers)
- **High missing percentage**: Automatic detection and warnings
- **Comprehensive visualization**: Before/after comparison

In [3]:
# Handle missing values with enhanced strategy
df = prep.handle_missing_values(strategy='auto')

print(f"\nüìä Missing Value Summary:")
missing_after = df.isnull().sum().sum()
print(f"   Total missing values after cleaning: {missing_after:,}")
print(f"   Data completeness: {((df.size - missing_after) / df.size * 100):.2f}%")


üîç MISSING VALUE HANDLING
‚úÖ No missing values found!

üìä Missing Value Summary:
   Total missing values after cleaning: 0
   Data completeness: 100.00%


## üì¶ Advanced Outlier Treatment

Multiple outlier detection and treatment methods:
- **IQR Method**: Interquartile range-based detection
- **Z-Score Method**: Standard deviation-based detection
- **Robust Method**: Median absolute deviation-based
- **Capping Strategy**: Preserves data while removing extreme values

In [None]:
# Apply advanced outlier treatment
df = prep.outlier_treatment(method='iqr', threshold=1.5)

print(f"\nüìà Outlier Treatment Results:")
print(f"   Method used: IQR with threshold 1.5")
print(f"   Strategy: Capping (preserves data size)")


üì¶ OUTLIER TREATMENT
üîç Analyzing outliers in 4 features using IQR method

üìä Processing 'injuries_total':
   Original range: 0.00 to 21.00
   Outliers detected: 5,692 (2.7%)
   After capping: 0.00 to 2.50

üìä Processing 'injuries_incapacitating':
   Original range: 0.00 to 7.00
   Outliers detected: 6,634 (3.2%)
   After capping: 0.00 to 0.00

üìä Processing 'injuries_non_incapacitating':
   Original range: 0.00 to 21.00
   Outliers detected: 33,000 (15.8%)
   After capping: 0.00 to 0.00

üìä Processing 'num_units':
   Original range: 1.00 to 11.00
   Outliers detected: 19,940 (9.5%)
   After capping: 2.00 to 2.00


## üîÑ Intelligent Data Transformation

Automatic skewness detection and transformation:
- **Skewness Analysis**: Automatic detection of skewed features
- **Log Transformation**: Applied to highly skewed features
- **Threshold-Based**: Configurable skewness threshold
- **Before/After Visualization**: Distribution comparison

In [None]:
# Apply intelligent data transformation
df = prep.data_transformation(auto_detect_skew=True, skew_threshold=1.0)

print(f"\nüîÑ Transformation Summary:")
print(f"   Auto-detection enabled with threshold: 1.0")
print(f"   Method: Log1p transformation (handles zeros)")

## üîß Domain-Specific Feature Engineering

Traffic accident domain knowledge applied to create meaningful features:
- **Time-Based Features**: Night indicator, rush hours, time periods
- **Risk Assessment**: Composite risk scores
- **Injury Analysis**: Severe injury indicators, injury rates
- **Interaction Features**: Combined risk factors
- **Weekend/Weekday**: Day-based patterns

In [None]:
# Create domain-specific features
df = prep.feature_engineering(create_interactions=True)

print(f"\nüîß Feature Engineering Results:")
print(f"   New dataset shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
print(f"   Features added: Time-based, risk assessment, injury analysis")

# Show new feature columns
new_features = [col for col in df.columns if any(keyword in col.lower() 
                for keyword in ['night', 'rush', 'weekend', 'risk', 'severe'])]
if new_features:
    print(f"\nüÜï New Features Created: {', '.join(new_features)}")

## üî§ Smart Categorical Encoding

Cardinality-aware encoding strategy:
- **Auto Strategy**: Chooses encoding based on cardinality
- **Label Encoding**: For ordinal and high-cardinality features
- **Cardinality Warnings**: Alerts for high-cardinality features
- **Encoder Storage**: Saves encoders for future use

In [None]:
# Apply smart categorical encoding
df = prep.encode_features(encoding_strategy='auto')

print(f"\nüî§ Encoding Results:")
print(f"   Categorical features processed: {len(prep.encoders)}")
print(f"   Encoding strategy: Auto (cardinality-aware)")

# Show data types after encoding
print(f"\nüìä Data Types After Encoding:")
dtype_counts = df.dtypes.value_counts()
for dtype, count in dtype_counts.items():
    print(f"   {dtype}: {count} features")

## ‚öñÔ∏è Intelligent Feature Scaling

Smart scaling that preserves binary features:
- **Binary Detection**: Automatically identifies binary features
- **Selective Scaling**: Excludes binary features from scaling
- **Multiple Methods**: Standard, Robust, MinMax scaling
- **Before/After Comparison**: Statistical comparison

In [None]:
# Apply intelligent feature scaling
df = prep.scale_features(scaling_method='standard', exclude_binary=True)

print(f"\n‚öñÔ∏è Scaling Results:")
print(f"   Method: Standard scaling")
print(f"   Binary features excluded: Yes")
print(f"   Scaler stored for future use: Yes")

## ‚öñÔ∏è Advanced Class Imbalance Handling

Multiple resampling techniques for class imbalance:
- **SMOTE**: Synthetic Minority Oversampling Technique
- **ADASYN**: Adaptive Synthetic Sampling
- **BorderlineSMOTE**: Borderline cases focus
- **Imbalance Analysis**: Automatic ratio calculation
- **Before/After Visualization**: Class distribution comparison

In [None]:
# Handle class imbalance with advanced methods
df = prep.handle_imbalance(method='smote', sampling_strategy='auto')

print(f"\n‚öñÔ∏è Class Balancing Results:")
print(f"   Method: SMOTE (Synthetic Minority Oversampling)")
print(f"   Final dataset shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns")

# Show final class distribution
target_dist = df[prep.target].value_counts().sort_index()
print(f"\nüéØ Final Class Distribution:")
for class_val, count in target_dist.items():
    pct = (count / len(df)) * 100
    print(f"   Class {class_val}: {count:,} ({pct:.1f}%)")

## ‚úÖ Comprehensive Data Quality Validation

Final data quality checks and validation:
- **Missing Values Check**: Ensures no missing data
- **Infinite Values Check**: Detects mathematical issues
- **Data Types Validation**: Confirms proper types
- **Target Variable Check**: Validates target classes
- **Duplicate Detection**: Identifies duplicate rows

In [None]:
# Perform comprehensive data quality validation
validation_results = prep.validate_data_quality()

print(f"\nüìã Validation Summary:")
passed = sum(1 for r in validation_results if r['status'] == 'PASS')
warnings = sum(1 for r in validation_results if r['status'] == 'WARN')
failed = sum(1 for r in validation_results if r['status'] == 'FAIL')

print(f"   ‚úÖ Passed: {passed}")
print(f"   ‚ö†Ô∏è Warnings: {warnings}")
print(f"   ‚ùå Failed: {failed}")

if failed == 0:
    print(f"\nüéâ Data is ready for modeling!")
else:
    print(f"\n‚ö†Ô∏è Please review failed checks before proceeding.")

## üìã Complete Preparation Summary

Comprehensive summary of all preprocessing steps and transformations applied.

In [None]:
# Generate comprehensive preparation summary
summary = prep.generate_preparation_summary()

print(f"\nüìä Final Dataset Characteristics:")
print(f"   Total samples: {df.shape[0]:,}")
print(f"   Total features: {df.shape[1]}")
print(f"   Numeric features: {len(df.select_dtypes(include=[np.number]).columns)}")
print(f"   Target variable: {prep.target}")
print(f"   Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

## üíæ Save Model-Ready Data

Save the fully processed dataset with comprehensive metadata for reproducibility.

In [None]:
# Save the model-ready dataset with metadata
output_path = "../data/processed/traffic_model_ready.csv"
prep.save_model_ready_data(output_path)

print(f"\nüíæ Data Saved Successfully!")
print(f"   Main dataset: {output_path}")
print(f"   Metadata: {output_path.replace('.csv', '_metadata.json')}")
print(f"   File size: {os.path.getsize(output_path) / 1024**2:.1f} MB")

## üîÑ Enhanced Train-Test-Validation Splits

Create stratified splits with optional validation set:
- **Stratified Sampling**: Maintains class distribution
- **Three-Way Split**: Train, Test, and Validation sets
- **Configurable Ratios**: Flexible split proportions
- **Metadata Tracking**: Complete split information

In [None]:
# Create enhanced train-test-validation splits
splits_info = prep.save_train_test_splits(
    output_dir="../data/processed",
    test_size=0.2,
    validation_size=0.1
)

print(f"\nüîÑ Split Creation Complete!")
print(f"   Files created in: ../data/processed/")
print(f"   Split configuration: 70% train, 20% test, 10% validation")
print(f"   Stratified: Yes (maintains class distribution)")

## üéØ Final Data Overview

Complete overview of the processed dataset ready for machine learning.