# Complete ML Pipeline: Used Cars Price Prediction

This notebook demonstrates the complete end-to-end machine learning pipeline:

1. ‚úÖ **Data Preprocessing** - Clean and prepare data
2. ‚úÖ **Feature Engineering** - Create new features
3. ‚úÖ **Feature Selection** - Select most important features
4. ‚úÖ **Model Training** - Train multiple models
5. ‚úÖ **Model Evaluation** - Evaluate and compare models
6. ‚úÖ **Hyperparameter Tuning** - Optimize best models
7. ‚úÖ **Final Model Selection** - Choose and save best model

## üéØ Goal
Build a production-ready model to predict used car prices with high accuracy.

## 1. Setup and Imports

In [None]:
# Import libraries
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import logging
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Import custom modules
from config import *
from data_validator import DataValidator
from preprocessing_pipeline import DataPreprocessor, split_data
from feature_engineering import FeatureEngineer
from feature_selection import FeatureSelector
from model_training import ModelTrainer
from model_evaluation import ModelEvaluator
from hyperparameter_tuning import HyperparameterTuner

# Set up plotting
%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ All modules imported successfully!")

## 2. Configure Logging

In [None]:
# Create logs directory
log_dir = Path('../logs')
log_dir.mkdir(exist_ok=True)

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(log_dir / 'complete_pipeline.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)
logger.info("="*80)
logger.info("COMPLETE ML PIPELINE - USED CARS PRICE PREDICTION")
logger.info("="*80)

## 3. Load and Preprocess Data

In [None]:
# Load raw data
logger.info("Loading raw data...")
df_raw = pd.read_csv(RAW_DATA_PATH)
print(f"Raw data shape: {df_raw.shape}")

# Create configuration
config = {
    'COLUMNS_TO_DROP': COLUMNS_TO_DROP,
    'PRICE_FILTER': PRICE_FILTER,
    'YEAR_FILTER': YEAR_FILTER,
    'ODOMETER_FILTER': ODOMETER_FILTER,
    'IMPUTATION_STRATEGY': IMPUTATION_STRATEGY,
    'CONSTANT_VALUES': CONSTANT_VALUES,
    'TARGET_COLUMN': TARGET_COLUMN,
    'OUTLIER_CONFIG': OUTLIER_CONFIG,
    'NUMERICAL_COLUMNS': NUMERICAL_COLUMNS,
    'ENCODING_CONFIG': ENCODING_CONFIG,
    'SCALING_CONFIG': SCALING_CONFIG,
    'TRAIN_TEST_SPLIT': TRAIN_TEST_SPLIT
}

# Preprocess data
preprocessor = DataPreprocessor(config)
df_processed = preprocessor.fit_transform(df_raw.copy())
print(f"\nProcessed data shape: {df_processed.shape}")
print(f"Rows retained: {len(df_processed) / len(df_raw) * 100:.1f}%")

## 4. Feature Engineering

In [None]:
# Initialize feature engineer
fe_config = {
    'create_polynomial': False,  # Set to True for polynomial features
    'create_statistical': False  # Set to True for statistical features (slower)
}

feature_engineer = FeatureEngineer(fe_config)

# Create engineered features
df_engineered = feature_engineer.fit_transform(df_processed.copy())
print(f"\nEngineered data shape: {df_engineered.shape}")
print(f"New features created: {len(feature_engineer.get_feature_names())}")
print(f"\nNew features: {feature_engineer.get_feature_names()}")

## 5. Split Data

In [None]:
# Split into train and test sets
X_train, X_test, y_train, y_test = split_data(df_engineered, config)

print(f"\nTrain set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTarget statistics:")
print(f"  Train - Mean: ${y_train.mean():,.2f}, Median: ${y_train.median():,.2f}")
print(f"  Test - Mean: ${y_test.mean():,.2f}, Median: ${y_test.median():,.2f}")

## 6. Feature Selection

In [None]:
# Initialize feature selector
fs_config = {
    'variance_threshold': 0.01,
    'correlation_threshold': 0.95
}

feature_selector = FeatureSelector(fs_config)

# Select features using ensemble method
n_features_to_select = min(50, X_train.shape[1])  # Select top 50 or all if less
X_train_selected = feature_selector.fit_transform(
    X_train, y_train, 
    method='ensemble',  # Use ensemble of multiple methods
    n_features=n_features_to_select
)

X_test_selected = feature_selector.transform(X_test)

print(f"\nSelected features: {X_train_selected.shape[1]}")
print(f"\nTop 10 features:")
print(feature_selector.get_feature_scores().head(10)[['feature', 'avg_score']])

In [None]:
# Plot feature importance
feature_selector.plot_feature_importance(
    feature_selector.get_feature_scores(),
    top_n=20,
    save_path='../data/processed/feature_importance.png'
)

## 7. Train Baseline Models

In [None]:
# Initialize model trainer
trainer = ModelTrainer()

# Train baseline models (subset for speed)
baseline_models = [
    'linear_regression',
    'ridge',
    'random_forest',
    'gradient_boosting',
    'xgboost',
    'lightgbm'
]

trained_models = trainer.train_all_models(
    X_train_selected, y_train,
    model_subset=baseline_models,
    use_optimized=False
)

print(f"\n‚úÖ Trained {len(trained_models)} baseline models")

## 8. Evaluate Baseline Models

In [None]:
# Initialize evaluator
evaluator = ModelEvaluator()

# Get predictions from all models
predictions_train = trainer.predict_all(X_train_selected)
predictions_test = trainer.predict_all(X_test_selected)

# Evaluate on train set
train_results = evaluator.evaluate_all_models(predictions_train, y_train, dataset='train')

# Evaluate on test set
test_results = evaluator.evaluate_all_models(predictions_test, y_test, dataset='test')

print("\n" + "="*80)
print("BASELINE MODEL RESULTS (TEST SET)")
print("="*80)
print(test_results[['model', 'r2', 'rmse', 'mae', 'mape']].to_string(index=False))

In [None]:
# Plot model comparison
evaluator.plot_model_comparison(
    test_results,
    metric='r2',
    save_path='../data/processed/baseline_model_comparison.png'
)

In [None]:
# Get best baseline model
best_baseline_name, best_baseline_metrics = evaluator.get_best_model(metric='r2', dataset='test')
print(f"\nüèÜ Best baseline model: {best_baseline_name}")
print(f"   R¬≤ Score: {best_baseline_metrics['r2']:.4f}")
print(f"   RMSE: ${best_baseline_metrics['rmse']:,.2f}")
print(f"   MAE: ${best_baseline_metrics['mae']:,.2f}")

## 9. Visualize Best Baseline Model

In [None]:
# Get predictions from best model
best_predictions = predictions_test[best_baseline_name]

# Plot predictions vs actual
evaluator.plot_predictions_vs_actual(
    y_test, best_predictions,
    model_name=best_baseline_name,
    save_path=f'../data/processed/{best_baseline_name}_predictions.png'
)

In [None]:
# Plot residuals
evaluator.plot_residuals(
    y_test, best_predictions,
    model_name=best_baseline_name,
    save_path=f'../data/processed/{best_baseline_name}_residuals.png'
)

## 10. Hyperparameter Tuning

In [None]:
# Initialize tuner
tuner = HyperparameterTuner()

# Select top models to tune
models_to_tune = ['random_forest', 'xgboost', 'lightgbm']

# Tune models using random search (faster than grid search)
print("\n‚öôÔ∏è Starting hyperparameter tuning...")
print("This may take several minutes...\n")

tuned_results = tuner.tune_all_models(
    X_train_selected, y_train,
    models=models_to_tune,
    method='random',  # Use 'grid' for exhaustive search
    n_iter=20,  # Number of random combinations to try
    cv=3,  # 3-fold cross-validation for speed
    scoring='r2'
)

print(f"\n‚úÖ Tuned {len(tuned_results)} models")

In [None]:
# Display best parameters
print("\n" + "="*80)
print("BEST HYPERPARAMETERS")
print("="*80)

for model_name in models_to_tune:
    params = tuner.get_best_params(model_name)
    if params:
        print(f"\n{model_name.upper()}:")
        for param, value in params.items():
            print(f"  {param}: {value}")

## 11. Evaluate Tuned Models

In [None]:
# Get predictions from tuned models
tuned_predictions_test = {}

for model_name, (model, params) in tuned_results.items():
    tuned_predictions_test[f"{model_name}_tuned"] = model.predict(X_test_selected)

# Evaluate tuned models
tuned_test_results = evaluator.evaluate_all_models(
    tuned_predictions_test, y_test, dataset='test'
)

print("\n" + "="*80)
print("TUNED MODEL RESULTS (TEST SET)")
print("="*80)
print(tuned_test_results[['model', 'r2', 'rmse', 'mae', 'mape']].to_string(index=False))

In [None]:
# Compare baseline vs tuned
comparison_df = pd.concat([test_results, tuned_test_results])
comparison_df = comparison_df.sort_values('r2', ascending=False)

print("\n" + "="*80)
print("ALL MODELS COMPARISON (BASELINE + TUNED)")
print("="*80)
print(comparison_df[['model', 'r2', 'rmse', 'mae']].to_string(index=False))

## 12. Select Final Best Model

In [None]:
# Get overall best model
best_model_name = comparison_df.iloc[0]['model']
best_r2 = comparison_df.iloc[0]['r2']
best_rmse = comparison_df.iloc[0]['rmse']
best_mae = comparison_df.iloc[0]['mae']

print("\n" + "="*80)
print("üèÜ FINAL BEST MODEL")
print("="*80)
print(f"\nModel: {best_model_name}")
print(f"\nPerformance Metrics:")
print(f"  R¬≤ Score: {best_r2:.4f}")
print(f"  RMSE: ${best_rmse:,.2f}")
print(f"  MAE: ${best_mae:,.2f}")
print(f"\nInterpretation:")
print(f"  - The model explains {best_r2*100:.2f}% of the variance in car prices")
print(f"  - Average prediction error: ${best_mae:,.2f}")
print(f"  - Root mean squared error: ${best_rmse:,.2f}")

## 13. Save Everything

In [None]:
# Create models directory
models_dir = Path('../models')
models_dir.mkdir(exist_ok=True)

# Save preprocessor
import pickle
with open(models_dir / 'preprocessor.pkl', 'wb') as f:
    pickle.dump(preprocessor, f)
print("‚úÖ Saved: preprocessor.pkl")

# Save feature engineer
with open(models_dir / 'feature_engineer.pkl', 'wb') as f:
    pickle.dump(feature_engineer, f)
print("‚úÖ Saved: feature_engineer.pkl")

# Save feature selector
with open(models_dir / 'feature_selector.pkl', 'wb') as f:
    pickle.dump(feature_selector, f)
print("‚úÖ Saved: feature_selector.pkl")

# Save all trained models
trainer.save_all_models(models_dir)

# Save tuned models
for model_name, (model, params) in tuned_results.items():
    with open(models_dir / f"{model_name}_tuned.pkl", 'wb') as f:
        pickle.dump(model, f)
    print(f"‚úÖ Saved: {model_name}_tuned.pkl")

# Save best parameters
tuner.save_best_params(models_dir / 'best_params.pkl')

# Save evaluation results
comparison_df.to_csv(models_dir / 'model_comparison.csv', index=False)
print("‚úÖ Saved: model_comparison.csv")

# Save selected features
selected_features = feature_selector.get_selected_features()
pd.DataFrame({'feature': selected_features}).to_csv(
    models_dir / 'selected_features.csv', index=False
)
print("‚úÖ Saved: selected_features.csv")

print(f"\n‚úÖ All artifacts saved to: {models_dir.absolute()}")

## 14. Generate Final Report

In [None]:
# Generate evaluation report
report = evaluator.generate_evaluation_report(
    comparison_df,
    save_path='../models/evaluation_report.txt'
)

print(report)

## 15. Example: Using the Model for Predictions

In [None]:
# Example of how to use the saved model for new predictions
print("\n" + "="*80)
print("EXAMPLE: MAKING PREDICTIONS ON NEW DATA")
print("="*80)

# Take a sample from test set
sample_data = X_test.head(5)
sample_actual = y_test.head(5)

# Apply feature engineering
sample_engineered = feature_engineer.transform(sample_data)

# Apply feature selection
sample_selected = feature_selector.transform(sample_engineered)

# Make predictions with best model
# (In production, load the saved model)
if '_tuned' in best_model_name:
    base_name = best_model_name.replace('_tuned', '')
    best_model = tuned_results[base_name][0]
else:
    best_model = trainer.get_model(best_model_name)

sample_predictions = best_model.predict(sample_selected)

# Display results
results_df = pd.DataFrame({
    'Actual Price': sample_actual.values,
    'Predicted Price': sample_predictions,
    'Error': sample_predictions - sample_actual.values,
    'Error %': ((sample_predictions - sample_actual.values) / sample_actual.values * 100)
})

print("\nSample Predictions:")
print(results_df.to_string(index=False))

print("\n" + "="*80)
print("‚úÖ PIPELINE COMPLETED SUCCESSFULLY!")
print("="*80)

## Summary

### What We Accomplished:

1. ‚úÖ **Preprocessed** 400K+ rows of raw data
2. ‚úÖ **Engineered** new features to improve predictions
3. ‚úÖ **Selected** the most important features
4. ‚úÖ **Trained** multiple baseline models
5. ‚úÖ **Evaluated** and compared all models
6. ‚úÖ **Tuned** hyperparameters for best models
7. ‚úÖ **Selected** the best performing model
8. ‚úÖ **Saved** all artifacts for production use

### Next Steps:

1. **Deploy the model** to a production environment
2. **Monitor performance** on new data
3. **Retrain periodically** with new data
4. **A/B test** different models
5. **Collect feedback** and iterate

### How to Use in Production:

```python
import pickle

# Load all components
with open('../models/preprocessor.pkl', 'rb') as f:
    preprocessor = pickle.load(f)

with open('../models/feature_engineer.pkl', 'rb') as f:
    feature_engineer = pickle.load(f)

with open('../models/feature_selector.pkl', 'rb') as f:
    feature_selector = pickle.load(f)

with open('../models/best_model.pkl', 'rb') as f:
    model = pickle.load(f)

# Process new data
df_new_processed = preprocessor.transform(df_new)
df_new_engineered = feature_engineer.transform(df_new_processed)
df_new_selected = feature_selector.transform(df_new_engineered)

# Make predictions
predictions = model.predict(df_new_selected)
```