# Used Car Price Prediction: From Regression to Decision Support

### 1. Project Overview
+ This project builds a regression-based pricing model to estimate the market value of used vehicles based on observable characteristics such as mileage, 
vehicle age, engine specifications, brand, and model.
+ Beyond predictive accuracy, the project emphasizes interpretability, robustness, and decision usability, 
transforming a regression model into a pricing support system with confidence-aware actions.


### 2. Business Problem
Accurately pricing used vehicles is challenging due to:

+ Large variability across brands and models

+ Non-linear depreciation effects

+ Skewed price distributions

+ Sparse data for rare or luxury vehicles

A purely metric-driven model risks producing unreliable estimates when applied blindly. 
The goal is therefore not only to predict prices, but to:

+ Identify key drivers of value

+ Understand when predictions are reliable

+ Define when human review is required


### 3. Dataset Description
The dataset contains real-world used car listings with:

**Target Variable**

+ Price â€“ final market price of the vehicle

**Numerical Features**

+ Mileage

+ Engine volume

+ Year of manufacture

**Categorical Features**

+ Brand

+ Model

+ Body type

+ Engine type

+ Registration status

**The dataset exhibits:**

+ Missing values

+ Strong right-skew in prices

+ High cardinality categorical variables (e.g. Model)


### 4. Data Preparation and Feature Engineering
Key preprocessing steps included:

+ Handling missing values via imputation

+ Percentile-based outlier trimming to reduce leverage effects

+ One-hot encoding of categorical variables to preserve identity-based pricing signals

+ Feature scaling of numeric variables to stabilize optimization

All preprocessing was implemented using scikit-learn Pipelines and ColumnTransformers to ensure:

+ Consistent feature construction

+ No data leakage between training and testing

### 5. Modeling Approach

Three regression models were evaluated:

**1. Linear Regression (Baseline)**

Provided a reasonable baseline but showed sensitivity to outliers and high-priced vehicles.

**2. Ridge Regression**

Introduced regularization to control coefficient magnitude in the presence of many correlated one-hot encoded features, 
improving stability and generalization.

**3. Ridge Regression on log(Price)**

Applied a logarithmic transformation to the target variable to:

+ Address price skewness

+ Stabilize variance

Model proportional price effects rather than absolute differences

This final model achieved the best balance of accuracy, robustness, and realism.

### 6. Model Performance 

Rather than focusing on metrics alone, performance was interpreted in business terms:

+ The final model explains a large proportion of price variation using observable features

+ Typical prediction errors are small enough for valuation guidance

+ Larger errors are concentrated in rare or high-end vehicles

+ This aligns with real market behavior and highlights the importance of context-aware model usage.

### 7. Key Business Insights

**What Drives Vehicle Price?**

+ Vehicle age and mileage strongly influence depreciation

+ Brand and model identity exert large price shifts

+ Registration status introduces discrete price effects

Performance improvements after adding categorical features confirm that identity-based attributes are critical pricing drivers, not just wear-and-age variables.

**When to Trust the Model vs Be Cautious**

+ Predictions are most reliable for common, mid-range vehicles

+ Uncertainty increases for rare models and extreme price ranges

+ This insight is supported by error concentration analysis on the test set.

**Model Limitations**

+ Sparse data for rare models limits accuracy in those segments

+ Linear structure (even with transformation) cannot capture all market nuances

+ Predictions should not be treated as absolute ground truth

Acknowledging these limitations is essential for responsible use

### 8. From Prediction to Decision: Action Thresholds

To translate predictions into actionable decisions, confidence tiers were introduced based on:

+ Historical prediction error

+ Vehicle price segment

+ Rarity of the model

**Confidence Tiers**
|Tier	            |Interpretation	                                |Recommended Action|
|-------------------|-----------------------------------------------|---------------------------|    
|High Confidence	|Low historical error, common vehicles	        |Automatic pricing guidance|
|Medium Confidence	|Moderate uncertainty	                        |Use as reference with review|
|Low Confidence	    |High uncertainty, rare or expensive vehicles	|Manual expert review|

This framework ensures the model supports decisions without overstepping its reliability.

### 9. Why This Model Was Chosen

The final Ridge regression model with a log-transformed target was selected because it:

+ Handles skewed price distributions effectively

+ Controls coefficient instability in high-dimensional feature space

+ Generalizes well to unseen data

+ Produces realistic, interpretable predictions

+ Most importantly, it integrates naturally into a risk-aware decision workflow.

### 10. Practical Applications

This system can be used for:

+ Used-car pricing guidance

+ Detection of under- or over-valued listings

+ Inventory valuation and screening

+ Supporting negotiation and appraisal processes

Predictions are accompanied by confidence tiers to guide appropriate action.

### Final Note

This project reflects an end-to-end regression workflow that prioritizes business relevance, transparency, 
and responsible use, not just predictive performance.