# Used Car Price Prediction: From Regression to Decision Support

### 1. Project Overview

This project builds a regression-based pricing model to estimate the market value of used vehicles based on observable characteristics such as mileage, vehicle age, engine specifications, brand, and model.

Beyond predictive accuracy, the project emphasizes interpretability, robustness, and decision usability, transforming a regression model into a pricing support system with confidence-aware actions.

#### Aim: 
+ Predict used-car price from mileage, engine size, year, and vehicle attributes (body type, fuel type, registration status, model). 
+ This is a supervised regression task. 
+ Success is measured by generalization performance on a held-out test set.

In [59]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error


In [60]:
df = pd.read_csv('1.04.+Real-life+example.csv')
df.head()

Unnamed: 0,Brand,Price,Body,Mileage,EngineV,Engine Type,Registration,Year,Model
0,BMW,4200.0,sedan,277,2.0,Petrol,yes,1991,320
1,Mercedes-Benz,7900.0,van,427,2.9,Diesel,yes,1999,Sprinter 212
2,Mercedes-Benz,13300.0,sedan,358,5.0,Gas,yes,2003,S 500
3,Audi,23000.0,crossover,240,4.2,Petrol,yes,2007,Q7
4,Toyota,18300.0,crossover,120,2.0,Petrol,yes,2011,Rav 4


In [61]:
print("Shape:", df.shape)
print(df.dtypes)

missing = df.isna().sum().sort_values(ascending=False)
missing[missing > 0]


Shape: (4345, 9)
Brand            object
Price           float64
Body             object
Mileage           int64
EngineV         float64
Engine Type      object
Registration     object
Year              int64
Model            object
dtype: object


Price      172
EngineV    150
dtype: int64

In [62]:
y = df["Price"]
X = df.drop(columns=["Price"])


In [63]:
numeric_features = ["Mileage", "EngineV", "Year"]
categorical_features = ["Brand", "Body", "Engine Type", "Registration", "Model"]


In [64]:
###Trim extreme outliers using percentile thresholds to reduce leverage.

In [65]:
df2 = df.copy()

# Keep only rows where target exists (we must have y for training)
df2 = df2.dropna(subset=["Price"])

# Percentile-based trimming (optional but common for this dataset)
price_hi = df2["Price"].quantile(0.99)
mileage_hi = df2["Mileage"].quantile(0.99)
year_lo = df2["Year"].quantile(0.01)

df2 = df2[
    (df2["Price"] <= price_hi) &
    (df2["Mileage"] <= mileage_hi) &
    (df2["Year"] >= year_lo)
].copy()

# Re-split X/y after trimming
y = df2["Price"]
X = df2.drop(columns=["Price"])


In [66]:
### split data into train/test before fitting any preprocessing to avoid leakage.

In [67]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [68]:
### Use a ColumnTransformer pipeline so numeric scaling + categorical encoding are applied consistently to train and test.

In [69]:
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)


In [70]:
### Train a baseline linear regression model.

In [71]:
lin_model = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", LinearRegression())
])

lin_model.fit(X_train, y_train)

pred_train = lin_model.predict(X_train)
pred_test = lin_model.predict(X_test)


In [72]:
### Evaluate using R² (explained variance) and MAE/RMSE (error magnitude)

In [73]:
def regression_metrics(y_true, y_pred):
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    return r2, mae, rmse

train_r2, train_mae, train_rmse = regression_metrics(y_train, pred_train)
test_r2, test_mae, test_rmse = regression_metrics(y_test, pred_test)

print("TRAIN -> R2:", train_r2, "MAE:", train_mae, "RMSE:", train_rmse)
print("TEST  -> R2:", test_r2,  "MAE:", test_mae,  "RMSE:", test_rmse)


TRAIN -> R2: 0.8320271424173001 MAE: 4650.991002924482 RMSE: 7508.8934286344465
TEST  -> R2: 0.7759216446205555 MAE: 5776.488972622222 RMSE: 9999.10377736539




In [74]:
### Check whether predictions collapse to a narrow range (flat-line symptom)

In [75]:
def pred_summary(name, preds):
    preds = np.asarray(preds)
    print(f"{name}: mean={preds.mean():.2f}, std={preds.std():.2f}, min={preds.min():.2f}, max={preds.max():.2f}")

pred_summary("Pred TRAIN", pred_train)
pred_summary("Pred TEST", pred_test)


Pred TRAIN: mean=17712.27, std=16711.88, min=-18378.12, max=116614.71
Pred TEST: mean=18567.77, std=18438.25, min=-22143.02, max=110982.76


In [76]:
### Use Ridge regression to reduce coefficient variance and improve generalization

In [77]:
ridge_model = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", Ridge(alpha=10.0, random_state=42))
])

ridge_model.fit(X_train, y_train)

pred_train_r = ridge_model.predict(X_train)
pred_test_r = ridge_model.predict(X_test)

print("RIDGE TRAIN:", regression_metrics(y_train, pred_train_r))
print("RIDGE TEST :", regression_metrics(y_test, pred_test_r))


RIDGE TRAIN: (0.7638172868963666, 5689.8405153444455, 8903.903384658788)
RIDGE TEST : (0.7392570929377476, 6539.932940283585, 10786.172790714727)




In [78]:
### Model log(Price) to handle right-skew and stabilize variance.

In [79]:
y_train_log = np.log(y_train)
y_test_log = np.log(y_test)

ridge_log = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", Ridge(alpha=10.0, random_state=42))
])

ridge_log.fit(X_train, y_train_log)

pred_test_log = ridge_log.predict(X_test)
pred_test_price = np.exp(pred_test_log)

print("TEST on original price scale:")
print(regression_metrics(y_test, pred_test_price))


TEST on original price scale:
(0.8316284082997705, 3935.1114545620662, 8667.533530267438)




### Model Comparison and Selection

+ A baseline linear regression model achieved reasonable performance but exhibited larger errors on high-priced vehicles.
+ Ridge regression improved stability by shrinking coefficients in the presence of many correlated one-hot encoded features.
+ Applying a log transformation to the target further improved performance by addressing price skew and heteroscedasticity. 
+ The log-Ridge model achieved the best balance between explanatory power and error magnitude and was selected as the final model.

## Business Insights of the problem

+ The final model provides reasonably accurate price estimates for most vehicles, with typical errors small enough to support pricing guidance, 
valuation screening, or market analysis tasks.
+ On average, the model’s price estimates are within about $4,000 of the true market value.
+ The model explains about 83% of the variation in car prices using observable vehicle characteristics.
+ Modeling the logarithm of price allows the model to focus on proportional differences rather than absolute dollar differences, 
aligning predictions with how market pricing typically behaves.

#### What factors matter most when pricing a car?

In [80]:
feature_names = (
    ridge_log.named_steps["preprocess"]
    .get_feature_names_out()
)

coefficients = ridge_log.named_steps["model"].coef_

coef_df = pd.DataFrame({
    "feature": feature_names,
    "coefficient": coefficients
}).sort_values(by="coefficient", key=np.abs, ascending=False)

coef_df.head(10)


Unnamed: 0,feature,coefficient
7,cat__Brand_Renault,-0.569128
2,num__Year,0.527029
294,cat__Model_Vito,-0.505707
20,cat__Registration_no,-0.442757
21,cat__Registration_yes,0.442757
184,cat__Model_Land Cruiser 200,0.434624
207,cat__Model_Multivan,0.428914
224,cat__Model_Polo,-0.425787
120,cat__Model_Caddy,-0.42468
5,cat__Brand_Mercedes-Benz,0.377138


In [81]:
+ The largest coefficient magnitudes correspond to brand and model indicators, 
confirming that categorical identity exerts a strong influence on price predictions.
+ The model indicates that vehicle age, mileage, and brand/model identity are the dominant contributors to price variation, 
consistent with real-world automotive market behavior.

SyntaxError: invalid syntax (1159543925.py, line 1)

#### When should One trust this model — and when should One be cautious?

In [None]:
abs_errors = np.abs(y_test - pred_test_price)

error_df = X_test.copy()
error_df["true_price"] = y_test
error_df["abs_error"] = abs_errors

error_df.sort_values("abs_error", ascending=False).head(10)


+ The largest prediction errors occur predominantly among high-priced and less frequent vehicle types, 
indicating reduced reliability for rare market segments.
+ The model is most reliable for mainstream vehicles with sufficient historical data and less reliable
for rare or high-end vehicles where training data is sparse.

 **Model limitations**

In [None]:

model_counts = df["Model"].value_counts()
rare_models = model_counts[model_counts < 10].index

error_df["is_rare_model"] = error_df["Model"].isin(rare_models)

error_df.groupby("is_rare_model")["abs_error"].mean()


+ Vehicles belonging to rarely observed models exhibit higher prediction errors due to limited representation in the training data.
+ Prediction uncertainty increases for rare or high-priced vehicles due to limited historical examples. 
+ These cases should be treated as higher-risk estimates rather than definitive prices.
+ The improvement observed after log-transforming the target indicates that price dynamics are not strictly linear in absolute terms, 
limiting the effectiveness of untransformed linear regression.

### Action Thresholds

In [None]:
# Error bases thresholds
low_error_thresh = abs_errors.quantile(0.50)   # median error
high_error_thresh = abs_errors.quantile(0.80)  # high error


In [None]:
# Price based thresholds
low_price = y_test.quantile(0.33)
high_price = y_test.quantile(0.67)


In [None]:
# Rarity threshold
model_counts = df["Model"].value_counts()
rare_models = model_counts[model_counts < 10].index


In [84]:
results_df["is_rare_model"] = results_df["Model"].isin(rare_models)


In [85]:
# Action Logic
def assign_confidence(row):
    if (
        row["abs_error"] <= low_error_thresh and
        row["true_price"] <= high_price and
        not row["is_rare_model"]
    ):
        return "High confidence"
    
    elif (
        row["abs_error"] <= high_error_thresh
    ):
        return "Medium confidence"
    
    else:
        return "Low confidence"


In [87]:
results_df["confidence_tier"] = results_df.apply(assign_confidence, axis=1)


In [88]:
results_df[["true_price", "pred_price", "abs_error", "is_rare_model", "confidence_tier"]].head()


Unnamed: 0,true_price,pred_price,abs_error,is_rare_model,confidence_tier
1034,14250.0,13482.316112,767.683888,True,Medium confidence
843,6300.0,8289.031773,1989.031773,False,Medium confidence
1621,2550.0,3424.927764,874.927764,True,Medium confidence
2183,6900.0,6844.342391,55.657609,False,High confidence
1174,12200.0,12908.829101,708.829101,False,High confidence


In [89]:
results_df.groupby("confidence_tier")["true_price"].agg(
    count="count",
    mean_price="mean",
    min_price="min",
    max_price="max"
)


Unnamed: 0_level_0,count,mean_price,min_price,max_price
confidence_tier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
High confidence,306,7648.408333,1300.0,17499.0
Low confidence,162,44423.60858,2250.0,125000.0
Medium confidence,343,16783.012157,1200.0,86000.0


In [90]:
confidence_percentages = (
    results_df["confidence_tier"]
    .value_counts(normalize=True)
    .mul(100)
    .round(2)
)

confidence_percentages


confidence_tier
Medium confidence    42.29
High confidence      37.73
Low confidence       19.98
Name: proportion, dtype: float64

In [91]:
final_confidence_summary = results_df.groupby("confidence_tier").agg(
    predictions=("confidence_tier", "count"),
    avg_abs_error=("abs_error", "mean"),
    avg_price=("true_price", "mean")
)

final_confidence_summary


Unnamed: 0_level_0,predictions,avg_abs_error,avg_price
confidence_tier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
High confidence,306,617.973401,7648.408333
Low confidence,162,14236.435032,44423.60858
Medium confidence,343,2029.075958,16783.012157


+ The confidence tiers separate predictions by historical error behavior, with high-confidence predictions showing
consistently low error on common vehicles, and low-confidence predictions concentrated among rare or high-priced vehicles.