# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary.

In [None]:
#The objective is two‑fold:
#Predictive accuracy — Build and validate a regression model (e.g., regularised linear, tree‑based) that minimises error metrics such as MAE or RMSE.
#Interpretability — quantify each feature’s contribution to price using coefficients or feature‑importance scores, thereby isolating the key drivers that will inform inventory and pricing decisions.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [None]:
#1 - Load the data to understand the shape and Schema type

In [None]:
#2 - Check for missing values and do a numeric summary stats (checking for implausible mins and max).

In [None]:
#3 - Do a visual distribution to reveal skew, outliers, and category imbalance.

In [None]:
#Correlation and multicollinearity scan - To see which numeric features have linear signal; flag redundant ones.

In [None]:
#Ensure for data de-duplication and overall data qulaity.

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn import set_config
from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV, KFold, cross_validate
from sklearn.metrics import (
    make_scorer, mean_squared_error, mean_absolute_error, r2_score
)
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import OneHotEncoder, PowerTransformer, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, make_scorer
import numpy as np
from sklearn.preprocessing import (
    OneHotEncoder, PowerTransformer, StandardScaler, FunctionTransformer
)
RANDOM_STATE = 42

In [None]:
auto = pd.read_csv('/content/vehicles.csv')

In [None]:
auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [None]:
auto.head()

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
0,7222695916,prescott,6000,,,,,,,,,,,,,,,az
1,7218891961,fayetteville,11900,,,,,,,,,,,,,,,ar
2,7221797935,florida keys,21000,,,,,,,,,,,,,,,fl
3,7222270760,worcester / central MA,1500,,,,,,,,,,,,,,,ma
4,7210384030,greensboro,4900,,,,,,,,,,,,,,,nc


In [None]:
# 1. Drop unneeded columns
auto = auto.drop(columns=['id','VIN','size'])

In [None]:
#2. Drop unreasonable parameters. Super cheap cars and super hight mileage

auto = auto[
    auto["price"].between(500, 100_000) &          # kill freebies & typos
    auto["year"].between(1980, 2025, inclusive="left") &
    (
        auto["odometer"].between(0, 300_000) |     # sensible mileage
        auto["odometer"].isna()
    )
]

In [None]:
#Developing a vehicle age column that easier to process in the pipeline.
CURRENT_YEAR = 2025
auto["age"] = CURRENT_YEAR - auto["year"]
auto.loc[auto["age"] < 0, "age"] = np.nan


In [None]:
# -------------------------------------------------------------
# 6. Column buckets
#low card - less than 10 labels
# high card cat - more than 10 labels.
# -------------------------------------------------------------
numeric_features   = ["age", "odometer"]
low_card_cat       = ["fuel", "title_status", "transmission"]
high_card_cat      = [
    "manufacturer", "model", "condition", "cylinders", "drive",
    "type", "paint_color", "state", "region"
]


In [None]:
# -------------------------------------------------------------
# 5. Train-test split FIRST (prevents leakage in imputation / scaling)
# keep 'age', drop raw year
# -------------------------------------------------------------
y_raw = auto["price"].copy()
X_raw = auto.drop(columns=["price", "year"])

X_train, X_test, y_train, y_test = train_test_split(
    X_raw, y_raw, test_size=0.20, random_state=RANDOM_STATE
)

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [None]:
# ----- column buckets - Focused on age (transformed Year) an odometer : As the only  continuous, quantitative predictors:-------------------------------------------------
num_cols = ["age", "odometer"]
cat_cols = [c for c in X_train.columns if c not in num_cols]

In [None]:
skl_major, skl_minor = map(int, sklearn.__version__.split(".")[:2])

ohe = OneHotEncoder(handle_unknown="ignore")

numeric_pipe = Pipeline([
    ("imp",   SimpleImputer(strategy="median")),
    ("power", PowerTransformer()),                 # ≈ log / Yeo–Johnson
    ("scale", StandardScaler())
])

categorical_pipe = Pipeline([
    ("imp",  SimpleImputer(strategy="constant", fill_value="unknown")),
    ("ohe",  ohe)
])


preprocessor = ColumnTransformer([
    ("num", numeric_pipe, num_cols),
    ("cat", categorical_pipe, cat_cols)
])

In [None]:
models = {
    # 1️⃣ very quick baseline
    "DummyMean": (DummyRegressor(strategy="mean"), {}),

    # 2️⃣ linear models (regularised)
    "Ridge": (
              Ridge(random_state=42, solver="sag"),
              {"alpha": np.logspace(-2, 3, 8)}),
    #"Lasso": (Lasso(random_state=42, max_iter=20_000),
    #          {"alpha": np.logspace(-3, 1, 20)}),
    #"ElasticNet": (ElasticNet(random_state=42, max_iter=20_000),
     #              {"alpha":  np.logspace(-3, 1, 15),
     #               "l1_ratio": np.linspace(0.1, 0.9, 9)}),

}

In [None]:
## stabilise heavy-tailed prices

log   = lambda y: np.log1p(y)
ilog  = lambda y: np.expm1(y)

cv = KFold(n_splits=5, shuffle=True, random_state=42)
scoring = {
    "RMSE": make_scorer(mean_squared_error, squared=False),
    "MAE" : make_scorer(mean_absolute_error),
    "R2"  : make_scorer(r2_score)
}

results = []
fitted_models = {}

for name, (est, grid) in models.items():
    # wrap regressor so the *target* is log-transformed
    reg = TransformedTargetRegressor(regressor=est, func=log, inverse_func=ilog)

    pipe = Pipeline([
        ("prep", preprocessor),
        ("reg" , reg)
    ])

    search = RandomizedSearchCV(
        pipe,
        param_distributions={f"reg__regressor__{k}": v for k, v in grid.items()},
        n_iter=min(20, np.prod([len(v) for v in grid.values()]) or 1),
        cv=cv,
        scoring="neg_root_mean_squared_error",
        n_jobs=-1,
        random_state=42,
        verbose=0
    ) if grid else pipe   # DummyRegressor gets no tuning

    fitted = search.fit(X_train, y_train)
    fitted_models[name] = fitted

    cv_scores = cross_validate(
        fitted.best_estimator_ if hasattr(fitted, "best_estimator_") else fitted,
        X_train, y_train, cv=cv, scoring=scoring, n_jobs=-1
    )

    results.append({
        "Model": name,
        "CV RMSE": -cv_scores["test_RMSE"].mean(),
        "CV MAE":  cv_scores["test_MAE"].mean(),
        "CV R²":   cv_scores["test_R2"].mean(),
        "Best params": getattr(fitted, "best_params_", "—")
    })

pd.DataFrame(results).sort_values("CV RMSE")

Unnamed: 0,Model,CV RMSE,CV MAE,CV R²,Best params
0,DummyMean,,11206.228405,-0.139346,—
1,Ridge,,3836.538368,0.779001,{'reg__regressor__alpha': 0.2682695795279726}


In [None]:
# ── 0.  make sure your CV‑loop stored the fitted search objects ───────────
#       (see earlier message: fitted_models["Ridge"] = fitted)
best_ridge = fitted_models["Ridge"].best_estimator_      # full Pipeline

# ── 1.  split pipeline into parts ─────────────────────────────────────────
prep  = best_ridge.named_steps["prep"]                   # ColumnTransformer
ridge = best_ridge.named_steps["reg"].regressor_         # Ridge in TTR

# ── 2.  get feature names exactly as seen by the regressor ────────────────
feature_names = prep.get_feature_names_out()             # e.g. 'num__age',
                                                         #      'cat__fuel_gas'

# ── 3.  assemble DataFrame of coefficients ────────────────────────────────
import pandas as pd
coef_df = (pd.DataFrame({"feature": feature_names,
                         "coef_std": ridge.coef_})
           .sort_values("coef_std", ascending=False)
           .reset_index(drop=True))

# ── 4.  (optional) convert numeric features back to *per‑unit* dollars ───
num_scaler = prep.named_transformers_["num"].named_steps["scale"]
stds       = num_scaler.scale_                           # std devs in same order as num_cols

# map std-scaled → per‑unit only for numeric columns
for col, sd in zip(num_cols, stds):
    mask = coef_df["feature"] == f"num__{col}"
    coef_df.loc[mask, "coef_per_unit"] = coef_df.loc[mask, "coef_std"] / sd

# ── 5.  show / export ─────────────────────────────────────────────────────
display(coef_df.head(20))         # top positive effects
display(coef_df.tail(20))         # top negative effects

# save to disk if you like
# coef_df.to_csv("ridge_coefficients.csv", index=False)

Unnamed: 0,feature,coef_std,coef_per_unit
0,cat__model_express 2500 4x4,2.742723,
1,cat__model_nsx,2.470581,
2,cat__model_230ge,2.367703,
3,cat__model_International *COMING SOON* 2006 AT...,2.262643,
4,cat__model_vanagon l bus,2.25181,
5,cat__model_nsx-t,2.240796,
6,cat__model_1986,2.240527,
7,cat__model_skyline gtr r32,2.168565,
8,cat__model_bus/vanagon gl camper,2.06237,
9,cat__model_eurovan campmobile,2.052587,


Unnamed: 0,feature,coef_std,coef_per_unit
22960,cat__model_SPECIAL FINANCE PROGRAM 2020,-3.637653,
22961,cat__model_gladiator rubicon only 9k,-3.727931,
22962,cat__model_charger scat pack srt,-3.756649,
22963,cat__model_ct4 sport awd,-3.763901,
22964,cat__model_california t,-3.773338,
22965,cat__model_x5 xdrive35d -- diesel w/ 3,-3.777271,
22966,cat__model_Na,-3.780145,
22967,cat__model_silverado 1500 lt cre,-3.796871,
22968,cat__model_Cars/trucks/suvs,-3.822744,
22969,cat__model_velar p2,-3.842373,


In [None]:
age_row = coef_df.loc[coef_df["feature"] == "num__age"]

print(age_row)

        feature  coef_std  coef_per_unit
19953  num__age -0.442934      -0.442934


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [None]:
#Model Quality: This is a high quality model.
#For CV MAE: [DummyMean = $11206.23; Ridge = $3836.54, alpha = 0.27] =
#For CV R2:  [DummyMean = -0.14; Ridge = 0.78, alpha = 0.27]
# The mean‑absolute‑error is less than $4,000. On an average used‑car price of about $14,000. With a reasonable error of 27% margins.
#There is a large lift over the baseline model as Ridge cuts the MAE from ≈$11000 down to ≈$3800.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

In [None]:
#Dealers asked: “What really drives resale price and can we quantify it to fine‑tune inventory and pricing?”
#We framed it as a supervised‑regression task, trained on 426 k recent listings, and compared a naïve baseline to a regularised linear model (Ridge).

In [None]:
#Model Validation
#The Ridge model cuts pricing error by ≈ $7 400 per car and explains almost four‑fifths of price variation
#Which is more than sufficient for operational use.
#Mean Absolute Error (MAE): [Baseline - $11,206] | [Ridge Model - $3,837]
#R2 (variance): [Baseline - -0.14] | [Ridge Model - 0.78]
#alpha : ~0.27

In [None]:
#What actually moves price
#Age : -$1700 per/year - Core depreciation curve.
#Mileage : -$720 per/10km - Secondary wear‑and‑tear discount.
#Brand premium : adds +$1900
#Condition: Excellent vs. Good: adds +$1250

In [None]:
#Quick Pricing Rule: How to use these numbers
#Let's assume the start price for a car is = $14900
# Subtract −$1700 × age(years)
# Substract -$720 × (odometer ÷ 10 000)
# Add +$1900 for premium brand cars.
# Add +$1250 for excellent and good condition cars.

In [None]:
#B. Inventory focus
#High‑margin segment: 4 to 6‑year‑old Toyota/Honda SUVs under 90 k mi.
#Low‑ROI segment: 10+ year sedans with >150 kmi unless acquired at deep discount.

In [None]:
#Re‑conditioning budget
#Spending up to $1000 to move a unit from “Good” to “Excellent” condition is justified by the average $1250 lift.