# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary.

### Business Understanding

From a business perspective, our goal is to help a used-car dealership understand which factors most influence vehicle price.

Translating this into a **data-science problem**, we aim to build a **supervised regression model** that predicts the continuous variable `price` using features such as make, model, year, mileage, and condition.

The dealership can use these insights to improve **pricing strategy**, **inventory acquisition**, and **marketing focus** on the most valuable vehicle attributes.


### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

### Data Understanding

The dataset contains information on 426,880 used cars collected from online listings.
Each record includes attributes such as manufacturer, model, year, condition, fuel type, transmission, odometer reading, and price.

We will inspect the data for missing values, inconsistent entries, and outliers to ensure data quality before modeling.


In [13]:
# ---- Imports & visual setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (8,5)

# Load dataset (adjust path if needed)
df = pd.read_csv("vehicles.csv", low_memory=False)
print(df.shape)
df.head()


(426880, 18)


Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
0,7222695916,prescott,6000,,,,,,,,,,,,,,,az
1,7218891961,fayetteville,11900,,,,,,,,,,,,,,,ar
2,7221797935,florida keys,21000,,,,,,,,,,,,,,,fl
3,7222270760,worcester / central MA,1500,,,,,,,,,,,,,,,ma
4,7210384030,greensboro,4900,,,,,,,,,,,,,,,nc


In [14]:
df.info()
df.describe(include="all").T.head(20)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
id,426880.0,,,,7311486634.224333,4473170.412559,7207408119.0,7308143339.25,7312620821.0,7315253543.5,7317101084.0
region,426880.0,404.0,columbus,3608.0,,,,,,,
price,426880.0,,,,75199.033187,12182282.173604,0.0,5900.0,13950.0,26485.75,3736928711.0
year,425675.0,,,,2011.235191,9.45212,1900.0,2008.0,2013.0,2017.0,2022.0
manufacturer,409234.0,42.0,ford,70985.0,,,,,,,
model,421603.0,29649.0,f-150,8009.0,,,,,,,
condition,252776.0,6.0,good,121456.0,,,,,,,
cylinders,249202.0,8.0,6 cylinders,94169.0,,,,,,,
fuel,423867.0,5.0,gas,356209.0,,,,,,,
odometer,422480.0,,,,98043.331443,213881.500798,0.0,37704.0,85548.0,133542.5,10000000.0


In [15]:
for col in ["manufacturer","model","condition","fuel","transmission","type","drive"]:
    if col in df.columns:
        print("\n", col)
        print(df[col].value_counts(dropna=False).head(10))


 manufacturer
manufacturer
ford         70985
chevrolet    55064
toyota       34202
honda        21269
nissan       19067
jeep         19014
ram          18342
NaN          17646
gmc          16785
bmw          14699
Name: count, dtype: int64

 model
model
f-150             8009
NaN               5277
silverado 1500    5140
1500              4211
camry             3135
silverado         3023
accord            2969
wrangler          2848
civic             2799
altima            2779
Name: count, dtype: int64

 condition
condition
NaN          174104
good         121456
excellent    101467
like new      21178
fair           6769
new            1305
salvage         601
Name: count, dtype: int64

 fuel
fuel
gas         356209
other        30728
diesel       30062
hybrid        5170
NaN           3013
electric      1698
Name: count, dtype: int64

 transmission
transmission
automatic    336524
other         62682
manual        25118
NaN            2556
Name: count, dtype: int64

 type
type


### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`.

### Data Preparation

Before modeling, we clean the data by removing missing or invalid prices, filtering unrealistic values, and engineering a few helpful features such as `car_age`.
We also separate numeric and categorical features for one-hot encoding.


In [16]:
df = df.copy()
df = df[df["price"].between(1000, 100000)]
df = df[df["year"].between(1985, 2025)]
df["car_age"] = 2025 - df["year"]
df = df.drop_duplicates()
df["log_price"] = np.log1p(df["price"])
df.head()


Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state,car_age,log_price
27,7316814884,auburn,33590,2014.0,gmc,sierra 1500 crew cab slt,good,8 cylinders,gas,57923.0,clean,other,3GTP1VEC4EG551563,,,pickup,white,al,11.0,10.422013
28,7316814758,auburn,22590,2010.0,chevrolet,silverado 1500,good,8 cylinders,gas,71229.0,clean,other,1GCSCSE06AZ123805,,,pickup,blue,al,15.0,10.025307
29,7316814989,auburn,39590,2020.0,chevrolet,silverado 1500 crew,good,8 cylinders,gas,19160.0,clean,other,3GCPWCED5LG130317,,,pickup,red,al,5.0,10.586357
30,7316743432,auburn,30990,2017.0,toyota,tundra double cab sr,good,8 cylinders,gas,41124.0,clean,other,5TFRM5F17HX120972,,,pickup,red,al,8.0,10.341452
31,7316356412,auburn,15000,2013.0,ford,f-150 xlt,excellent,6 cylinders,gas,128000.0,clean,automatic,,rwd,full-size,truck,black,al,12.0,9.615872


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

### Modeling

We compare three regression algorithms: Linear Regression, Ridge Regression (with hyperparameter tuning), and Random Forest.
Performance will be evaluated with **RMSE**, **MAE**, and **R²** to identify the most accurate and interpretable model.


In [17]:
# --- Modeling Section (complete working version) ---

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Separate target and features
target = "price"
y = df[target].values

features = ["year","odometer","car_age","manufacturer","condition","fuel","transmission","type","drive"]
X = df[features].copy()

# Identify numeric vs categorical
num_cols = ["year","odometer","car_age"]
cat_cols = [c for c in X.columns if c not in num_cols]

# --- Handle missing values ---
for col in num_cols:
    X[col] = X[col].fillna(X[col].median())
for col in cat_cols:
    X[col] = X[col].fillna('Unknown')

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessor
preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat_cols),
        ("num", "passthrough", num_cols)
    ]
)

# --- Evaluation Function ---
def evaluate(model, name):
    pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, pred))   # fixed here
    mae  = mean_absolute_error(y_test, pred)
    r2   = r2_score(y_test, pred)
    print(f"{name}: RMSE={rmse:.2f}, MAE={mae:.2f}, R2={r2:.2f}")
    return rmse, mae, r2

# --- 1) Linear Regression ---
lin = Pipeline([("prep", preprocess), ("model", LinearRegression())])
lin.fit(X_train, y_train)
evaluate(lin, "Linear Regression")

# --- 2) Ridge Regression (GridSearchCV) ---
ridge = Pipeline([("prep", preprocess), ("model", Ridge())])
params = {"model__alpha":[0.1, 1, 10]}
ridge_cv = GridSearchCV(ridge, params, cv=3)
ridge_cv.fit(X_train, y_train)
evaluate(ridge_cv.best_estimator_, "Ridge Regression")

# --- 3) Random Forest Regressor ---
rf = Pipeline([("prep", preprocess),
               ("model", RandomForestRegressor(n_estimators=200, random_state=42))])
rf.fit(X_train, y_train)
evaluate(rf, "Random Forest")


Linear Regression: RMSE=8788.95, MAE=6195.19, R2=0.62


  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)


Ridge Regression: RMSE=8788.43, MAE=6195.21, R2=0.62
Random Forest: RMSE=4456.38, MAE=2075.91, R2=0.90


(np.float64(4456.379573989913), 2075.9104935786318, 0.9018179305613095)

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Evaluation

Among the models tested, the **Random Forest** generally performs the best, achieving the lowest RMSE and highest R² score.  
This means it predicts car prices more accurately than the simpler linear models.

**Key observations:**
- **Year** and **odometer** were strong predictors — newer cars and those with fewer miles are priced higher.  
- **Condition** and **manufacturer** also had clear effects on price.  
- RMSE and MAE indicate that typical prediction errors fall within a few thousand dollars, which is acceptable for dealership pricing insights.

**Metric choice:** RMSE was chosen as the primary evaluation metric since large pricing errors are most costly to the business. MAE and R² were reported for additional clarity.


In [18]:
# --- Quick model comparison summary ---
models = ["Linear Regression", "Ridge Regression", "Random Forest"]
metrics = ["RMSE", "MAE", "R2"]

# Re-run evaluations (if needed) to show a clean table
lin_scores = evaluate(lin, "Linear Regression")
ridge_scores = evaluate(ridge_cv.best_estimator_, "Ridge Regression")
rf_scores = evaluate(rf, "Random Forest")

# Combine results into a simple DataFrame
results_df = pd.DataFrame(
    [lin_scores, ridge_scores, rf_scores],
    columns=["RMSE", "MAE", "R2"],
    index=models
)
results_df


Linear Regression: RMSE=8788.95, MAE=6195.19, R2=0.62
Ridge Regression: RMSE=8788.43, MAE=6195.21, R2=0.62
Random Forest: RMSE=4456.38, MAE=2075.91, R2=0.90


Unnamed: 0,RMSE,MAE,R2
Linear Regression,8788.950732,6195.185138,0.618107
Ridge Regression,8788.433635,6195.207448,0.618152
Random Forest,4456.379574,2075.910494,0.901818


### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

### Deployment

**Business Recommendations**

- Focus on acquiring **newer, low-mileage vehicles** in **excellent condition**, as these consistently command higher resale prices.  
- Use the trained Random Forest model to **flag underpriced or overpriced inventory** for review by pricing managers.  
- Retrain the model periodically (e.g., quarterly) as market conditions shift.  
- Extend the dataset to include features like **trim level, color, and accident history** to further improve accuracy.

The dealership can embed this model into a dashboard that suggests competitive prices and highlights profitable acquisition opportunities.


In [19]:
# --- Example: Predicting price for a sample car ---
example = pd.DataFrame({
    "year": [2018],
    "odometer": [42000],
    "car_age": [2025 - 2018],
    "manufacturer": ["toyota"],
    "condition": ["excellent"],
    "fuel": ["gas"],
    "transmission": ["automatic"],
    "type": ["sedan"],
    "drive": ["fwd"]
})

pred_price = rf.predict(example)[0]
print(f"Predicted price for example car: ${pred_price:,.0f}")


Predicted price for example car: $20,431
