# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

We are tasked with identifying the *key drivers* for used car prices. Let's start with our business understanding as outlined by the CRISP-DM approcah.
We are going to structure our *Stage 1* as follows,
**Business Understanding**
- **Problem statement**: “What factors drive used car prices?”
- **Audience**: used car dealers.
- **Success metric**: good predictive performance + interpretable coefficients

---
# Business Understanding

A used car dealership wants to understand **what drives the price of a used car** so they can make smarter decisions about which vehicles to buy, how to price them, and which features to highlight to customers.

From a business perspective, the key questions are:

- Which characteristics (e.g., brand, age, mileage, fuel type, transmission) most strongly influence price?
- Which combinations of features tend to command a premium?
- Where are the biggest discounts or price penalties (e.g., high mileage, older cars, certain brands or trims)?

From a data perspective, we can frame this as a **supervised regression problem**:

- **Target variable:** `price` (continuous)
- **Features (predictors):** vehicle attributes such as `year`, `mileage`, `brand`, `model`, `fuel`, `transmission`, `condition`, etc.
- **Goal:** 
  - Build regression models that can **predict price** reasonably well.
  - Use **model coefficients and feature importance** to interpret which factors increase or decrease price, holding other variables constant.

Success will be measured by:

- A clear, interpretable model (or set of models) that explains **which features matter most**.
- A reasonable error metric (e.g., RMSE or MAE) that shows the model captures pricing patterns.
- Actionable insights that a nontechnical audience (used car dealers) can use to **fine‑tune inventory and pricing strategy**.

### Data Exploration Findings
*Because the dataset contains substantial missing values and some unrealistic entries (e.g., extreme prices, odometer readings, and years), part of the technical challenge will involve cleaning and filtering the data before modeling. Many categorical variables such as manufacturer, model, condition, cylinders, and drive have missingness ranging from 20% to 70%, so we will need to decide whether to impute, simplify, or drop certain features.*

*Given that our audience is a used car dealership, our modeling approach must balance predictive performance with interpretability. Linear and regularized regression models will allow us to quantify how each feature (e.g., mileage, age, brand, transmission) affects price while still producing actionable insights.*

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.