# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

#### Answer

The goal of this project is to predict the  car price using different ML techniques.The purpose of this rep is to identify which features (year, odometer, cylinders, age, etc.) have the strongest predictive power for used car prices. The goal is to explain our client the relationship between these features and price through model coefficients, model performance and determine feature importance to provide insights to the dealership about what factors consumers value most when purchasing used vehicles.


### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

#### Answer
To understand the data, We will 

1. Inspect the dataset
   - Load the dataset
   - Display first few rows to understand structure
   - Check data types of each column

2. Deal with Missing values 
   - Identify which features have critical missing data that would impact modeling and clean it accordingly

Looking at the CSV file we can see our dataset contains 426,000 used car entries (3M cars), Our features include
  1. region : string
  2. price : float
  3. year : float
  4. manufacturer : string
  5. model : string
  6. condition : string
  7. cylinders : string
  8. fuel : string
  9. odometer : float
  10. title_status : string
  11. transmission : string
  12. VIN : string
  13. drive : string
  14.  size : full-size
  15. type : string
  16.  paint_color : string
  17.  state : string

Given this, We will only focus on the float values for linear regression



### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

#### Answer
For this stage we will consider doing the following:

1. Data Cleaning:
   - Removed invalid prices: filtered to keep prices between $1 and $200,000 (removed 0 and extreme outliers)
   - Dropped rows with missing critical features: year, manufacturer, and odometer (essential for pricing)
   - Filled missing values: condition='unknown', transmission='unknown', fuel='gas', drive='unknown'

2. Feature Engineering:
   - Extracted numeric values from cylinders column (string format)
   - Created 'age' feature: age = 2025 - year (more interpretable than year)
   - Filtered out unrealistic ages (kept 0-50 years)

3. Feature Selection:
   - Selected numerical features: year, odometer, cylinders, age
   - Removed categorical features for initial linear regression models

4. Data Splitting:
   - Split data into training (70%) and test (30%) sets using random_state=42 for reproducibility
   - Final dataset: 370,156 records after cleaning




### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

#### Answer
The models we build for this project includes : 

1. Single Feature Linear Regression Models
 

2. Polynomial Feature Models (Year only)


3. Multiple Linear Regression (4 features)
   

4. Polynomial Regression (4 features, degree 2)

Based on this, We  Found:
- Year/Age is the strongest single predictor (R² = 0.26)
- Polynomial features significantly improve performance
- Combining multiple features with polynomial transformations yields best results


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

#### Answer
For Evaluation, We found that the best performing model was the Polynomial Regression with 4 features (degree 2)
- R² = 0.4906
- MAE = 6,984.82: Average prediction error
- RMSE = 10,467.73: Penalizes larger errors

Key Drivers of Used Car Prices:

1. Year/Age (Strongest Predictor): 
   - Newer cars == higher prices
   - R² = 0.26 as single feature, improves to 0.39 with polynomial terms
   - Shows non-linear relationship (polynomial degree 3 performs best for year alone)


2. Odometer (Weakest Single Predictor):
   - R² = 0.05 as single feature which was shocking to me 
   - Lower mileage generally increases value, but relationship is weak alone


3. Cylinders:
   - More cylinders associated with higher prices
   - R² = 0.07 still low 


4. Feature Interactions:
   - Polynomial model (degree 2) captures interactions between features
   - R² from 0.37 to 0.49 showing multiple features + polynomial regression works best here with a big jump
   

This finding suggest that our clients should 
- Increase inventory for newer vehicles
- Mileage matters but less than vehicle age
- Engine size (cylinders) has moderate impact



### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.