# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

# Business Understanding

## Determine Business Objectives
### Background
* Client is a used car dealership.
### Business Objectives
* Identify what factors make a car more or less expensive
* Related Question: Are different factors more or less important depending on the demographic?
### Business Success Criteria
* The client will be able to accurately estimate how profitable a used car will be before buying one

## Assess Situation
### Inventory of Resources
* For this educational example project, only a pre-selected dataset will be available, and the analysis/modeling will be done by a single data scientist
### Requirements, Assumptions, and Constraints
* Project completion date: 4/15
* Predictions will only be applicable for used cars and buyers with similar features as ones that can be found in the dataset

## Determine Data Mining Goals
### Data Mining Goals
* Predict the sales price of a used car based on features of the car
### Data Mining Success Criteria
* Predicted sales price will be accurate with an acceptable error margin
* Most important features impacting sales price will be identified


### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

# Data Understanding

## Collect Initial Data
### Initial Data Collection Report
* Data is a subset of a publicly available dataset from kaggle.com

## Describe Data
### Data Description Report
* Format: CSV
* Number of rows: 426,888
* Number of columns: 18
* Numeric columns: 'odometer', 'price'
* Date columns: 'year'
* Categorical columns: 'region', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'size', 'type', 'paint_color', 'state'
* Unique columns: 'id', 'VIN'

## Explore Data
### Data Exploration Report
* Target/Prediction column: 'price'
* As expected, there appears to be a correlation between price and model year, as well as a correlation between price and odometer
* Most columns are categorical and will need to be cleaned and prepared for further analysis
* More data exploration details can be found in the 'Data Understanding.ipynb' notebook

## Verify Data Quality
### Data Quality Report
* Unique columns 'id' and 'VIN' are unnecessary
* Only 34,868 rows have complete data (that is, no columns contain 'NaN')
  * If 'VIN' column is removed first, there are 79,195 rows of complete data
* There are clear outliers and incorrect data
  * Model year includes a range of years from 1900-1960. These can safely be ignored as they are either incorrect or extremely atypical
  * Some odometer readings exceed 1 million miles; these are clearly incorrect
  * Some sale prices exceed 1 million; these are extreme outliers for used cars
* Some categorical columns are too numerous to reasonably encode, so only columns with 13 or fewer unique values will be examined
  * One-Hot encoding: 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'type', 'paint_color'
  * Ordinal encoding: 'condition', 'size'
#### Data Cleaning
* Remove 'id' and 'VIN' column
* Remove rows with missing data
* Remove outlier/incorrect data for 'year', 'odometer', and 'price'
* Convert 'year' to 'age' by subtracting it from the current year
* Scale 'price' and 'odometer' by factor of 1000
* Remaining categorical data will have to be preprocessed (such as one-hot encoding and ordinal encoding)

### Plots of numeric data after cleaning

![](images/year.png)

![](images/odometer.png)

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [12]:
import numpy as np
import pandas as pd
from datetime import datetime
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder, OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector, ColumnTransformer
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV

import warnings
warnings.filterwarnings("ignore")

In [13]:
df = pd.read_csv('data/vehicles.csv')

In [14]:
# Remove 'id' and 'VIN' columns, then all rows with NaN
df = df.drop(['id','VIN'], axis=1)
df = df.dropna()

# Remove all data from before year 1960
mask_year = df['year'] > 1960
df = df[mask_year]

# Convert 'year' column to age of car
df['age'] = datetime.now().year - df['year']
df = df.drop('year', axis=1)

# Remove all data where odometer is greater than 300,000
mask_od = df['odometer'] < 300000
df = df[mask_od]

# Remove all data where sales price is greater than 1,000,000
mask_price = df['price'] < 1000000
df = df[mask_price]

# Scale odometer by factor of 1000
df['odometer'] = df['odometer'].div(1000, fill_value=0.000)

# Scale price by factor of 1000
df['price'] = df['price'].div(1000, fill_value=0.000)

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [16]:
# Split data into training set and test set
X = df.drop(['price', 'region', 'manufacturer', 'model', 'state'], axis=1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47)

In [17]:
# Lists of columns
num_cols = ['age', 'odometer']
ohe_cols = ['cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'type', 'paint_color']
ord_cols = ['condition', 'size']
# List of Ordinal values
condition_cats = ['salvage', 'fair', 'good', 'excellent', 'like new', 'new']
size_cats = ['sub-compact', 'compact', 'mid-size', 'full-size']

In [18]:
# Linear regression models with polynomial degree 1-10

# Create empty lists of training and test mean squared errors
train_mses = []
test_mses = []

# For loop to iterate over each polynomial degree
for i in range(1,11):
    # Preprocessor created to iterate polynomial degree
    num_trans = Pipeline([
        ('polyfeatures', PolynomialFeatures(degree=i)),
        ('scaler', StandardScaler())
    ])
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', num_trans, num_cols),
            ('ord', OrdinalEncoder(), ord_cols),
            ('ohe', OneHotEncoder(), ohe_cols)
        ]
    )
    # Pipeline for linear regression
    pipe = Pipeline([
        ('preprocessor', preprocessor),
        ('linreg', LinearRegression())
    ])
    # Fit the model on training data
    pipe.fit(X_train, y_train)
    # Make predictions for training and test data
    p1 = pipe.predict(X_train)
    p2 = pipe.predict(X_test)
    # Mean squared errors calculated for train and test sets
    train_mses.append(mean_squared_error(y_train, p1))
    test_mses.append(mean_squared_error(y_test, p2))

# Cross validation
# Create dataframe with training and test mean squared errors for each value of K (polynomial degree)
k_values = np.array(range(1,11))
MSEs_w_k = pd.DataFrame({"k": k_values, "Train MSE": train_mses, "Test MSE": test_mses})
MSEs_w_k

Unnamed: 0,k,Train MSE,Test MSE
0,1,84.21392,79.868194
1,2,75.525471,71.694607
2,3,75.044165,71.342374
3,4,74.829402,71.388489
4,5,74.429077,71.095033
5,6,74.064486,70.768859
6,7,73.907264,70.538416
7,8,73.758238,71.391891
8,9,73.53616,71.400594
9,10,73.400239,72.260815


In [19]:
# Column Transformer for numeric values; degree 7 Polynomial Features are created and then scaled
num_transformer = Pipeline([
    ('polyfeatures', PolynomialFeatures(degree=7)),
    ('scaler', StandardScaler())
])

# Create preprocessor for numeric columns, ordinal columns, and OneHot encoding
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols),
        ('ord', OrdinalEncoder(categories=[condition_cats, size_cats]), ord_cols),
        ('ohe', OneHotEncoder(), ohe_cols)
    ]
)

In [20]:
# Sequential feature selector for degree 6 polynomial features to select 6 features
lin_model = Pipeline([
    ('preprocessor', preprocessor),
    ('selector', SequentialFeatureSelector(LinearRegression(), n_features_to_select=6)),
    ('linreg', LinearRegression())
])
# Set feature names for sequential feature selector
sfs = lin_model.named_steps['selector']
sfs.feature_names_in_ = X.columns

# Fit pipeline
lin_model.fit(X_train, y_train)

# Calculate mean squared errors
lin_train_mse = mean_squared_error(lin_model.predict(X_train), y_train)
lin_test_mse = mean_squared_error(lin_model.predict(X_test), y_test)

# Get transformed column names
lin_coefs = lin_model.named_steps['linreg'].coef_
lin_selected_features = sfs.get_feature_names_out()

# Visualize pipeline
lin_model

In [21]:
# Ridge regression model
ridge_model = Pipeline([
    ('preprocessor', preprocessor),
    ('ridge', Ridge())
])

# Grid search CV
parameters_to_try = {'ridge__alpha': 10**np.linspace(-4, 3, 10)}

ridge_model_finder = GridSearchCV(estimator=ridge_model,
                           param_grid=parameters_to_try,
                           scoring='neg_mean_squared_error')

ridge_model_finder.fit(X_train, y_train)

# Get best ridge model from GridSearchCV
ridge_best_model = ridge_model_finder.best_estimator_

# Calculate mean squared errors
ridge_train_mse = mean_squared_error(ridge_best_model.predict(X_train), y_train)
ridge_test_mse = mean_squared_error(ridge_best_model.predict(X_test), y_test)

# Visualize model
ridge_best_model

In [22]:
# LASSO regression model
lasso_model = Pipeline([
    ('preprocessor', preprocessor),
    ('lasso', Lasso())
])

# GridSearchCV
lasso_parameters_to_try = {'lasso__alpha': 10**np.linspace(-4, 3, 10)}

lasso_model_finder = GridSearchCV(estimator=lasso_model,
                           param_grid=lasso_parameters_to_try,
                           scoring='neg_mean_squared_error')

lasso_model_finder.fit(X_train, y_train)

# Get best LASSO model from GridSearchCV
lasso_best_model = lasso_model_finder.best_estimator_

#Calculate MSE for training and test set
lasso_train_mse = mean_squared_error(lasso_best_model.predict(X_train), y_train)
lasso_test_mse = mean_squared_error(lasso_best_model.predict(X_test), y_test)

# Visualize model
lasso_best_model

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [24]:
# Dataframe to view MSEs of best type of each model
# Create lists for each column
models = ['Linear Regression', 'Ridge Regression', 'LASSO Regression']
train_mses = [lin_train_mse, ridge_train_mse, lasso_train_mse]
test_mses = [lin_test_mse, ridge_test_mse, lasso_test_mse]
n_features = [lin_model.named_steps['linreg'].n_features_in_,
               ridge_best_model.named_steps['ridge'].n_features_in_,
               lasso_best_model.named_steps['lasso'].n_features_in_]

# Create and show dataframe
summary = pd.DataFrame({"Model": models, "Training MSEs": train_mses, "Test MSEs": test_mses, "Number of Features": n_features})
# Add column for square root of test MSE to give average error in dollars
summary['Mean Error (in thousands of dollars)'] = np.sqrt(summary['Test MSEs'])
summary

Unnamed: 0,Model,Training MSEs,Test MSEs,Number of Features,Mean Error (in thousands of dollars)
0,Linear Regression,87.6381,83.808613,6,9.154704
1,Ridge Regression,73.023316,69.734788,88,8.350736
2,LASSO Regression,74.008042,70.593258,88,8.401979


In [25]:
# Identifying index of SequentialFeatureSelector selected features
sfs_features_raw = lin_model.named_steps['selector'].get_feature_names_out()
sfs_features_index = [s[1:] for s in sfs_features_raw]
sfs_features_index = [int(x) for x in sfs_features_index]

In [26]:
# Get feature names from the preprocessor
lin_feature_names = lin_model.named_steps['preprocessor'].get_feature_names_out()

# Create dataframe with feature names of SequentialFeatureSelector features and their coefficients
lin_coef_df = pd.DataFrame({"Feature Name": lin_feature_names})
lin_coef_df = lin_coef_df.iloc[sfs_features_index,:]
lin_coef_df['Coefficients'] = lin_coefs

lin_coef_df.sort_values(by=['Coefficients'], ascending=False)

Unnamed: 0,Feature Name,Coefficients
46,ohe__fuel_diesel,13.30243
12,num__age^2 odometer^2,6.940993
44,ohe__cylinders_8 cylinders,4.683232
72,ohe__type_sedan,-3.044951
61,ohe__drive_fwd,-3.92289
4,num__age odometer,-12.058363


In [27]:
# Create dataframe for ridge features names and coefficients
ridge_feature_names = ridge_best_model.named_steps['preprocessor'].get_feature_names_out()
ridge_coefs = ridge_best_model.named_steps['ridge'].coef_
ridge_feature_coef_mapping = dict(zip(ridge_feature_names, ridge_coefs))

ridge_coef_df = pd.DataFrame({"Feature Name": ridge_feature_names, "Coefficient": ridge_coefs})
# Show top 10 coefficients
ridge_coef_df.sort_values(by=['Coefficient'], ascending=False).head(10)

Unnamed: 0,Feature Name,Coefficient
13,num__age odometer^3,1765.650549
20,num__odometer^5,1500.496601
12,num__age^2 odometer^2,1210.141844
25,num__age^2 odometer^4,762.65552
15,num__age^5,688.342598
26,num__age odometer^5,644.871228
11,num__age^3 odometer,566.646086
4,num__age odometer,400.465941
9,num__odometer^3,332.457643
24,num__age^3 odometer^3,277.293215


In [28]:
# Create dataframe for LASSO features names and coefficients
lasso_feature_names = lasso_best_model.named_steps['preprocessor'].get_feature_names_out()
lasso_coefficients = lasso_best_model.named_steps['lasso'].coef_
feature_coef_mapping = dict(zip(lasso_feature_names, lasso_coefficients))
coef_df = pd.DataFrame({"Feature Name": lasso_feature_names, "Coefficient": lasso_coefficients})
# Show top 10 coefficients
coef_df.sort_values(by=['Coefficient'], ascending=False).head(10)

Unnamed: 0,Feature Name,Coefficient
3,num__age^2,17.506895
46,ohe__fuel_diesel,9.766301
39,ohe__cylinders_12 cylinders,8.282558
4,num__age odometer,7.220695
47,ohe__fuel_electric,5.747451
65,ohe__type_convertible,5.171402
52,ohe__title_status_lien,4.148462
69,ohe__type_offroad,3.470049
9,num__odometer^3,3.270022
73,ohe__type_truck,2.615409


### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

# Findings

## Model Selection
* Three different regression algorithms were used to produce models, which were then compared against each other
  * Algorithms used: Linear Regression, Ridge Regression, LASSO Regression
* For each algorithm, many models were created with different initialization parameters to determine the optimal parameters for each model given the dataset

## Model Evaluation
* After the optimal model for each algorithm was identified, the models were tested on a new subset of the data that was not available to them during training
  * The performance of each model for this new test dataset was evaluated by calculating the mean squared error (MSE)
  * Taking the square root of the MSE of each model gives the average prediction error for any data point, in units of 1000 dollars
* With the above evaluation, the Ridge Regression model was determined to be the best model, with an average error of approximately 8,400 dollars
* It is worth noting that the Ridge Regression model also utilizes every data point that was initially fed into it
  * This increases computational complexity, but results in a more accurate model
* The LASSO model performed slightly better, but was much more computationally complex
### Coefficient Evaluation
* It is also important to look at the coefficients of the resulting models
  * These can effectively be viewed as the "weight" of a particular feature. Higher coefficient values mean the model is more heavily impacted by those data points
* The coefficients from each model can provide insight
  * The Linear Regression model was restricted to using 6 input features based on an automated selection process. All 6 coefficients can be evaluated
  * Both the Ridge and LASSO Regression models used all input features, which totals to 88 after data preprocessing. For this reason, the top 10 coefficients will be evaluated
* Linear Regression
  * The most significant features that affects the sales price is if the car uses diesel fuel, followed by the odometer reading, the age of the car, or if it has 8 cylinders
* Ridge Regression
  * All of the top 10 most significant coeffiecients (out of 88) are based on the age and odometer reading of the car, or are derrived features based on those two values
* LASSO Regression
  * Age and odometer are again significant coefficients
  * Use of diesel fuel is also the second most significant coefficient
  * The remaining significant features are, in order from most to least significant:
    * If the engine is 12 cylinder
    * The car is a convertable
    * If the car is an electric vehicle
    * If there is a lien on the title of the vehicle
    * If the car is an offroad vehicle
    * If the car is a truck

## Summary

After performing this analysis, it has been determined that a Ridge Regression model was the most accurate in predicting the sales price of a used car based on all known attributes of the car. The average prediction error is approximately 8,400 dollars.

Additionally, the models that were generated were analyzed to assess the most important factors that determine the final sales price of a used car. The Ridge Regression model, while most accurate, was mostly based on the age of the vehicle and how many miles are on the odometer. In contrast, the other models identified more features that impact the sales price. Although these models are less accurate, their error margin is still within 1000 dollars of the Ridge model. Other factors that are important in determining the sales price of a car include:
* The car uses diesel fuel
* The engine is 12 cylinder
* The car is a convertable
* The car is electric
* There is a lien on the car title
* The car is offroad
* The car is a truck

*Note that the above features might not all be present in a single car. For instance, there wouldn't be a diesel fueled electric car. This list should be interpreted as a checklist of decreasing importance