# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.filters.filtertools import convolution_filter
import statsmodels.graphics.tsaplots as tsplots
from statsmodels.tsa.seasonal import seasonal_decompose, STL
from statsmodels.tsa.forecasting.stl import STLForecast
from statsmodels.tsa.arima.model import ARIMA

In [17]:
data = pd.read_csv('data/vehicles.csv')
data = data.dropna()
data.head(5)

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
126,7305672709,auburn,0,2018.0,chevrolet,express cargo van,like new,6 cylinders,gas,68472.0,clean,automatic,1GCWGAFP8J1309579,rwd,full-size,van,white,al
127,7305672266,auburn,0,2019.0,chevrolet,express cargo van,like new,6 cylinders,gas,69125.0,clean,automatic,1GCWGAFP4K1214373,rwd,full-size,van,white,al
128,7305672252,auburn,0,2018.0,chevrolet,express cargo van,like new,6 cylinders,gas,66555.0,clean,automatic,1GCWGAFPXJ1337903,rwd,full-size,van,white,al
215,7316482063,birmingham,4000,2002.0,toyota,echo,excellent,4 cylinders,gas,155000.0,clean,automatic,JTDBT123520243495,fwd,compact,sedan,blue,al
219,7316429417,birmingham,2500,1995.0,bmw,525i,fair,6 cylinders,gas,110661.0,clean,automatic,WBAHD6322SGK86772,rwd,mid-size,sedan,white,al


From the data above, we can observe certan factors that have the potential to raise or lower prices of the vehicles. We can ask which factors play the largest part in determining how expensive a vehicle is and relay that information to the used car dealership.

In [19]:
# Grab a list of unique cylinder names to change into just integers for easier handling. Removed 'other'
unique_cyl = ['6 cylinders', '4 cylinders', '8 cylinders', '5 cylinders','10 cylinders', '3 cylinders', '12 cylinders']
data = data[data['cylinders'].str.contains('other') == False]
data['cylinders'] = data['cylinders'].str.replace(' cylinders', '')


In [24]:
data['cylinders'] = data['cylinders'].astype(int)
data.head()

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
126,7305672709,auburn,0,2018.0,chevrolet,express cargo van,like new,6,gas,68472.0,clean,automatic,1GCWGAFP8J1309579,rwd,full-size,van,white,al
127,7305672266,auburn,0,2019.0,chevrolet,express cargo van,like new,6,gas,69125.0,clean,automatic,1GCWGAFP4K1214373,rwd,full-size,van,white,al
128,7305672252,auburn,0,2018.0,chevrolet,express cargo van,like new,6,gas,66555.0,clean,automatic,1GCWGAFPXJ1337903,rwd,full-size,van,white,al
215,7316482063,birmingham,4000,2002.0,toyota,echo,excellent,4,gas,155000.0,clean,automatic,JTDBT123520243495,fwd,compact,sedan,blue,al
219,7316429417,birmingham,2500,1995.0,bmw,525i,fair,6,gas,110661.0,clean,automatic,WBAHD6322SGK86772,rwd,mid-size,sedan,white,al


### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

The main objective is to understand the elements that are influencing the price of the vehicles the most and creating some sort of analysis with those features as the key parts. Certain elements are useful to split into different groups of data, such as manufacturer, where we can compare each car under a certain manufacturer to understand why there are price discrepancies. Also created ways to quantify other elements such as condition would help create more data points to pull from.
Step 1: quantify any descriptive elements that could be useful.
Step 2: Seperate data into different groups to simplify analysis.

In [113]:
data['condition'].unique()
convert_condition_dict = {'salvage' : 0, 'fair' : 1, 'good' : 2, 'excellent' : 3, 'like new' : 4, 'new' : 5}
data_new = data.replace({'condition': convert_condition_dict})
#Remove any unused columns ['fuel', 'title_status', 'transmission', 'VIN', 'drive', 'paint_color', 'state', 'id']
remove_cols = ['fuel', 'title_status', 'transmission', 'VIN', 'drive', 'paint_color', 'state', 'id']
data_new = data_new.drop(labels=remove_cols, axis=1)
#Remove any vehicles with a price of 0 and odometer of 0
data_new = data_new[data_new['price'] > 0 ]
data_new = data_new[data_new['odometer'] > 0]
data_new.sample(10)

Unnamed: 0,region,price,year,manufacturer,model,condition,cylinders,odometer,size,type
23244,bakersfield,31995,2017.0,ford,f-150 xlt,3,6,111574.0,full-size,truck
164058,waterloo / cedar falls,10900,2012.0,audi,a5,3,4,93000.0,sub-compact,convertible
388349,vermont,9990,1991.0,honda,acty,2,3,38700.0,full-size,truck
16795,tucson,10995,2006.0,chevrolet,silverado 2500,3,8,193410.0,full-size,pickup
117080,tampa bay area,80000,2019.0,bmw,m550i xdrive,4,8,7000.0,full-size,sedan
172963,louisville,33990,2007.0,chevrolet,cc4500,2,8,77581.0,full-size,truck
220570,joplin,14901,2012.0,acura,mdx,3,6,134092.0,full-size,SUV
265752,albany,13950,2014.0,honda,civic,3,4,47026.0,compact,sedan
421941,madison,15985,2010.0,ram,1500,2,8,105831.0,full-size,truck
302466,toledo,21572,2018.0,honda,cr-v,3,4,42416.0,mid-size,SUV


Here we can see that with a trimmed down data set it becomes a little easier to understand the driving factors of car cost. There is much more to interpret and to prepare in order to make an informed decision.

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [95]:
#Split data by manufacturer
data_chevrolet = data_new[data_new['manufacturer'].str.contains('chevrolet') == True]
data_chevrolet.sample(10)

Unnamed: 0,region,price,year,manufacturer,model,condition,cylinders,odometer,size,type
347236,greenville / upstate,48900,2018.0,chevrolet,silverado 2500 hd ltz,4,8,43034.0,full-size,truck
213748,minneapolis / st paul,10995,2009.0,chevrolet,silverado 1500,2,8,184728.0,full-size,truck
49820,reno / tahoe,7500,1991.0,chevrolet,s10,2,8,164296.0,compact,pickup
308589,tulsa,9900,2015.0,chevrolet,equinox lt,2,4,143214.0,mid-size,SUV
198829,flint,13900,2008.0,chevrolet,silverado 3500hd,2,8,181364.0,full-size,truck
396082,richmond,39995,2016.0,chevrolet,silverado 3500hd,3,8,35809.0,full-size,truck
363188,amarillo,29900,2010.0,chevrolet,corvette grand sport,4,8,63350.0,full-size,coupe
248674,las vegas,6900,1991.0,chevrolet,corvette coupe,2,8,119286.0,full-size,coupe
154767,south bend / michiana,4595,2005.0,chevrolet,cobalt,3,4,120314.0,compact,sedan
210824,bemidji,16990,2012.0,chevrolet,avalanche 4x4,3,8,99179.0,full-size,pickup


In [96]:
chevy_plot = px.scatter(x = np.log10(data_chevrolet['odometer']), y = data_chevrolet['price'], trendline='ols')
chevy_plot.show()
high_mile_chevies = data_chevrolet[data_chevrolet['odometer'] > 500_000]
high_mile_chevies

Unnamed: 0,region,price,year,manufacturer,model,condition,cylinders,odometer,size,type
12564,phoenix,9500,1979.0,chevrolet,k10,2,8,583120.0,full-size,pickup
190505,western massachusetts,11999,2009.0,chevrolet,tahoe,2,8,999999.0,full-size,SUV
417643,green bay,39500,1958.0,chevrolet,impala,2,8,1710000.0,full-size,coupe


Interestingly, we do run into certain outliers that probably exist across all manufacturers in this data, as of right now I am going to believe that there is a singular chevy impala on the resell market that is in good condition and has also been driven 1.71 million miles.

Now to prepare the data for modeling and sklearn.

In [102]:
#normalize odometer
# norm_df = data_new
# o_mean = data_new['odometer'].mean()
# o_std = data_new['odometer'].std()
# norm_df['odometer'] = (data_new['odometer'] - o_mean) / o_std
# norm_df

Unnamed: 0,region,price,year,manufacturer,model,condition,cylinders,odometer,size,type
215,birmingham,4000,2002.0,toyota,echo,3,4,0.446696,compact,sedan
219,birmingham,2500,1995.0,bmw,525i,1,6,0.013412,mid-size,sedan
268,birmingham,9000,2008.0,mazda,miata mx-5,3,4,-0.513900,compact,convertible
337,birmingham,8950,2011.0,ford,f-150,3,6,0.534645,full-size,truck
338,birmingham,4000,1972.0,mercedes-benz,benz,1,6,-0.207056,full-size,coupe
...,...,...,...,...,...,...,...,...,...,...
426785,wyoming,23495,2015.0,ford,f150 xlt 4x4,4,8,0.366516,full-size,truck
426788,wyoming,12995,2016.0,chevrolet,cruze lt,4,4,-0.470639,compact,sedan
426792,wyoming,32999,2014.0,ford,"f350, xlt",3,8,0.443198,full-size,pickup
426793,wyoming,15999,2018.0,chevrolet,"cruze, lt",3,4,-0.711638,mid-size,sedan


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [125]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
#Pipeline with degree 2 polynomial features and normal Linear Regression
pipe_deg2 = Pipeline([('quad_features', PolynomialFeatures(degree=2, include_bias=False)),
                ('quad_model', LinearRegression())])
X = data_new[['odometer']]
y = data_new['price']
pipe_deg2.fit(X, y)
pipe_mse = mean_squared_error(y, pipe_deg2.predict(X))
pipe_deg2.predict(np.array([[10000]]))


X does not have valid feature names, but PolynomialFeatures was fitted with feature names



array([24479.3076968])

In [126]:
pipe_deg3 = Pipeline([('quad_features', PolynomialFeatures(degree=3, include_bias=False)),
                     ('quad_model', LinearRegression())])
pipe_deg3.fit(X,y)
pipe_mse = mean_squared_error(y, pipe_deg3.predict(X))
pipe_deg3.predict(np.array([[10000]]))


X does not have valid feature names, but PolynomialFeatures was fitted with feature names



array([26483.24464038])

In [143]:
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SequentialFeatureSelector
X_train, X_test, y_train, y_test = train_test_split(data_new.drop('price', axis=1), np.log1p(data_new.price),
                                                   test_size=.3)
poly_features = PolynomialFeatures(degree = 3, include_bias=False)
X_train_poly = poly_features.fit_transform(X_train[['condition', 'odometer']])
X_test_poly = poly_features.fit_transform(X_test[['condition', 'odometer']])
columns = poly_features.get_feature_names_out()
print(columns)
train_df = pd.DataFrame(X_train_poly, columns=columns)
test_df = pd.DataFrame(X_test_poly, columns=columns)

['condition' 'odometer' 'condition^2' 'condition odometer' 'odometer^2'
 'condition^3' 'condition^2 odometer' 'condition odometer^2' 'odometer^3']


In [144]:
linReg = LinearRegression()
linReg.fit(train_df, y_train)
train_preds = linReg.predict(train_df)
test_preds = linReg.predict(test_df)
train_mse = mean_squared_error(y_train, train_preds)
test_mse = mean_squared_error(y_test, test_preds)
print(train_mse)
print(test_mse)
linReg.predict(np.array([[1, 10000, 1, 10000, 10000^2, 1, 10000, 10000^2, 10000^3]]))

2.217345801211969
2.147094206638203



X does not have valid feature names, but LinearRegression was fitted with feature names



array([9.54655432])

In [165]:
#create a dataframe to compare results of prediction model
end_df = pd.DataFrame(test_df, columns=test_df.columns)
end_df['pred price'] = np.exp(test_preds) - 1
end_df['log odometer'] = np.log10(end_df['odometer'])
end_fig = px.scatter(end_df, x='log odometer', y='pred price')
end_fig.show()

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

After having created a model to reflect what the data can tell us about the relationship between the odometer value and the price of the vehicle, we can see that there is a relationship between these two features. While there is a direct correlation between how far the car has been driven and the price that it is being sold at, the data itself was more scewed than I had thought and needs more adjustments to get the full picture.

In [171]:
#to better understand the data, I would split it into different regions, manufacturers, and by region, we'll check one here
check_df = pd.DataFrame(data_new, columns = data_new.columns)
check_df['log odometer'] = np.log10(data_new['odometer'])
manufacturer_fig = px.scatter(check_df, x='manufacturer', y='price', color='region')
manufacturer_fig.show()

Here we have a much better idea of what is scewing the data in the price department. Without proper context it is impossible to say why there are cars with the price of 1. To better fit this data for more analysis we should look at price ranges that are more within the realm of possible.

In [173]:
odometer_fig = px.scatter(check_df, x='odometer', y='price', color='manufacturer')
odometer_fig.show()

Here it becomes apparent that the odometer was more scewed than originally thought at the beginning. while the model does predict prices with some legitimate accuracy, the figure that is shown from it takes context to fully understand what is happening.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

I have finished my analysis of car prices using data collected from 456,000 cars that have been either sold or put on the market at a given price. We can see that as a car has more use and is driven more before selling, the model will predict that its value drops. Any car with over 100,000 miles on it or more begins to see a drastic drop in value, as shown in the figure below. The newer the car that is brought into the dealership, with less miles non-dependent of how the condition of the car is in, will sell at a higher value.

In [174]:
end_fig.show()