# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

From the CRISP-DM Manual: 
This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives.

*Determine Business Objectives:*
* Understand what makes cars more or less expensive. In other words, determine what features of a car contribute to that car's increase or decrease in value.

*Assess Situation:*
* Resources: We have a Kaggle dataset with data on 426k used cars. We will investigate it in the Data Understanding Section.

*Determine Data Mining Goals:*
* Data Mining Goals:
    * Create models that can demonstrate the relationship between features of cars and the value of those cars in order to investigate how the features correlate with the values.
* Data Mining Success Criteria:
    * Find factors which contribute to value of each car. Determine to what degree they do and how (i.e., do they increase or decrease value, and at what rate?)

*Produce Project Plan:*
* Project Plan: Fine tune the data for our models, then progressively try models until we can come away with actionable information for used car sales personnel to adequately assess inventory to maximize the potential of their sales. **In other words, how to determine the most valuable features that determine the price of their cars.**


* Initial Assessment (of Tools, and Techniques): Dataset needs to be prepared for our analysis, changing . We are aware of techniques such as linear regression, sequential feature selection, and regularization which can help us make actionable conclusions as to what features are most important in determining the price of a car.
* For numeric entries, we can use multidimensional forms of analysis like sequential feature selection or ridge regression, and use standardization to account for the differences between these features' magnitude
* For non-numeric entries, we can consider creating dummy encoded versions of any given column and using LinearRegression to determine how a certain value in that column predicts price. We will come away with an understanding of how certain aspects of a car can be used to determine the price of that car.


### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

* We can determine what features of the cars are likely to be useful, and which may not matter in our analysis.
* find nans - consider clearing nans
* note which columns are useful, which are useless

Observing data reveals that there are many duplicates, and many entries which are missing data in columns which could be crucial for analysis, those of which ought to be dropped during preparation.

***Feature analysis***
* By analyzing the columns, we can anticipate which features of our vehicles will be helpful for predicting price and which may be arbitrary.
* Year, Manufacturer, Condition, Cylinder, Fuel, Odomoter, Transmission, Type, Paint_color may all reliably contribute to price
* Region may be analyzed, but may not be useful
* Size is missing many values, and may be worth dropping
* VIN is arbitrary and can be dropped for analysis
* ID is arbitrary, but can be used for our index

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

***General possible integrity issues***
* integrity issues - nans, numbered and non-numbered, duplicates
* considerations - useful vs. non-useful columns

In [863]:
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt

vehicles = pd.read_csv('data/vehicles.csv', index_col=0)
vehicles.info

<bound method DataFrame.info of                             region  price    year manufacturer  \
id                                                               
7222695916                prescott   6000     NaN          NaN   
7218891961            fayetteville  11900     NaN          NaN   
7221797935            florida keys  21000     NaN          NaN   
7222270760  worcester / central MA   1500     NaN          NaN   
7210384030              greensboro   4900     NaN          NaN   
...                            ...    ...     ...          ...   
7301591192                 wyoming  23590  2019.0       nissan   
7301591187                 wyoming  30590  2020.0        volvo   
7301591147                 wyoming  34990  2020.0     cadillac   
7301591140                 wyoming  28990  2018.0        lexus   
7301591129                 wyoming  30590  2019.0          bmw   

                               model condition    cylinders    fuel  odometer  \
id                          

In [864]:
vehicles = vehicles.drop_duplicates()
vehicles.head
#We've dropped about 50k rows by dropping duplicates

<bound method NDFrame.head of                             region  price    year manufacturer  \
id                                                               
7222695916                prescott   6000     NaN          NaN   
7218891961            fayetteville  11900     NaN          NaN   
7221797935            florida keys  21000     NaN          NaN   
7222270760  worcester / central MA   1500     NaN          NaN   
7210384030              greensboro   4900     NaN          NaN   
...                            ...    ...     ...          ...   
7301591192                 wyoming  23590  2019.0       nissan   
7301591187                 wyoming  30590  2020.0        volvo   
7301591147                 wyoming  34990  2020.0     cadillac   
7301591140                 wyoming  28990  2018.0        lexus   
7301591129                 wyoming  30590  2019.0          bmw   

                               model condition    cylinders    fuel  odometer  \
id                            

***Too many NaNs for analysis***

This dataset has plenty of NaNs in certain columns. Anything with a high number of NaN values should be omitted from the data.

In [865]:
# print(vehicles.isna().sum().sort_values(ascending=False))

***Deciding what to do***
* We can remove size for having too many NaNs. If we parse away NaNs later, we might still be able to analyze certain columns like cylinders.
* VIN is not useful
* We should remove 'model', its entries are too specific.
* Let's analyse some columns for usability.

In [866]:
# print(vehicles['title_status'].value_counts())

In [867]:
#Drop size -- too many NaNs
vehicles = vehicles.drop('size', axis=1)
#Drop VIN, model, region -- not useful for analysis
vehicles = vehicles.drop(['VIN','model','region','state'], axis=1)

#Remove price == 0
vehicles = vehicles[vehicles['price']!=0]
vehicles = vehicles[vehicles['price']!=1]
# vehicles.info

In [868]:
# print(vehicles['cylinders'].value_counts())
cylinder_replacement_map = {'3 cylinders':3,'4 cylinders':4,'5 cylinders':5,'6 cylinders':6,'8 cylinders':8,'10 cylinders':10, '12 cylinders':12, 'other':''}
vehicles = vehicles.replace({'cylinders' : cylinder_replacement_map})
#drop empty value
vehicles = vehicles[vehicles['cylinders']!='']
#print new values
# print(vehicles['cylinders'].dropna().value_counts())
vehicles['cylinders'] = pd.to_numeric(vehicles['cylinders'])
# numeric_columns = vehicles.select_dtypes('number').columns
# print(f"Numeric columns: {list(numeric_columns)}")

#Success!

vehicles_num = vehicles.select_dtypes(include=np.number)
# vehicles_num.info

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

### Non-numeric features: Single feature Linear Regression

***Manufacturer***

In [869]:
X = pd.get_dummies(vehicles[['manufacturer']])
# print(X.columns.to_list)
y = vehicles['price']
manu_linreg = LinearRegression(fit_intercept=False).fit(X, y)
# print(manu_linreg.coef_)
manu_coef = pd.DataFrame({'manufacturer':list(X.columns),'coef':manu_linreg.coef_})
# print(manu_coef)
sorted_manu = manu_coef.sort_values(by='coef')
print(sorted_manu)
manu_mse = mean_squared_error(manu_linreg.predict(X),y)
# print(manu_mse)



                    manufacturer           coef
27          manufacturer_mercury    5876.729424
36           manufacturer_saturn    7253.224707
32          manufacturer_pontiac    8554.404192
8          manufacturer_chrysler   11243.786842
16            manufacturer_honda   11667.566560
17          manufacturer_hyundai   12241.834633
21              manufacturer_kia   12917.249017
12             manufacturer_fiat   12984.507375
30           manufacturer_morgan   13100.000000
15  manufacturer_harley-davidson   13354.756757
40       manufacturer_volkswagen   13438.250868
25            manufacturer_mazda   13772.292089
37           manufacturer_subaru   13911.915148
29       manufacturer_mitsubishi   14862.913656
22       manufacturer_land rover   15103.000000
28             manufacturer_mini   15310.870518
9            manufacturer_datsun   15757.200000
10            manufacturer_dodge   17467.666731
6          manufacturer_cadillac   20898.713694
23            manufacturer_lexus   20903

***Drive***

In [870]:
X = pd.get_dummies(vehicles[['drive']])

y = vehicles['price']
drive_linreg = LinearRegression(fit_intercept=False).fit(X, y)

drive_coef = pd.DataFrame({'drive':list(X.columns),'coef':drive_linreg.coef_})

sorted_drive = drive_coef.sort_values(by='coef')
print(sorted_drive)
drive_mse = mean_squared_error(drive_linreg.predict(X),y)
# print(drive_mse)
var = np.var(y - drive_linreg.predict(X))
# print(var)




       drive           coef
1  drive_fwd   17324.612954
2  drive_rwd   45931.616868
0  drive_4wd  138212.830207


***Condition***

In [871]:
X = pd.get_dummies(vehicles[['condition']])

y = vehicles['price']
cond_linreg = LinearRegression(fit_intercept=False).fit(X, y)

cond_coef = pd.DataFrame({'cond':list(X.columns),'coef':cond_linreg.coef_})

sorted_cond = cond_coef.sort_values(by='coef')
print(sorted_cond)
mse = mean_squared_error(cond_linreg.predict(X),y)
# print(mse)
var = np.var(y - cond_linreg.predict(X))
# print(var)




                  cond           coef
5    condition_salvage    3736.608379
4        condition_new   29779.991471
2       condition_good   33613.216880
3   condition_like new   42033.571666
0  condition_excellent   62320.623768
1       condition_fair  796604.879988


***Fuel***

In [872]:
X = pd.get_dummies(vehicles[['fuel']])

y = vehicles['price']
fuel_linreg = LinearRegression(fit_intercept=False).fit(X, y)

fuel_coef = pd.DataFrame({'fuel':list(X.columns),'coef':fuel_linreg.coef_})

sorted_fuel = fuel_coef.sort_values(by='coef')
print(sorted_fuel)
mse = mean_squared_error(fuel_linreg.predict(X),y)
# print('mse: ', mse)
var = np.var(y - fuel_linreg.predict(X))
# print('var: ',var)




            fuel           coef
3    fuel_hybrid   15577.421172
1  fuel_electric   26490.500000
4     fuel_other   76943.532335
2       fuel_gas   89552.846094
0    fuel_diesel  149649.129281


***Title status***

In [873]:
X = pd.get_dummies(vehicles[['title_status']])

y = vehicles['price']
title_linreg = LinearRegression(fit_intercept=False).fit(X, y)

title_coef = pd.DataFrame({'title status':list(X.columns),'coef':title_linreg.coef_})

sorted_title = title_coef.sort_values(by='coef')
print(sorted_title)
mse = mean_squared_error(title_linreg.predict(X),y)
# print('mse: ', mse)
var = np.var(y - title_linreg.predict(X))
# print('var: ',var)


              title status          coef
5     title_status_salvage  10847.200115
2     title_status_missing  11993.183060
3  title_status_parts only  13648.404908
4     title_status_rebuilt  15572.034870
1        title_status_lien  22454.155684
0       title_status_clean  95358.841552


### Considerations -- Multidimensional Data
Because the data we are using is multidimensional -- in other words, we'd like to know how multiple factors contribute together to affect the price of a vehicle, we will likely be using sequential feature selection or regularization for our eventual analysis. It appears we will be building linear regression models, and creating pipelines to fit our models to our data and then selecting the best model to represent our findings.

### Linear Regression Model -- all features


In [874]:
###Linear regression
import plotly.express as px
import numpy as np
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')


vehicles_dropped = vehicles_num.dropna()
X = pd.get_dummies(vehicles_dropped.drop('price', axis = 1))
y = vehicles_dropped['price']
all_features_linreg = LinearRegression(fit_intercept=False).fit(X, y)
linreg_mse = mean_squared_error(all_features_linreg.predict(X), y)



print(all_features_linreg)
print(all_features_linreg.coef_)
print(linreg_mse)

feature_names = ('year','cylinders','odometer')
linreg_df = pd.DataFrame({'feature': feature_names, 'coef': all_features_linreg.coef_})
print(linreg_df)
# vehicles_dropped.info




LinearRegression(fit_intercept=False)
[-7.02542401e+01  3.58517492e+04  4.53417992e-02]
187808051052264.94
     feature          coef
0       year    -70.254240
1  cylinders  35851.749181
2   odometer      0.045342


In [875]:
#generate train/test data for vehicles
# vehicles =  vehicles.select_dtypes(include=np.number).dropna()
v_dropped = vehicles_num.dropna()
vehicles_X = v_dropped.drop(['price'], axis = 1)
vehicles_y = v_dropped['price']
vehicles_X_train, vehicles_X_test, vehicles_y_train, vehicles_y_test = train_test_split(vehicles_X, vehicles_y, 
                                                                       test_size = 0.3,
                                                                       random_state = 42)

Lasso -- degree 1

In [876]:
vehicles_pipe = Pipeline([('polyfeatures', PolynomialFeatures(degree = 1, include_bias = False)),
                      ('scaler', StandardScaler()),
                     ('lasso', Lasso(random_state = 42))])
vehicles_pipe.fit(vehicles_X_train, vehicles_y_train)
lasso_coefs = vehicles_pipe.named_steps['lasso'].coef_

# print(type(lasso_coefs))
# print(lasso_coefs)


lasso_train_mse = mean_squared_error(vehicles_y_train, vehicles_pipe.predict(vehicles_X_train))
lasso_test_mse = mean_squared_error(vehicles_y_test, vehicles_pipe.predict(vehicles_X_test))
var = np.var(vehicles_y_test - vehicles_pipe.predict(vehicles_X_test))

# print(lasso_train_mse)
# print(lasso_test_mse)
# print(var)
#Higher than variance


feature_names = vehicles_pipe.named_steps['polyfeatures'].get_feature_names_out()
lasso_df = pd.DataFrame({'feature': feature_names, 'coef': lasso_coefs})

# print(type(feature_names))
lasso_df.loc[lasso_df['coef'] != 0]

Unnamed: 0,feature,coef
0,year,4631.151787
1,cylinders,35304.618688
2,odometer,-78.139261


Lasso -- degree 2

In [877]:
vehicles_pipe = Pipeline([('polyfeatures', PolynomialFeatures(degree = 2, include_bias = False)),
                      ('scaler', StandardScaler()),
                     ('lasso', Lasso(random_state = 42))])
vehicles_pipe.fit(vehicles_X_train, vehicles_y_train)
lasso_coefs = vehicles_pipe.named_steps['lasso'].coef_

# print(type(lasso_coefs))
# print(lasso_coefs)


lasso_train_mse = mean_squared_error(vehicles_y_train, vehicles_pipe.predict(vehicles_X_train))
lasso_test_mse = mean_squared_error(vehicles_y_test, vehicles_pipe.predict(vehicles_X_test))
var = np.var(vehicles_y_test - vehicles_pipe.predict(vehicles_X_test))

# print(lasso_train_mse)
# print(lasso_test_mse)
# print(var)
#Higher than variance


feature_names = vehicles_pipe.named_steps['polyfeatures'].get_feature_names_out()
lasso_df = pd.DataFrame({'feature': feature_names, 'coef': lasso_coefs})

# print(type(feature_names))
lasso_df.loc[lasso_df['coef'] != 0]

Unnamed: 0,feature,coef
0,year,-56203.18995
1,cylinders,11411.006298
2,odometer,184335.417428
3,year^2,57157.290861
4,year cylinders,184294.250604
5,year odometer,-109717.175312
6,cylinders^2,-152880.33771
7,cylinders odometer,-75506.237636
8,odometer^2,-1887.10846


In [878]:

sequential_pipe = Pipeline([('poly_features', PolynomialFeatures(degree = 2, include_bias = False)),
                           ('selector', SequentialFeatureSelector(LinearRegression(), 
                                                                  n_features_to_select=2)),
                           ('linreg', LinearRegression())])
sequential_pipe.fit(vehicles_X_train, vehicles_y_train)
sequential_train_mse = mean_squared_error(vehicles_y_train, sequential_pipe.predict(vehicles_X_train))
sequential_test_mse = mean_squared_error(vehicles_y_test, sequential_pipe.predict(vehicles_X_test))
### END SOLUTION

# Answer check
# print(sequential_train_mse)
# print(sequential_test_mse)
sequential_pipe

# var1 = np.var(vehicles_y_train - sequential_pipe.predict(vehicles_X_train))
# print(var1)
var2 = np.var(vehicles_y_test - sequential_pipe.predict(vehicles_X_test))
# print(var2)
#Both higher than variance :c

Model Selector

In [879]:
# model_selector_pipe = Pipeline([('poly_features', PolynomialFeatures(degree = 2, include_bias = False)),
#                                 ('scaler', StandardScaler()),
#                                 ('selector', SelectFromModel(Lasso())),
#                                     ('linreg', LinearRegression())])


# model_selector_pipe.fit(vehicles_X_train, vehicles_y_train)
# selector_train_mse = mean_squared_error(vehicles_y_train, model_selector_pipe.predict(vehicles_X_train))
# selector_test_mse = mean_squared_error(vehicles_y_test, model_selector_pipe.predict(vehicles_X_test))



# print(selector_train_mse)
# print(selector_test_mse)

# var2 = np.var(vehicles_y_test - model_selector_pipe.predict(vehicles_X_test))
# print(var2)



# #Test MSE is higher than variance

GridSearchCV with StandardScaler

In [880]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [881]:
pipe = Pipeline([('polyfeatures', PolynomialFeatures(degree = 2, include_bias = False)),('scale', StandardScaler()), ('ridge', Ridge())])
param_dict = {'ridge__alpha': [0.001, 0.1, 1.0, 10.0, 100.0, 1000.0]}

ridge = Ridge()
grid = GridSearchCV(ridge, param_grid=params_dict)
grid.fit(vehicles_X_train, vehicles_y_train)
train_preds = grid.predict(vehicles_X_train)
test_preds = grid.predict(vehicles_X_test)
train_mse = mean_squared_error(vehicles_y_train, train_preds)
test_mse = mean_squared_error(vehicles_y_test, test_preds)

best_alpha = grid.best_params_
print(f'Test MSE: {test_mse}')
print(f'Best Alpha: {list(best_alpha.values())[0]}')

Test MSE: 248923241710778.47
Best Alpha: 10.0


In [882]:
pipe = Pipeline([('polyfeatures', PolynomialFeatures(degree = 2, include_bias = False)),('scale', StandardScaler()), ('ridge', Ridge(alpha = 10.0))])
pipe.fit(vehicles_X_train, vehicles_y_train)

ridge_coefs = pipe.named_steps['ridge'].coef_

feature_names = pipe.named_steps['polyfeatures'].get_feature_names_out()
lasso_df = pd.DataFrame({'feature': feature_names, 'coef': ridge_coefs})

print(type(feature_names))
lasso_df.loc[lasso_df['coef'] != 0]

<class 'numpy.ndarray'>


Unnamed: 0,feature,coef
0,year,-197131.3
1,cylinders,-1000064.0
2,odometer,577059.6
3,year^2,177784.4
4,year cylinders,1180929.0
5,year odometer,-499212.3
6,cylinders^2,-141303.4
7,cylinders odometer,-74170.28
8,odometer^2,-6745.02


GridSearchCV without StandardScaler

In [883]:
# pipe = Pipeline([('polyfeatures', PolynomialFeatures(degree = 2, include_bias = False)), ('ridge', Ridge())])
# param_dict = {'ridge__alpha': [0.001, 0.1, 1.0, 10.0, 100.0, 1000.0]}

# ridge = Ridge()
# grid = GridSearchCV(ridge, param_grid=params_dict)
# grid.fit(vehicles_X_train, vehicles_y_train)
# train_preds = grid.predict(vehicles_X_train)
# test_preds = grid.predict(vehicles_X_test)
# train_mse = mean_squared_error(vehicles_y_train, train_preds)
# test_mse = mean_squared_error(vehicles_y_test, test_preds)

# best_alpha = grid.best_params_
# print(f'Test MSE: {test_mse}')
# print(f'Best Alpha: {list(best_alpha.values())[0]}')

In [884]:
# pipe = Pipeline([('polyfeatures', PolynomialFeatures(degree = 2, include_bias = False)), ('ridge', Ridge(alpha = 10.0))])
# pipe.fit(vehicles_X_train, vehicles_y_train)

# ridge_coefs = pipe.named_steps['ridge'].coef_

# feature_names = pipe.named_steps['polyfeatures'].get_feature_names_out()
# lasso_df = pd.DataFrame({'feature': feature_names, 'coef': ridge_coefs})

# print(type(feature_names))
# lasso_df.loc[lasso_df['coef'] != 0]

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

#### ***Single feature linear regression***
* In these models, our MSE was never significantly lower than our variance, demonstrating that these models essentially predicted the mean of the given values they were analysing, giving us little insight into how non-numerical factors dynamically affect the price of used vehicles. However, analysing them may at least demonstrate to sales personnel the average value of certain features of a car, like its manufacturer or the type of fuel, which can help them asses the value of their inventory.

#### ***Re-evaluations***
* Some models returned data that was obviously wrong, like year having a negative linreg coefficient, or odometer having a positive one. Interpreting these plainly would imply that newer cars are cheaper and that higher mileage cars are more expensive, which intuitively seems completely false.
* **Revisiting the data preparation phase** helped us overcome these obvious disparities. One of the most influential changes was to omit data entries where the 'price' field was set to 0 or 1. These data points gave us no real insight into actual trends and were creating huge disparities in our models, and omitting them gave us much more robust information.
* Some models were not that useful overall and their code has been commented out, but still included to indicate that they were assessed.

#### ***Compiling new data***

**Data from LASSO model (degree 1):**


feature	coef
* year	4631.151787
* cylinders	35304.618688
* odometer	-78.139261

**Data from GridSearchCV model:**


feature	coef
*	year	-1.971313e+05
*	cylinders	-1.000064e+06
*	odometer	5.770596e+05
*	year^2	1.777844e+05
*	year cylinders	1.180929e+06
*	year odometer	-4.992123e+05
*	cylinders^2	-1.413034e+05
*	cylinders odometer	-7.417028e+04
*	odometer^2	-6.745020e+03

**Data from LASSO model (degree 2):**

feature	coef
* year	-56203.189950
* cylinders	11411.006298
* odometer	184335.417428
* year^2	57157.290861
* year cylinders	184294.250604
* year odometer	-109717.175312
* cylinders^2	-152880.337710
* cylinders odometer	-75506.237636
* odometer^2	-1887.108460

### Single non-numeric features

*Condition*
* condition_salvage    3736.608379
* condition_new   29779.991471
* condition_good   33613.216880
* condition_like new   42033.571666
* condition_excellent   62320.623768
* condition_fair  796604.879988

*Fuel*
* fuel_hybrid   15577.421172
* fuel_electric   26490.500000
* fuel_other   76943.532335
* fuel_gas   89552.846094
* fuel_diesel  149649.129281

*Title Status*
* title_status_salvage  10847.200115
* title_status_missing  11993.183060
* title_status_parts only  13648.404908
* title_status_rebuilt  15572.034870
* title_status_lien  22454.155684
* title_status_clean  95358.841552

*Drive*
* drive_fwd   17324.612954
* drive_rwd   45931.616868
* drive_4wd  138212.830207


#### ***Valuable takeaways***

* Our analysis using the LASSO model suggest some relationships between year, cylinders, and odometer value with respect to price. It demonstrates that these are all important factors in determining price, and does not indicate that any of these values are insignificant. The cylinder value has a very high impact on price, the year value has a significant impact, and the odometer value has a slight impact -- this supports our intuition, since cylinder value can vary highly, year value may differ slightly between cars, and odometer values can be anywhere from very high to very low.
* Car sales personnel will be happy to know that cylinder count is a huge driver of value according to the LASSO analysis, while year contributes significantly to price as it increases, and odometer value tends to gradually decrease the price of a car as it increases.

* Further LASSO analysis from the degree 2 model gives us insight on how some variables interact with each other. For example, higher year values combined with higher odometer values indicate a very strong decrease in price. A similar but smaller effect can be seen with increased cylinder and odometer values. This demonstrates that greater mileage has a stronger detrimental effect on car price when occurring in newer cars and on cars with higher cylinder value.


### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

# What Drives the Price of a Car?: Used Car Feature to Price Analysis
In our analysis, we used statistical techniques to analyze what drives the value of used cars. Thanks to our analysis, we can make promising conclusions about what car features contribute to their price, and how cars can be assessed to ensure that sales peronnel maintain a high value inventory.

## Single Non-Numeric Values

Features like fuel type, condition, and title status all reliably contribute to the value of our car. Thanks to our analysis, we can pinpoint to what degree features affect price and advise sales personnel on how valuable certain features are.

### Condition
* Our analysis demonstrates that "fair" and "excellent" condition for cars are massive drivers of value. These features heavily increase the selling price of a car. Cars with a "like new" condition are actually statistically not as strongly preferred in the used car market, and "new" cars even less so, demonstrating buyer's preferences towards previously used cars and a likelihood for less costly brands to be present in inventory space. ***Used car sales personnel should prioritize "fair" and "excellent" cars** most to align with consumer interest, followed by "like new" and "good", then by "new" and "salvage"*


### Drive
* Amongst cars designated with front-wheel drive (FWD), rear-wheel drive (RWD), and four-wheel drive (4WD), 4WD is by far the biggest contributer to value. A vehicle's 4WD designation tends to contribute nearly three times as much towards a car's price than a FWD designation, and far far more than a RWD designation. ***Used car sales personnel should prioritize 4WD in ther inventory** and then provide FWD, followed by RWD vehicles in order to maximize the price of their stock*


### Fuel
* In terms of fuel type, sales personnel can **expect diesel and gas cars to be the highest price vehicles on their lot**. Vehicles with an "other" designation are less valuable, while **electric and hybrid vehicles are even further less valuable** according to analysis. This likely reflects the used car market, where users are hoping for reliable cars for every day use, and may not have the capability to use electric means of fuel.

### Title Status
* In terms of title status, sales personnel can rest assured that intuitively, the better condition a title status of a car, the higher the price the car will be. In essence, **clean title status indicates highest pricing, then lien**, followed by rebuilt, parts only, missing, and salvage respectively.


## Numeric Values

### Cylinders, Odometer, Year

* Our analysis provides promising information about the numeric features of car's in the inventory of our sales personnel. Firstly, ** increasing cylinder count has a very strong effect on price**, where each increase in cylinder count increases the price of a car greatly. **Sales personnel should prioritize higher cylinder count cars** to ensure greater pricing. **Cars newer by the year are more valuable**, where each increase in year leads to a moderate increase in price. **Sales personnel should stock up on newer car models** to maximize their pricing. Finally, odometer count gradually decreases price as it increases. Unsurprisingly **sales personnel should be aware that greater mileage lowers the price of a car**, considering lower mileage cars to maximize value.