# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

## Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

**Background:  
The relationship between the price of a used car and single variables like the antiquity of the model or the visible condition of the vehicle is relatively intuitive (i.e. greater antiquity decreases price, better visible condition increases price), although when these variables are combined the actual relationship is more difficult to characterize. For this purpose, we will design and implement a machine learning regression model that combines separate characteristics (variables) in a manner that allows us to derive more exact conclusions regarding how to optimize the pricing system for a lot filled with used cars.**

**Goal:  
Derive deeper understanding regarding consumer preferences for purchasing used cars from regression model to optimize inventory and sales strategies for used car salesman and used car lot owners**

## Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [22]:
# Import all the necessary packages, objects, and functions
from sklearn.pipeline import Pipeline
from category_encoders import MEstimateEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler 
from sklearn.compose import make_column_transformer
from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn import set_config
from zipfile import ZipFile
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import warnings
set_config(display="diagram")
warnings.filterwarnings('ignore')

# Set display parameters for all printed pandas dataframe tables
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)

# freq_encoder was used in testing the model but was determined to not be the optimal metric
def freq_encoder(df,col_name):
    freq_dict = {}
    uniqs = df[col_name].unique()
    for cat in uniqs:
        freq_dict[cat] = df[col_name].loc[df[col_name] == cat].value_counts()[0]
    df[col_name].replace(freq_dict,inplace = True)
    
    return df

#### **Initial Checks**
- Check the data types of each variable (int, object, str, etc.)
- Check the amount of null values for each variable
- Check the amount of unique values for each variable

In [24]:
zip_file = ZipFile('data/vehicles.zip')
cars = pd.read_csv(zip_file.open('vehicles.csv'))
# cars = pd.read_csv('data/vehicles.zip', compression = 'zip')
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [None]:
column_names = cars.columns
uniq_cnt = []
null_cnt = []
for column in column_names:
    uniq_cnt.append(len(cars[column].unique()))
    null_cnt.append(cars[column].isnull().sum())

variable_info_df = pd.DataFrame(data = zip(null_cnt,uniq_cnt), index = column_names, columns = ['n_null_values','n_uniques']).sort_values('n_null_values',ascending=False)
variable_info_df.head(20)

#### **Conclusions from variable inspection**
- Variables that can identify each individual car (ID, VIN)
- Variables detailing the location of car sale (region, state)
- Variables indicating external characteristics of cars (paint color, size, type)
- Variables describing the ignition/operation of car (cylinders, fuel, transmission, drive)
- Variables for production characteristics (manufacturer, model)
- Variables for the state of car being sold (condition, title status, odometer, year)
- TARGET VARIABLE: price

**Removing vehicle identifiers is the first step as well as variables that could contain repetitive information**
- VIN and ID can be eliminated for our purposes
- Region and state are redundant, and since region has more unique values we will eliminate it and keep state
- Size, condition, drive and paint color will be eliminated due to massive amounts of missing values
- Cylinders could reflect the quantity of wear on the engine, so I suspect this variable will be useful
- If we are going to use cylinders, we can transform the text to numerical values

#### **We will now check the distributions of our numerical values: price, odometer, and year**

In [None]:
#We will start by using plotly.express for speed of generating the histograms
px.histogram(cars['price'],
            labels = {'value': 'Price ($)'},
            title = 'Distribution of Prices')

In [None]:
px.histogram(cars['odometer'],
             labels = {'value': 'Distance traveled (miles)'},
             title = 'Distribution of Odometer Values')

In [None]:
px.histogram(cars['year'],
            labels = {'value': 'Year'},
            title = 'Years of Production')

**From these histograms, we can draw various conclusions:**
- There are a great deal of outliers for price, odometer, and year
- The values for year should be changed to age values for our regression purposes
- We will test taking the logarithm of price and odometer values

In [None]:
cars.describe()

1) subplot hist for all numerical variables
2) subplot hist for all categorical variables that are eliminated
3) subplot hist for all categorical variables that are included
4) subplot hist after cleaning
5) subplot hist after taking log values

- make plots using seaborn
    - scatter of odometer vs. price
    - scatter of log odometer vs. log price (painted for various categorical variables
    - histplots with density function

## Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

**We use visual inspection of distributions for price as an example for identifying our outlier bounds, and the same process is applied to odometer and year**

In [None]:
sns.histplot(cars['price'].loc[(cars['price'] < 500000)]).set_title('Distribution of Car Prices')

We can visually assess that the upper bound needs to be decreased, and that a lower bound should be introduced as well

In [None]:
sns.histplot(cars['price'].loc[(cars['price'] > 1000) & (cars['price'] < 60000)]).set_title('Distribution of Car Prices')

We will also try taking the logarithm of these values, which makes the distribution closer to a Gaussian (respecting the assumption of normality for linear regressions)

In [None]:
sns.histplot(cars['price'].loc[(cars['price'] > 1000) & (cars['price'] < 60000)], log_scale=True).set_title('Distribution of Car Prices')

In [None]:
# We will check the same results for odometer
sns.histplot(cars['odometer'].loc[(cars['odometer'] > 1000) & (cars['odometer'] < 400000)], log_scale=True).set_title('Distribution of Odometer Readings')

In [None]:
# We check the year values as well
cars['year'] = 2022-cars['year']
sns.histplot(cars['year'].loc[(cars['year'] < 40)]).set_title('Distribution of Years of Production')

**Having identified reasonable ranges and transformations for our numerical variables, we will assess our categorical variables**

In [None]:
#Prepared data with appropriate ranges for price, odometer, and year
cars_fix = cars.loc[(cars['price'] < 58000) & (cars['price'] > 1000) & (cars['odometer'] < 400000) & (cars['odometer'] > 10000)]
cars_fix = cars_fix.loc[(cars['year'] < 40)]

#Transform price and odometer values with log10
cars_fix['log_price'] = cars_fix['price'].apply(lambda x: np.log10(x) if x!= 0 else 0)
cars_fix['log_odometer'] = cars_fix['odometer'].apply(lambda x: np.log10(x) if x!= 0 else 0)
cars_fix.drop(['odometer','price'],axis = 1, inplace=True)

cars_fix.shape


From this point on, we will use log_price and log_odometer to replace price and odometer values

In [None]:
# Here we can see the boxplots of our categorical variables related to the log_price variable
#    The important thing to note is that the ranges covered by the boxes has a large amount of overlap, which will support our regression
#    The impact of the observed outliers will be minimized by our encoding scheme
fig, axes = plt.subplots(ncols=4, nrows=1, sharey = True, figsize=(15,6))
sns.boxplot(data = cars_fix, x = 'fuel', y = 'log_price', ax = axes[0]).set_title('Fuel Boxplot')
axes[0].tick_params(axis='x', labelrotation=90)
sns.boxplot(data = cars_fix, x = 'transmission', y = 'log_price', ax = axes[1]).set_title('Transmission Boxplot')
axes[1].tick_params(axis='x', labelrotation=90)
sns.boxplot(data = cars_fix, x = 'title_status', y = 'log_price', ax = axes[2]).set_title('Title Status Boxplot')
axes[2].tick_params(axis='x', labelrotation=90)
sns.boxplot(data = cars_fix, x = 'type', y = 'log_price', ax = axes[3]).set_title('Car Type Boxplot')
axes[3].tick_params(axis='x', labelrotation=90)

Finally we will visualize the distributions for our remaining categorical variables: manufacturer, state
- There are too many unique values for models to visualize

**These variables, as well as the previous categorical variables, will be encoded based on their corresponding prices, which balances out the lack of instances for individual categories**

In [None]:
fig = plt.figure(figsize=(10,5))
sns.histplot(data = cars_fix, x = 'manufacturer').set_title('Distribution of Manufacturers')
_ = plt.xticks(rotation=90)

In [None]:
fig = plt.figure(figsize=(10,5))
sns.histplot(data = cars_fix, x = 'state').set_title('Distribution of States')
_ = plt.xticks(rotation=90)

**We will prepare our final variables**

In [None]:
#Translate cylinders into numerical values and replace in cars_fix dataframe
cylinder_rename = {'3 cylinders': 3, '4 cylinders': 4, '5 cylinders': 5, '6 cylinders': 6,
                   '8 cylinders': 8, '10 cylinders': 10, '12 cylinders': 12, 'other': 16}
cars_fix['cylinders'].replace(cylinder_rename, inplace=True)

#Drop other unnecessary variables
cars_fix = cars_fix.drop(['VIN','id','size','paint_color','region','condition','drive'], axis = 1).dropna() #'id','VIN','condition','cylinders','size','drive','paint_color','region'

cars_fix.shape

**We have managed to preserve ~40% of our data after all of our data cleaning and variable inclusions/exclusions**
- 170k data points is a robust dataset that permits a high level of statistical power, sufficient for our current use

## Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

**We will try 3 types of regression:**
- Ridge regression
- Lasso regression
- Regression with polynomial features

**We will use 2 different transformers in our pipeline:**
- MEstimateEncoder: transforms categorical values into their corresponding target values
- PolynomialFeatures: to combine the influence of log_odometer, year, and cylinders since these variables should have some nonlinear effects on our data

**We will use MSE, MAE, and R2 score as the evaluation metrics for our cross-validation and the strength of our model**
- The combination of MSE and MAE allows us to identify undo influence from outliers
- Since we are dealing with a numerical regression instead of a classification problem, the most appropriate metric for comparing our model to the underlying dataset would be the R2 score. This allows us assess how close our model is to the underlying dataset based on our target variable

***It is important to note that the results of our models will also be assessed using the coefficients derived from each regression as well as the permutation importance that evaluates the influence of each variable on our target value (log_price)***
- The permutation importance will give us a measure of the influence that each of our used car characteristics had on the result
- The coefficients will give us a measure of the degree of influence that each regressor had on the result

In [None]:
#Let's begin by declaring our output table for the evaluation of our three different regression models
#  We have the train and test values for comparison of the MSE, MAE, and R2 score values
#  We also have the optimum alpha value determined by our GridSearchCV for the two models that accept this parameter (ridge and lasso)
model_results = pd.DataFrame(columns=['loss','Linear','Ridge','Lasso'])
model_results['loss']=['alpha', 'MSE_Train','MSE_Test','MAE_Train','MAE_Test','R2_Train','R2_Test']
model_results = model_results.set_index('loss')
model_results.head(6)

In [None]:
#Here we declare our transformer pipeline and our cross-validation datasets
degree = 4
targcats = ['manufacturer','transmission','fuel','state','type','title_status','model']

transformer = make_column_transformer((MEstimateEncoder(), targcats), #MAYBE MENTION OTHER ATTEMPTED TRANSFORMERS IN REPORT
                                      (PolynomialFeatures(degree = degree, include_bias = False), ['log_odometer','year','cylinders']))

X_train, X_test, y_train, y_test = train_test_split(cars_fix.drop('log_price', axis = 1), cars_fix['log_price'], test_size = 0.3, random_state=42)

### RIDGE REGRESSION

In [None]:
#Beginning with Ridge regression
pipe = Pipeline([('transformer', transformer),
                    ('ridge', Ridge(fit_intercept=True))])

param_dict = {'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid = GridSearchCV(pipe, param_grid=param_dict)
grid.fit(X_train, y_train)

train_preds = grid.predict(X_train)
test_preds = grid.predict(X_test)

In [None]:
#Calculate the evaluation metrics
train_mse = mean_squared_error(y_train, train_preds)
test_mse = mean_squared_error(y_test, test_preds)
train_mae = mean_absolute_error(y_train, train_preds)
test_mae = mean_absolute_error(y_test, test_preds)
r2_train = grid.score(X_train, y_train)
r2_test = grid.score(X_test, y_test)
alpha_val = list(grid.best_params_.values())[0]

model_results['Ridge'] = [alpha_val, train_mse, test_mse, train_mae, test_mae, r2_train, r2_test]

In [None]:
#Save the coefficients derived from the model
fit_coef = grid.best_estimator_.named_steps.ridge.coef_
fit_feat = grid.best_estimator_.named_steps.transformer.get_feature_names_out()
ridge_results = pd.DataFrame(fit_coef).transpose()
ridge_results.rename(columns={x:y for x,y in zip(range(0,len(fit_feat)),fit_feat)}, inplace=True)
ridge_results.rename({0: 'Ridge'}, axis=0, inplace=True)

In [None]:
#Save the permutation importance results
seqp = Pipeline([('transformer', transformer), ('ridge', Ridge(alpha = alpha_val, fit_intercept=True))])
seqp.fit(X_train, y_train)

r = permutation_importance(seqp, X_test, y_test,
                           n_repeats=30,
                           random_state=0)

features = cars_fix.drop('log_price', axis=1).columns
r_dict = {}

for i in r.importances_mean.argsort()[::-1]:
    if r.importances_mean[i] - 2 * r.importances_std[i] > 0:
        r_dict[features[i]] = [r.importances_mean[i], r.importances_std[i]]
        
ridge_perm = pd.DataFrame(r_dict)
ridge_perm.rename({0: 'ridge_mean', 1: 'ridge_std'}, axis=0, inplace=True)

### Lasso Regression

In [None]:
#Followed by Lasso regression
pipe = Pipeline([('transformer', transformer),
                    ('lasso', Lasso(fit_intercept=True))])

param_dict = {'lasso__alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid = GridSearchCV(pipe, param_grid=param_dict)
grid.fit(X_train, y_train)

train_preds = grid.predict(X_train)
test_preds = grid.predict(X_test)

In [None]:
#Calculate the evaluation metrics
train_mse = mean_squared_error(y_train, train_preds)
test_mse = mean_squared_error(y_test, test_preds)
train_mae = mean_absolute_error(y_train, train_preds)
test_mae = mean_absolute_error(y_test, test_preds)
r2_train = grid.score(X_train, y_train)
r2_test = grid.score(X_test, y_test)
alpha_val = list(grid.best_params_.values())[0]

model_results['Lasso'] = [alpha_val, train_mse, test_mse, train_mae, test_mae, r2_train, r2_test]

In [None]:
#Save the coefficients derived from the model
fit_coef = grid.best_estimator_.named_steps.lasso.coef_
lasso_results = pd.DataFrame(fit_coef).transpose()
lasso_results.rename(columns={x:y for x,y in zip(range(0,len(fit_feat)),fit_feat)}, inplace=True)
lasso_results.rename({0: 'Lasso'}, axis=0, inplace=True)

In [None]:
#Save the permutation importance results
seqp = Pipeline([('transformer', transformer), ('lasso', Lasso(alpha = alpha_val, fit_intercept=True))])
seqp.fit(X_train, y_train)

r = permutation_importance(seqp, X_test, y_test,
                           n_repeats=30,
                           random_state=0)

features = cars_fix.drop('log_price', axis=1).columns
r_dict = {}

for i in r.importances_mean.argsort()[::-1]:
    if r.importances_mean[i] - 2 * r.importances_std[i] > 0:
        r_dict[features[i]] = [r.importances_mean[i], r.importances_std[i]]
        
lasso_perm = pd.DataFrame(r_dict)
lasso_perm.rename({0: 'lasso_mean', 1: 'lasso_std'}, axis=0, inplace=True)

### Linear Regression

In [None]:
#Followed by Lasso regression
pipe = Pipeline([('transformer', transformer),
                    ('linreg', LinearRegression(fit_intercept=True))])

pipe.fit(X_train, y_train)

train_preds = pipe.predict(X_train)
test_preds = pipe.predict(X_test)

In [None]:
#Calculate the evaluation metrics
train_mse = mean_squared_error(y_train, train_preds)
test_mse = mean_squared_error(y_test, test_preds)
train_mae = mean_absolute_error(y_train, train_preds)
test_mae = mean_absolute_error(y_test, test_preds)
r2_train = pipe.score(X_train, y_train)
r2_test = pipe.score(X_test, y_test)
alpha_val = '-'

model_results['Linear'] = [alpha_val, train_mse, test_mse, train_mae, test_mae, r2_train, r2_test]

In [None]:
#Save the coefficients derived from the model
fit_coef = pipe.named_steps.linreg.coef_
lin_results = pd.DataFrame(fit_coef).transpose()
lin_results.rename(columns={x:y for x,y in zip(range(0,len(fit_feat)),fit_feat)}, inplace=True)
lin_results.rename({0: 'Linear'}, axis=0, inplace=True)
coef_results = pd.concat([lin_results, ridge_results, lasso_results], axis=0)#, ignore_index=True

In [None]:
#Save the permutation importance results
seqp = Pipeline([('transformer', transformer), ('ridge', LinearRegression(fit_intercept=True))])
seqp.fit(X_train, y_train)

r = permutation_importance(seqp, X_test, y_test,
                           n_repeats=30,
                           random_state=0)

features = cars_fix.drop('log_price', axis=1).columns
r_dict = {}

for i in r.importances_mean.argsort()[::-1]:
    if r.importances_mean[i] - 2 * r.importances_std[i] > 0:
        r_dict[features[i]] = [r.importances_mean[i], r.importances_std[i]]
        
lin_perm = pd.DataFrame(r_dict)
lin_perm.rename({0: 'linear_mean', 1: 'linear_std'}, axis=0, inplace=True)
perm_results = pd.concat([lin_perm, ridge_perm, lasso_perm], axis=0).transpose()

#### Tested parameters:
- JamesSteinEncoder, TargetEncoder, and a self-written frequency encoder but MEstimateEncoder gave the best results
- Included and excluded bias for polynomial features, but excluding bias gave best results
- Cross-validation test-size was tested for 0.2, 0.3, and 0.4 but 0.3 gave the most consistent results across regression models
- Polynomial degree was tested between 1 and 10, degree = 4 was selected due to computational constraints and optimal model simplicity

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [None]:
model_results.head(20)

**Based on the model results, we can derive the following conclusions:**
- The results from linear and ridge regression are very difficult to distinguish, these two models perform almost equally as well
- The lasso regression model performs worse across all metrics
- The lasso and ridge regression models use the same alpha value, so our data is being treated in a comparable manner by both models

*MSE & MAE*
- The mse and mae values are extremely close for all 3 regression models, although our train values are slightly lower than our test values. Given the size of these values, this discrepancy can be considered to be negligible.

*R2 Score*
- The R2 scores for our train sets across all 3 regression models are slightly higher than for the test sets, but having a R2 score of around 80% is a very strong result regardless

In [None]:
perm_results.head(20)

**Based on the permutation importance results, we can derive the following conclusions:**
- All 3 regression models showed the greatest influence from year, model, and log_odometer
- The lasso regression, which will exclude unnecessary characteristics, eliminated title_status and manufacturer
- The lasso results are consistent with the results from the other two models, which demonstrate that manufacturer and title_status contribute minimally to all models

In [None]:
coef_results

*With so many regression coefficients, the results become slightly difficult to interpret, so we will isolate the values for each model and sort them based on their magnitude to identify the largest influences*

In [None]:
coef_results.loc['Linear',:].sort_values(ascending=False)

**Linear model coefficients:**
- For our linear model, the most impactful regressors are log_odometer^2, log_odometer*year, and log_odometer
- Log_odometer^2 and log_odometer*year had a positive relationship with changes in the log_price, while log_odometer had a negative correlation
- These results are consistent with our intuitive assumptions, and show that the use of polynomial features was important to fully characterize the relationship between these variables and the log_price target values

In [None]:
coef_results.loc['Ridge',:].sort_values(ascending=False)

**Ridge model coefficients:**
- For our ridge model, the most impactful regressors are log_odometer^2, log_odometer*year, and log_odometer
- Log_odometer^2 and log_odometer*year had a positive relationship with changes in the log_price, while log_odometer had a negative correlation
- These results are consistent with our intuitive assumptions, and show that the use of polynomial features was important to fully characterize the relationship between these variables and the log_price target values

In [None]:
coef_results.loc['Lasso',:].sort_values(ascending=False)

**Lasso model coefficients:**
- For our lasso model, the most impactful regressors are model and transmission
- While model was positively correlated with the log_price, transmission was negatively correlated. These correlations relate to the MEstimateEncoder, which uses target values to encode categorical variables. The meaning of these results is that models with greater corresponding prices have a higher correlation with log_prices, a result that does not contribute much to deriving significant conclusions from this regression model
- These discrepancies with the ridge and linear models can explain the sub-performance of the lasso regression model

### **We can take the linear and ridge regression models to perform equivalently based on the model, permutation and coefficient results**

## Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

**CONCLUSIONS:**
1) The majority of data that we have for used cars are sold between 1,000 and 58,0000 USD, and have odometer readings between 1,000 and 400,000 miles
2) Most used cars that have been purchased are less than 40 years old
3) Diesel and electric cars tend to sell for the highest prices
4) California, Florida and Texas have the greatest amount of used car purchases, although state had a very low influence on determining the actual price of the car sold
5) The most influential factors in the price of a used car are the odometer readings, year of purchase, and car model
6) The number of cylinders has an influence on price when considered together with year of purchase and odometer readings. This can be understood as a perception that more cylinders increase the wear on a car's engine as antiquity and distance traveled are increased
7) Some variables have similar influences on the price of a used car, such as fuel + transmission and model + manufacturer

**ACTION STEPS:**
1) Focus on cars manufactured within the last 40 years
2) Lower odometer readings are important for the desirability of a used car
3) The model of a car can mitigate higher odometer readings to some extent
4) For cars with more cylinders, it is important to focus on those with lower odometer readings in particular