# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

========================================================================================================================
========================================================================================================================
Task 1. Build different predictive model using Linear Regression by varying the degree of PolynomialFeatures, applying PCA to reduce dimsensionality before applying Linear Regression, LASSO model to determine top factors and Ridge Model.

Measure of effectiveness:

- Mean Squared Error of the Training datasets

- Mean Squared Error of the Training datasets

- R^2 value (for linear regression) to determine how much of the variance can be explained by the model

Task 2. Using the above metrics, determine the best model

Task 3. For the best model, present the coefficients

========================================================================================================================

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.filters.filtertools import convolution_filter
import statsmodels.graphics.tsaplots as tsaplots
from statsmodels.tsa.seasonal import _extrapolate_trend
import plotly.express as px
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error, explained_variance_score
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectFromModel

In [None]:
cars = pd.read_csv('data/vehicles.csv')

In [None]:
cars.info()

In [None]:
px.scatter(cars, x='year', y='price')

In [None]:
# Due to extreme outliers, it is recommended that we limit the car prices to no more than $250,000
px.scatter(cars[cars['price'] <= 250_000], x='year', y='price')

In [None]:
px.box(cars[cars['price'] <= 250_000], x='year', y='price')

In [None]:
# Uncomment the code below to determine the spread of car prices.  I have also plotted it separately in box and histogram above. 
#px.scatter(cars[cars['price'] <= 250_000], x='year', y='price', marginal_y='box')

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [None]:
# 1. digitise all the data

In [None]:
dummies=pd.get_dummies(cars[['manufacturer','condition', 'cylinders', 'fuel', 'title_status', 'transmission',
                             'drive', 'size', 'type']])
cars_dummies = pd.concat([cars, dummies], axis=1)
cars_dummies=cars_dummies.drop(columns = ['manufacturer','condition', 'cylinders', 'fuel', 'title_status', 'transmission',
                             'drive', 'size', 'type'])

In [None]:
# 2. Remove unnecessary data 

In [None]:
df_cars = cars_dummies.drop(columns = ['id', 'region','model','VIN', 'paint_color','state'])
df_cars = df_cars.dropna()

In [None]:
# Retain reasonable values of price

# I remove all car prices > $250,000 as any higher doesn't sound realistic and i assumed it was data entry error
# I limited the data to car prices > $0 for 2 reasons - 1. giving a car for free is possible but it wont tell us much about
# price predictions; 2. I decided to apply a logarithmic model on the car price so that after we apply an exponential function 
# to reverse the logarithm, the price is positive.  Log cannot be applied on value zero. (I have tried an earlier model and 
# astoundingly, many of the predicted prices were negative.)

df_cars=df_cars[(df_cars['price']<=250_000) & (df_cars['price']>0) ]

In [None]:
y=np.log(df_cars['price'])
X=df_cars.drop(columns=['price'])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
np.abs(df_cars.corr())['price'].sort_values(ascending = False).head(15)

#   

In [None]:
# 3. Repeat steps 1 and 2 but this time, remove manufacturer data. 

In [None]:
dummies2=pd.get_dummies(cars[['condition', 'cylinders', 'fuel', 'title_status', 'transmission',
                             'drive', 'size', 'type']])
cars_dummies2 = pd.concat([cars, dummies2], axis=1)
cars_dummies2=cars_dummies2.drop(columns = ['condition', 'cylinders', 'fuel', 'title_status', 'transmission',
                             'drive', 'size', 'type'])

In [None]:
df_cars2 = cars_dummies2.drop(columns = ['id', 'region','model','VIN', 'paint_color','state', 'manufacturer'])
df_cars2 = df_cars2.dropna()
df_cars2=df_cars2[(df_cars2['price']<=250_000) & (df_cars2['price']>0) ]

In [None]:
y2=np.log(df_cars2['price'])
X2=df_cars2.drop(columns=['price'])

In [None]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.3)

In [None]:
train_mse = []
test_mse = []
explained_variance = []
model =[]

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

### Model 1: Linear Regression with 92 features

In [None]:
# Linear Regression (Degree = 1) using 92 features

In [None]:
LR1_model = Pipeline([
    ('transform', PolynomialFeatures(degree=1, include_bias=False)),
    ('scale', StandardScaler()),
    ('linreg', LinearRegression())
])
LR1_model.fit(X_train, y_train)

LR1_train_mse= round(mean_squared_error(LR1_model.predict(X_train), y_train), 4)
LR1_test_mse=round(mean_squared_error(LR1_model.predict(X_test), y_test),4)
LR1_EV = explained_variance_score(y_train, LR1_model.predict(X_train))

train_mse.append(LR1_train_mse)
test_mse.append(LR1_test_mse)
explained_variance.append(LR1_EV)
model.append("Model 1 - LR1")

In [None]:
plt.figure(figsize=(16,6))
plt.scatter(X_train['year'], np.exp(y_train), c ="blue", alpha=0.3, label='observed values')
plt.scatter(X_train['year'], np.exp(LR1_model.predict(X_train)), c ="orange", alpha=0.3, label = 'predicted value')
plt.title('Predicted vs Observed (Using ordinary linear regression on 92 features); Training MSE = ' + str(LR1_train_mse) +
          '; Development MSE = ' + str(LR1_test_mse))
plt.xlabel('Year')
plt.ylabel('Car Price')
plt.legend()

### Model 2: Linear Regression with 50 features 

In [None]:
# Linear Regression (Degree = 1) using 50 features

In [None]:
LR2_model = Pipeline([
    ('transform', PolynomialFeatures(degree=1, include_bias=False)),
    ('scale', StandardScaler()),
    ('linreg', LinearRegression())
])
LR2_model.fit(X_train2, y_train2)

LR2_train_mse= round(mean_squared_error(LR2_model.predict(X_train2), y_train2), 4)
LR2_test_mse=round(mean_squared_error(LR2_model.predict(X_test2), y_test2),4)
LR2_EV = explained_variance_score(y_train2, LR2_model.predict(X_train2))

train_mse.append(LR2_train_mse)
test_mse.append(LR2_test_mse)
explained_variance.append(LR2_EV)
model.append("Model 2 - LR2")

In [None]:
plt.figure(figsize=(16,6))
plt.scatter(X_train2['year'], np.exp(y_train2), c ="blue", alpha=0.3, label='observed values')
plt.scatter(X_train2['year'], np.exp(LR2_model.predict(X_train2)), c ="orange", alpha=0.3, label = 'predicted value')
plt.title('Predicted vs Observed (Using ordinary linear regression on 50 features); Training MSE = ' + str(LR2_train_mse) +
          '; Development MSE = ' + str(LR2_test_mse))
plt.xlabel('Year')
plt.ylabel('Car Price')
plt.legend()

### Model 3: Linear Regression with 50 features and applying PolynomialFeatures(degree = 2) 

In [None]:
LR3_model = Pipeline([
    ('transform', PolynomialFeatures(degree=2, include_bias=False)),
    ('scale', StandardScaler()),
    ('linreg', LinearRegression())
])
LR3_model.fit(X_train2, y_train2)

LR3_train_mse= round(mean_squared_error(LR3_model.predict(X_train2), y_train2), 4)
LR3_test_mse=round(mean_squared_error(LR3_model.predict(X_test2), y_test2),4)
LR3_EV= explained_variance_score(y_train2, LR3_model.predict(X_train2))

train_mse.append(LR3_train_mse)
test_mse.append(LR3_test_mse)
explained_variance.append(LR3_EV)
model.append("Model 3 - LR3Deg2")

In [None]:
plt.figure(figsize=(16,6))
plt.scatter(X_train2['year'], np.exp(y_train2), c ="blue", alpha=0.3, label='observed values')
plt.scatter(X_train2['year'], np.exp(LR3_model.predict(X_train2)), c ="orange", alpha=0.3, label = 'predicted value')
plt.title('Predicted vs Observed (Using linear regression with degree =2 on 50 features); Training MSE = ' + str(LR3_train_mse) +
          '; Development MSE = ' + str(LR3_test_mse))
plt.xlabel('Year')
plt.ylabel('Car Price')
plt.legend()

### Model 4: LASSO model

In [None]:
pipe_lasso = Pipeline([
    ('transform', PolynomialFeatures(degree=2, include_bias=False)),
    ("scaler", StandardScaler()), 
    ('lasso', Lasso())
])
pipe_lasso.fit(X_train2, y_train2)

lasso_train_mse= round(mean_squared_error(pipe_lasso.predict(X_train2), y_train2), 4)
lasso_test_mse=round(mean_squared_error(pipe_lasso.predict(X_test2), y_test2),4)
lasso_EV= explained_variance_score(y_train2, pipe_lasso.predict(X_train2))

train_mse.append(lasso_train_mse)
test_mse.append(lasso_test_mse)
explained_variance.append(lasso_EV)
model.append("Model 4 - Lasso")

In [None]:
lasso_coef = pipe_lasso.named_steps['lasso'].coef_
lasso_coef

In [None]:
plt.figure(figsize=(16,6))
plt.scatter(X_train2['year'], np.exp(y_train2), c ="blue", alpha=0.3, label='observed values')
plt.scatter(X_train2['year'], np.exp(pipe_lasso.predict(X_train2)), c ="orange", alpha=0.3, label = 'predicted value')
plt.title('Predicted vs Observed (Using Lasso model); Training MSE = ' + str(lasso_train_mse) +
          '; Development MSE = ' + str(lasso_test_mse))
plt.xlabel('Year')
plt.ylabel('Car Price')
plt.legend()

### Model 5: Ridge model with varying alphas
#### Please note that the alogrithm below will take a few minutes to run as it is being trained on a high degree model and running an optimisation to find the best solution 

In [None]:
ridge_model = Pipeline([
    ('transform', PolynomialFeatures(degree=2, include_bias=False)),
    ('scale', StandardScaler()),
    ('ridge', Ridge())
])

alpha_value = {'ridge__alpha': [0.1,1,10]}

model_finder = GridSearchCV(estimator = ridge_model, 
                           param_grid=alpha_value,
                           scoring = "neg_mean_squared_error"
                           )

model_finder.fit(X_train2, y_train2)

best_model=model_finder.best_estimator_

In [None]:
best_alpha = model_finder.best_params_

In [None]:
ridge_train_mse = round(mean_squared_error(best_model.predict(X_train2), y_train2),4)
ridge_test_mse = round(mean_squared_error(best_model.predict(X_test2), y_test2),4)
ridge_EV= explained_variance_score(y_train2, best_model.predict(X_train2))

train_mse.append(ridge_train_mse)
test_mse.append(ridge_test_mse)
explained_variance.append(ridge_EV)
model.append("Model 5 - Ridge")

In [None]:
plt.figure(figsize=(16,6))
plt.scatter(X_train2['year'], np.exp(y_train2), c ="blue", alpha=0.3, label='observed values')
plt.scatter(X_train2['year'], np.exp(best_model.predict(X_train2)), c ="orange", alpha=0.3, label = 'predicted value')
plt.title('Predicted vs Observed (Using ridge model with alpha = ' + str(best_alpha) + '); Training MSE = ' + str(ridge_train_mse) +
          '; Development MSE = ' + str(ridge_test_mse))
plt.xlabel('Year')
plt.ylabel('Car Price')
plt.legend()

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [None]:
MetricsTable = pd.DataFrame({
    'model': model,
    'train_mse': train_mse,
    'test_mse':test_mse,
    'explained_variance': explained_variance
})

MetricsTable

### Deploying the best performing model - the ridge model (PolynomialFeatures(degree=2), alpha =1)

In [None]:
ridge_value = pd.DataFrame({
    'features': best_model.named_steps['transform'].get_feature_names_out(),
    'coefficient': best_model.named_steps['ridge'].coef_
})


In [None]:
ridge_value['abs_coef'] = np.abs(ridge_value['coefficient'])

In [None]:
ridge_value.sort_values(by = 'abs_coef', ascending = False).head(15)

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

### refer to https://github.com/CarolTeo11/Assignment-11.1 for the model and deployment