# Ford Focus Modelling

In [38]:
import pandas as pd
import numpy as np
from data_preparation import split_years_df, split_train
from models import linear_model, custom_input_for_model, xgb_r
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate, cross_val_score
import xgboost as xg 


The goal of this project is to build a model, that can be used to find potentially undervalued cars. I will use linear regression to build a model to predict a car price based on various features.

Various models will have to be trained and tested in order to find the optimum features. The first part will be exploring the features and the correlations.

In [39]:
df_focus_17, df_focus_18, df_focus_19 = split_years_df()

In [40]:
df_focus_17.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1288 entries, 12 to 5441
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   year                    1288 non-null   int64  
 1   price                   1288 non-null   int64  
 2   mileage                 1288 non-null   int64  
 3   engineSize              1288 non-null   float64
 4   transmission_Automatic  1288 non-null   uint8  
 5   transmission_Manual     1288 non-null   uint8  
 6   transmission_Semi-Auto  1288 non-null   uint8  
 7   fuelType_Diesel         1288 non-null   uint8  
 8   fuelType_Petrol         1288 non-null   uint8  
dtypes: float64(1), int64(3), uint8(5)
memory usage: 56.6 KB


In [41]:
focus_17_mean = df_focus_17["price"].mean()
focus_18_mean = df_focus_18["price"].mean()
focus_19_mean = df_focus_19["price"].mean()

print("Mean price for 17 models: £{}".format(round(focus_17_mean, 1)))
print("Mean price for 18 models: £{}".format(round(focus_18_mean, 1)))
print("Mean price for 19 models: £{}".format(round(focus_19_mean, 1)))

Mean price for 17 models: £12588.9
Mean price for 18 models: £14613.9
Mean price for 19 models: £17859.3


In [42]:
corr = df_focus_17.corr()

In [43]:
corr

Unnamed: 0,year,price,mileage,engineSize,transmission_Automatic,transmission_Manual,transmission_Semi-Auto,fuelType_Diesel,fuelType_Petrol
year,,,,,,,,,
price,,1.0,-0.348932,0.696645,0.002393,0.016651,-0.026427,-0.093647,0.093647
mileage,,-0.348932,1.0,0.098406,-0.08979,0.093344,-0.035052,0.372073,-0.372073
engineSize,,0.696645,0.098406,1.0,-0.117206,0.170547,-0.11538,0.430865,-0.430865
transmission_Automatic,,0.002393,-0.08979,-0.117206,1.0,-0.716401,-0.071647,-0.144955,0.144955
transmission_Manual,,0.016651,0.093344,0.170547,-0.716401,1.0,-0.644568,0.149911,-0.149911
transmission_Semi-Auto,,-0.026427,-0.035052,-0.11538,-0.071647,-0.644568,1.0,-0.05547,0.05547
fuelType_Diesel,,-0.093647,0.372073,0.430865,-0.144955,0.149911,-0.05547,1.0,-1.0
fuelType_Petrol,,0.093647,-0.372073,-0.430865,0.144955,-0.149911,0.05547,-1.0,1.0


# Linear Regression Modelling

From the correlation table above, we can see that there isn't much correlation between a lot of the variables that were going to be used in the modelling. This could result in overfitting of the model because of it may be too complex. I will start by modelling for the 2017 dataset with all variables.
Then gradually reduce the variables to just mileage and engineSize which are the two variables which have the biggest correlation with price. This should improve the R2 score of the model.
The model is going to be used with a web scraper, so it will be easier to use one model for all predictions. After the variables have been reduced and the optimum x variables have been selected. I will try the model with 2017-19 data all combined and use year as an extra x-variable. This will mean only one model needs to be trained and will be more efficient.

### Model for all X variables using the 2017 dataset

In [44]:
x_train, x_test, y_train, y_test = split_train(df_focus_17, ["mileage", "engineSize", "transmission_Automatic", 
                                                             "transmission_Manual", "transmission_Semi-Auto",
                                                             "fuelType_Diesel", "fuelType_Petrol"])

In [45]:
model_fit, r2 = linear_model(x_train, x_test, y_train, y_test)
r2.round(3)

0.823

In [46]:
cross_val = cross_validate(LinearRegression(), df_focus_17[["mileage", "engineSize", "transmission_Automatic", 
                                                   "transmission_Manual", "transmission_Semi-Auto",
                                                   "fuelType_Diesel", "fuelType_Petrol"]], 
                                         df_focus_17["price"], cv=2 )

In [47]:
print(cross_val["test_score"].mean().round(3))

0.741


The test score when tested with cross validation is 74.1%. I'll now start to remove some features to see if the model can be improved. 

### Model without Transmission Type using the 2017 dataset

In [48]:
x_train_2, x_test_2, y_train_2, y_test_2 = split_train(df_focus_17, ["mileage", "engineSize", 
                                                                     "fuelType_Diesel", "fuelType_Petrol"])

In [49]:
model_2, r2_2 = linear_model(x_train_2, x_test_2, y_train_2, y_test_2)

In [50]:
r2_2.round(3)

0.826

In [51]:
cross_val_2 = cross_validate(LinearRegression(), df_focus_17[["mileage", "engineSize",
                                                "fuelType_Diesel", "fuelType_Petrol"]], 
                                         df_focus_17["price"], cv=2 )

In [52]:
print(cross_val_2["test_score"].mean().round(3))

0.741


The test score is showing the same accuracy as before. There wasn't that much correlation between the features that got removed and the price. Now the model will be simplified further, just leaving mileage and engine size which have the biggest relationship with price.

### Model for Mileage and Engine Size using the 2017 dataset

In [53]:
x_train_3, x_test_3, y_train_3, y_test_3 = split_train(df_focus_17, ["mileage", "engineSize"])

In [54]:
model_3, r2_3 = linear_model(x_train_3, x_test_3, y_train_3, y_test_3)

In [55]:
r2_3.round(3)

0.663

In [56]:
cross_val_3 = cross_validate(LinearRegression(), df_focus_17[["mileage", "engineSize"]], 
                                         df_focus_17["price"], cv=2 )

In [57]:
print(cross_val_3["test_score"].mean().round(3))

0.647


The new cross validation score is showing a lot worse than the previous models. Ideally we want one model for all the years. I will try these features along with year as a new x variable and combine all the years into one dataframe.

### Model for Mileage, Engine Size and Year (2017-2019)

In [59]:
x_train_all, x_test_all, y_train_all, y_test_all = split_train(df_17_to_19, ["year", "mileage", "engineSize"])

In [60]:
model_all, r2_all = linear_model(x_train_all, x_test_all, y_train_all, y_test_all)

In [61]:
r2_all.round(3)

0.705

In [62]:
cross_val_all = cross_validate(LinearRegression(), df_17_to_19[["year", "mileage", "engineSize"]], df_17_to_19["price"], cv=2 )

In [63]:
print(cross_val_all["test_score"].mean().round(4))

0.6253


This is also performing poorly and isn't accurate to use as the final model. I will try to use XG boost to see if it can improve the model, if not then seperate models based on year will have to be used for prediction.

### Combined years using XGBoost Regressor

In [64]:
xgb_r_all = xg.XGBRegressor(objective = "reg:squarederror", n_estimators = 10)

In [65]:
cross_val_boost_all = cross_validate(xgb_r_all, df_17_to_19[["year", "mileage", "engineSize"]], 
                                         df_17_to_19["price"], cv=5 )

In [66]:
print(cross_val_boost_all["test_score"].mean().round(3))

0.712


This shows a big improvement and is performing slightly under the best model (first model). This will be used as the final predictor model.

### Final Model 

Training the final model with all the data to use for some predictions. Please see models.py to see the function used to train the model.

In [67]:
xgb_r_final = xgb_r(df_17_to_19)



Now I will test the model with data taken from www.autotrader.co.uk
On the website, the cars are given a label for the price. This will be a good indicator to see how the model is predicting in comparison.
The rating categories: Lower price, great price, good price, fair price and higher price.
The following is taken from their website to show how they calculate the categories:

    - The car's condition
    - The car's colour
    - Regional supply and demand
    - Special modifications
    - Additional services, warranties, admin fees and finance deals
    - The number of owners
    - The car's service history

### Car 1: Ford Focus 2017 (50,206 miles, 1.5l engine and asking price £8925)
This car is advertised as "lower price", so let's see if the model agrees.

In [68]:
custom_input_for_model(xgb_r_final, [2017, 50206, 1.5], 8925)

9833
Potential Profit: £908


### Car 2: Ford Focus 2017 (5500 miles, 1.0l engine and asking price £12,800)

Advertised as "fair price"

In [69]:
custom_input_for_model(xgb_r_final, [2017, 5500, 1.0], 12800)

12133
Potential Profit: £-667


### Car 3: Ford Focus 2018 (34,875 miles, 1.5l engine and asking price is £10,499)
Advertised as "great price"

In [70]:
custom_input_for_model(xgb_r_final, [2018, 34875, 1.5], 10499)

11805
Potential Profit: £1306


### Car 4: Ford Focus 2019 (9,020 miles, 2.3l and asking price is £26750)
Advertised as "fair price"

In [71]:
custom_input_for_model(xgb_r_final, [2019, 9020, 2.3], 26750)

25213
Potential Profit: £-1537


### Car 5: Ford Focus 2019 (15,900 miles, 1.0l and asking price is £15,740)
Advertised as "great price"

In [72]:
custom_input_for_model(xgb_r_final, [2019, 15900, 1.0], 15740)

15720
Potential Profit: £-20


We can see from the results above that the model has similar predictions to the categorical label that autotrader have given the cars. One big factor that the model won't be able to consider is car condtion. 