# Evaluation of the Models

## Results of the models

### Linear regression

#### Price

The main observations for the main target variable price:

- Mean absolute error (MAE) in test data is 4482\\$. This means that on average the error in the predicted prices is roughly $\pm$4500\\$.
  
- Mean absolute percentage error (MAPE) in test data is 18.6 %. In other words, the average of individual relative errors in the predicted prices is 18.6%.
  
- $R^2$ coefficient of determination in test data is roughly 0.69. This means that 69% of the variance in price (dependent variable) can be explained by using the selected numerical features (independent variables).

The MAE value 4500\\$ and $R^2$ value 0.69 are rather good for the linear model. We have to remember that we did use only numeric features in the linear regression so many features are missing from the model. The MAPE value 18.6% of the linear model is good as well since it is between 10 - 20%. Some other observations:
  
- The five most important features which affect the price the most are `horsepower`, `year`, `mileage`, `major_options` and `torque`. From these five features mileage gives negative contribution to the price and the rest contribute positively to the price. These important features and their contributions to the price make sense since we are talking about used cars data in the US.
  
- The scatter plot visualization of predicted prices and deviations from actual prices shows a cone-shaped distribution which means higher variability in the model's predictions for higher-priced vehicles. Visualization also shows non-realistic predictions, i.e. some of the predicted prices are negative.

#### Days on market
The main observations for the secondary target variable days on market:

- Mean absolute error (MAE) in test data is 48. This means that on average the error in the predicted days is roughly $\pm$48 days.
  
- Mean absolute percentage error (MAPE) was not calculated since the data contains cars which have days on market value 0. Calculating MAPE value for days on market feature would lead to division by zero. Relative mean absolute error with respect to average is 81.6 % and relative mean absolute error with respect to median is 141.6 %.
  
- $R^2$ coefficient of determination in test data is roughly 0.05. This means that only 5% of the variance in days on market (dependent variable) can be explained by using the selected numerical features (independent variables).

The MAE value 48 and $R^2$ value 0.05 are very poor for the linear model. Even though we did use only numeric features in the linear regression these results can be considered extremely bad. Some other observations:
  
- The five most important features which affect the days on market the most are `mileage`, `combine_fuel_economy`, `engine_displacement`, `savings_amount` and `torque`. These five features all contribute negatively to the days on market and the interpretation of their contribution is not as clear as in the case of the feature price. 
  
- The scatter plot visualization of predicted days on market and deviations from actual values shows a wide variance in predictions, particularly for vehicles with a higher number of days on market. The model struggles with accurately predicting longer market times. Visualization also shows non-realistic predictions, i.e. some of the predicted days on market are negative.


### Random forest regression

#### Price
The main observations for the main target variable price:

- Mean absolute error (MAE) in test data is 1617\\$. This means that on average the error in the predicted prices is roughly $\pm$1600\\$.
  
- Mean absolute percentage error (MAPE) in test data is  6.4%. In other words, the average of individual relative errors in the predicted prices is  6.4%.
  
- $R^2$ coefficient of determination in test data is roughly 0.95. This means that 95% of the variance in price (dependent variable) can be explained by using the other features (independent variables).

The MAE value 1600\\$ and $R^2$ value 0.95 are excellent for the random forest model. We used all features of the cleaned dataset in training, validation and testing. The MAPE value 6.4% of the random forest model is excellent as well since it is well under 10%. Some other observations:
  
- The five most important features which affect the price the most are `year`, `horsepower`, `mileage`, `make_name` and `size`. In contrast to the linear model we do not know whether the contributions are positive or negative. These top 5 important features make sense since we are talking about used cars data in the US.
  
- The scatter plot visualization of predicted prices and deviations from actual prices shows that predictions are very close to the actual prices, with minor deviations, demonstrating the model's accuracy for those instances. The model seems to maintain its performance even for higher-value vehicles, which can be challenging to predict due to the complexity of factors that drive higher prices.

#### Days on market
The main observations for the secondary target variable days on market:

- Mean absolute error (MAE) in test data is 38. This means that on average the error in the predicted days is roughly $\pm$38 days.
  
- Mean absolute percentage error (MAPE) was not calculated since the data contains cars which have days on market value 0. Calculating MAPE value for days on market feature would lead to division by zero. Relative mean absolute error with respect to average is 64.4% and relative mean absolute error with respect to median is 111.8%.
  
- $R^2$ coefficient of determination in test data is roughly 0.34. This means that 34% of the variance in days on market (dependent variable) can be explained by using the other features (independent variables).

The MAE value 38 and $R^2$ value 0.34 are rather poor for the random forest model. We used all features of the cleaned dataset in training, validation and testing. The $R^2$ value is promising but the relative MAE with respect to average and median are too large for making any practical predictions. Some other observations:
  
- The five most important features which affect the days on market the most are `savings_amount`, `mileage`, `seller_rating`, `year` and `major_options`. In contrast to the linear model we do not know whether the contributions are positive or negative and the interpretation of the contributions is not as clear as in the case of the feature price.
  
- The scatter plot visualization of predicted days on market and deviations from actual values shows a wide variance in predictions, especially for vehicles with a longer predicted days on market. The predictions tend to be less accurate as the days on market increases.

## Comparison of the models

Some observations when comparing the two models and their performance:

- The random forest regression is significantly more accurate than linear regression. Especially in predicting the feature price MAE, MAPE and $R^2$ values all are far better in random forest regression compared to linear regression.

- In linear regression we included only numerical features which means that there was fewer dependent variables in the model compared to the random forest regression. This naturally leads to a poorer accuracy. We could have included categorical features in the linear regression model using pandas' get dummies encoding and this could have improved its accuracy.

- Even though linear regression is worse in accuracy it is far, far faster to train and evaluate especially in large datasets. The training and evaluation of the random forest regressor took almost two hours of computing time compared to the almost simultaneous fit of the linear model. In addition, our linear model does not contain any hyperparameters so the validation phase can be skipped which reduces a lot of time since we do not need to tune any hyperparameters.

- Both models struggle to succeed in predicting the feature days on market. Random forest model is better in accuracy but the accuracy is not good enough so that the model could be used in any practical applications.

- The results of the models are almost the same in the training/validation data and in test data. This means that the models are not overfitting to the data and they have a lot of potential to apply to unseen new car data.

- In linear regression we can see whether the dependence of one feature and price/daysonmarket is positive or negative. Random forest regression only gives feature importances as percentage values but does not include the direction of the dependence. However, one should be careful when interpreting the coefficients of the linear model since they are completely different than what can be seen from the correlation matrix which includes pairwise possible linear dependencies.

- When predicting the feature price the features year, horsepower and mileage seem to be important to both models. Since the models are independent of each other we can conclude that their contribution to price is real and not just a by-product of the individual model and its performance.

- When predicting the feature days on market the features savings amount and mileage seem to be important to both models. Again by the independence of the models we can say that these are true factors which contribute to the days on market value.

## Possible improvements


- Size of vehicle could have been designed better. The dimensions were in inches so the dominant part came from the dimensions of the car. Volume of the fuel tank and maximum seating did not contribute much to the new feature size. Maybe one should have converted the dimensions of each individual feature into cubic meters to obtain a balanced outcome.

- The models could be improved by gathering data from different sources. But here one should be aware that for example US citizens and European citizens might have different preferences for used cars and due to this reason the data could contain controversial information.

- The accuracy of the models could have been improved by investigating the data more thoroughly. Publicly available information about sold used cars could improve the models when doing feature engineering and removing outliers. For example, the feature interior color now contains almost 27 000 unique values which can be too much for a categorical feature.

- Including geolocational data could also have improved the accuracy. Cars can be sold with higher price in more wealthy areas and the climate and environment can dramatically affect the condition of the car being sold. We also dropped many features because they contained a lot of missing values. By careful investigation these features could have been filled reliably and thus used as independent variables to increase the accuracy of the models.

- The poor accuracy of the feature days on market can be due to the fact that we maybe dropped something essential from the data or that days on market does not have strong dependence on other features. This could be resolved by a more thorough analysis of the raw data and using third party information about the sales of used cars.

- We encoded the categorical variables using LabelEncoder. The accuracy can be increased by using pandas' get dummies encoding. However, get dummies adds to the data as many columns as there are unique values in the feature. Therefore the size of the dataset increases a lot and training the machine learning models requires sufficient amount of memory (and computing power) from the computer.

- The accuracy of the random forest regressor is not easy to optimize using such a large dataset we have since training and validation take almost an hour. One option would be to take a smaller random sample of the data to optimize the hyperparameters (n_estimators and max_depth) in the validation data.

- The accuracy of the linear regression model was not as good as expected when comparing to the correlation matrix. This can be due to the fact that correlation does not mean causation, i.e. there could be a third feature which can cause linear increase or decrease in a pair of features (one of which is price). By a more careful investigation of the original raw data and feature engineering it is perhaps possible to increase the accuracy of the linear model.

## Conclusions

A short summary of the evaluation phase:

- Both linear regression and random forest regression perform well in predicting the price of a used car. The perfomance of random forest model (MAPE = 6.4% and $R^2$ = 0.95) is much better than the performance of linear model (MAPE = 18.6% and $R^2$ = 0.69) but the random forest takes much more time to train. However, both models fail to give any accurate and practical predictions for the feature days on market.

- Features year, horsepower and mileage seem to be important to both models when predicting price. Features savings amount and mileage are important to both models when predicting days on market.

- The similarity of results in training/validation data and test data indicates that our models are not overfitting. Thus we can safely apply them to unseen new car data. 

- There are a lot of potential ways to improve our models (e.g. by doing more careful data preparation, preprocessing and feature selection/engineering) and these can be taken into account in the future iterations of CRISP-DM phases.

Finally, we can conclude that:

- As both of our models are performing well in price prediction and give the same three most important features we can deduce that the data preparation, preprocessing, feature selection and training and testing of models went well for our main target variable price.
  
- Our secondary target variable days on market was way more difficult to predict which tells that something went wrong in the data preprocessing and feature selection phases, or that days on market does not have a strong dependence on other features of the original dataset. This could be investigated in more detail in the future.
