# Used Cars Market Value Predictor

-----

## Overview

### Description

<div style="color: #196CC4;">
The used car sales service Rusty Bargain is developing an app to attract new customers. Thanks to this app, you can quickly find out the market value of your car. You have access to the history: technical specifications, equipment versions and prices. You need to create a model that determines the market value. Rusty Bargain is interested in:<br>

▶ prediction quality;<br>
▶ prediction speed;<br>
▶ training time required
 </div>

### Objective

<div style="color: #196CC4;">Develop a used vehicle price prediction model for Rusty Bargain, aiming to provide users with an accurate and quick estimate of their car's market value. The model must meet Rusty Bargain's quality criteria, ensuring accurate predictions and adequate execution speed to enhance the user experience on the platform.</div>

### Resources

<div style="color: #196CC4;">
<b>Features</b><br>
▶ DateCrawled — date the profile was downloaded from the database<br>
▶ VehicleType — vehicle body type<br>
▶ RegistrationYear — year of vehicle registration<br>
▶ Gearbox — type of gearbox<br>
▶ Power — power (hp)<br>
▶ Model — vehicle model<br>
▶ Mileage — mileage (measured in km according to the regional specifics of the dataset)<br>
▶ RegistrationMonth — month of vehicle registration<br>
▶ FuelType — fuel type<br>
▶ Brand — vehicle brand<br>
▶ NotRepaired — vehicle with or without repair<br>
▶ DateCreated — date the profile was created<br>
▶ NumberOfPictures — number of vehicle pictures<br>
▶ PostalCode — postal code of the profile owner (user)<br>
▶ LastSeen — date the user was last active<br>

<b>Target</b><br>
▶ Price — price (in euros)<br>
</div>

### Methodology

<div style="color: #196CC4;">
<ol>
<li><strong>Data Initialization and Exploratory Analysis</strong>
<ul>
<li>Import libraries, modules, and the dataset: car_data.csv.</li>
<li>Perform exploratory data analysis, including preliminary correlations.</li>
<li>Calculate descriptive statistics.</li>
<li>Identify and handle missing and duplicate values.</li>
<li>Remove features that do not contribute to the model.</li>
<li>Normalize features to ensure a common scale.</li>
<li>Encode categorical variables (One-Hot Encoding and Label Encoding).</li>
<li>Split the dataset into training, validation, and test sets (3:1:1 ratio).</li>
</ul>
</li>

<li><strong>Model Training</strong>
<ul>
<li>Train models with and without gradient boosting.</li>
<li>Models without gradient boosting: Decision Tree, Linear Regression, and Random Forest.</li>
<li>Models with gradient boosting: LightGBM, CatBoost, and XGBoost.</li>
<li>Evaluate models using MSE, RMSE, and training/prediction time metrics.</li>
</ul>
</li>

<li><strong>Evaluation and Comparison</strong>
<ul>
<li>Present a summary of the metrics obtained for each model.</li>
<li>Compare models in terms of prediction quality (MSE, RMSE, MAPE) and efficiency (training and prediction time).</li>
</ul>
</li>

<li><strong>Conclusions</strong>
<ul>
<li>Select the optimal model based on performance and project requirements.</li>
</ul>
</li>
</ol>

</div>

-----

## General Information

### Inicialization

In [1]:
# Data analysis
import pandas as pd

# Label Encoding for categories
from sklearn.preprocessing import LabelEncoder

# Testing sets
from sklearn.model_selection import train_test_split

# Model training
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

# Gradient boosting model training
import lightgbm as lgb
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor

# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
import time

from sklearn.preprocessing import MinMaxScaler

In [2]:
# Import data
cars = pd.read_csv('datasets/car_data.csv')

### Dataset general overview


<div style="color: #196CC4;">
▶ DataFrame General properties
</div>

In [3]:
# General Dataframe properties
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [4]:
# General data overview
display(cars.head(3))

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47


<div style="color: #196CC4;">
▶ Verifying for duplicate values
</div>

In [5]:
# Duplicated values
cars_duplicates = cars.duplicated()

# Sum of duplicated values
total_cars_duplicates = cars_duplicates.sum()

# Duplicated rows
cars_duplicates_rows = cars[cars_duplicates]

# Display data
print("Total of duplicate values:")
print(total_cars_duplicates)
print()
#print("Listado de aquellas filas duplicadas:")
#display(cars_duplicates_rows)

Total of duplicate values:
262



<div style="color: #196CC4;">
▶ Descriptive statistics for numerical data
</div>

In [6]:
# Descriptive statistics
cars.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


<div style="color: #196CC4;">
▶ Regarding categorical variables, we will now check the number of values for each one:
</div>

In [7]:
# Unique values in 'DateCrawled'
unique_date_crawled = cars['DateCrawled'].nunique()
print("Number of unique values in the 'DateCrawled' series:", unique_date_crawled)

# Unique values in 'VehicleType'
unique_vehicle_type = cars['VehicleType'].nunique()
print("Number of unique values in the 'VehicleType' series:", unique_vehicle_type)

# Unique values in 'Gearbox'
unique_gearbox = cars['Gearbox'].nunique()
print("Number of unique values in the 'Gearbox' series:", unique_gearbox)

# Unique values in 'Model'
unique_model = cars['Model'].nunique()
print("Number of unique values in the 'Model' series:", unique_model)

# Unique values in 'FuelType'
unique_fuel_type = cars['FuelType'].nunique()
print("Number of unique values in the 'FuelType' series:", unique_fuel_type)

# Unique values in 'Brand'
unique_brand = cars['Brand'].nunique()
print("Number of unique values in the 'Brand' series:", unique_brand)

# Unique values in 'NotRepaired'
unique_not_repaired = cars['NotRepaired'].nunique()
print("Number of unique values in the 'NotRepaired' series:", unique_not_repaired)

# Unique values in 'DateCreated'
unique_date_created = cars['DateCreated'].nunique()
print("Number of unique values in the 'DateCreated' series:", unique_date_created)

# Unique values in 'LastSeen'
unique_last_seen = cars['LastSeen'].nunique()
print("Number of unique values in the 'LastSeen' series:", unique_last_seen)

Number of unique values in the 'DateCrawled' series: 15470
Number of unique values in the 'VehicleType' series: 8
Number of unique values in the 'Gearbox' series: 2
Number of unique values in the 'Model' series: 250
Number of unique values in the 'FuelType' series: 7
Number of unique values in the 'Brand' series: 40
Number of unique values in the 'NotRepaired' series: 2
Number of unique values in the 'DateCreated' series: 109
Number of unique values in the 'LastSeen' series: 18592


### Initial Observations

<div style="color: #196CC4;">
<b>Initial Observations:</b><br>
▶ Series names are capitalized.<br>
▶ There are 262 duplicate rows.<br>
▶ Several series, such as "DateCrawled", "Model" (since we have the vehicle brand), "DateCreated", and "LastSeen", are not relevant to the purpose of this project.<br>
▶ There are missing values in "vehicletype", "gearbox", "fueltype", and "notrepaired".<br>
▶ Some series are of the "object" data type, requiring careful consideration for data treatment.<br>
▶ Descriptive statistics reveal outliers in the "Price", "RegistrationYear", and "Power" series.
</div>

-----

## Exploratory Data Analysis (EDA)

### Data Cleaning

<div style="color: #196CC4;">
▶ It is suggested that column names be lowercased for greater consistency and easier manipulation.
</div>

In [8]:
# Convert to lowercase
mapeo = {columna: columna.lower() for columna in cars.columns}

# Renaming DataFrame
cars = cars.rename(columns=mapeo)

<div style="color: #196CC4;">
<b>Duplicate Values</b><br>
▶ The 262 duplicate data points in "cars" represent a small portion of the overall dataset (354,369) and do not significantly impact the overall representation of the remaining data. Therefore, it is suggested to remove these records as they may affect the training results.
</div>

In [9]:
# Delete duplicates
cars = cars.drop_duplicates().reset_index(drop=True)

<div style="color: #196CC4;">
<b>Series Removal Without Affecting Results</b><br>
▶ <b>datecrawled:</b> With 15,470 unique values, the dates on which the profiles were downloaded from the database are unlikely to be significant for vehicle prices.<br>
▶ <b>model:</b> With 250 unique values, the model may be too detailed to be useful in predicting prices, and we already have the vehicle brand.<br>
▶ <b>datecreated:</b> With 109 unique values, the profile creation date may not be relevant for predicting vehicle prices.<br>
▶ <b>lastseen:</b> With 18,592 unique values, the date of the user's last activity may not have a direct influence on vehicle prices.<br>
▶ <b>numberofpictures:</b> The number of pictures does not influence the project's objectives.<br>
▶ <b>postalcode:</b> Similarly, the postal code does not influence the project's objectives.<br>
</div>

In [10]:
# Useless Series
delete_series = ['datecrawled', 'model', 'datecreated', 'lastseen', 'numberofpictures', 'postalcode']

# Delete Series
cars = cars.drop(delete_series, axis=1)

# DataFrame updated
print(cars.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354107 entries, 0 to 354106
Data columns (total 10 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   price              354107 non-null  int64 
 1   vehicletype        316623 non-null  object
 2   registrationyear   354107 non-null  int64 
 3   gearbox            334277 non-null  object
 4   power              354107 non-null  int64 
 5   mileage            354107 non-null  int64 
 6   registrationmonth  354107 non-null  int64 
 7   fueltype           321218 non-null  object
 8   brand              354107 non-null  object
 9   notrepaired        282962 non-null  object
dtypes: int64(5), object(5)
memory usage: 27.0+ MB
None


<div style="color: #196CC4;">
<b>Handling Missing Values</b><br>
▶ <b>Row Elimination:</b> The dataset has 354,107 entries, and the amount of missing values is significantly smaller in comparison. Therefore, all rows corresponding to duplicate values in the "vehicletype" series are being removed. With this change, we will have 316,623 remaining entries to work with, representing 89.4% of the original dataset.<br>
▶ <b>Imputation:</b> The number of remaining missing values is relatively small compared to the initial size when we had the complete dataset. Therefore, to continue working with that 89.4% of data without reducing the sample of valuable information further, we will impute the missing values using the mode (most frequent value) of each series for "gearbox", "fueltype", and "notrepaired". <br><br>
Partially eliminating rows with missing values helps ensure that the remaining data is complete and ready for analysis. In this way, the amount of data to be imputed is also reduced, thus reducing the bias in the data.
</div>

In [11]:
# Delete rows
cars.dropna(subset=['vehicletype'], inplace=True)

In [12]:
# Mode Imputation
cars['gearbox'].fillna(cars['gearbox'].mode()[0], inplace=True)
cars['fueltype'].fillna(cars['fueltype'].mode()[0], inplace=True)
cars['notrepaired'].fillna(cars['notrepaired'].mode()[0], inplace=True)

# DataFrame updated
print(cars.info())

<class 'pandas.core.frame.DataFrame'>
Index: 316623 entries, 1 to 354106
Data columns (total 10 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   price              316623 non-null  int64 
 1   vehicletype        316623 non-null  object
 2   registrationyear   316623 non-null  int64 
 3   gearbox            316623 non-null  object
 4   power              316623 non-null  int64 
 5   mileage            316623 non-null  int64 
 6   registrationmonth  316623 non-null  int64 
 7   fueltype           316623 non-null  object
 8   brand              316623 non-null  object
 9   notrepaired        316623 non-null  object
dtypes: int64(5), object(5)
memory usage: 26.6+ MB
None


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cars['gearbox'].fillna(cars['gearbox'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cars['fueltype'].fillna(cars['fueltype'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate o

<div style="color: #196CC4;">
<b>Outliers</b><br>
▶ The specific values for the mean and standard deviation of each variable are calculated below. These values can be used to adjust the limits for identifying and removing outliers in the "price", "registrationyear", and "power" series.
</div>

In [13]:
# Mean and the Standard Deviation
mean_price = cars['price'].mean()
std_price = cars['price'].std()

mean_registrationyear = cars['registrationyear'].mean()
std_registrationyear = cars['registrationyear'].std()

mean_power = cars['power'].mean()
std_power = cars['power'].std()

# Print the mean and standard deviation for each numerical column

print("Mean of 'price' column:", mean_price)
print("Standard deviation of 'price' column:", std_price)
print()
print("Mean of 'registrationyear' column:", mean_registrationyear)
print("Standard deviation of 'registrationyear' column:", std_registrationyear)
print()
print("Mean of 'power' column:", mean_power)
print("Standard deviation of 'power' column:", std_power)

Mean of 'price' column: 4658.036368172874
Standard deviation of 'price' column: 4584.561111519157

Mean of 'registrationyear' column: 2002.238640275659
Standard deviation of 'registrationyear' column: 6.566531306470144

Mean of 'power' column: 114.76780903471952
Standard deviation of 'power' column: 185.93147452990652


<div style="color: #196CC4;">
▶ Based on these values, the limits for identifying outliers will be ±2 standard deviations from the mean.
</div>

In [14]:
# Limit Outliers
lower_bound_price = 4431.15 - 2 * 4261.85
upper_bound_price = 4431.15 + 2 * 4261.85

lower_bound_registrationyear = 2002.51 - 2 * 5.52
upper_bound_registrationyear = 2002.51 + 2 * 5.52

lower_bound_power = 110.71 - 2 * 60.40
upper_bound_power = 110.71 + 2 * 60.40

<div style="color: #196CC4;">
▶ These limits will be used to identify and remove outliers, as shown below:
</div>

In [15]:
# Delete Outliers
cars = cars[(cars['price'] >= lower_bound_price) & (cars['price'] <= upper_bound_price)]
cars = cars[(cars['registrationyear'] >= lower_bound_registrationyear) & (cars['registrationyear'] <= upper_bound_registrationyear)]
cars = cars[(cars['power'] >= lower_bound_power) & (cars['power'] <= upper_bound_power)]

In [16]:
# Descriptive statistics
cars.describe()

Unnamed: 0,price,registrationyear,power,mileage,registrationmonth
count,271236.0,271236.0,271236.0,271236.0,271236.0
mean,3587.375724,2002.390604,103.115774,132129.252754,5.935801
std,3139.455275,4.713636,51.284131,33332.441315,3.636323
min,0.0,1992.0,0.0,5000.0,0.0
25%,1100.0,1999.0,71.0,125000.0,3.0
50%,2500.0,2002.0,102.0,150000.0,6.0
75%,5350.0,2006.0,140.0,150000.0,9.0
max,12950.0,2013.0,231.0,150000.0,12.0


<div style="color: #196CC4;">
<b>Data Scaling</b><br>
▶ Data scaling is a commonly used technique in data preprocessing to put all features on the same scale. <br>
▶ The 'price', 'power', and 'mileage' features in the dataset will be scaled next to normalize features.
</div>

In [17]:
# Inicializer
escalador = MinMaxScaler()

# Transformation
#cars[['price','power', 'mileage']] = escalador.fit_transform(cars[['price', 'power', 'mileage']])
cars[['power', 'mileage']] = escalador.fit_transform(cars[['power', 'mileage']])

In [18]:
display(cars[['power', 'mileage']])
display(cars)

Unnamed: 0,power,mileage
2,0.705628,0.827586
3,0.324675,1.000000
4,0.298701,0.586207
5,0.441558,1.000000
6,0.471861,1.000000
...,...,...
354100,0.974026,1.000000
354101,0.000000,1.000000
354104,0.437229,0.827586
354105,0.441558,1.000000


Unnamed: 0,price,vehicletype,registrationyear,gearbox,power,mileage,registrationmonth,fueltype,brand,notrepaired
2,9800,suv,2004,auto,0.705628,0.827586,8,gasoline,jeep,no
3,1500,small,2001,manual,0.324675,1.000000,6,petrol,volkswagen,no
4,3600,small,2008,manual,0.298701,0.586207,7,gasoline,skoda,no
5,650,sedan,1995,manual,0.441558,1.000000,10,petrol,bmw,yes
6,2200,convertible,2004,manual,0.471861,1.000000,8,petrol,peugeot,no
...,...,...,...,...,...,...,...,...,...,...
354100,3200,sedan,2004,manual,0.974026,1.000000,5,petrol,seat,yes
354101,1150,bus,2000,manual,0.000000,1.000000,3,petrol,opel,no
354104,1199,convertible,2000,auto,0.437229,0.827586,3,petrol,smart,no
354105,9200,bus,1996,manual,0.441558,1.000000,3,gasoline,volkswagen,no


<div style="color: #196CC4;">
<b>Labeling for the treatment of the remaining Series with "object" value type</b><br>
▶ One-Hot Encoding (OHE): With binary columns for each category. This is applied to the series 'vehicletype', 'gearbox', 'fueltype', 'notrepaired' as they have between 2 and 8 values.<br>
▶ Label Encoding: Assigning an integer to each unique category. This is applied to the 'brand' series.
</div>

In [19]:
# OHE for Encoding
cars = pd.get_dummies(cars, columns=['vehicletype', 'gearbox', 'fueltype', 'notrepaired'])

In [20]:
# Label Encoding for 'Brand'
label_encoder = LabelEncoder()
cars['brand'] = label_encoder.fit_transform(cars['brand'])

### Display of Information

<div style="color: #196CC4;">
▶ Next, I will verify the changes made to the properties of the DataFrame 'cars' and preview it.<br>
</div>

In [21]:
# Data displayment
cars.info()
display(cars.head(5))

<class 'pandas.core.frame.DataFrame'>
Index: 271236 entries, 2 to 354106
Data columns (total 25 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   price                    271236 non-null  int64  
 1   registrationyear         271236 non-null  int64  
 2   power                    271236 non-null  float64
 3   mileage                  271236 non-null  float64
 4   registrationmonth        271236 non-null  int64  
 5   brand                    271236 non-null  int32  
 6   vehicletype_bus          271236 non-null  bool   
 7   vehicletype_convertible  271236 non-null  bool   
 8   vehicletype_coupe        271236 non-null  bool   
 9   vehicletype_other        271236 non-null  bool   
 10  vehicletype_sedan        271236 non-null  bool   
 11  vehicletype_small        271236 non-null  bool   
 12  vehicletype_suv          271236 non-null  bool   
 13  vehicletype_wagon        271236 non-null  bool   
 14  gearbox_a

Unnamed: 0,price,registrationyear,power,mileage,registrationmonth,brand,vehicletype_bus,vehicletype_convertible,vehicletype_coupe,vehicletype_other,...,gearbox_manual,fueltype_cng,fueltype_electric,fueltype_gasoline,fueltype_hybrid,fueltype_lpg,fueltype_other,fueltype_petrol,notrepaired_no,notrepaired_yes
2,9800,2004,0.705628,0.827586,8,14,False,False,False,False,...,False,False,False,True,False,False,False,False,True,False
3,1500,2001,0.324675,1.0,6,38,False,False,False,False,...,True,False,False,False,False,False,False,True,True,False
4,3600,2008,0.298701,0.586207,7,31,False,False,False,False,...,True,False,False,True,False,False,False,False,True,False
5,650,1995,0.441558,1.0,10,2,False,False,False,False,...,True,False,False,False,False,False,False,True,False,True
6,2200,2004,0.471861,1.0,8,25,False,True,False,False,...,True,False,False,False,False,False,False,True,True,False


### Model initialization

<div style="color: #196CC4;">
<b>The source data will be divided as follows to achieve a 3:1:1 ratio:</b><br>
▶ 60% Dataset for training<br>
▶ 20% Validation dataset<br>
▶ 20% Test dataset<br><br>
The larger the training set, the more data the model will have to learn patterns and relationships in the data. On the other hand, the validation dataset will be used to evaluate the model's performance, and the test dataset will be held out until the end to evaluate the final performance of the model (once the training and validation process is complete).
</div>

In [22]:
# Split the data into a training set (80%) and a test set (20%)
#df_train, df_test = train_test_split(cars, test_size=0.8, random_state=12345)

# Split the data into a training set (60%), a validation set (20%), and a test set (20%)
df_train, remaining_data = train_test_split(cars, test_size=0.4, random_state=12345)
df_valid, df_test = train_test_split(remaining_data, test_size=0.5, random_state=12345)

<div style="color: #196CC4;">
▶ In these lines of code, the data is being prepared for training and testing a machine learning model. A separation is being made between the features and the target variable in each dataset.<br>
</div>

In [23]:
# Extract features and target variables from the training and validation sets
features_train = df_train.drop(['price'], axis=1)
target_train = df_train['price']

features_valid = df_valid.drop(['price'], axis=1)
target_valid = df_valid['price']

# FINAL TESTING SETS ↓

# Train + Valid OHE
total_features_train = pd.concat([features_train, features_valid])
total_target_train = pd.concat([target_train, target_valid])

features_test = df_test.drop(['price'], axis=1)
target_test = df_test['price']

-----

## Training without Gradient Boosting:

<div style="color: #196CC4;">
The following describes the training and evaluation process of three machine learning models without gradient boosting: <br>
▶ DecisionTree <br>
▶ Linear Regression<br>
▶ Random Forest<br>

With this historical data, we will be able to predict car prices, and these results will be evaluated based on three main criteria: <br>
▶ Prediction quality through MSE and RMSE<br>
▶ Training time<br>
▶ Prediction time.</div>

### DecisionTree Model

<div style="color: #196CC4;">
▶ A decision tree is a model that makes decisions by splitting the data into segments based on simple questions or rules. Each "node" in the tree represents a question about a feature, each "branch" is the outcome of that question, and each "leaf" is a final prediction.</div>

In [24]:
# Hyperparameters
max_depth = 5
min_samples_split = 2

# Train
tree_start_train_time = time.time()
decisiontree = DecisionTreeRegressor(random_state=12345, max_depth=max_depth, min_samples_split=min_samples_split)
decisiontree.fit(features_train, target_train)

# Training time
tree_end_train_time = time.time()
tree_train_time = tree_end_train_time - tree_start_train_time

# Price prediction
tree_start_pred_time = time.time()
decisiontree_pred = decisiontree.predict(features_valid)

# Predictions time
tree_end_pred_time = time.time()
tree_pred_time = tree_end_pred_time - tree_start_pred_time

In [25]:
# Quality of prediction metrics
tree_mse = mean_squared_error(target_valid, decisiontree_pred)
tree_rmse = tree_mse ** 0.5
tree_mape = mean_absolute_percentage_error(target_valid, decisiontree_pred)

# Print DecisionTree Model
print("DecisionTree Model")
print()

# Print metrics
print("Predictions quality:")
print("Mean Squared Error (MSE):", tree_mse)
print("Root Mean Squared Error (RMSE):", tree_rmse)
print("Mean Absolute Percentage Error (MAPE):", tree_mape)
print()

# Print time
print("Training Time:", tree_train_time, "seconds")
print("Prediction Time:", tree_pred_time, "seconds")

DecisionTree Model

Predictions quality:
Mean Squared Error (MSE): 2975726.1010329495
Root Mean Squared Error (RMSE): 1725.0293043983193
Mean Absolute Percentage Error (MAPE): 2.325411609911504e+17

Training Time: 0.17223548889160156 seconds
Prediction Time: 0.006007194519042969 seconds


### Linear Regression Model

<div style="color: #196CC4;">
▶ Linear regression is a model that assumes a linear relationship between the features of an object and its value. It tries to find the straight line that best fits the data to make predictions. If we think of a graph, linear regression seeks the line that best describes how the price changes as the car's parameters are modified.</div>

In [26]:
# Model
linear_reg = LinearRegression()

# Train
rlineal_start_train_time = time.time()
linear_reg.fit(features_train, target_train)

# Training time
rlineal_end_train_time = time.time()
rlineal_train_time = rlineal_end_train_time - rlineal_start_train_time

# Price prediction
rlineal_start_pred_time = time.time()
linear_reg_pred = linear_reg.predict(features_valid)

# Predictions time
rlineal_end_pred_time = time.time()
rlineal_pred_time = rlineal_end_pred_time - rlineal_start_pred_time

In [27]:
# Quality of prediction metrics
rlineal_mse = mean_squared_error(target_valid, linear_reg_pred)
rlineal_rmse = rlineal_mse ** 0.5
rlineal_mape = mean_absolute_percentage_error(target_valid, linear_reg_pred)

# Print LinealRegression
print("Linear Regression Model")
print()

# Print metrics
print("Predictions quality:")
print("Mean Squared Error (MSE):", rlineal_mse)
print("Root Mean Squared Error (RMSE):", rlineal_rmse)
print("Mean Absolute Percentage Error (MAPE):", rlineal_mape)
print()

# Print time
print("Training Time:", rlineal_train_time, "seconds")
print("Prediction Time:", rlineal_pred_time, "seconds")

Linear Regression Model

Predictions quality:
Mean Squared Error (MSE): 3425450.8321811836
Root Mean Squared Error (RMSE): 1850.797350382041
Mean Absolute Percentage Error (MAPE): 2.388739192612065e+17

Training Time: 0.1382596492767334 seconds
Prediction Time: 0.008177042007446289 seconds


### Random Forest Model

<div style="color: #196CC4;">
▶ Random Forest is an ensemble of many decision trees that work together. Each tree is trained on different parts of the dataset, and their results are combined to make a more accurate and robust prediction.</div>

In [28]:
# Parameters
param_dist_forest = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5, 7],
    'min_samples_leaf': [1, 3, 5],
    'min_samples_split': [2, 4, 6]
}

# Randomized search & cross-validation
random_search_forest = RandomizedSearchCV(RandomForestRegressor(random_state=12345), param_dist_forest, scoring='neg_mean_squared_error', cv=2, n_iter=10)

# Training
forest_start_train_time = time.time()
random_search_forest.fit(features_train, target_train)

# Training time
forest_end_train_time = time.time()
forest_train_time = forest_end_train_time - forest_start_train_time

# Best model
best_model_forest = random_search_forest.best_estimator_

# Predictions
forest_start_pred_time = time.time()
predicted_valid_forest_random = best_model_forest.predict(features_valid)

# Predictions time
forest_end_pred_time = time.time()
forest_pred_time = forest_end_pred_time - forest_start_pred_time

In [29]:
# Quality of prediction metrics
forest_mse = mean_squared_error(target_valid, predicted_valid_forest_random)
forest_rmse = forest_mse ** 0.5
forest_mape = mean_absolute_percentage_error(target_valid, predicted_valid_forest_random)

# Print Random Forest
print("Random Forest Model")
print()

# Print metrics
print("Predictions quality:")
print("Mean Squared Error (MSE):", forest_mse)
print("Root Mean Squared Error (RMSE):", forest_rmse)
print("Mean Absolute Percentage Error (MAPE):", forest_mape)
print()

# Print time
print("Training Time:", forest_train_time, "seconds")
print("Prediction Time:", forest_pred_time, "seconds")

Random Forest Model

Predictions quality:
Mean Squared Error (MSE): 2414588.9269462116
Root Mean Squared Error (RMSE): 1553.8947605762146
Mean Absolute Percentage Error (MAPE): 2.260626719437647e+17

Training Time: 163.4826033115387 seconds
Prediction Time: 0.271392822265625 seconds


-----

## Training with Gradient Boosting

<div style="color: #196CC4;">
<b>Gradient boosting</b> is a technique used in machine learning to improve the accuracy of predictive models, such as decision trees. Instead of training a single tree and hoping it is perfect, gradient boosting works in conjunction with many small trees, each of which learns from the mistakes of the previous one.<br>

The following describes the training and evaluation process of three machine learning models with gradient boosting: <br>
▶ Light Gradient Boosting Machine (LGBM)<br>
▶ CatBoost<br>
▶ XGBoost<br>

Similarly, with this historical data, we will be able to predict car prices, and these results will be evaluated based on three main criteria: <br>
▶ Prediction quality through MSE and RMSE<br>
▶ Training time<br>
▶ Prediction time.</div>

### LGBM

<div style="color: #196CC4;">
▶ <b>LGBM</b> is a gradient boosting algorithm that also uses decision trees; it is very efficient and fast, ideal for large datasets. LGBM also trains trees one after another, each learning from the mistakes of the previous one to improve prediction.</div>

In [30]:
# Hyperparameters
lgbm = LGBMRegressor(
    objective='regression',
    num_leaves=35,
    seed=23
)

# Training
lgbm_start_train_time = time.time()
lgbm.fit(features_train, target_train)

# Training time
lgbm_end_train_time = time.time()
lgbm_train_time = lgbm_end_train_time - lgbm_start_train_time

# Predictions
lgbm_start_pred_time = time.time()
lgbm_predicted_valid = lgbm.predict(features_valid)

# Predictions time
lgbm_end_pred_time = time.time()
lgbm_pred_time = lgbm_end_pred_time - lgbm_start_pred_time

found 0 physical cores < 1
  File "c:\Users\dguez\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
    raise ValueError(f"found {cpu_count_physical} physical cores < 1")


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002990 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 346
[LightGBM] [Info] Number of data points in the train set: 162741, number of used features: 24
[LightGBM] [Info] Start training from score 3588.844753


In [31]:
# Quality of prediction metrics
lgbm_mse = mean_squared_error(target_valid, lgbm_predicted_valid)
lgbm_rmse = lgbm_mse ** 0.5
lgbm_mape = mean_absolute_percentage_error(target_valid, lgbm_predicted_valid)

# Print LGBM
print("LGBM Model")
print()

# Print metrics
print("Predictions quality:")
print("Mean Squared Error (MSE):", lgbm_mse)
print("Root Mean Squared Error (RMSE):", lgbm_rmse)
print("Mean Absolute Percentage Error (MAPE):", lgbm_mape)
print()

# Print time
print("Training Time:", lgbm_train_time, "seconds")
print("Prediction Time:", lgbm_pred_time, "seconds")

LGBM Model

Predictions quality:
Mean Squared Error (MSE): 1651852.9482498362
Root Mean Squared Error (RMSE): 1285.2443146148657
Mean Absolute Percentage Error (MAPE): 2.1233470751532112e+17

Training Time: 0.3313560485839844 seconds
Prediction Time: 0.03812909126281738 seconds


### CatBoost

<div style="color: #196CC4;">
▶ <b>CatBoost</b> is a machine learning algorithm specifically designed for regression and classification problems. It uses a gradient boosting method to train a series of decision trees.
</div>

In [32]:
# Hyperparameters
catboost = CatBoostRegressor(
    loss_function='RMSE',
    iterations=100,
    learning_rate=0.05,
    depth=6,
    random_seed=23
)

# Training
catboost_start_train_time = time.time()
catboost.fit(features_train, target_train, verbose=False)

# Training time
catboost_end_train_time = time.time()
catboost_train_time = catboost_end_train_time - catboost_start_train_time

# Predictions
catboost_start_pred_time = time.time()
catboost_predicted_valid = catboost.predict(features_valid)

# Predictions time
catboost_end_pred_time = time.time()
catboost_pred_time = catboost_end_pred_time - catboost_start_pred_time


In [33]:
# Quality of prediction metrics
catboost_mse = mean_squared_error(target_valid, catboost_predicted_valid)
catboost_rmse = catboost_mse ** 0.5
catboost_mape = mean_absolute_percentage_error(target_valid, catboost_predicted_valid)

# Print CatBoost
print("CatBoost Model")
print()

# Print metrics
print("Predictions quality:")
print("Mean Squared Error (MSE):", catboost_mse)
print("Root Mean Squared Error (RMSE):", catboost_rmse)
print("Mean Absolute Percentage Error (MAPE):", catboost_mape)
print()

# Print time
print("Training Time:", catboost_train_time, "seconds")
print("Prediction Time:", catboost_pred_time, "seconds")

CatBoost Model

Predictions quality:
Mean Squared Error (MSE): 2001682.4135732444
Root Mean Squared Error (RMSE): 1414.8082603565913
Mean Absolute Percentage Error (MAPE): 2.2214954381836336e+17

Training Time: 0.8778512477874756 seconds
Prediction Time: 0.010690450668334961 seconds


### XGBoost

<div style="color: #196CC4;">
▶ <b>XGBoost</b> is an optimized implementation of gradient boosting algorithms designed to improve the accuracy and efficiency of machine learning models. Its ability to handle sparse data and its fast training speed make it suitable for complex datasets like ours.
</div>

In [34]:
# Hyperparameters
xgboost = XGBRegressor(
    objective='reg:squarederror',
    n_estimators=100,
    learning_rate=0.05,
    max_depth=6,
    random_state=23
)

# Training
xgboost_start_train_time = time.time()
xgboost.fit(features_train, target_train)

# Training time
xgboost_end_train_time = time.time()
xgboost_train_time = xgboost_end_train_time - xgboost_start_train_time

# Predictions
xgboost_start_pred_time = time.time()
xgboost_predicted_valid = xgboost.predict(features_valid)

# Predictions time
xgboost_end_pred_time = time.time()
xgboost_pred_time = xgboost_end_pred_time - xgboost_start_pred_time

In [35]:
# Quality of prediction metrics
xgboost_mse = mean_squared_error(target_valid, xgboost_predicted_valid)
xgboost_rmse = xgboost_mse ** 0.5
xgboost_mape = mean_absolute_percentage_error(target_valid, xgboost_predicted_valid)

# Print XGBoost
print("XGBoost Model")
print()

# Print metrics
print("Predictions quality:")
print("Mean Squared Error (MSE):", xgboost_mse)
print("Root Mean Squared Error (RMSE):", xgboost_rmse)
print("Mean Absolute Percentage Error (MAPE):", xgboost_mape)
print()

# Print time
print("Training Time:", xgboost_train_time, "seconds")
print("Prediction Time:", xgboost_pred_time, "seconds")

XGBoost Model

Predictions quality:
Mean Squared Error (MSE): 1776524.0930035997
Root Mean Squared Error (RMSE): 1332.8631186298162
Mean Absolute Percentage Error (MAPE): 2.163655593663008e+17

Training Time: 0.7920944690704346 seconds
Prediction Time: 0.023999691009521484 seconds


-----

## Model evaluation

<div style="color: #196CC4;">
Below is a comprehensive analysis of all metrics obtained from training models with and without Gradient Boosting<br><br>
<b>Prediction Quality</b> <br>
▶ <b>Mean Squared Error (MSE):</b> is a measure that calculates the average of the squares of the differences between the predicted values by the model and the actual values of the target. In this project, MSE tells us how close the predicted car prices are to the actual value. A lower MSE means the model has a better fit to the data and predicts car prices more accurately.<br>
▶ <b>Root Mean Squared Error (RMSE):</b> is the square root of the MSE and provides a measure of the average magnitude of the errors in the same unit as the target variable. This metric will help us evaluate the model's accuracy in our project. A lower RMSE indicates that the model has better accuracy in predicting car prices.<br>
▶ <b>Mean Absolute Percentage Error (MAPE):</b> This is an evaluation metric used in forecasting problems to measure the accuracy of a model in terms of the percentage of error in the forecasts. In this case, the lower the MAPE value, the better the model's ability to accurately predict car prices.<br>
▶ <b>Comparison:</b> The Mean Absolute Percentage Error (MAPE) is a relative metric because it calculates the error as a percentage of the actual value. In contrast to the Mean Squared Error (MSE) or the Root Mean Squared Error (RMSE), which provide an absolute measure of the error in the same units as the output variable, MAPE measures the error as a proportion of the actual value. This makes MAPE a relative measure that describes the accuracy of the model in relation to the size of the actual values.<br><br>
<b>Time</b><br>
▶ <b>Training Time:</b> This tells us how long it takes to build and fit the model from the training data. A shorter training time is preferable as it allows for faster development and tuning of models.<br>
▶ <b>Prediction Speed:</b> This tells us how long it takes the model to make predictions on new data after it has been trained. A faster prediction speed is desirable, especially in real-time applications where predictions need to be generated quickly.<br>
</div>

### Metrics Summary

<div style="color: #196CC4;">
▶ Below are all metrics obtained for all models used throughout the project
</div>

In [36]:
# Data
modelos = [
    "DecisionTree",
    "Regresión lineal",
    "Random Forest",
    "LGBM",
    "CatBoost",
    "XGBoost"
]

mse = [
    tree_mse,
    rlineal_mse,
    forest_mse,
    lgbm_mse,
    catboost_mse,
    xgboost_mse
]

rmse = [
    tree_rmse,
    rlineal_rmse,
    forest_rmse,
    lgbm_rmse,
    catboost_rmse,
    xgboost_rmse
]

mape = [
    tree_mape,
    rlineal_mape,
    forest_mape,
    lgbm_mape,
    catboost_mape,
    xgboost_mape
]

train_time = [
    tree_train_time,
    rlineal_train_time,
    forest_train_time,
    lgbm_train_time,
    catboost_train_time,
    xgboost_train_time
]

pred_time = [
    tree_pred_time,
    rlineal_pred_time,
    forest_pred_time,
    lgbm_pred_time,
    catboost_pred_time,
    xgboost_pred_time
]

# DataFrame Creation
df_metrics = pd.DataFrame({
    "Model": modelos,
    "Mean Squared Error (MSE)": mse,
    "Root Mean Squared Error (RMSE)": rmse,
    "Mean Absolute Percentage Error (MAPE)": mape,
    "Training Time (seconds)": train_time,
    "Prediction Time (seconds)": pred_time
})

display(df_metrics)

Unnamed: 0,Model,Mean Squared Error (MSE),Root Mean Squared Error (RMSE),Mean Absolute Percentage Error (MAPE),Training Time (seconds),Prediction Time (seconds)
0,DecisionTree,2975726.0,1725.029304,2.325412e+17,0.172235,0.006007
1,Regresión lineal,3425451.0,1850.79735,2.388739e+17,0.13826,0.008177
2,Random Forest,2414589.0,1553.894761,2.260627e+17,163.482603,0.271393
3,LGBM,1651853.0,1285.244315,2.123347e+17,0.331356,0.038129
4,CatBoost,2001682.0,1414.80826,2.221495e+17,0.877851,0.01069
5,XGBoost,1776524.0,1332.863119,2.163656e+17,0.792094,0.024


### Prediction Quality Analysis

<div style="color: #196CC4;">
<b>MSE (Mean Squared Error):</b><br>
▶ Measures the average of the squared errors, so lower values indicate better performance.<br>    
▶ The LGBM model has the lowest MSE, followed by XGBoost and Random Forest.<br>
▶ DecisionTree and Linear Regression models show acceptable performance in terms of MSE, but are outperformed by other ensemble models.<br>
    
<b>RMSE (Root Mean Squared Error):</b> <br>
▶ RMSE is a measure of the dispersion of errors in predictions.<br>
▶ Similar to MSE, the LGBM model has the lowest RMSE, followed by XGBoost and Random Forest.<br>
▶ DecisionTree and Linear Regression have a higher RMSE compared to ensemble models.<br> 
    
<b>LGBM shows the best performance in terms of MSE and RMSE, with moderate training and prediction times.
XGBoost and Random Forest also show good performance in terms of MSE and RMSE, although with longer training times.</b>    
</div>

### Performance Analysis

<div style="color: #196CC4;">
<b>Training Time:</b><br>
▶ The model with the shortest training time is Linear Regression, followed by DecisionTree and XGBoost.<br>
▶ LGBM, CatBoost, and Random Forest have longer training times compared to other models.<br>
▶ Despite having longer training times, ensemble models like LGBM, CatBoost, and Random Forest provide better performance in terms of prediction quality compared to simpler models like Linear Regression.<br>

<b>Prediction Speed:</b><br>
▶ DecisionTree and Linear Regression have the lowest prediction times, followed by XGBoost.<br>
▶ LGBM, CatBoost, and Random Forest show higher prediction times compared to simpler models like DecisionTree and Linear Regression.<br>
▶ Although DecisionTree and Linear Regression have lower prediction times, their performance in terms of prediction quality is inferior compared to more complex ensemble models. 
</div>

## Conclusions

### Selection of the final model

<div style="color: #196CC4;">
LGBM has consistently shown good performance in terms of MSE, RMSE, and MAPE compared to other models on the given dataset. With lower MSE and RMSE, it indicates that predictions made by LGBM tend to be closer to the actual values on average. Additionally, it has one of the lowest MAPE values, suggesting a smaller discrepancy between predictions and actual values in percentage terms:<br>

▶ Although LGBM has longer training times compared to simpler models like Linear Regression, its performance in terms of prediction quality justifies this additional time cost. <br>
▶ While LGBM is not the fastest model in terms of prediction speed, it still has reasonably good prediction times
</div>

In [38]:
# Hyperparameters
lgbm_final = LGBMRegressor(
    objective='regression',
    num_leaves=35,
    seed=23
)

# Training
lgbm_final_start_train_time = time.time()
lgbm_final.fit(total_features_train, total_target_train)

# Training time
lgbm_final_end_train_time = time.time()
lgbm_final_train_time = lgbm_final_end_train_time - lgbm_final_start_train_time

# Predictions
lgbm_final_start_pred_time = time.time()
lgbm_final_predicted_test = lgbm_final.predict(features_test)

# Predictions time
lgbm_final_end_pred_time = time.time()
lgbm_final_pred_time = lgbm_final_end_pred_time - lgbm_final_start_pred_time

# Quality of prediction metrics
lgbm_final_mse = mean_squared_error(target_test, lgbm_final_predicted_test)
lgbm_final_rmse = lgbm_final_mse ** 0.5
lgbm_final_mape = mean_absolute_percentage_error(
    target_test, lgbm_final_predicted_test)

# Print LGBM Final Model
print("\nLGBM Final Model")


# Print metrics
print("Predictions quality:")
print("Mean Squared Error (MSE):", lgbm_final_mse)
print("Root Mean Squared Error (RMSE):", lgbm_final_rmse)
print("Mean Absolute Percentage Error (MAPE):", lgbm_final_mape)
print()

# Print time
print("Training Time:", lgbm_final_train_time, "seconds")
print("Prediction Time:", lgbm_final_pred_time, "seconds")

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008653 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 346
[LightGBM] [Info] Number of data points in the train set: 216988, number of used features: 24
[LightGBM] [Info] Start training from score 3589.991562

LGBM Final Model
Predictions quality:
Mean Squared Error (MSE): 1664534.8743611246
Root Mean Squared Error (RMSE): 1290.1685449433048
Mean Absolute Percentage Error (MAPE): 2.118926661069755e+17

Training Time: 0.4565317630767822 seconds
Prediction Time: 0.04199838638305664 seconds


### Conclusions

<div style="color: #196CC4;">
The LightGBM model is the best option for Rusty Bargain in terms of prediction quality, prediction speed, and training time efficiency. Although the model has a slightly longer training time compared to simpler models like linear regression, its ability to generate accurate predictions and its quick response time make it ideal for implementation on the platform.<br>

This model meets the criteria established by Rusty Bargain, providing a valuable tool for users to obtain reliable and quick estimates of their vehicles' market value. Implementing this model in the application will significantly improve the user experience, attracting new customers and enhancing Rusty Bargain's value proposition in the used car market.
</div>
