# Development of a Machine Learning Model for Used Car Price Prediction

## Table of Contents
1. [Introduction](#introduccion)
2. [Objectives](#objectives)
3. [Library Import](#library-import)
4. [Data Analysis](#data-analysis)
5. [Model Training](#model-training)
   - [Hyperparameter Tuning](#hyperparameter-tuning)
   - [Linear Regression](#linear-regression)
   - [Decision Tree](#decision-tree)
   - [Boosted Decision Tree (Gradient Boosting)](#boosted-decision-tree-gradient-boosting)
   - [CatBoost](#catboost)
6. [Conclusion](#conclusion)


## Introduccion
The goal of this project is to create a machine learning model that can forecast the market value of used cars, giving Rusty Bargain's app an effective tool. The study entails comparing several supervised learning methods as well as data discovery, preprocessing, and analysis of vehicle-related characteristics. Training duration, prediction speed, and prediction quality will be the main evaluation criteria. Root Mean Squared Error, or RMSE, will be the main assessment statistic.

## Objectives

- __Preparing and Exploring Data__:

Examine the dataset to find duplicates, missing values, and any mistakes.
To ensure machine learning model compatibility, clean the data and convert numerical and categorical characteristics.

- __Training and Comparing Models__:

Use LightGBM, XGBoost, and CatBoost to implement and compare various regression techniques, such as gradient boosting models, random forest, and linear regression.
Adjust the model's hyperparameters to enhance performance while assessing prediction accuracy and training time.

- __Assessment and Selection of Models__:

Use the RMSE statistic to assess models, paying particular attention to computational efficiency and prediction quality.
Examine the outcomes to see which model best suits Rusty Bargain's needs.

## Library Import

In [18]:
pip install numpy --upgrade

Collecting numpy
  Downloading numpy-2.2.2-cp312-cp312-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.8 kB ? eta -:--:--
     ------ --------------------------------- 10.2/60.8 kB ? eta -:--:--
     ------ --------------------------------- 10.2/60.8 kB ? eta -:--:--
     ------------------- ------------------ 30.7/60.8 kB 217.9 kB/s eta 0:00:01
     ------------------- ------------------ 30.7/60.8 kB 217.9 kB/s eta 0:00:01
     -------------------------------------- 60.8/60.8 kB 231.1 kB/s eta 0:00:00
Downloading numpy-2.2.2-cp312-cp312-win_amd64.whl (12.6 MB)
   ---------------------------------------- 0.0/12.6 MB ? eta -:--:--
   ---------------------------------------- 0.0/12.6 MB 991.0 kB/s eta 0:00:13
    --------------------------------------- 0.2/12.6 MB 2.9 MB/s eta 0:00:05
    --------------------------------------- 0.2/12.6 MB 2.9 MB/s eta 0:00:05
   - -------------------------------------- 0.5/12.6 MB 3.0 MB/s eta 0:00:04
   -- ----------

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
catboost 1.2.7 requires numpy<2.0,>=1.16.0, but you have numpy 2.2.2 which is incompatible.


In [23]:
pip install catboost

Note: you may need to restart the kernel to use updated packages.


In [1]:
# Import necessary libraries
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Scikit-learn utilities
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import RFE
from sklearn.metrics import r2_score, mean_squared_error

# Regression models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

from catboost import CatBoostRegressor


In [4]:
# Load the dataset
df = pd.read_csv('car_data.csv')

In [5]:
# Rename the dataset columns
df = df.rename(columns={
    'DateCrawled': 'date_crawled',
    'Price': 'price',
    'VehicleType': 'vehicle_type',
    'RegistrationYear': 'registration_year',
    'Gearbox': 'gearbox',
    'Power': 'power',
    'Model': 'model',
    'Mileage': 'mileage',
    'RegistrationMonth': 'registration_month',
    'FuelType': 'fuel_type',
    'Brand': 'brand',
    'NotRepaired': 'not_repaired',
    'DateCreated': 'date_created',
    'NumberOfPictures': 'number_of_pictures',
    'PostalCode': 'postcode',
    'LastSeen': 'lastseen'
})

In [6]:
df.head()

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created,number_of_pictures,postcode,lastseen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


## Data Analysis

In [7]:
# Perform a preliminary analysis of the data
df.info()
print('=' * 50)
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        354369 non-null  object
 1   price               354369 non-null  int64 
 2   vehicle_type        316879 non-null  object
 3   registration_year   354369 non-null  int64 
 4   gearbox             334536 non-null  object
 5   power               354369 non-null  int64 
 6   model               334664 non-null  object
 7   mileage             354369 non-null  int64 
 8   registration_month  354369 non-null  int64 
 9   fuel_type           321474 non-null  object
 10  brand               354369 non-null  object
 11  not_repaired        283215 non-null  object
 12  date_created        354369 non-null  object
 13  number_of_pictures  354369 non-null  int64 
 14  postcode            354369 non-null  int64 
 15  lastseen            354369 non-null  object
dtypes:

Unnamed: 0,price,registration_year,power,mileage,registration_month,number_of_pictures,postcode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


We can observe that the dataset contains null values, and the data types are correct and do not require modifications.

In [8]:
# Fill null values in categorical columns
df['gearbox'].fillna(df['gearbox'].mode()[0], inplace=True)
df['vehicle_type'].fillna(df['vehicle_type'].mode()[0], inplace=True)
df['model'].fillna(df['model'].mode()[0], inplace=True)
df['fuel_type'].fillna(df['fuel_type'].mode()[0], inplace=True)
df['not_repaired'].fillna("Unknown", inplace=True)

# Remove duplicate values
df.drop_duplicates(inplace=True)

# Verify null values
print('Null values verification:\n', df.isnull().sum())
print('=' * 25)

# Verify duplicate values
print('Total number of duplicate values =', df.duplicated().sum())

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['gearbox'].fillna(df['gearbox'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['vehicle_type'].fillna(df['vehicle_type'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate o

Null values verification:
 date_crawled          0
price                 0
vehicle_type          0
registration_year     0
gearbox               0
power                 0
model                 0
mileage               0
registration_month    0
fuel_type             0
brand                 0
not_repaired          0
date_created          0
number_of_pictures    0
postcode              0
lastseen              0
dtype: int64
Total number of duplicate values = 0


Our data has been corrected and is now ready for processing.

## Model Training

### Hyperparameter Tuning

In [9]:
# Exclude irrelevant columns
df = df.drop(['date_crawled', 'date_created', 'lastseen', 'registration_month', 'number_of_pictures', 'postcode'], axis=1)

# Convert categorical columns into dummy variables (One-Hot Encoding)
X = df.drop('price', axis=1)  # 'price' is the target variable
y = df['price']

# Apply OHE to categorical columns
X_encoded = pd.get_dummies(X, drop_first=True)

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_encoded, y, test_size=0.2, random_state=42)


### Linear Regression

In [10]:
# Create the linear regression model
model = LinearRegression()

# Measure training time
start_train = time.time()
model.fit(X_train, y_train)
end_train = time.time()

# Training time
train_time = end_train - start_train
print(f"Training time: {train_time} seconds")

# Measure prediction time
start_predict = time.time()
y_pred = model.predict(X_val)
end_predict = time.time()

# Make predictions on the validation set
y_pred = model.predict(X_val)

# Calculate RMSE manually
rmse = np.sqrt(np.mean((y_val - y_pred) ** 2))

print(f"RMSE: {rmse}")

Training time: 8.428355932235718 seconds
RMSE: 3187.8423893330164


### Decision Tree

In [11]:
# Create the Decision Tree model
model = DecisionTreeRegressor(random_state=42)

# Train the model
model.fit(X_train, y_train)

# Measure training time
start_train = time.time()
model.fit(X_train, y_train)
end_train = time.time()

# Training time
train_time = end_train - start_train
print(f"Training time: {train_time} seconds")

# Measure prediction time
start_predict = time.time()
y_pred = model.predict(X_val)
end_predict = time.time()

# Prediction time
predict_time = end_predict - start_predict
print(f"Prediction time: {predict_time} seconds")

# Make predictions on the validation set
y_pred = model.predict(X_val)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
print("RMSE:", rmse)


Training time: 10.817238569259644 seconds
Prediction time: 0.12571215629577637 seconds
RMSE: 2081.4097691055617


### Boosted Decision Tree (Gradient Boosting)

In [12]:
# Create the Gradient Boosting model
model = GradientBoostingRegressor(random_state=42)

# Measure training time
start_train = time.time()
model.fit(X_train, y_train)
end_train = time.time()

# Training time
train_time = end_train - start_train
print(f"Training time: {train_time} seconds")

# Measure prediction time
start_predict = time.time()
y_pred = model.predict(X_val)
end_predict = time.time()

# Prediction time
predict_time = end_predict - start_predict
print(f"Prediction time: {predict_time} seconds")

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
print(f"RMSE: {rmse}")

Training time: 166.36652326583862 seconds
Prediction time: 0.41561055183410645 seconds
RMSE: 2067.2720644200763


### CatBoost

In [13]:
# One-Hot Encoding was executed previously, no need to specify categorical features
model = CatBoostRegressor(iterations=1000,    # Number of iterations
                          learning_rate=0.1,  # Learning rate
                          depth=6,            # Tree depth
                          verbose=100)        # Display every 100 iterations

# Measure training time
start_train = time.time()
model.fit(X_train, y_train, eval_set=(X_val, y_val), early_stopping_rounds=50)
end_train = time.time()

# Training time
train_time = end_train - start_train
print(f"Training time: {train_time} seconds")

# Measure prediction time
start_predict = time.time()
y_pred = model.predict(X_val)
end_predict = time.time()

# Prediction time
predict_time = end_predict - start_predict
print(f"Prediction time: {predict_time} seconds")

# Calculate RMSE
rmse = np.sqrt(np.mean((y_val - y_pred) ** 2))
print(f"RMSE: {rmse}")

0:	learn: 4238.8198947	test: 4238.6117021	best: 4238.6117021 (0)	total: 166ms	remaining: 2m 46s
100:	learn: 1961.2207816	test: 1972.6055926	best: 1972.6055926 (100)	total: 2.55s	remaining: 22.7s
200:	learn: 1872.8541664	test: 1888.8926982	best: 1888.8926982 (200)	total: 4.82s	remaining: 19.2s
300:	learn: 1827.1184337	test: 1848.6293692	best: 1848.6293692 (300)	total: 7.14s	remaining: 16.6s
400:	learn: 1798.2636534	test: 1825.3441473	best: 1825.3441473 (400)	total: 9.48s	remaining: 14.2s
500:	learn: 1774.7634790	test: 1806.4918783	best: 1806.4918783 (500)	total: 11.6s	remaining: 11.6s
600:	learn: 1755.8400097	test: 1792.8463683	best: 1792.8463683 (600)	total: 13.7s	remaining: 9.12s
700:	learn: 1739.7824686	test: 1781.8195460	best: 1781.8195460 (700)	total: 15.9s	remaining: 6.78s
800:	learn: 1725.4068219	test: 1772.3808289	best: 1772.3808289 (800)	total: 18s	remaining: 4.48s
900:	learn: 1713.1088681	test: 1764.9020891	best: 1764.9020891 (900)	total: 20.2s	remaining: 2.22s
999:	learn: 170

## Conclusion

During the training of our models, the conventional ML methods did not yield the expected results, especially linear regression, which turned out to be the least suitable among the methods studied. The decision tree model showed a significant improvement, but when applying boosting, the result was not as expected, although there was some improvement. The CatBoost method proved to be the most suitable for our case, due to its improvement in efficiency and the brevity of its operation.