# Project Title: Predicting Car Prices for Rusty Bargain: A Machine Learning Approach

# Project Description:

to attract new customers. Within this application, users can efficiently ascertain the market value of their cars using available historical data, encompassing technical specifications, trim versions, and corresponding prices. The project's objective is to construct a model that facilitates the accurate determination of a car's value.

Rusty Bargain is interested in:

-The quality of the prediction

-The speed of the prediction

-The time required for training

The goal is to quickly determine the market value of a car based on historical data, technical specifications, trim versions, and prices. The project involves building machine learning models, including linear regression, random forest, LightGBM, CatBoost, and XGBoost, to evaluate the quality and speed of predictions. The emphasis is on comparing different gradient boosting methods and other algorithms, considering hyperparameter tuning and encoding of categorical features. The analysis aims to provide insights into the most effective models for predicting car prices.

## Data preparation

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt
import time
from sklearn.impute import SimpleImputer 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline



In [2]:

df = pd.read_csv('/datasets/car_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [3]:
df

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354364,21/03/2016 09:50,0,,2005,manual,0,colt,150000,7,petrol,mitsubishi,yes,21/03/2016 00:00,0,2694,21/03/2016 10:42
354365,14/03/2016 17:48,2200,,2005,,0,,20000,1,,sonstige_autos,,14/03/2016 00:00,0,39576,06/04/2016 00:46
354366,05/03/2016 19:56,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no,05/03/2016 00:00,0,26135,11/03/2016 18:17
354367,19/03/2016 18:57,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no,19/03/2016 00:00,0,87439,07/04/2016 07:15


In [4]:
# Display the first few rows of the dataframe
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [5]:
# Convert column names to lowercase
df.columns = df.columns.str.lower()
df

Unnamed: 0,datecrawled,price,vehicletype,registrationyear,gearbox,power,model,mileage,registrationmonth,fueltype,brand,notrepaired,datecreated,numberofpictures,postalcode,lastseen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354364,21/03/2016 09:50,0,,2005,manual,0,colt,150000,7,petrol,mitsubishi,yes,21/03/2016 00:00,0,2694,21/03/2016 10:42
354365,14/03/2016 17:48,2200,,2005,,0,,20000,1,,sonstige_autos,,14/03/2016 00:00,0,39576,06/04/2016 00:46
354366,05/03/2016 19:56,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no,05/03/2016 00:00,0,26135,11/03/2016 18:17
354367,19/03/2016 18:57,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no,19/03/2016 00:00,0,87439,07/04/2016 07:15


In [6]:
# Drop irrelevant columns
df = df.drop(columns=['datecrawled', 'datecreated', 'lastseen', 'numberofpictures', 'postalcode'])
df

Unnamed: 0,price,vehicletype,registrationyear,gearbox,power,model,mileage,registrationmonth,fueltype,brand,notrepaired
0,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,
1,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no
...,...,...,...,...,...,...,...,...,...,...,...
354364,0,,2005,manual,0,colt,150000,7,petrol,mitsubishi,yes
354365,2200,,2005,,0,,20000,1,,sonstige_autos,
354366,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no
354367,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no


In [7]:
# Handling missing values
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 245814 entries, 3 to 354367
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   price              245814 non-null  int64 
 1   vehicletype        245814 non-null  object
 2   registrationyear   245814 non-null  int64 
 3   gearbox            245814 non-null  object
 4   power              245814 non-null  int64 
 5   model              245814 non-null  object
 6   mileage            245814 non-null  int64 
 7   registrationmonth  245814 non-null  int64 
 8   fueltype           245814 non-null  object
 9   brand              245814 non-null  object
 10  notrepaired        245814 non-null  object
dtypes: int64(5), object(6)
memory usage: 22.5+ MB


In [8]:
# Encoding categorical features
df = pd.get_dummies(df, columns=['vehicletype', 'gearbox', 'model', 'fueltype', 'brand', 'notrepaired'])
df

Unnamed: 0,price,registrationyear,power,mileage,registrationmonth,vehicletype_bus,vehicletype_convertible,vehicletype_coupe,vehicletype_other,vehicletype_sedan,...,brand_skoda,brand_smart,brand_subaru,brand_suzuki,brand_toyota,brand_trabant,brand_volkswagen,brand_volvo,notrepaired_no,notrepaired_yes
3,1500,2001,75,150000,6,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
4,3600,2008,69,90000,7,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
5,650,1995,102,150000,10,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
6,2200,2004,109,150000,8,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
7,0,1980,50,40000,7,0,0,0,0,1,...,0,0,0,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354360,3999,2005,3,150000,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
354362,3200,2004,225,150000,5,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
354363,1150,2000,0,150000,3,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
354366,1199,2000,101,125000,3,0,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0


In [9]:
# Display basic statistics of numerical features
numerical_stats = df.describe()
print(numerical_stats)


               price  registrationyear          power        mileage  \
count  245814.000000     245814.000000  245814.000000  245814.000000   
mean     5125.346717       2002.918699     119.970884  127296.716216   
std      4717.948673          6.163765     139.387116   37078.820368   
min         0.000000       1910.000000       0.000000    5000.000000   
25%      1499.000000       1999.000000      75.000000  125000.000000   
50%      3500.000000       2003.000000     110.000000  150000.000000   
75%      7500.000000       2007.000000     150.000000  150000.000000   
max     20000.000000       2018.000000   20000.000000  150000.000000   

       registrationmonth  vehicletype_bus  vehicletype_convertible  \
count      245814.000000    245814.000000            245814.000000   
mean            6.179701         0.096207                 0.065932   
std             3.479519         0.294875                 0.248164   
min             0.000000         0.000000                 0.000000   
2

Price: There seems to be an issue with zero values in the minimum price, which may not be realistic for a car. We might need to investigate and handle these instances.

Power: The maximum power value of 20,000 is unusually high and could be an outlier or erroneous data. We might want to examine and potentially address such extreme values.

Mileage: The mileage values appear within a reasonable range, but we might want to check for any outliers or extreme values that could impact the analysis.

Registration Year: The minimum registration year of 1910 might be an outlier or error. It's worth looking into and deciding whether to handle such cases.

Registration Month: The values range from 0 to 12, which might indicate an issue or inconsistency. Typically, months are represented from 1 to 12. We may need to investigate further.

In [10]:
# Handling Zero Values in 'Price'
df = df[df['price'] > 0]
df

Unnamed: 0,price,registrationyear,power,mileage,registrationmonth,vehicletype_bus,vehicletype_convertible,vehicletype_coupe,vehicletype_other,vehicletype_sedan,...,brand_skoda,brand_smart,brand_subaru,brand_suzuki,brand_toyota,brand_trabant,brand_volkswagen,brand_volvo,notrepaired_no,notrepaired_yes
3,1500,2001,75,150000,6,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
4,3600,2008,69,90000,7,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
5,650,1995,102,150000,10,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
6,2200,2004,109,150000,8,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
10,2000,2004,105,150000,12,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354360,3999,2005,3,150000,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
354362,3200,2004,225,150000,5,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
354363,1150,2000,0,150000,3,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
354366,1199,2000,101,125000,3,0,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0


In [11]:
# Handling Unusually High 'Power' Values
max_power_threshold = 1000 
df = df[df['power'] <= max_power_threshold]
df

Unnamed: 0,price,registrationyear,power,mileage,registrationmonth,vehicletype_bus,vehicletype_convertible,vehicletype_coupe,vehicletype_other,vehicletype_sedan,...,brand_skoda,brand_smart,brand_subaru,brand_suzuki,brand_toyota,brand_trabant,brand_volkswagen,brand_volvo,notrepaired_no,notrepaired_yes
3,1500,2001,75,150000,6,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
4,3600,2008,69,90000,7,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
5,650,1995,102,150000,10,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
6,2200,2004,109,150000,8,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
10,2000,2004,105,150000,12,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354360,3999,2005,3,150000,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
354362,3200,2004,225,150000,5,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
354363,1150,2000,0,150000,3,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
354366,1199,2000,101,125000,3,0,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0


In [12]:
# Handling Unusual 'Registration Year' Values
min_registration_year_threshold = 1950  
df = df[df['registrationyear'] >= min_registration_year_threshold]
df

Unnamed: 0,price,registrationyear,power,mileage,registrationmonth,vehicletype_bus,vehicletype_convertible,vehicletype_coupe,vehicletype_other,vehicletype_sedan,...,brand_skoda,brand_smart,brand_subaru,brand_suzuki,brand_toyota,brand_trabant,brand_volkswagen,brand_volvo,notrepaired_no,notrepaired_yes
3,1500,2001,75,150000,6,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
4,3600,2008,69,90000,7,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
5,650,1995,102,150000,10,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
6,2200,2004,109,150000,8,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
10,2000,2004,105,150000,12,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354360,3999,2005,3,150000,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
354362,3200,2004,225,150000,5,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
354363,1150,2000,0,150000,3,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
354366,1199,2000,101,125000,3,0,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0


In [13]:
# Handling Unusual 'Registration Month' Values
df = df[(df['registrationmonth'] >= 1) & (df['registrationmonth'] <= 12)]
df

Unnamed: 0,price,registrationyear,power,mileage,registrationmonth,vehicletype_bus,vehicletype_convertible,vehicletype_coupe,vehicletype_other,vehicletype_sedan,...,brand_skoda,brand_smart,brand_subaru,brand_suzuki,brand_toyota,brand_trabant,brand_volkswagen,brand_volvo,notrepaired_no,notrepaired_yes
3,1500,2001,75,150000,6,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
4,3600,2008,69,90000,7,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
5,650,1995,102,150000,10,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
6,2200,2004,109,150000,8,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
10,2000,2004,105,150000,12,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354360,3999,2005,3,150000,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
354362,3200,2004,225,150000,5,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
354363,1150,2000,0,150000,3,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
354366,1199,2000,101,125000,3,0,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0


In [14]:
# Display updated basic statistics of numerical features
numerical_stats = df.describe()
print(numerical_stats)

               price  registrationyear          power        mileage  \
count  235425.000000     235425.000000  235425.000000  235425.000000   
mean     5276.564838       2003.090572     119.172909  126981.522778   
std      4726.837762          6.079106      57.583969   37106.606850   
min         1.000000       1950.000000       0.000000    5000.000000   
25%      1550.000000       1999.000000      75.000000  125000.000000   
50%      3650.000000       2004.000000     113.000000  150000.000000   
75%      7700.000000       2007.000000     150.000000  150000.000000   
max     20000.000000       2018.000000    1000.000000  150000.000000   

       registrationmonth  vehicletype_bus  vehicletype_convertible  \
count      235425.000000    235425.000000            235425.000000   
mean            6.373244         0.097560                 0.066713   
std             3.353039         0.296719                 0.249526   
min             1.000000         0.000000                 0.000000   
2

## Model training

In [15]:
# Separate categorical and numerical columns
categorical_cols = df.select_dtypes(include=['object']).columns
numerical_cols = df.select_dtypes(exclude=['object']).columns

# Create transformers for numerical and categorical features
numerical_transformer = Pipeline(steps=[('num_imputer', SimpleImputer(strategy='mean'))])

categorical_transformer = Pipeline(steps=[
    ('cat_imputer', SimpleImputer(strategy='constant', fill_value='other')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers into a preprocessor using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])


In [16]:
# Split data into features and target
X = df.drop(columns='price')
y = df['price']

In [17]:
# Ensure 'price' column is present in training and test data
X['price'] = y

In [18]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [19]:
# Linear Regression as a baseline model
lr_model = LinearRegression()

# Measure training time for Linear Regression
start_time_lr = time.time()
lr_model.fit(X_train, y_train)
end_time_lr = time.time()

lr_preds = lr_model.predict(X_test)

# Evaluate Linear Regression
lr_rmse = sqrt(mean_squared_error(y_test, lr_preds))
print(f'Linear Regression RMSE: {lr_rmse}')
print(f'Linear Regression Training Time: {end_time_lr - start_time_lr} seconds')


Linear Regression RMSE: 4.7550137573905135e-11
Linear Regression Training Time: 14.770609140396118 seconds


In [20]:
# Random Forest
rf_model = RandomForestRegressor(n_estimators=20, random_state=42)

# Measure training time for Random Forest
start_time_rf = time.time()
rf_model.fit(X_train, y_train)
end_time_rf = time.time()

rf_preds = rf_model.predict(X_test)

# Evaluate Random Forest
rf_rmse = sqrt(mean_squared_error(y_test, rf_preds))
print(f'Random Forest RMSE: {rf_rmse}')
print(f'Random Forest Training Time: {end_time_rf - start_time_rf} seconds')

Random Forest RMSE: 0.5064293547010914
Random Forest Training Time: 64.06066608428955 seconds


## Model analysis

In [21]:
# LightGBM
lgbm_model = LGBMRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
start_time = time.time()
lgbm_model.fit(X_train, y_train)
lgbm_preds = lgbm_model.predict(X_test)
end_time = time.time()

In [22]:
# Evaluate LightGBM
lgbm_rmse = sqrt(mean_squared_error(y_test, lgbm_preds))
print(f'LightGBM RMSE: {lgbm_rmse}')
print(f'LightGBM Training Time: {end_time - start_time} seconds')


LightGBM RMSE: 21.471290364599994
LightGBM Training Time: 5.585995435714722 seconds


In [23]:
# LightGBM with Hyperparameter Tuning
from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV

lgbm_model = LGBMRegressor(random_state=42)

# Hyperparameter grid for tuning
param_grid = {
    'n_estimators': [10, 15, 20],
    'learning_rate': [0.01, 0.1, 0.2]
}

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=lgbm_model, param_grid=param_grid, 
                           scoring='neg_mean_squared_error', cv=5, verbose=0)
start_time_grid = time.time()
grid_search.fit(X_train, y_train)
end_time_grid = time.time()

# Best hyperparameters from the grid search
best_params = grid_search.best_params_
print(f'Best Hyperparameters: {best_params}')


Best Hyperparameters: {'learning_rate': 0.2, 'n_estimators': 20}


In [24]:
# Train the LightGBM model with the best hyperparameters
lgbm_model_tuned = LGBMRegressor(random_state=42, **best_params)
start_time = time.time()
lgbm_model_tuned.fit(X_train, y_train)
lgbm_preds_tuned = lgbm_model_tuned.predict(X_test)
end_time = time.time()

# Evaluate LightGBM with hyperparameter tuning
lgbm_rmse_tuned = sqrt(mean_squared_error(y_test, lgbm_preds_tuned))
print(f'LightGBM RMSE with Hyperparameter Tuning: {lgbm_rmse_tuned}')
print(f'LightGBM Training Time with Hyperparameter Tuning: {end_time - start_time} seconds')
print(f'Grid Search Time for Hyperparameter Tuning: {end_time_grid - start_time_grid} seconds')

LightGBM RMSE with Hyperparameter Tuning: 59.169958317778146
LightGBM Training Time with Hyperparameter Tuning: 3.2125399112701416 seconds
Grid Search Time for Hyperparameter Tuning: 118.20733785629272 seconds


In [25]:
# CatBoost
catboost_model = CatBoostRegressor(iterations=100, learning_rate=0.1, random_seed=42, verbose=0)
start_time = time.time()
catboost_model.fit(X_train, y_train)
catboost_preds = catboost_model.predict(X_test)
end_time = time.time()

In [26]:
# Evaluate CatBoost
catboost_rmse = sqrt(mean_squared_error(y_test, catboost_preds))
print(f'CatBoost RMSE: {catboost_rmse}')
print(f'CatBoost Training Time: {end_time - start_time} seconds')


CatBoost RMSE: 66.05260939982051
CatBoost Training Time: 5.71368145942688 seconds


In [27]:
# XGBoost with One-Hot Encoding
xgboost_model = XGBRegressor(n_estimators=20, learning_rate=0.1, random_state=42)

# Create and fit the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('model', xgboost_model)])

start_time = time.time()
pipeline.fit(X_train, y_train)
xgboost_preds = pipeline.predict(X_test)
end_time = time.time()


In [28]:
# Evaluate XGBoost
xgboost_rmse = sqrt(mean_squared_error(y_test, xgboost_preds))
print(f'XGBoost RMSE: {xgboost_rmse}')
print(f'XGBoost Training Time: {end_time - start_time} seconds')

XGBoost RMSE: 861.9906587585083
XGBoost Training Time: 73.77553057670593 seconds


# Conclusion:
In conclusion, this project successfully addressed the task of predicting quality, speed and the time required to tain for Rusty Bargain. The implemented machine learning models, including linear regression, random forest, and gradient boosting methods (LightGBM, CatBoost, and XGBoost), were trained and evaluated. The focus was on assessing prediction quality, speed, and training time. Through systematic experimentation with different algorithms and hyperparameters, valuable insights were gained into the strengths and weaknesses of each model. The final model selection depends on the specific requirements of Rusty Bargain, balancing prediction accuracy and computational efficiency. For a good balance between accuracy and computational efficiency, LightGBM and CatBoost are often considered strong contenders. These gradient boosting frameworks are designed to be computationally efficient while providing competitive predictive performance. They handle large datasets well and are optimized for speed.

