OVERVIEW

We are still in the hackathon competition, predicting used car prices. Previously, a baseline model was built, and now we want to improve the model. We want to see if we can get a better Root Mean Squared Error(RMSE) value, because the lower, the better.

We built the baseline model using a LinearRegression(), and we performed a cross-validation to check how robust our model is. Now, we will improve the model using the following:
- LGBMRegressor()
- RandomizedSearchCV()
- Hyperparameter tunning etc.

In [1]:
pip install lightgbm

Note: you may need to restart the kernel to use updated packages.


In [2]:
#importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from lightgbm import LGBMRegressor

In [3]:
#Print out libraries' versions

print("Pandas version is:", pd.__version__)
print("Numpy version is:", np.__version__)
print("Sci-kit learn version is:", sklearn.__version__)

Pandas version is: 2.2.2
Numpy version is: 1.26.4
Sci-kit learn version is: 1.5.1


In [4]:
# A wrangle function to handle part of our data preprocessing
def wrangle(filepath):
    #read data into a DataFrame
    df = pd.read_csv(filepath)
    print("length of df with outliers:", len(df))

    #Remove outliers by cutting off the first 10% and last 10% of our target
    q1, q9 = df['price'].quantile([0.1, 0.9])
    df = df[df['price'].between(q1, q9)]
    print("length of df without outliers:", len(df))
    
    return df

In [5]:
df = wrangle(r'C:\Users\USER\Desktop\Kaggle Hackathon\archive\train.csv')
df.head()

length of df with outliers: 188533
length of df without outliers: 151067


Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
2,2,Chevrolet,Silverado 2500 LT,2002,136731,E85 Flex Fuel,320.0HP 5.3L 8 Cylinder Engine Flex Fuel Capab...,A/T,Blue,Gray,None reported,Yes,13900
3,3,Genesis,G90 5.0 Ultimate,2017,19500,Gasoline,420.0HP 5.0L 8 Cylinder Engine Gasoline Fuel,Transmission w/Dual Shift Mode,Black,Black,None reported,Yes,45000
5,5,Audi,A6 2.0T Sport,2018,40950,Gasoline,252.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,A/T,White,–,None reported,Yes,29950
6,6,Audi,A8 L 3.0T,2016,62200,Gasoline,333.0HP 3.0L V6 Cylinder Engine Gasoline Fuel,8-Speed A/T,Black,Black,None reported,Yes,28500
7,7,Chevrolet,Silverado 1500 1LZ,2016,102604,E85 Flex Fuel,355.0HP 5.3L 8 Cylinder Engine Flex Fuel Capab...,A/T,White,Gray,None reported,Yes,12500


In [6]:
#Split data into feature matrix and target vectr
target = 'price'
X = df.drop(columns=[target])
y = df[target]

In [7]:
#Clasify dataTypes of features

cat_features = X.select_dtypes('object').columns.to_list()
num_features = X.select_dtypes('number').columns.to_list()

In [8]:
#Vertical splitting into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print('length of X_train:', len(X_train))
print('length of X_val:', len(X_val))
print('length of y_train:', len(y_train))
print('length of X_train:', len(y_val))

length of X_train: 120853
length of X_val: 30214
length of y_train: 120853
length of X_train: 30214


In [9]:
#Build a preprocessing pipeline

num_processing = Pipeline(
    [
        ('scaler', StandardScaler())
    ]
)

cat_processing = Pipeline(
    [
        ('imputer', SimpleImputer(strategy= 'most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ]
)

In [10]:
transform = ColumnTransformer(
    transformers= [
        ('num', num_processing, num_features),
        ('cat', cat_processing, cat_features)
    ]
)

In [11]:
#Build a model pipeline
model = Pipeline(
    [
        ('transform', transform),
        ('regressor', LGBMRegressor())
    ]
)
model.fit(X_train, y_train)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.561840 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4893
[LightGBM] [Info] Number of data points in the train set: 120853, number of used features: 2177
[LightGBM] [Info] Start training from score 33815.281201


In [12]:
#evaluate model performance
model.score(X_val, y_val)

0.5670898453118675

In [13]:
LGBMRegressor().get_params()

{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_depth': -1,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_split_gain': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'num_leaves': 31,
 'objective': None,
 'random_state': None,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0}

In [14]:
#Hyperparameter tunning
param_grid = {
    'regressor__n_estimators': [10, 50, 100, 150],
    'regressor__learning_rate': [0.01, 0.05, 0.1],
    'regressor__max_depth': [-1, 5, 10],
    #'force_row_wise': [True]
    
}
rf = RandomizedSearchCV(estimator=model, param_distributions=param_grid,
                        n_iter=5, scoring='neg_root_mean_squared_error', cv=5, verbose=0)

In [15]:
rf.fit(X_train, y_train)

print("Best Parameters:", rf.best_params_)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.350467 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4399
[LightGBM] [Info] Number of data points in the train set: 96682, number of used features: 1930
[LightGBM] [Info] Start training from score 33843.380857
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.398102 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4413
[LightGBM] [Info] Number of data points in the train set: 96682, number of used features: 1937
[LightGBM] [Info] Start training from score 33793.953797
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.461700 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory 

In [16]:
best_model = rf.best_estimator_
y_val_pred = best_model.predict(X_val)
y_val_pred[:5]

array([60706.70293865, 56508.28175838, 52516.16342822, 41330.29140821,
       35171.69153452])

In [29]:
#Load the test data
test = pd.read_csv(r'C:\Users\USER\Desktop\Kaggle Hackathon\archive\test.csv')
test.head()

Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title
0,188533,Land,Rover LR2 Base,2015,98000,Gasoline,240.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,6-Speed A/T,White,Beige,None reported,Yes
1,188534,Land,Rover Defender SE,2020,9142,Hybrid,395.0HP 3.0L Straight 6 Cylinder Engine Gasoli...,8-Speed A/T,Silver,Black,None reported,Yes
2,188535,Ford,Expedition Limited,2022,28121,Gasoline,3.5L V6 24V PDI DOHC Twin Turbo,10-Speed Automatic,White,Ebony,None reported,
3,188536,Audi,A6 2.0T Sport,2016,61258,Gasoline,2.0 Liter TFSI,Automatic,Silician Yellow,Black,None reported,
4,188537,Audi,A6 2.0T Premium Plus,2018,59000,Gasoline,252.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,A/T,Gray,Black,None reported,Yes


In [33]:
#make predictions on the test data
y_test_pred = best_model.predict(test).round(2)
y_test_pred[:5]

array([18613.73, 50797.06, 47747.73, 26648.13, 31970.25])

In [34]:
#Create a submision DataFrame
submission_2 = pd.DataFrame(
    {'id': test['id'], 'price': y_test_pred}
)
try:
    submission_2.to_csv(r'C:\Users\USER\Desktop\Kaggle Hackathon\submission_2.csv', index=False)
    print('File saved successfully!')
except Exception as e:
    print(f"Error saving file: {e}")

File saved successfully!


In [36]:
#Calculate Root Mean Squared Error
rmse = mean_squared_error(y_val, y_val_pred, squared=False)
print('RMSE for this model is:', round(rmse, 2))

RMSE for this model is: 11426.04


Our baseline model's RMSE stood 11689.28, the difference now tot so huge, but, it is an improvement on our baseline model.

In [17]:
# Hyperparameter tuning with GridSearchCV
param_grid = {
    'regressor__n_estimators': [100, 300, 500],
    'regressor__learning_rate': [0.01, 0.05, 0.1],
    'regressor__max_depth': [-1, 5, 10]
}
gs = GridSearchCV(model, param_grid, cv=3, 
                  scoring='neg_root_mean_squared_error', verbose=1, n_jobs=-1)
gs.fit(X_train, y_train)

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.578240 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4893
[LightGBM] [Info] Number of data points in the train set: 120853, number of used features: 2177
[LightGBM] [Info] Start training from score 33815.281201


NameError: name 'grid_search' is not defined

In [18]:
print("Best Parameters:", gs.best_params_)

Best Parameters: {'regressor__learning_rate': 0.05, 'regressor__max_depth': -1, 'regressor__n_estimators': 500}


In [25]:
best_gs_model = gs.best_estimator_
y_val_gs = best_gs_model.predict(X_val)
y_val_gs[:5]

array([62850.7749126 , 56798.77326673, 51511.82215677, 42723.5999015 ,
       34436.53675603])

In [21]:
test = pd.read_csv(r'C:\Users\USER\Desktop\Kaggle Hackathon\archive\test.csv')
test.head()

Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title
0,188533,Land,Rover LR2 Base,2015,98000,Gasoline,240.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,6-Speed A/T,White,Beige,None reported,Yes
1,188534,Land,Rover Defender SE,2020,9142,Hybrid,395.0HP 3.0L Straight 6 Cylinder Engine Gasoli...,8-Speed A/T,Silver,Black,None reported,Yes
2,188535,Ford,Expedition Limited,2022,28121,Gasoline,3.5L V6 24V PDI DOHC Twin Turbo,10-Speed Automatic,White,Ebony,None reported,
3,188536,Audi,A6 2.0T Sport,2016,61258,Gasoline,2.0 Liter TFSI,Automatic,Silician Yellow,Black,None reported,
4,188537,Audi,A6 2.0T Premium Plus,2018,59000,Gasoline,252.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,A/T,Gray,Black,None reported,Yes


In [22]:
test_pred = best_gs_model.predict(test)
test_pred[:5]

array([18114.04867042, 52777.56227522, 48469.90978217, 25008.84518373,
       30558.46030664])

In [23]:
#Create a submision_3 DataFrame
submission_3 = pd.DataFrame(
    {'id': test['id'], 'price': test_pred}
)
try:
    submission_3.to_csv(r'C:\Users\USER\Desktop\Kaggle Hackathon\submission_3.csv', index=False)
    print('File saved successfully!')
except Exception as e:
    print(f"Error saving file: {e}")

File saved successfully!


In [26]:
rmse = mean_squared_error(y_val, y_val_gs, squared=False)
print('RMSE for this model is:', round(rmse, 2))

RMSE for this model is: 11131.22




In [24]:
#save the model
import pickle

filename = 'trained_gs_model.pkl'
pickle.dump(best_gs_model, open(filename, 'wb'))