# Implementation of Light Gradient Boost Machine for Prediction of Sale Price of Bulldozers

In this notebook, we will use the Kaggle Dataset [Blue Book for Bulldozers](https://www.kaggle.com/competitions/bluebook-for-bulldozers/) to train a machine learning model that predicts the sale price of heavy machinery at auction based on factors like its configuration, usage, etc. 

Primary focuses will be on cleaning the data in an attempt to feed the most relevant pieces of information to the ML model, and then using LGBM for the actual model. 

In [1]:
# Please install the lightgbm library if not already installed by uncommenting the following line: 
# !pip install lightgbm

## Importing Libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import mean_squared_error
import lightgbm as lgb

## Loading Dataset

Dataset obtained from [Kaggle](https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview). 

In [3]:
df = pd.read_csv("Train.csv", low_memory=False, parse_dates=["saledate"])

## Data Preprocessing

In this step SaleDate column is split into 'Year' and 'Month'. These features can be helpful in making predictions about the sale price. 

'Year' is helpful as prices of used vehicles tend to lower over years, but newer vehicles become more expensive due to inflation. The hope is that the model can capture trends from this information.  

'Month' is potentially useful, as prices can wary according to season, especially in parts where construction may be unfeasible over the winters etc. and machines are going to be unused commodities for some months.  

Experimentation was also done with making columns for the Day, Day of the week, and Day of the year, but these lead to poorer performance of the model, most likely as the model would find it difficult to extract meaningful information from these and may be slightly overfitting to them. So it was decided to not include those. 

The original 'saledate' column is dropped.

In [4]:

df["saleYear"] = df.saledate.dt.year
df["saleMonth"] = df.saledate.dt.month
# df["saleDay"] = df.saledate.dt.day
# df["saleDayOfWeek"] = df.saledate.dt.dayofweek
# df["saleDayOfYear"] = df.saledate.dt.dayofyear
df.drop("saledate", axis=1, inplace=True)

There currently are a lot of null values in the dataset. Some of these columns already have "None or Unspecified" entries. This effectively means that "None of Unspecified" is a valid value. 

So, we find any columns that contain such "None or Unspecified" values, and just replace the actual null values with "None or Unspecified" to complete these columns. This ensures that the rest of the information in these columns can be used and they do not have to be dropped. 

In [5]:
for column in df.columns:
    if (df[column] == 'None or Unspecified').any():
        df[column] = df[column].fillna('None or Unspecified')

There still are many columns that have many missing values. Generally, it is not great to impute values in columns that have majority missing values, as whatever statistic we impute with (mean, median, etc.) could be misleading when the statistic is being calculate with such little information. 

Therefore, columns with more than 75% missing values were dropped. 

In [6]:
percentage_missing = df.isna().sum() / len(df) * 100
columns_to_drop = percentage_missing[percentage_missing > 75].index

print("Columns to be dropped:", list(columns_to_drop))

df.drop(columns=columns_to_drop, inplace=True)

Columns to be dropped: ['UsageBand', 'fiModelSeries', 'fiModelDescriptor', 'Stick', 'Engine_Horsepower', 'Track_Type', 'Grouser_Type', 'Differential_Type', 'Steering_Controls']


The dataset was understood in-depth and analyzed manually by using the "Data Dictionary.xlsx" provided by Kaggle. This document contains brief descriptions of what each feature in the dataset means (where each feature is some characteristic of the machine being sold). 

Many features that could be either useless or redundant (as in, features that were one-to-one related to other features) were shortlisted: 
- SalesID
- MachineID
- state
- fiModelDesc
- fiBaseModel
- fiSecondaryDesc
- fiProductClassDesc
- datasource
- auctioneerID
- ProductGroupDesc

Experimentation was done with dropping some of them.  

Somewhat counter-intuitively, dropping most of these columns made the model's performance ever so slightly worse on the RMSLE. So, most of them were kept and the model learn was given a freehand to learn by itself. Only a few columns were dropped where an actual improvement was observed. 

In [7]:
columns_to_drop2 = ['datasource', 'auctioneerID', 'ProductGroupDesc']

df = df.drop(columns=columns_to_drop2, axis=1)

### Imputing Missing Values

In this step values that are still missing are being catered.
1. Numeric columns with missing values are imputed with the median. This is because the median is less sensitive to outliers compared to mean. 
2. Categorical columns are imputed with the most frequent value in each column. 

In [8]:
numeric_imputer = SimpleImputer(strategy='median')
df[df.select_dtypes(include=['float64']).columns] = numeric_imputer.fit_transform(df.select_dtypes(include=['float64']))

categorical_imputer = SimpleImputer(strategy='most_frequent')
df[df.select_dtypes(include=['object']).columns] = categorical_imputer.fit_transform(df.select_dtypes(include=['object']))


### Enconding of Categorical Variables 

As all the data for our LGBM model has to be in numerical form, categorical columns in the dataset were selected and coverted into numerical form using label encoding.

Label encoding transforms categorical data into a simple numerical format that is efficient in terms of memory and computation as our dataset is very large (around 401k rows). Thus, it was chosen over one-hot encoding (which would vastly increase the size of dataset).


In [9]:
for column in df.select_dtypes(include=['object']).columns:
    df[column] = LabelEncoder().fit_transform(df[column])

# df = pd.get_dummies(df, columns=df.select_dtypes(include=['object']).columns)

In [10]:
# df.dtypes

## Splitting Dataset

The dataset is split into: 
- X -> Features
- y -> labels 

In [11]:
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']

In [12]:
# Splitting into training and validation sets

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_val.shape)
print(y_val.shape)

(320900, 41)
(320900,)
(80225, 41)
(80225,)


## Trainingg LGBM Model 

The lgb.Dataset function converts this data into a format that is internally optimized for speed and memory usage by LightGBM.

In [13]:
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_val, y_val, reference=lgb_train)


In this step two things are done:

1. The parameters of the LGBM are defined. These include: 
- the type of model ('gbdt' for gradient boosted decision trees)
- the objective (regression)
- Metric:{'l2', 'l1'} (to use both L1 and L2 regression, as L2 is more sensitive to outliers and vice versa)
- num_leaves (a key paramter which specifies the maximum number of leaves in one tree, controlling the complexity of the model) 
- learning rate 

2. LGBM model is trainined on X_train. 'num_boost_round' indicates the number of boosting iterations. Early_stopping_rounds is used to stop training if the validation score does not improve for 5 consecutive rounds, which helps in preventing overfitting.

Note: Increasing the num_boost_round decreased the rmsle

In [14]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

# Train the model
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=90,
                valid_sets=lgb_eval,
                callbacks=[lgb.early_stopping(stopping_rounds=5)])


Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[90]	valid_0's l1: 6739.08	valid_0's l2: 9.97596e+07


## Making Prediction and Calculating RMSLE

The trained model is now used to predict the sale price on the validation set. 

RMSLE (Root Mean Squared Logarithmic Error) is used to evaluate model performance. The RMSLE is a measure of the ratio between the actual and predicted values. The logarithmic part of the equation helps evaluate performance well in cases where the target variable has a wide range of values. Our target variable here is the price of the tractor, and it does indeed have a very wide range or possible values. Also, RMSLE was the metric used to evaluate the performance of the models in the original Kaggle competition for this dataset. 

A smaller RMSLE value means better performance, with 0 being the ideal score indicating perfect predictions.

In [15]:
y_pred = gbm.predict(X_val, num_iteration=gbm.best_iteration)


In [16]:

rmsle = np.sqrt(mean_squared_error(np.log1p(y_val), np.log1p(y_pred)))
print("RMSLE on validation set:", rmsle)

RMSLE on validation set: 0.31593054635125856


The RMSLE value of ~0.3159 obtained is a fairly good value, amongst the top 100 in the [Kaggle competition leaderboards](https://www.kaggle.com/competitions/bluebook-for-bulldozers/leaderboard). 