# Implementation of Light Gradient Boost Machine for Prediction of Sale Price of Bulldozers

In [15]:
# Please install lightgbm if not already installed: 
# !pip install lightgbm

## Importing Libraries

In this step we are importing all the necessary libraries we will be using in this code. 


In [16]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import mean_squared_error
import lightgbm as lgb

## Loading Dataset

In this step CSV dataset is loaded in a Pandas Dataframe. 

In [17]:
df = pd.read_csv("Train.csv", low_memory=False, parse_dates=["saledate"])

## Data Preprocessing

In this step SaleDate column is split into Year and Month. These features can be hepful in making predictions about the sale price.  
Year is helpful as prices of used vehicles tend to lower over years, but newer vehicles become more expensive due to inflation. We hoped the model can capture trends from this information.  
Month is potentially useful, as prices can wary according to season, especially in parts where construction may be unfeasible over the winters etc. and machines are going to be unused commodities for some months.  
We also experimented with making columns for the Day, Day of the week, and Day of the year, but these lead to poorer performance of the model, most likely as the model would find it difficult to extract meaningful information from these and may be slightly overfitting to them. So we chose not to include those.  
The original 'saledate' column is dropped.

In [18]:

df["saleYear"] = df.saledate.dt.year
df["saleMonth"] = df.saledate.dt.month
# df["saleDay"] = df.saledate.dt.day
# df["saleDayOfWeek"] = df.saledate.dt.dayofweek
# df["saleDayOfYear"] = df.saledate.dt.dayofyear
df.drop("saledate", axis=1, inplace=True)

There currently a lot of missing values in the dataset. For some of these columns, missing values are unspecified if we observe that these columns already have entries that say "None or Unspecified".  
So, we find any columns that contain such "None or Unspecified", and just replace the null values with "None or Unspecified" to complete these columns and ensure the rest of the information can be used. 

In [19]:
for column in df.columns:
    if (df[column] == 'None or Unspecified').any():
        df[column] = df[column].fillna('None or Unspecified')

There still are many columns that have many missing values. We felt it would not be great to impute values in columns that have majority missing values, as whatever we impute with (mean, median, etc) could be misleading when the statistic is being calculate with such little information.  
Therefore, we dropped columns with more than 75% missing values. 

In [20]:
percentage_missing = df.isna().sum() / len(df) * 100
columns_to_drop = percentage_missing[percentage_missing > 75].index

print("Columns to be dropped:", list(columns_to_drop))

df.drop(columns=columns_to_drop, inplace=True)

Columns to be dropped: ['UsageBand', 'fiModelSeries', 'fiModelDescriptor', 'Stick', 'Engine_Horsepower', 'Track_Type', 'Grouser_Type', 'Differential_Type', 'Steering_Controls']


Next, we analyzed the dataset in-depth, manually, using the "Data Dictionary.xlsx" provided by Kaggle with the dataset. We shortlisted many features that could be either useless or redundant (as in, features that were one-to-one related to other features) and experimented with dropping some of them.  
Somewhat counter-intuitively, dropping most of these columns made our model performance ever so slightly worse on the RMSLE. So, we chose to keep most of them and let the model learn by itself, and only dropped the few that showed an actual improvement.  
Columns we experimented with manually dropping include:  
'SalesID', 'MachineID', 'state', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 'fiProductClassDesc', 'datasource', 'auctioneerID', 'ProductGroupDesc'

In [21]:
columns_to_drop2 = ['datasource', 'auctioneerID', 'ProductGroupDesc']

df = df.drop(columns=columns_to_drop2, axis=1)

### Imputing Missing Values

In this step values that are still missing are being catered.
1. Numeric columns with missing values are imputed with the median. It is because medican is less sensitive to outliers compared to mean. 
2. Categorical columns are imputed with the most frequent value in each column. 

In [22]:
numeric_imputer = SimpleImputer(strategy='median')
df[df.select_dtypes(include=['float64']).columns] = numeric_imputer.fit_transform(df.select_dtypes(include=['float64']))

categorical_imputer = SimpleImputer(strategy='most_frequent')
df[df.select_dtypes(include=['object']).columns] = categorical_imputer.fit_transform(df.select_dtypes(include=['object']))


### Enconding of Categorical Variables 

Categorical columns in the dataset are selected and and coverted in to numeric form using label encoding. It is because like many other algorithms LGBM also required data to be in the numeric form.
Label encoding transforms categorical data into a simple numerical format that is efficient in terms of memory and computation as our dataset is very large around 401k rows, it seemed to be a suitable choice over one-hot encoding (which increase the size of dataset).


In [23]:
for column in df.select_dtypes(include=['object']).columns:
    df[column] = LabelEncoder().fit_transform(df[column])

# df = pd.get_dummies(df, columns=df.select_dtypes(include=['object']).columns)

In [24]:
# df.dtypes

## Splitting Dataset

The dataset is split into X -> Features and y -> labels 

In [25]:
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']

In this step the data is split into training and validation sets using train_test_split and there sizes are printed. 

In [26]:

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_val.shape)
print(y_val.shape)

(320900, 41)
(320900,)
(80225, 41)
(80225,)


## Trainingg LGBM Model 

The lgb.Dataset function converts this data into a format that is internally optimized for speed and memory usage by LightGBM.
LightGBM's advanced features like handling categorical features, optimizing memory usage, and speeding up training are largely due to this specialized data structure.

In [27]:
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_val, y_val, reference=lgb_train)


In this step two things are done:
1. The parameters of the LGBM are deifined. These include the type of model (gbdt for gradient boosted decision trees), the objective (regression), Metric:{'l2', 'l1'} means it uses both the L2 (mean squared error) and L1 (mean absolute error) metrics for regression. L2 is sensitive to outliers, whereas L1 is more robust to them, num_leaves specifies the maximum number of leaves in one tree. It is the key parameter that controls the complexity of the model, learning rate etc.
2. LGBM model is trainined on X_train. num_boost_round indicates the number of boosting iterations. Early_stopping_rounds is used to stop training if the validation score does not improve for 5 consecutive rounds, which helps in preventing overfitting.

Key Note: Increasing the num_boost_round decreased the rmsle

In [28]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

# Train the model
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=90,
                valid_sets=lgb_eval,
                callbacks=[lgb.early_stopping(stopping_rounds=5)])


Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[90]	valid_0's l1: 6739.08	valid_0's l2: 9.97596e+07


## Making Prediction and Calculating RMSLE

The trained model is now used to predict the sale price on the validation set. We are using RMSLE to evaluate model performance. The RMSLE is a measure of the ratio between the actual and predicted values. A smaller RMSLE value means better performance, with 0 being the ideal score indicating perfect predictions.

In [29]:
y_pred = gbm.predict(X_val, num_iteration=gbm.best_iteration)


In [30]:

rmsle = np.sqrt(mean_squared_error(np.log1p(y_val), np.log1p(y_pred)))
print("RMSLE on validation set:", rmsle)

RMSLE on validation set: 0.31593054635125856
