# Big Mart Sales Prediction

<u>**Assignment Objective**</u>: Given a training dataset (~8k rows), predict product sales for the given test dataset (~5k rows)

<u>**Notebook Objective**</u>: Explore different ML models, tune hyperparameters, optimize the best performing model further

**Content**:
1. Data Import
2. Model Tuning Setup
3. Model Inference and CV Scores
4. Additional Optimization

In [1]:
import optuna
import numpy as np
import pandas as pd
import xgboost as xgb
import lightgbm as lgb
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import RepeatedKFold, cross_val_score
from sklearn.metrics import make_scorer, mean_squared_error

## Data Import

In [2]:
df_train = pd.read_csv('../data/df_train.csv',index_col=0)
df_test = pd.read_csv('../data/df_test.csv',index_col=0)

# Importing original test data set to get the item and outler IDs
test = pd.read_csv('../data/test.csv')

In [3]:
X = df_train.drop(['Item_Outlet_Sales'],axis=1)
y = df_train['Item_Outlet_Sales']

In [4]:
X.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_ID,Outlet_Age,price/weight,Outlet_Identifier_HIGH,Outlet_Identifier_LOW,Outlet_Identifier_MED,Item_MRP_Group_A,Item_MRP_Group_B,Item_MRP_Group_C,Item_MRP_Group_D
0,9.3,0,-4.132215,249.8092,1,1,1,2,14,26.861204,0,0,1,0,0,0,1
1,5.92,1,-3.94878,48.2692,1,3,2,0,4,8.153581,0,0,1,1,0,0,0
2,17.5,0,-4.088756,141.618,1,1,1,2,14,8.092457,0,0,1,0,0,1,0
3,19.2,1,-4.181625,182.095,0,3,0,2,15,9.484115,0,1,0,0,0,1,0
4,8.93,0,-4.818097,53.8614,2,3,1,1,26,6.031512,0,0,1,1,0,0,0


In [5]:
df_test.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_ID,Outlet_Age,price/weight,Outlet_Identifier_HIGH,Outlet_Identifier_LOW,Outlet_Identifier_MED,Item_MRP_Group_A,Item_MRP_Group_B,Item_MRP_Group_C,Item_MRP_Group_D
0,20.75,0,-4.884245,107.8622,1,1,1,2,14,5.198178,0,0,1,0,1,0,0
1,8.3,1,-3.258977,87.3198,0,2,1,2,6,10.520458,0,0,1,0,1,0,0
2,14.6,0,-2.306845,241.7538,0,3,0,1,15,16.558479,0,1,0,0,0,0,1
3,7.315,0,-4.174142,155.034,0,2,1,2,6,21.193985,0,0,1,0,0,1,0
4,13.6,1,-2.132005,234.23,1,3,3,2,28,17.222794,1,0,0,0,0,0,1


In [6]:
y

0       3735.1380
1        443.4228
2       2097.2700
3        732.3800
4        994.7052
          ...    
8518    2778.3834
8519     549.2850
8520    1193.1136
8521    1845.5976
8522     765.6700
Name: Item_Outlet_Sales, Length: 8523, dtype: float64

---

## Model Tuning Setup

We will create a model tuner class with in built hyperparameters for 3 models --> LightGBM, GradientBoost and XGBoost

We will also create objective functions and a run function to optimize the hyperparameters using optuna.

Other than that, we are using a **RepeatedKFold** cross-validation to get a robust model with low variance and good generalization.

In [7]:
class ModelTuner:
    def __init__(self, X, y, n_splits=5, n_repeats=5, random_state=42):
        self.X = X
        self.y = y
        self.rkf = RepeatedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=random_state) 
        self.rmse_scorer = make_scorer(self.rmse, greater_is_better=False)

    # Since our metric is RMSE, we'll create a custom scorer here for it
    @staticmethod
    def rmse(y_true, y_pred):
        return np.sqrt(mean_squared_error(y_true, y_pred))

    # XGBoost Objective
    def objective_xgb(self, trial):
        params = {
            "objective": "reg:squarederror",
            "eval_metric": "rmse",
            "random_state": 42,
            "n_jobs": -1,
            "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.2, log=True),
            "max_depth": trial.suggest_int("max_depth", 3, 8),
            "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
            "subsample": trial.suggest_float("subsample", 0.5, 1.0),
            "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
            "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.5, 1.0),
            "colsample_bynode": trial.suggest_float("colsample_bynode", 0.5, 1.0),
            "gamma": trial.suggest_float("gamma", 0, 5),
            "reg_alpha": trial.suggest_float("reg_alpha", 0.0, 5.0),
            "reg_lambda": trial.suggest_float("reg_lambda", 0.0, 10.0),
            "n_estimators": trial.suggest_int("n_estimators", 200, 1500),
            "max_leaves": trial.suggest_int("max_leaves", 4, 256)
        }
        model = xgb.XGBRegressor(**params)
        scores = cross_val_score(model, self.X, self.y, scoring=self.rmse_scorer, cv=self.rkf, n_jobs=-1)
        return -np.mean(scores)

    # GradientBoost Objective
    def objective_gb(self, trial):
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2, log=True),
            'max_depth': trial.suggest_int('max_depth', 3, 8),
            'min_samples_split': trial.suggest_int('min_samples_split', 2, 30),
            'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20),
            'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None]),
            'subsample': trial.suggest_float('subsample', 0.6, 1.0),
            "random_state": 42,
        }
        model = GradientBoostingRegressor(**params)
        scores = cross_val_score(model, self.X, self.y, scoring=self.rmse_scorer, cv=self.rkf, n_jobs=-1)
        return -np.mean(scores)

    # LightGBM Objective
    def objective_lgb(self, trial):
        params = {
            "objective": "regression",
            "metric": "rmse",
            "verbosity": -1,
            'boosting_type': 'gbdt',
            "random_state": 42,
            "n_jobs": -1,
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2, log=True),
            'num_leaves': trial.suggest_int('num_leaves', 15, 150),
            'max_depth': trial.suggest_int('max_depth', 3, 10),
            'min_child_samples': trial.suggest_int('min_child_samples', 5, 50),
            'min_child_weight': trial.suggest_float('min_child_weight', 1e-3, 10.0, log=True),
            'subsample': trial.suggest_float('subsample', 0.5, 1.0),
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
            'reg_alpha': trial.suggest_float('reg_alpha', 1e-5, 5.0, log=True),
            'reg_lambda': trial.suggest_float('reg_lambda', 1e-5, 5.0, log=True),
            'n_estimators': trial.suggest_int('n_estimators', 100, 1000)
        }
        model = lgb.LGBMRegressor(**params)
        scores = cross_val_score(model, self.X, self.y, scoring=self.rmse_scorer, cv=self.rkf, n_jobs=-1)
        return -np.mean(scores)

    # Optimization function
    def run(self, model_name, n_trials):
        mapping = {
            "xgb": self.objective_xgb,
            "gb": self.objective_gb,
            "lgb": self.objective_lgb,
        }
        if model_name not in mapping:
            raise ValueError("Model name must be one of: 'xgb', 'gb', 'cat', 'lgb'")
        
        study = optuna.create_study(direction="minimize")
        study.optimize(mapping[model_name], n_trials=n_trials, n_jobs=1)
        return study

---

## Model Inference and CV Scores

In this section, I did 10, 20, 50 and 100 trials over these 3 models and found that XGBoost was consistently performing better on leaderboard scores (despite having higher CV scores than LightGBM). This could be because XGBoost was doing a better job of generalization while LightGBM might be slightly overfitting in comparison.

In [115]:
# tuner = ModelTuner(X, y)
# study_gb  = tuner.run(model_name="gb", n_trials=100)
# study_lgb = tuner.run(model_name="lgb", n_trials=100)
# study_xgb = tuner.run(model_name="xgb", n_trials=100)

In [None]:
print(f'BEST CV SCORE for GradientBoosting: {study_gb.best_value:0.2f}')

BEST CV SCORE for GradientBoosting: 1079.63


In [None]:
gb_best_params = study_gb.best_params
gb_best_params['random_state'] = 42

final_model = GradientBoostingRegressor(**gb_best_params)
final_model.fit(X,y)
y_preds = final_model.predict(df_test)

submit = test[['Item_Identifier','Outlet_Identifier']].copy()
submit['Item_Outlet_Sales'] = y_preds
submit.Item_Outlet_Sales = submit.Item_Outlet_Sales.clip(lower=0) # Clipping lower bound to 0 since sales cannot be < 0
submit.to_csv('submission.csv',index=False)

In [None]:
print(f'BEST CV SCORE for LightGBM: {study_lgb.best_value:0.2f}')

BEST CV SCORE for LightGBM: 1077.86


In [None]:
lgb_best_params = study_lgb.best_params
lgb_best_params['objective'] ='regression'
lgb_best_params['metric'] = 'rmse'
lgb_best_params['random_state'] = 42
lgb_best_params['n_jobs'] = -1
lgb_best_params['verbosity'] = -1
lgb_best_params['boosting_type'] = 'gbdt'

final_model = lgb.LGBMRegressor(**lgb_best_params)
final_model.fit(X,y)
y_preds = final_model.predict(df_test)

submit = test[['Item_Identifier','Outlet_Identifier']].copy()
submit['Item_Outlet_Sales'] = y_preds
submit.Item_Outlet_Sales = submit.Item_Outlet_Sales.clip(lower=0) # Clipping lower bound to 0 since sales cannot be < 0
submit.to_csv('submission.csv',index=False)

In [None]:
print(f'BEST CV SCORE for XGBoost: {study_xgb.best_value:0.2f}')

BEST CV SCORE for XGBoost: 1078.34


In [None]:
xgb_best_params = study_xgb.best_params
xgb_best_params['objective'] ='reg:squarederror'
xgb_best_params['eval_metric'] = 'rmse'
xgb_best_params['random_state'] = 42
xgb_best_params['n_jobs'] = -1

final_model = xgb.XGBRegressor(**xgb_best_params)
final_model.fit(X,y)
y_preds = final_model.predict(df_test)

submit = test[['Item_Identifier','Outlet_Identifier']].copy()
submit['Item_Outlet_Sales'] = y_preds
submit.Item_Outlet_Sales = submit.Item_Outlet_Sales.clip(lower=0) # Clipping lower bound to 0 since sales cannot be < 0
submit.to_csv('submission.csv',index=False)

Since XGBoost has consistently performed better in terms of LB scores (despite of worse CV scores compared to LightGBM), we will move on to the rigourous optimization part with XGBoost alone and focus more on additional feature engineering and higher training regime.

On the other hand, gradient boost has not only a bad CV score (comparitively) but also fails to perform better on LB scores.

---

## Additional Optimization

From our inferences, we have concluded that XGBoost is not only faster to train but also ends up giving a better leaderboard score. Now we try to improve the hyperparameters further with higher number of trials.

Since this is an iterative process, we will not only perform hyperparameter tuning but also do **addtional feature engineering** in our EDA notebook to see what works and what doesn't.

In [114]:
# tuner = ModelTuner(X, y)
# study_xgb = tuner.run(model_name="xgb", n_trials=200)

In [None]:
xgb_best_params = study_xgb.best_params
xgb_best_params['objective'] ='reg:squarederror'
xgb_best_params['eval_metric'] = 'rmse'
xgb_best_params['random_state'] = 42
xgb_best_params['n_jobs'] = -1

final_model = xgb.XGBRegressor(**xgb_best_params)
final_model.fit(X,y)
y_preds = final_model.predict(df_test)

submit = test[['Item_Identifier','Outlet_Identifier']].copy()
submit['Item_Outlet_Sales'] = y_preds
submit.Item_Outlet_Sales = submit.Item_Outlet_Sales.clip(lower=0) # Clipping lower bound to 0 since sales cannot be < 0
submit.to_csv('submission.csv',index=False)

After numerous trials, loads of feature engineering (adding/removing/modifying) features from the dataset, I finally have a solution with a leaderboard RMSE score of roughly 1146 putting me at a RANK of ***259*** out of a total of ***53k*** registered participants.