```
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

# Boosting Machine on M5 Forecasting Accuracy Dataset

## Background 

The goal of this learning task is to predict the daily sales in Walmart, the world's largest company by revenue, based on hierachical sales data from the past two years.

## Source

The raw dataset can be obtained directly from [Kaggle](https://www.kaggle.com/competitions/m5-forecasting-accuracy). 

In this example, we download the dataset directly from Kaggle using their API. 

In order for this to work, you must login into Kaggle and folow [these instructions](https://www.kaggle.com/docs/api) to install your API token on your machine.

## Goal

The goal of this notebook is to illustrate how Snap ML's boosting machine can perform Poisson regression and provide best-in-class accuracy when compared to XGBoost and LightGBM.

## Code


In [None]:
cd ../../

In [None]:
CACHE_DIR='cache-dir'

In [None]:
import numpy as np
import pandas as pd
import time
from datasets import M5Forecasting
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from snapml import BoostingMachineRegressor as SnapBoostingMachineRegressor
from sklearn.metrics import mean_poisson_deviance

In [None]:
dataset = M5Forecasting(cache_dir=CACHE_DIR)
X_train, X_test, y_train, y_test = dataset.get_train_test_split()

In [None]:
print("Number of examples: %d" % (X_train.shape[0]))
print("Number of features: %d" % (X_train.shape[1]))

We will train all 3 boosting frameworks the Poisson objective. 

We will use the following parameters for the optimization in all cases:

In [None]:
NUM_ROUND = 100
LEARNING_RATE = 0.5
MAX_DEPTH = 6
NUM_THREADS = 8
LAMBDA_2 = 0.1
MAX_DELTA_STEP = 0.7
MAX_BINS = 256
RANDOM_STATE = 42

In [None]:
df = pd.DataFrame(columns=['poisson_loss'])

#### XGBoost

In [None]:
params_xgb = dict(    
    learning_rate=LEARNING_RATE,
    n_estimators=NUM_ROUND,
    max_depth=MAX_DEPTH,
    reg_lambda = LAMBDA_2,
    max_delta_step = MAX_DELTA_STEP,
    n_jobs = NUM_THREADS,    
    min_child_weight = 0.0,  
    max_bin = MAX_BINS,
    random_state=RANDOM_STATE, 
)

gbr_x = XGBRegressor(objective="count:poisson",                    
                   tree_method='hist',
                   **params_xgb)
                        
gbr_x.fit(X_train, y_train)

# XGBoost Prediction   
score_xgboost = mean_poisson_deviance(y_test, gbr_x.predict(X_test))
    
res_xgboost = pd.Series({'poisson_loss': score_xgboost}, name='xgboost')
df = df.append(res_xgboost)
print(df)

#### LightGBM

In [None]:
params_lgb = dict(
    learning_rate=LEARNING_RATE,
    n_estimators=NUM_ROUND,
    max_depth=MAX_DEPTH,
    reg_alpha = LAMBDA_2,
    max_delta_step = MAX_DELTA_STEP,
    n_jobs = NUM_THREADS, 
    min_child_weight = 0.0,
    max_bin = MAX_BINS,
    random_state=RANDOM_STATE,     
    num_leaves = 2^MAX_DEPTH +1,
)

gbr_l = LGBMRegressor(objective='poisson',
                      **params_lgb)
                        
gbr_l.fit(X_train, y_train)

# LightGBM Prediction
score_lightgbm = mean_poisson_deviance(y_test, gbr_l.predict(X_test))

res_lightgbm = pd.Series({'poisson_loss': score_lightgbm}, name='lightgbm')
df = df.append(res_lightgbm)
print(df)

#### SnapBoost

In [None]:
params_snap = dict(
    learning_rate=LEARNING_RATE,
    num_round=NUM_ROUND,
    max_depth=MAX_DEPTH,
    lambda_l2 = LAMBDA_2,
    max_delta_step = MAX_DELTA_STEP,
    n_jobs = NUM_THREADS,
    use_gpu =  False,
    use_histograms = True,
    hist_nbins = MAX_BINS
)


gbr_s = SnapBoostingMachineRegressor(objective = "poisson",
                                    random_state=42, 
                                    **params_snap)
                             
gbr_s.fit(X_train, y_train)

# SnapBoost Prediction    
score_snapml = mean_poisson_deviance(y_test, gbr_s.predict(X_test))

res_snapml = pd.Series({'poisson_loss': score_snapml}, name='snapml')
df = df.append(res_snapml)
print(df)

### Calculate Leaderboard

In [None]:
df = df.sort_values(by='poisson_loss')
df['rank'] = df['poisson_loss'].rank()
df

## Disclaimer

Performance results always depend on the hardware and software environment. 

Information regarding the environment that was used to run this notebook are provided below:

In [None]:
import utils
environment = utils.get_environment()
for k,v in environment.items():
    print("%20s: %s" % (k, v))

# Record Statistics

Finally, we record the enviroment and performance statistics for analysis outside of this standalone notebook.

In [None]:
import scrapbook as sb
sb.glue("result", {
    'dataset': dataset.name,
    'n_examples_train': X_train.shape[0],
    'n_examples_test': X_test.shape[0],
    'n_features': X_train.shape[1],
    'model': 'BoostingMachineRegressor',
    'score': 'mean_poisson_deviance',    
    'score_xgboost': score_xgboost,
    'score_lightgbm': score_lightgbm,
    'score_snapml': score_snapml,
    **environment,
})