# Kaggle Example: Store Item Demand Forecasting Challenge

Following this example notebook you'll see how easy you can boost your ML tasks with Upgini. We will enrich a dataset with relevant features and build a better model upon them.

If you haven't got our library yet, you can install it now. Also, you can install CatBoost for the last part of this demonstartion.

In [1]:
%pip install -Uq upgini catboost

Note: you may need to restart the kernel to use updated packages.


## Prepare the input data

For this demo we will use the train dataset from [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only). You can download it from [here](https://www.kaggle.com/c/demand-forecasting-kernels-only/data?select=train.csv) or get from [our repo](https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip).

To speed up the search let's take a random sample.

In [2]:
from os.path import exists
import pandas as pd

df_path = "train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path)
df = df.sample(n=7_000, random_state=0)
df["store"] = df["store"].astype(str)
df["item"] = df["item"].astype(str)
df["date"] = pd.to_datetime(df["date"])
df.head()

Unnamed: 0,date,store,item,sales
335813,2017-07-14,4,19,56
630838,2015-05-19,6,35,45
365685,2014-05-01,1,21,48
322781,2016-11-06,7,18,85
151590,2013-02-02,4,9,46


This dataset contains 5 years of records from 2013 to 2017. Let's split it into the train (2013–2016) and the evaluation (2017) parts.

In [3]:
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]

Let's also separate features from targets for future use.

In [4]:
train_features = train.drop(columns=["sales"])
train_target = train["sales"]
test_features = test.drop(columns=["sales"])
test_target = test["sales"]

## Search relevant features with FeaturesEnricher

Next, we will use FeaturesEnricher on the train dataset to find features best suited for this particular target prediction. To do this we need to specify the column containing dates and provide the target to predict. Also, we can specify any number of additional datasets to evaluate the features. We will use our test dataset later to get the evaluation metrics.

In [5]:
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher(
    search_keys={"date": SearchKey.DATE},
    keep_input=True,
    cv=CVType.time_series
)
enricher.fit(train_features, train_target, eval_set=[(test_features, test_target)])

Detected task type: ModelTaskType.REGRESSION


Unnamed: 0,Column name,Status,Description
0,target,All valid,All values in this column are good to go
1,date,All valid,All values in this column are good to go


Running search request with search_id=3bf0c90e-9462-412e-a53e-b0ba0ef39286
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
[KDone                         [0m [K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K

[92m[1m
We found 11 useful feature(s) for you by search keys: ['date'][0m


Unnamed: 0,feature_name,shap_value,coverage %,type
0,item,0.435446,100.0,CHARACTER
1,store,0.17338,100.0,CHARACTER
2,f_weather_pca_0_94efd18d,0.09102,100.0,NUMERIC
3,f_week_sin1_a71d22f6,0.024524,100.0,NUMERIC
4,f_cpi_pca_2_3c36cd6c,0.016749,100.0,NUMERIC
5,f_year_cos1_cd165f8c,0.013541,100.0,NUMERIC
6,f_c2c_fraud_score_5028232e,0.009125,100.0,NUMERIC
7,f_dow_jones_89547e1d,0.00835,100.0,NUMERIC
8,f_weather_umap_48_66a91289,0.007987,100.0,NUMERIC
9,f_credit_default_score_05229fa7,0.007271,100.0,NUMERIC


In our case the task is auto-detected as a regression. Hence the metric to optimize is auto-selected as RMSE.

## Get the features and test them locally

Finally, we can enrich our datasets with the features found and use them in our own ML pipelines. Lets's enrich both the train and the test datasets.

In [6]:
enriched_train_features = enricher.transform(train_features)
enriched_test_features = enricher.transform(test_features)
enriched_train_features.head()

74.43151% of the rows are fully duplicated


Unnamed: 0,Column name,Status,Description
0,date,All valid,All values in this column are good to go


Running search request with search_id=8ad84636-43e8-4e4d-b767-efec04682335
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
[KDone                         [0m [K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K

Executing transform step
[KDone                         [0m 
74.55830% of the rows are fully duplicated


Unnamed: 0,Column name,Status,Description
0,date,All valid,All values in this column are good to go


Running search request with search_id=ba3a17a3-d8a6-4ada-ac92-ef4ccca21b7d
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
[KDone                         [0m [K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K

Executing transform step
[KDone                         [0m 


Unnamed: 0,date,store,item,f_weather_pca_0_94efd18d,f_week_sin1_a71d22f6,f_cpi_pca_2_3c36cd6c,f_year_cos1_cd165f8c,f_c2c_fraud_score_5028232e,f_dow_jones_89547e1d,f_weather_umap_48_66a91289,f_credit_default_score_05229fa7,f_weather_umap_30_98fa4f7d,f_payment_fraud_score_3cae9c42,f_week_cos1_d3d56d7f
630838,2015-05-19,6,35,-13.630459,0.781831,-24.552701,-0.82877,0.31325,18312.390625,4.268885,0.033271,3.372021,0.093462,0.62349
365685,2014-05-01,1,21,-5.923637,0.433884,-27.784719,-0.618671,0.424054,16558.869141,3.907936,0.080416,3.313002,0.170168,-0.900969
322781,2016-11-06,7,18,6.533717,-0.781831,-6.686327,0.704066,0.390712,17888.279297,4.077544,0.130224,3.399531,0.320714,0.62349
151590,2013-02-02,4,9,29.567388,-0.974928,-43.102328,0.749826,0.414091,14009.790039,3.790206,0.166972,3.52304,0.275836,-0.222521
572011,2014-04-19,4,32,2.09524,-0.974928,-27.784719,-0.444378,0.409781,16408.539062,3.805812,0.160551,2.945799,0.29632,-0.222521


Here, we've got several dozens of extra features in addition to our initial columns. They should improve the quality of our model.

Let's evaluate the SMAPE metric on train and test datasets using CatBoost model:

In [7]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric
from sklearn.metrics import make_scorer

model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)
smape_scorer = make_scorer(
    lambda y_true, y_pred: eval_metric(y_true.values, y_pred, "SMAPE")[0], 
    greater_is_better=False
)
smape_scorer.__name__ = "SMAPE"
enricher.calculate_metrics(
    train_features, train_target, 
    eval_set=[(test_features, test_target)],
    estimator=model,
    scoring=smape_scorer
)

Unnamed: 0,match_rate,baseline SMAPE,enriched SMAPE,uplift
,,,,
train,100.0,-26.015335,-15.36308,10.652255
eval 1,100.0,-25.342502,-13.347232,11.99527


In the output you see SMAPE values for the train dataset (using cross-validation) and for every evaluation dataset we have provided. There are also match rate values (a percent share of rows enriched with features) and uplift values (a relative improvement in SMAPE for the enriched dataset over the initial dataset).
Here we can see a strong uplift both on the cross-validation and on the out-of-time validation dataset.

You see a much better result after the enrichment. That's the magic of using our library.