# Kaggle Example: Store Item Demand Forecasting Challenge

Following this example notebook you'll see how easy you can boost your ML tasks with Upgini. We will enrich a dataset with relevant features and build a better model upon them.

If you haven't got our library yet, you can install it now. Also, you can install CatBoost for the last part of this demonstartion.

In [1]:
%pip install -Uq upgini catboost

Note: you may need to restart the kernel to use updated packages.


## Prepare the input data

For this demo we will use the train dataset from [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only). You can download it from [here](https://www.kaggle.com/c/demand-forecasting-kernels-only/data?select=train.csv) or get from [our repo](https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip).

To speed up the search let's take a random sample.

In [1]:
from os.path import exists
import pandas as pd

df_path = "train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path)
df = df.sample(n=7_000, random_state=0)
df["store"] = df["store"].astype(str)
df["item"] = df["item"].astype(str)
df["date"] = pd.to_datetime(df["date"])
df.sort_values("date", inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,date,store,item,sales
0,2013-01-01,10,21,33
1,2013-01-01,5,24,26
2,2013-01-01,3,27,11
3,2013-01-02,9,7,24
4,2013-01-02,6,40,9


This dataset contains 5 years of records from 2013 to 2017. Let's split it into the train (2013–2016) and the evaluation (2017) parts.

In [2]:
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]

Let's also separate features from targets for future use.

In [5]:
train_features = train.drop(columns=["sales"])
train_target = train["sales"]
test_features = test.drop(columns=["sales"])
test_target = test["sales"]

## Search relevant features with FeaturesEnricher

Next, we will use FeaturesEnricher on the train dataset to find features best suited for this particular target prediction. To do this we need to specify the column containing dates and provide the target to predict. Also, we can specify any number of additional datasets to evaluate the features. We will use our test dataset later to get the evaluation metrics.

In [6]:
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher(
    search_keys={"date": SearchKey.DATE},
    keep_input=True,
    cv=CVType.time_series
)
enricher.fit(train_features, train_target, eval_set=[(test_features, test_target)])

Detected task type: ModelTaskType.REGRESSION


Unnamed: 0,Column name,Status,Description
0,date,All valid,All values in this column are good to go
1,target,All valid,All values in this column are good to go


Running search request with search_id=8dde9b44-bf97-4d23-bf93-d4e77e5edbfd
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
[KDone                         [0m [K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K

[92m[1m
We found 12 useful feature(s) for you by search keys: ['date'][0m


Unnamed: 0,feature_name,shap_value,coverage %,type
0,item,0.434272,100.0,CHARACTER
1,store,0.16374,100.0,CHARACTER
2,f_weather_pca_0_94efd18d,0.097509,100.0,NUMERIC
3,f_year_cos1_cd165f8c,0.016779,100.0,NUMERIC
4,f_payment_fraud_score_3cae9c42,0.015332,100.0,NUMERIC
5,f_week_sin1_a71d22f6,0.014975,100.0,NUMERIC
6,f_week_cos1_d3d56d7f,0.012109,100.0,NUMERIC
7,f_c2c_fraud_score_5028232e,0.010888,100.0,NUMERIC
8,f_cpi_pca_2_3c36cd6c,0.010555,100.0,NUMERIC
9,f_finance_umap_0_ad818bcb,0.008208,100.0,NUMERIC


In our case the task is auto-detected as a regression. Hence the metric to optimize is auto-selected as RMSE.

## Get the features and test them locally

Finally, we can enrich our datasets with the features found and use them in our own ML pipelines. Lets's enrich both the train and the test datasets.

In [7]:
enriched_train_features = enricher.transform(train_features)
enriched_test_features = enricher.transform(test_features)
enriched_train_features.head()

74.43151% of the rows are fully duplicated


Unnamed: 0,Column name,Status,Description
0,date,All valid,All values in this column are good to go


Running search request with search_id=c75b81d6-6f9d-40a0-b1bb-1b0d636452d6
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
[KDone                         [0m [K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K

Executing transform step
[KDone                         [0m [K[K
74.55830% of the rows are fully duplicated


Unnamed: 0,Column name,Status,Description
0,date,All valid,All values in this column are good to go


Running search request with search_id=363c8988-f548-4850-b865-7770bf946ede
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
[KDone                         [0m [K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K

Executing transform step
[KDone                         [0m [K[K[K


Unnamed: 0,date,store,item,f_weather_pca_0_94efd18d,f_year_cos1_cd165f8c,f_payment_fraud_score_3cae9c42,f_week_sin1_a71d22f6,f_week_cos1_d3d56d7f,f_c2c_fraud_score_5028232e,f_cpi_pca_2_3c36cd6c,f_finance_umap_0_ad818bcb,f_credit_default_score_05229fa7,f_italy_match_cnt_fdb09b71,f_finance_umap_1_15890450,f_weather_umap_30_98fa4f7d
0,2013-01-01,10,21,28.661328,0.98522,0.232837,0.781831,0.62349,0.369604,-33.814365,10.026849,0.118754,0,9.95028,3.547175
1,2013-01-01,5,24,28.661328,0.98522,0.232837,0.781831,0.62349,0.369604,-33.814365,10.026849,0.118754,0,9.95028,3.547175
2,2013-01-01,3,27,28.661328,0.98522,0.232837,0.781831,0.62349,0.369604,-33.814365,10.026849,0.118754,0,9.95028,3.547175
3,2013-01-02,9,7,28.79589,0.982126,0.115787,0.974928,-0.222521,0.277366,-33.814365,10.075461,0.050849,0,9.880929,3.400228
4,2013-01-02,6,40,28.79589,0.982126,0.115787,0.974928,-0.222521,0.277366,-33.814365,10.075461,0.050849,0,9.880929,3.400228


Here, we've got several dozens of extra features in addition to our initial columns. They should improve the quality of our model.

Let's evaluate the SMAPE metric on train and test datasets using CatBoost model:

In [9]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric
from sklearn.metrics import make_scorer

model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)
smape_scorer = make_scorer(
    lambda y_true, y_pred: eval_metric(y_true.values, y_pred, "SMAPE")[0], 
    greater_is_better=False
)
smape_scorer.__name__ = "SMAPE"
enricher.calculate_metrics(
    train_features, train_target, 
    eval_set=[(test_features, test_target)],
    estimator=model,
    scoring=smape_scorer
)

Unnamed: 0,match_rate,baseline SMAPE,enriched SMAPE,uplift
,,,,
train,100.0,-26.577846,-16.24334,10.334506
eval 1,100.0,-25.713841,-14.510655,11.203186


In the output you see SMAPE values for the train dataset (using cross-validation) and for every evaluation dataset we have provided. There are also match rate values (a percent share of rows enriched with features) and uplift values (a relative improvement in SMAPE for the enriched dataset over the initial dataset).
Here we can see a strong uplift both on the cross-validation and on the out-of-time validation dataset.

You see a much better result after the enrichment. That's the magic of using our library.