In [9]:
%pip install -Uq upgini catboost


prepare input data

In [10]:
from os.path import exists
import pandas as pd

df_path ="train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path)
df = df.sample(n=19_000, random_state=0)
df["store"] = df["store"].astype(str)
df["item"] = df["item"].astype(str)

df["date"] = pd.to_datetime(df["date"])

df.sort_values("date", inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()


Unnamed: 0,date,store,item,sales
0,2013-01-01,7,5,5
1,2013-01-01,4,9,19
2,2013-01-01,1,33,37
3,2013-01-01,3,41,14
4,2013-01-01,5,24,26


In [11]:
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]


In [12]:
train_features = train.drop(columns=["sales"])
train_target = train["sales"]
test_features = test.drop(columns=["sales"])
test_target = test["sales"]

Enrich Features

In [13]:
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher(
    search_keys= {
        "date": SearchKey.DATE,
    },
    cv = CVType.time_series
)
enricher.fit(train_features,
             train_target,
             eval_set=[(test_features, test_target)])


Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IPv4 to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history
Detected task type: ModelTaskType.REGRESSION



Column name,Status,Errors
target,All valid,-
date,All valid,-



Running search request, search_id=716b43e6-1232-478b-9bb4-6ec62b5636cd
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

[92m[1m
24 relevant feature(s) found with the search keys: ['date'][0m


Provider,Source,Feature name,SHAP value,Coverage %,Type,Feature type
,,item,0.590452,100.0,categorical,
,,store,0.200864,100.0,categorical,
Upgini,Weather & climate normals data,f_weather_date_weather_umap_31_fa6d9a99,0.039431,100.0,numerical,Trial
Upgini,Calendar data,f_events_date_week_sin1_847b5db1,0.032182,100.0,numerical,Free
Upgini,Weather & climate normals data,f_weather_date_weather_umap_48_b39cd0c4,0.030379,100.0,numerical,Trial
Upgini,Calendar data,f_events_date_year_cos1_9014a856,0.030034,100.0,numerical,Free
Upgini,Weather & climate normals data,f_weather_date_weather_pca_0_d7e0a1fc,0.012436,100.0,numerical,Trial
Upgini,Weather & climate normals data,f_weather_date_weather_umap_34_c3ef5b4f,0.010964,100.0,numerical,Trial
Upgini,Weather & climate normals data,f_weather_date_weather_umap_47_5123ef0a,0.009575,100.0,numerical,Trial
Upgini,Calendar data,f_events_date_week_cos3_7525fe31,0.009001,100.0,numerical,Free


We detected 113 outliers in your sample.
Examples of outliers with maximum value of target:
84    205
47    196
38    187
Name: target, dtype: int64
Outliers will be excluded during the metrics calculation.
Before dropping target outliers size: 19000
After dropping target outliers size: 18887
Calculating accuracy uplift after enrichment...

which makes metrics between the train and eval_set incomparable.
[92m[1m
Quality metrics[0m


Unnamed: 0,Rows,Baseline mean_squared_error,Enriched mean_squared_error,Uplift
,,,,
Train,15148.0,309.990498,190.800456,119.190041
Eval 1,3739.0,509.28974,371.455111,137.834629


In [14]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric

model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)

enricher.calculate_metrics(
    train_features, train_target,
    eval_set=[(test_features, test_target)],
    estimator = model,
    scoring = "mean_absolute_percentage_error"
)

Calculating accuracy uplift after enrichment...
-
which makes metrics between the train and eval_set incomparable.


Unnamed: 0,Rows,Baseline mean_absolute_percentage_error,Enriched mean_absolute_percentage_error,Uplift
,,,,
Train,15148.0,0.255109,0.153784,0.101325
Eval 1,3739.0,0.270377,0.198137,0.072241


In [15]:
enriched_train_features = enricher.transform(train_features, keep_input=True)
enriched_test_features = enricher.transform(test_features, keep_input=True)
enriched_train_features.head()



Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IPv4 to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history


Column name,Status,Errors
date,All valid,-



Running search request, search_id=21444a05-012b-4646-9ab7-3c30ef7ed2ab
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

Retrieving selected features from data sources...


Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IPv4 to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history


Column name,Status,Errors
date,All valid,-



Running search request, search_id=fd6a0eb8-63c2-462e-820d-db0bd37e0294
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

Retrieving selected features from data sources...


Unnamed: 0,date,store,item,f_events_date_week_sin1_847b5db1,f_events_date_year_cos1_9014a856,f_events_date_week_cos3_7525fe31,f_financial_date_silver_7d_to_7d_1y_shift_55fa8001,f_financial_date_nasdaq_7d_to_1y_b00bfaa7,f_financial_date_stoxx_043cbcd4,f_economic_date_cbpol_pca_2_33d6e3fc,f_economic_date_cbpol_pca_9_bde660b4,f_events_date_year_sin1_3c44bc64,f_economic_date_cbpol_umap_4_c5ce4e90,f_economic_date_cci_pca_3_10646e17,f_economic_date_cpi_umap_7_20d961e2
0,2013-01-01,7,5,0.781831,0.98522,-0.900969,1.072025,1.006665,278.779999,-0.938709,-0.332055,0.171293,4.006054,-1.962578,12.812381
1,2013-01-01,4,9,0.781831,0.98522,-0.900969,1.072025,1.006665,278.779999,-0.938709,-0.332055,0.171293,4.006054,-1.962578,12.812381
2,2013-01-01,1,33,0.781831,0.98522,-0.900969,1.072025,1.006665,278.779999,-0.938709,-0.332055,0.171293,4.006054,-1.962578,12.812381
3,2013-01-01,3,41,0.781831,0.98522,-0.900969,1.072025,1.006665,278.779999,-0.938709,-0.332055,0.171293,4.006054,-1.962578,12.812381
4,2013-01-01,5,24,0.781831,0.98522,-0.900969,1.072025,1.006665,278.779999,-0.938709,-0.332055,0.171293,4.006054,-1.962578,12.812381


In [17]:
model.fit(train_features, train_target)
preds = model.predict(test_features)
eval_metric(test_target.values, preds, "SMAPE")

[37.65141857448004]

In [18]:
model.fit(enriched_train_features, train_target)
enriched_preds = model.predict(enriched_test_features)
eval_metric(test_target.values,enriched_preds, "SMAPE")

[14.771608037567335]