1️⃣ Prepare input data

In [1]:
%pip install -Uq upgini catboost

[K     |████████████████████████████████| 74 kB 2.0 MB/s 
[K     |████████████████████████████████| 76.8 MB 1.3 MB/s 
[K     |████████████████████████████████| 1.5 MB 55.0 MB/s 
[K     |████████████████████████████████| 1.6 MB 43.8 MB/s 
[?25h

2️⃣ Search new relevant features with FeaturesEnricher

In [12]:
from os.path import exists
import pandas as pd

df_path ="train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path)
df = df.sample(n=19_000 , random_state=0)
df["store"]= df["store"].astype(str)
df["item"] = df["item"].astype(str)

df["date"] = pd.to_datetime(df["date"])

df.sort_values("date", inplace= True)
df.reset_index(inplace= True , drop = True)
df.head()

Unnamed: 0,date,store,item,sales
0,2013-01-01,7,5,5
1,2013-01-01,4,9,19
2,2013-01-01,1,33,37
3,2013-01-01,3,41,14
4,2013-01-01,5,24,26


In [41]:
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]

In [42]:
train_features = train.drop(columns=["sales"])
train_target = train["sales"]
test_features = test.drop(columns=["sales"])
test_target = test["sales"]

Enrich Features

In [46]:
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher(
    search_keys = {
      "date": SearchKey.DATE,
    },
    cv = CVType.time_series
)
enricher.fit(train_features,
             train_target,
             eval_set=[(test_features, test_target)]
)

Detected task type: ModelTaskType.REGRESSION


Column name,Status,Description
date,All valid,All values in this column are good to go
target,All valid,All values in this column are good to go


Running search request with search_id=53fb0bf7-e49f-4b98-90a2-5a0f081701b9
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
Done

[92m[1m
28 relevant feature(s) found with the search keys: ['date'].[0m


Unnamed: 0,feature_name,shap_value,coverage %,type
0,item,0.487726,100.0,CHARACTER
1,store,0.172106,100.0,CHARACTER
2,f_weather_pca_0_94efd18d,0.056047,100.0,NUMERIC
3,f_week_sin1_a71d22f6,0.044632,100.0,NUMERIC
4,f_week_cos1_d3d56d7f,0.029552,100.0,NUMERIC
5,f_weather_umap_48_66a91289,0.025132,100.0,NUMERIC
6,f_weather_umap_24_409427e4,0.019315,100.0,NUMERIC
7,f_weather_umap_33_b9760f68,0.014638,100.0,NUMERIC
8,f_year_cos1_cd165f8c,0.012112,100.0,NUMERIC
9,f_dow_jones_89547e1d,0.007461,100.0,NUMERIC


In [54]:
from upgini.metadata import CVType
enricher = FeaturesEnricher(
	search_keys={"sales_date": SearchKey.DATE},
	cv=CVType.time_series
)

In [58]:
%pip install upgini

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [66]:
from upgini.metadata import CVType
enricher = FeaturesEnricher(
	search_keys={"sales_date": SearchKey.DATE},
	cv=CVType.time_series
)

In [67]:
from upgini import ModelTaskType
enricher = FeaturesEnricher(
	search_keys={"subscription_activation_date": SearchKey.DATE},
	model_task_type=ModelTaskType.REGRESSION
)

In [70]:
import numpy as np
from catboost import Pool, CatBoostRegressor
# initialize data
train_data = np.random.randint(0, 
                               100, 
                               size=(100, 10))
train_label = np.random.randint(0, 
                                1000, 
                                size=(100))
test_data = np.random.randint(0, 
                              100, 
                              size=(50, 10))
# initialize Pool
train_pool = Pool(train_data, 
                  train_label, 
                  cat_features=[0,2,5])
test_pool = Pool(test_data, 
                 cat_features=[0,2,5]) 

# specify the training parameters 
model = CatBoostRegressor(iterations=2, 
                          depth=2, 
                          learning_rate=1, 
                          loss_function='RMSE')
#train the model
model.fit(train_pool)
# make the prediction using the resulting model
preds = model.predict(test_pool)
print(preds)


0:	learn: 292.3990486	total: 847us	remaining: 847us
1:	learn: 288.8401734	total: 2.34ms	remaining: 0us
[596.7818746  462.69141328 511.71789713 338.4496672  511.71789713
 511.71789713 696.45721443 511.71789713 511.71789713 596.7818746
 511.71789713 596.7818746  572.47335993 387.47615105 511.71789713
 511.71789713 511.71789713 511.71789713 462.69141328 511.71789713
 511.71789713 572.47335993 572.47335993 596.7818746  596.7818746
 511.71789713 511.71789713 511.71789713 511.71789713 448.23161385
 511.71789713 511.71789713 572.47335993 572.47335993 696.45721443
 511.71789713 572.47335993 511.71789713 511.71789713 387.47615105
 472.54012852 511.71789713 596.7818746  596.7818746  596.7818746
 511.71789713 596.7818746  511.71789713 572.47335993 511.71789713]


3️⃣ Calculate model metrics and uplift from new relevant features

We can use any model estimator with scikit-learn compatible interface. Let's take CatBoost regressor.
For evaluation metric there are two options:

--->Predefined evaluation function alias from Upgini library, like RMSLE for Root Mean Squared Logarithmic Error

--->Define custom evaluation function using scikit-learn make_scorer, for example SMAPE

Model evaluation metric both for train and validation datasets will be calculated with the same cross-validation strategy as for FeaturesEnricher.fit()- in this example time series CV.

In [71]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric
model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)

# Calculate metrics before and after enrichment with new relevant features
enricher.calculate_metrics(
    train_features, train_target, 
    eval_set = [(test_features, test_target)],
    estimator = model,
    scoring = "mean_absolute_percentage_error"
)

4️⃣ Enrich datasets with the new features and retrain model

In [None]:
enriched_train_features = enricher.transform(train_features, keep_input = True)
enriched_test_features = enricher.transform(test_features, keep_input = True)
enriched_train_features.head()

BEFORE enrichment with the new features:

In [None]:
model.fit(train_features, train_target)
preds = model.predict(test_features)
eval_metric(test_target.values, preds, "SMAPE")

AFTER enrichment with the new features:

In [None]:
model.fit(enriched_train_features, train_target)
enriched_preds = model.predict(enriched_test_features)
eval_metric(test_target.values, enriched_preds, "SMAPE")