![alt text](
https://upgini.com/lib_tHloTHmnvYomfhRQ/1lth2xdahxnz5tr4.svg?w=206)

# Quick Start guide: Search new relevant external features for  product sales forecast  
_________________

Following this guide, you'll learn how to **search new relevant features with Upgini library**. We will enrich a dataset with new features and significantly improve model accuracy. All in 4 simple steps.  
The goal is to predict future sales of different goods in stores based on a 5-year history of sales. The evaluation metric is SMAPE.  
⏱ Time needed: *15 minutes.*  

Download this notebook: [GitHub Link](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)
_________________

First, let's install latest version of Upgini library. Also, we'll need CatBoost for the last part of this guide.

In [1]:
%pip install -Uq upgini catboost

Note: you may need to restart the kernel to use updated packages.


## 1️⃣ Prepare input data

For this guide we'll use the train dataset from [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only). You can download it from [here](https://www.kaggle.com/c/demand-forecasting-kernels-only/data?select=train.csv).  
To speed up the search we'll take a subsample.  
⚠️ All columns in the input dataset with dates/datetime should be converted to pandas datetime object for correct datetime representation

In [2]:
from os.path import exists
import pandas as pd

df_path = "train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path).sample(n=19_000, random_state=0)
df["store"] = df["store"].astype(str)
df["item"] = df["item"].astype(str)

# Convert date column to datetime pandas object
df["date"] = pd.to_datetime(df["date"])

df.sort_values("date", inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,date,store,item,sales
0,2013-01-01,7,5,5
1,2013-01-01,4,9,19
2,2013-01-01,1,33,37
3,2013-01-01,3,41,14
4,2013-01-01,5,24,26


This dataset contains 5 years of records from 2013 to 2017. Let's split it into the train (2013–2016) and the evaluation (2017) parts.

In [3]:
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]

Let's also separate features from targets in *a scikit-learn style* (X and y).

In [4]:
train_features = train.drop(columns=["sales"])
train_target = train["sales"]
test_features = test.drop(columns=["sales"])
test_target = test["sales"]

In [5]:
LIMIT = 400 # free version's limitation
train_features = train_features[:LIMIT]
train_target = train_target[:LIMIT]
test_features = test_features[:LIMIT // 10]
test_target = test_target[:LIMIT // 10]

## 2️⃣ Search new relevant features with FeaturesEnricher

Next, we will use **`FeaturesEnricher`** on the train dataset to find new features relevant for this target prediction.  
* To do this, we need to specify the column(s) containing [**search key(s)**](https://github.com/upgini/upgini#-search-key-types-we-support-more-to-come), in this case it's `date` and provide the target to predict.  
* Also, we can specify any number of additional out-of-time validation datasets to evaluate robustness of the new features.  
* This search task will be auto-detected as a regression. And as we have time series prediction (daily sales as a target variable), we have to pass [**time series specific cross-validation split**](https://github.com/upgini/upgini#-time-series-prediction-support) **`CVType.time_series`**. Now search algorithm know that we are working with the time series prediction task, not just simple regression and will use [time series CV](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split) for new features search.  

Search step will take around *2.5 minutes*

In [6]:
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher(
    search_keys = {
      "date": SearchKey.DATE,
    },
    cv = CVType.time_series,
)
enricher.fit(
  train_features,
  train_target,
  eval_set=[(test_features, test_target)]
)

Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IP to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history


Detected task type: ModelTaskType.REGRESSION. Reason: date search key is present, treating as regression
You can set task type manually with argument `model_task_type` of FeaturesEnricher constructor if task type detected incorrectly





Column name,Status,Errors
target,All valid,-
date,All valid,-


please update with “%pip install -U upgini” to the latest 1.2.31 and restart Jupyter kernel


Running search request, search_id=cf09f2e9-f832-49fd-aedf-826b207bc6f4
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
-
[92m[1m
2 relevant feature(s) found with the search keys: ['date'][0m


Feature name,SHAP value,Coverage %,Value preview,Provider,Source,Updates
f_autofe_mul_028109ec,1.8778,100.0,"282.8929, 287.717, 290.0917",Upgini,AutoFE: features from Markets data,Daily
f_financial_date_snp500_daydiff_15190f0a,0.3142,100.0,"-4.74, 0.3, 15.0601",Upgini,Markets data,Daily


Provider,Source,All features SHAP,Number of relevant features
Upgini,AutoFE: features from Markets data,1.8778,1
Upgini,Markets data,0.3142,1


Sources,Feature name,Feature 1,Feature 2,Function
Markets data,f_autofe_mul_028109ec,f_financial_date_crude_oil_1d_to_7d_471b3b15,f_financial_date_stoxx_043cbcd4,*


We've got **20+ new relevant features** from [different sources such as weather data, calendar data, financial data](https://github.com/upgini/upgini#-connected-data-sources-and-coverage), which expected to improve accuracy of the model. Ranked by [SHAP values](https://en.wikipedia.org/wiki/Shapley_value).

Initial features from search dataset will be checked for relevancy as well, so you don't need an extra feature selection step.

## 3️⃣ Calculate model metrics and uplift from new relevant features
You can use any model estimator with scikit-learn compatible interface. Let's take CatBoost regressor.  
For evaluation metric there are two options:
* Predefined evaluation function alias from [*Upgini library*](https://github.com/upgini/upgini#-accuracy-and-uplift-metrics-calculations), like **`RMSLE`** for Root Mean Squared Logarithmic Error

* Define custom evaluation function using [scikit-learn make_scorer](https://scikit-learn.org/0.15/modules/model_evaluation.html#defining-your-scoring-strategy-from-score-functions), for example [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error)

Model evaluation metric both for train and validation datasets will be calculated with the same cross-validation strategy as for **`FeaturesEnricher.fit()`**-  in this example [time series CV](https://github.com/upgini/upgini#-time-series-prediction-support). 

In [7]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric
model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)

# Calculate metrics before and after enrichment with a new relevant features
enricher.calculate_metrics(
    estimator = model,
    scoring = "mean_absolute_percentage_error"
)

Calculating accuracy uplift after enrichment...
|

Unnamed: 0,Dataset type,Rows,Mean target,Baseline mean_absolute_percentage_error,Enriched mean_absolute_percentage_error,Uplift
0,Train,400,29.5075,0.373 ± 0.095,0.380 ± 0.087,-0.006277
1,Eval 1,40,34.775,0.300 ± 0.054,0.279 ± 0.059,0.021127


We've got a strong metric uplift both on the cross-validation (*train*) and on the out-of-time validation dataset (*eval1*) **after enrichment**.

## 4️⃣ Enrich datasets with new features and retrain model

Now we can enrich our datasets with the features found and use them in our own ML pipelines. Lets' enrich both the train and the test datasets.  
Enrichment step for two datasets will take *2.5 minutes*

In [8]:
enriched_train_features = enricher.transform(train_features, keep_input = True)
enriched_test_features = enricher.transform(test_features, keep_input = True)
enriched_train_features.head()

You use Trial access to Upgini data enrichment. Limit for Trial: 1000 rows. You have already enriched: 179 rows.
Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IP to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history




Column name,Status,Errors
date,All valid,-



Running search request, search_id=7ef3b53f-ae39-4a35-ac9f-e2ded300e826
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
|
Retrieving selected features from data sources...


You use Trial access to Upgini data enrichment. Limit for Trial: 1000 rows. You have already enriched: 218 rows.
Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IP to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history




Column name,Status,Errors
date,All valid,-



Running search request, search_id=2672b5ab-a0ae-476b-8997-61ea5c02c6fc
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
|
Retrieving selected features from data sources...


Unnamed: 0,date,store,item,datetime_day_in_quarter_sin,datetime_day_in_quarter_cos,f_autofe_mul_028109ec,f_financial_date_snp500_daydiff_15190f0a
0,2013-01-01,7,5,0.069756,0.997564,282.320305,23.759888
1,2013-01-01,4,9,0.069756,0.997564,282.320305,23.759888
2,2013-01-01,1,33,0.069756,0.997564,282.320305,23.759888
3,2013-01-01,3,41,0.069756,0.997564,282.320305,23.759888
4,2013-01-01,5,24,0.069756,0.997564,282.320305,23.759888


We've got new features and ready to retrain the model.  
**BEFORE** enrichment with the new features:

In [9]:
model.fit(train_features, train_target)
preds = model.predict(test_features)
eval_metric(test_target.values, preds, "SMAPE")

[29.78405634120717]

**AFTER** enrichment:

In [10]:
model.fit(enriched_train_features, train_target)
enriched_preds = model.predict(enriched_test_features)
eval_metric(test_target.values, enriched_preds, "SMAPE")

[27.679888249190476]

______________________________
Thanks for reading! If you found this useful or interesting, please share with a friend.
______________________________
## 🔗 Useful links
* Upgini Library [Documentation](https://github.com/upgini/upgini#readme)
* More [Notebooks and Guides](https://github.com/upgini/upgini#briefcase-use-cases)
* Kaggle public [Notebooks](https://www.kaggle.com/romaupgini/code)


<sup>😔 Found mistype or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
Please report it here.</a></sup>