# <div style="color:white;display:fill;border-radius:5px;background-color:#9DCDD1;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;font-size:100%;letter-spacing:0.5px;margin:0"><b>1 | About</b></div>

Sale forecasting and data enrichment model.

## <div style="color:white;display:fill;border-radius:5px;background-color:#9DCDD1;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;font-size:100%;letter-spacing:0.5px;margin:0"><b>2 | Data overview</b></div>
- Tabular data (.csv)
- 5 years worth of product sales data
- 4 features
  - ["date", "store_id", "item_id", "sales"]
- limited information (i.e., only date and sales are useful) for our model to understand how to succesfully predict future sales
- forecast the future sales of products for the next 3 months

## <div style="color:white;display:fill;border-radius:5px;background-color:#9DCDD1;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;font-size:100%;letter-spacing:0.5px;margin:0"><b>3 | Stack</b></div>

- Catboost: state-of-the-art gradient boosting on decision trees 
- Upgini: data enrichment given limited data 
  - used to automatically search through thousands of public data sources to find the most relevant features that work for the project


## <div style="color:white;display:fill;border-radius:5px;background-color:#9DCDD1;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;font-size:100%;letter-spacing:0.5px;margin:0"><b>4 | Exploratory Data Analysis</b></div>

In [None]:
!pip install -q upgini catboost

In [2]:
from os.path import exists
import pandas as pd

df_path = "train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path)
df = df.sample(n=19_000, random_state=0)
df["store"] = df["store"].astype(str)
df["item"] = df["item"].astype(str)

# Convert date column to datetime pandas object
df["date"] = pd.to_datetime(df["date"])

df.sort_values("date", inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,date,store,item,sales
0,2013-01-01,7,5,5
1,2013-01-01,4,9,19
2,2013-01-01,1,33,37
3,2013-01-01,3,41,14
4,2013-01-01,5,24,26


## <div style="color:white;display:fill;border-radius:5px;background-color:#9DCDD1;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;font-size:100%;letter-spacing:0.5px;margin:0"><b>5 | Training</b></div>

### <b><span style='color:#58A2A8'>5.1</span> | Creating train and test splits</b>

In [3]:
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]

In [4]:
train_features = train.drop(columns=["sales"])
train_target = train["sales"]
test_features = test.drop(columns=["sales"])
test_target = test["sales"]

### <b><span style='color:#58A2A8'>5.2</span> | Data enrichment</b>

In [5]:
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher(
    search_keys = {
      "date": SearchKey.DATE,
    },
    cv = CVType.time_series
)
enricher.fit(train_features,
             train_target,
             eval_set=[(test_features, test_target)]
)

Detected task type: ModelTaskType.REGRESSION


Column name,Status,Description
target,All valid,All values in this column are good to go
date,All valid,All values in this column are good to go


Running search request with search_id=46f93703-d8be-45d2-a0df-27322a181932
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
Done

[92m[1m
28 relevant feature(s) found with the search keys: ['date'].[0m


Unnamed: 0,feature_name,shap_value,coverage %,type
0,item,0.487726,100.0,CHARACTER
1,store,0.172106,100.0,CHARACTER
2,f_weather_pca_0_94efd18d,0.056047,100.0,NUMERIC
3,f_week_sin1_a71d22f6,0.044632,100.0,NUMERIC
4,f_week_cos1_d3d56d7f,0.029552,100.0,NUMERIC
5,f_weather_umap_48_66a91289,0.025132,100.0,NUMERIC
6,f_weather_umap_24_409427e4,0.019315,100.0,NUMERIC
7,f_weather_umap_33_b9760f68,0.014638,100.0,NUMERIC
8,f_year_cos1_cd165f8c,0.012112,100.0,NUMERIC
9,f_dow_jones_89547e1d,0.007461,100.0,NUMERIC


### <b><span style='color:#58A2A8'>5.3</span> | Model creation</b>

In [6]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric
model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)

# Calculate metrics before and after enrichment with new relevant features
enricher.calculate_metrics(
    train_features, train_target, 
    eval_set = [(test_features, test_target)],
    estimator = model,
    scoring = "mean_absolute_percentage_error"
)

Calculating metrics...
Done


Unnamed: 0,match_rate,baseline mean_absolute_percentage_error,enriched mean_absolute_percentage_error,uplift
,,,,
train,100.0,0.255844,0.16662,0.089224
eval 1,100.0,0.243877,0.13113,0.112746


In [7]:
enriched_train_features = enricher.transform(train_features, keep_input = True)
enriched_test_features = enricher.transform(test_features, keep_input = True)
enriched_train_features.head()

90.39637% of the rows are fully duplicated


Column name,Status,Description
date,All valid,All values in this column are good to go


Running search request with search_id=7248c24e-3de2-4444-ac74-2f29ba2bbc0a
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
Done

Collecting selected features...
Done
90.36176% of the rows are fully duplicated


Column name,Status,Description
date,All valid,All values in this column are good to go


Running search request with search_id=f418d6bd-e742-4a2a-b4b1-9fb804b77a7b
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
Done

Collecting selected features...
Done


Unnamed: 0,date,f_cbpol_pca_3_2e94b9bf,f_cbpol_umap_1_34dc2149,f_cbpol_umap_6_f175da9a,f_cpi_pca_5_db7798a3,f_dow_jones_7d_to_7d_1y_shift_9628c89b,f_dow_jones_89547e1d,f_finance_umap_3_424d51ca,f_italy_game_cnt_9cfcfe65,f_mlending_approve_score_d4c33397,...,f_weather_umap_34_39fc3e94,f_weather_umap_35_436c04a5,f_weather_umap_43_4e9820c4,f_weather_umap_45_b348f420,f_weather_umap_48_66a91289,f_week_cos1_d3d56d7f,f_week_sin1_a71d22f6,f_year_cos1_cd165f8c,item,store
0,2013-01-01,-0.323471,4.815701,1.367325,-8.943169,1.065267,13104.139648,7.647812,0,0.338412,...,5.664261,4.76773,5.079482,4.923654,4.540985,0.62349,0.781831,0.98522,5,7
1,2013-01-01,-0.323471,4.815701,1.367325,-8.943169,1.065267,13104.139648,7.647812,0,0.338412,...,5.664261,4.76773,5.079482,4.923654,4.540985,0.62349,0.781831,0.98522,9,4
2,2013-01-01,-0.323471,4.815701,1.367325,-8.943169,1.065267,13104.139648,7.647812,0,0.338412,...,5.664261,4.76773,5.079482,4.923654,4.540985,0.62349,0.781831,0.98522,33,1
3,2013-01-01,-0.323471,4.815701,1.367325,-8.943169,1.065267,13104.139648,7.647812,0,0.338412,...,5.664261,4.76773,5.079482,4.923654,4.540985,0.62349,0.781831,0.98522,41,3
4,2013-01-01,-0.323471,4.815701,1.367325,-8.943169,1.065267,13104.139648,7.647812,0,0.338412,...,5.664261,4.76773,5.079482,4.923654,4.540985,0.62349,0.781831,0.98522,24,5


## <div style="color:white;display:fill;border-radius:5px;background-color:#9DCDD1;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;font-size:100%;letter-spacing:0.5px;margin:0"><b>6 | Performance and Evaluation</b></div>

In [8]:
model.fit(train_features, train_target)
preds = model.predict(test_features)
eval_metric(test_target.values, preds, "SMAPE")

[37.65141857448004]

In [9]:
model.fit(enriched_train_features, train_target)
enriched_preds = model.predict(enriched_test_features)
eval_metric(test_target.values, enriched_preds, "SMAPE")

[14.504497540797917]

By using our enriched features created using Upgini we were able to achieve an increase in performance.