# <div style="color:white;display:fill;border-radius:5px;background-color:#9DCDD1;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;font-size:100%;letter-spacing:0.5px;margin:0"><b>1 | About</b></div>

Sales prediction and data enrichment using the Catboost algorithm and Upgini. The goal is to forecast future sales for the next 3 months and determine whether enriching the data leads to an increase in model accuracy.

[GitHub](https://github.com/1391819/sales-forecasting)

## <div style="color:white;display:fill;border-radius:5px;background-color:#9DCDD1;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;font-size:100%;letter-spacing:0.5px;margin:0"><b>2 | Data overview</b></div>
- Tabular data
- 5 years' worth of product sales (19k samples)
- 4 features:
  - date, store_id, item_id, and sales
- Sales data before 2017 will be used as training data (15213 samples), while everything older than 2017 will be our test data (3787 samples)
- Limited information (i.e., only date and sales are useful) for our model to understand how to successfully predict future sales

## <div style="color:white;display:fill;border-radius:5px;background-color:#9DCDD1;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;font-size:100%;letter-spacing:0.5px;margin:0"><b>3 | Stack</b></div>

- Catboost: state-of-the-art gradient boosting on decision trees 
- Upgini: data enrichment given limited data 
  - used to automatically search through thousands of public data sources to find the most relevant features that work for the project


## <div style="color:white;display:fill;border-radius:5px;background-color:#9DCDD1;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;font-size:100%;letter-spacing:0.5px;margin:0"><b>4 | Exploratory Data Analysis</b></div>

In [1]:
!pip install -q upgini catboost

[0m

In [2]:
# imports 
from os.path import exists
import pandas as pd

# creating dataframe
df_path = "train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path)
df = df.sample(n=19_000, random_state=0) # 19k samples

# converting store and item features to strings (they are int initially)
df["store"] = df["store"].astype(str)
df["item"] = df["item"].astype(str)

# converting date column to datetime pandas object
df["date"] = pd.to_datetime(df["date"])

# sorting values by date
df.sort_values("date", inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,date,store,item,sales
0,2013-01-01,7,5,5
1,2013-01-01,4,9,19
2,2013-01-01,1,33,37
3,2013-01-01,3,41,14
4,2013-01-01,5,24,26


## <div style="color:white;display:fill;border-radius:5px;background-color:#9DCDD1;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;font-size:100%;letter-spacing:0.5px;margin:0"><b>5 | Training</b></div>

### <b><span style='color:#58A2A8'>5.1</span> | Creating train and test splits</b>

In [3]:
# splitting data into training and testing
# data before 2017 = training, data after 2017 = training
train = df[df["date"] < "2017-01-01"] # 15213 samples for training
test = df[df["date"] >= "2017-01-01"] # 3787 samples for testing

In [4]:
# creating target feature for both training and testing datasets
train_features = train.drop(columns=["sales"])
train_target = train["sales"]

test_features = test.drop(columns=["sales"])
test_target = test["sales"]

### <b><span style='color:#58A2A8'>5.2</span> | Data enrichment</b>

In [5]:
# imports
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

# creating upgini enricher, we use date as the search keyword
enricher = FeaturesEnricher(
    search_keys = {
      "date": SearchKey.DATE,
    },
    cv = CVType.time_series
)

# fitting training data to the enricher
enricher.fit(train_features,
             train_target,
             eval_set=[(test_features, test_target)]
)


Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IPv4 to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history
Detected task type: ModelTaskType.REGRESSION



Column name,Status,Errors
date,All valid,-
target,All valid,-



Running search request, search_id=4210daf9-f71a-4891-bc8a-98434d0f9a14
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
|
[92m[1m
30 relevant feature(s) found with the search keys: ['date'][0m


Provider,Source,Feature name,SHAP value,Coverage %,Type,Feature type
,,item,0.593929,100.0,categorical,
,,store,0.19591,100.0,categorical,
Upgini,Public data,f_weather_date_weather_umap_31_fa6d9a99,0.036087,100.0,numerical,Free
Upgini,Public data,f_events_date_week_sin1_847b5db1,0.033563,100.0,numerical,Free
Upgini,Public data,f_events_date_year_cos1_9014a856,0.03218,100.0,numerical,Free
Upgini,Public data,f_weather_date_weather_umap_48_b39cd0c4,0.030332,100.0,numerical,Free
Upgini,Public data,f_weather_date_weather_umap_47_5123ef0a,0.014408,100.0,numerical,Free
Upgini,Public data,f_weather_date_weather_umap_34_c3ef5b4f,0.013787,100.0,numerical,Free
Upgini,Public data,f_weather_date_weather_pca_0_d7e0a1fc,0.013082,100.0,numerical,Free
Upgini,Public data,f_events_date_week_cos3_7525fe31,0.009484,100.0,numerical,Free


We detected 113 outliers in your sample.
Examples of outliers with maximum value of target:
84    205
47    196
38    187
Name: target, dtype: int64
Outliers will be excluded during the metrics calculation.
Before dropping target outliers size: 19000
After dropping target outliers size: 18887
Calculating accuracy uplift after enrichment...
-
which makes metrics between the train and eval_set incomparable.
[92m[1m
Quality metrics[0m


Unnamed: 0,Rows,Baseline mean_squared_error,Enriched mean_squared_error,Uplift
,,,,
Train,15148.0,309.990498,195.876523,114.113975
Eval 1,3739.0,509.28974,366.309441,142.980299


### <b><span style='color:#58A2A8'>5.3</span> | Model creation</b>

In [6]:
# imports
from catboost import CatBoostRegressor
from catboost.utils import eval_metric

# creating CatBoost model
model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)

# calculating metrics before and after enrichment with new relevant features
enricher.calculate_metrics(
    train_features, train_target, 
    eval_set = [(test_features, test_target)],
    estimator = model,
    scoring = "mean_absolute_percentage_error"
)

Calculating accuracy uplift after enrichment...
-
which makes metrics between the train and eval_set incomparable.
\

Unnamed: 0,Rows,Baseline mean_absolute_percentage_error,Enriched mean_absolute_percentage_error,Uplift
,,,,
Train,15148.0,0.255109,0.155266,0.099843
Eval 1,3739.0,0.270377,0.194607,0.07577


In [7]:
# joining initial date feature with enriched features found using upgini
enriched_train_features = enricher.transform(train_features, keep_input = True)
enriched_test_features = enricher.transform(test_features, keep_input = True)
enriched_train_features.head()


Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IPv4 to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history


Column name,Status,Errors
date,All valid,-



Running search request, search_id=79571097-bed2-4a27-8966-1ab14ec168e8
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
-
Retrieving selected features from data sources...
|
Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IPv4 to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history


Column name,Status,Errors
date,All valid,-



Running search request, search_id=c8ee0eb1-703b-4155-8b90-3fcaaa767978
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
-
Retrieving selected features from data sources...
\

Unnamed: 0,date,store,item,f_weather_date_weather_umap_31_fa6d9a99,f_events_date_week_sin1_847b5db1,f_events_date_year_cos1_9014a856,f_weather_date_weather_umap_48_b39cd0c4,f_weather_date_weather_umap_47_5123ef0a,f_weather_date_weather_umap_34_c3ef5b4f,f_weather_date_weather_pca_0_d7e0a1fc,...,f_economic_date_cci_pca_3_10646e17,f_economic_date_cci_pca_6_aa7c1005,f_events_date_year_sin1_3c44bc64,f_weather_date_weather_umap_12_d03be9a0,f_economic_date_cbpol_pca_1_31e5f62c,f_financial_date_finance_umap_0_526a7a88,f_weather_date_weather_umap_35_5ddaa0ba,f_weather_date_weather_umap_22_0342ee9e,f_economic_date_cpi_umap_7_d43e2396,f_weather_date_weather_umap_45_d474bf8d
0,2013-01-01,7,5,4.712653,0.781831,0.98522,4.540985,5.927147,5.664261,29.676683,...,-1.962578,-1.387072,0.171293,3.739055,-0.438029,10.955449,4.76773,4.806711,11.154531,4.923654
1,2013-01-01,4,9,4.712653,0.781831,0.98522,4.540985,5.927147,5.664261,29.676683,...,-1.962578,-1.387072,0.171293,3.739055,-0.438029,10.955449,4.76773,4.806711,11.154531,4.923654
2,2013-01-01,1,33,4.712653,0.781831,0.98522,4.540985,5.927147,5.664261,29.676683,...,-1.962578,-1.387072,0.171293,3.739055,-0.438029,10.955449,4.76773,4.806711,11.154531,4.923654
3,2013-01-01,3,41,4.712653,0.781831,0.98522,4.540985,5.927147,5.664261,29.676683,...,-1.962578,-1.387072,0.171293,3.739055,-0.438029,10.955449,4.76773,4.806711,11.154531,4.923654
4,2013-01-01,5,24,4.712653,0.781831,0.98522,4.540985,5.927147,5.664261,29.676683,...,-1.962578,-1.387072,0.171293,3.739055,-0.438029,10.955449,4.76773,4.806711,11.154531,4.923654


## <div style="color:white;display:fill;border-radius:5px;background-color:#9DCDD1;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;font-size:100%;letter-spacing:0.5px;margin:0"><b>6 | Performance and Evaluation</b></div>

In [8]:
# baseline data - performance analysis
model.fit(train_features, train_target)
preds = model.predict(test_features)
eval_metric(test_target.values, preds, "SMAPE")

[37.65141857448004]

In [9]:
# enriched data - performance analysis
model.fit(enriched_train_features, train_target)
enriched_preds = model.predict(enriched_test_features)
eval_metric(test_target.values, enriched_preds, "SMAPE")

[14.681142656550785]

By using our enriched features created using Upgini we were able to achieve an increase in performance.