# Business Case


## II. Model & Evaluation

#### Introduction
The decision has been taken to focus on the main library fbProphet for time series. The library has been developed by Facebook and provide good results. The aim is to start from the algorithm and then adjust the parameters and the model, to be able to improve the model based on the **same KPI**.

- a. Two Univariate Times Series for product promotion and not in promotion
- b. Multivariate Time Series with correlated variable (IsPromo)
- c. Multivariate Time Series with correlated variable (IsPromo and Communication Channel)
- d. Multivariate Time Series with correlated variable (Communication Channel) and do a logarithm transformation for UnitSales

For all the algorithm, they are an explanation :
- Feature Engineering
- Model Evaluation based on the RMSE and NRMSE of each algorithm



#### The evaluation of the model will be based on:

$$RMSE = \sqrt{\dfrac{\sum_{i=1}^{n}(y_i + \hat{y})}{n}}$$

$$NRMSE = \dfrac{RMSE}{\hat{y}} $$

We use NRMSE to be able to compare a model for all product. Because the RMSE is linked to scale of the UnitSales for a product.
- If the average of the UnitSales for a product A is 1000: a RMSE of 1000 will give a NRMSE of 1
- If the average of the UnitSales for a product B is 10: a RMSE of 10 will give a NRMSE of 1



#### During the EDA, we discover that:
The **IsPromo** is one of the most important variable. Because we want to know what is the impact from the promotion on the UnitSales. but the first promotion appear on the dataset in November 2016 so either
That's quite important, because if we provide wrong labelled data in the model, we can except wrong forecast.
So we are going to always run for two input
- Both years 2016/2017
- Only a year 2017


In [1]:
import pandas as pd
import warnings

# Constants
pd.set_option("display.max_columns", None)
warnings.filterwarnings('ignore')

## 1. Getting Data

In [2]:
import sys
from src.ah_forecast_sales.utils.exploratory_analysis import get_procceed_data
del sys.modules['src.ah_forecast_sales.utils.exploratory_analysis']
from src.ah_forecast_sales.utils.exploratory_analysis import get_procceed_data

df = get_procceed_data()

Number of observation: 1597612
Number of features: 35
Variable with an unique Value ['RainFallSum', 'MinAge', 'AlcoholPercentage']
Variable with NaN Value for ShelfCapacity 0.021527129240391282
Variable with NaN Value for UnitPromotionThreshold 0.9241167442407794
Variable with NaN Value for CommunicationChannel 0.923170957654299
Variable with NaN Value for UnitSales 0.06667388577451847
Variable with NaN Value for BasePrice 0.923170957654299
Variable with NaN Value for DiscountPercentage 0.923170957654299
INFO: Pandarallel will run on 12 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
Number of observation: 1479823
Number of features: 36


### A. Run the differents Model on a random item


In [6]:
itemNumberSample = df[
    df.ItemNumber == 10496
].copy()

itemNumberSample.groupby('IsPromo').size().reset_index()

Unnamed: 0,IsPromo,0
0,False,665
1,True,63


## a. fbProphetUnivariate 

The first Model is using fbProphet using only the values of UnitsSales of the dataset but to do the comparition between a product on promotion or not, they is twp model which is created
- fb Prophet model using only the data which is **not in promotion**
- fb Prophet model using only the data which is in **promotion**

Then depending on if the product was in promotion or not at this time, we select the output from one of this model. It's working in the same way to forecast a week ahead, in promotion or not we are going to use one of this model.

In [9]:
# class need to be imported at least once before runing it
import sys
from src.ah_forecast_sales.pipeline.fbProphetUnivariate import fbProphetUnivariate
del sys.modules['src.ah_forecast_sales.pipeline.fbProphetUnivariate']
from src.ah_forecast_sales.pipeline.fbProphetUnivariate import fbProphetUnivariate

## Based on the full Dataset 2016/2017
print('Model based on 2016 and 2017 data')
fb_prophet_forecast = fbProphetUnivariate(itemNumberSample, start_date = '2018-01-01')
print('The RMSE is {}'. format(fb_prophet_forecast.rmse))
print('The NRMSE is {}'. format(fb_prophet_forecast.nrmse))

fb_prophet_forecast.get_vizualisation_metrics().show()

## Based on the last year of the dataset
print('Model based on 2017 data')
tmp = itemNumberSample[
    itemNumberSample.years=='2017'
].copy()
fb_prophet_forecast = fbProphetUnivariate(tmp, start_date = '2018-01-01')
print('The RMSE is {}'. format(fb_prophet_forecast.rmse))
print('The NRMSE is {}'. format(fb_prophet_forecast.nrmse))

fb_prophet_forecast.get_vizualisation_metrics().show()

Model based on 2016 and 2017 data

Initial log joint probability = -31.4547
Iteration  1. Log joint probability =    27.6919. Improved by 59.1466.
Iteration  2. Log joint probability =    61.5922. Improved by 33.9003.
Iteration  3. Log joint probability =    75.7023. Improved by 14.1101.
Iteration  4. Log joint probability =    77.6841. Improved by 1.98179.
Iteration  5. Log joint probability =    77.6848. Improved by 0.000740647.
Iteration  6. Log joint probability =    77.6872. Improved by 0.00234653.
Iteration  7. Log joint probability =    77.7139. Improved by 0.026745.
Iteration  8. Log joint probability =    78.0195. Improved by 0.305618.
Iteration  9. Log joint probability =    78.0211. Improved by 0.0015472.
Iteration 10. Log joint probability =     78.057. Improved by 0.0359088.
Iteration 11. Log joint probability =    78.1817. Improved by 0.124743.
Iteration 12. Log joint probability =    78.3139. Improved by 0.132119.
Iteration 13. Log joint probability =    78.3186. Improve

The RMSE is 869.5570914341259
The NRMSE is 0.5950322572605548


Model based on 2017 data

Initial log joint probability = -21.9267
Iteration  1. Log joint probability =    36.9021. Improved by 58.8287.
Iteration  2. Log joint probability =    77.9545. Improved by 41.0524.
Iteration  3. Log joint probability =    86.4684. Improved by 8.51392.
Iteration  4. Log joint probability =    86.8079. Improved by 0.339514.
Iteration  5. Log joint probability =    86.8805. Improved by 0.0726332.
Iteration  6. Log joint probability =    86.9009. Improved by 0.0203283.
Iteration  7. Log joint probability =    87.0125. Improved by 0.111674.
Iteration  8. Log joint probability =    87.0387. Improved by 0.0261477.
Iteration  9. Log joint probability =    87.0422. Improved by 0.00349901.
Iteration 10. Log joint probability =    87.0538. Improved by 0.0116665.
Iteration 11. Log joint probability =    87.0571. Improved by 0.00324637.
Iteration 12. Log joint probability =    87.0593. Improved by 0.00221257.
Iteration 13. Log joint probability =    87.0602. Improved by 

## b. fbProphetMutlivariate using IsPromo variables

The second Model is using fbProphet using only the values of the times series of the dataset but to do the comparition between a product on promotion or not, they is twp model which is created
- fb Prophet model using only the data which is **not in promotion**
- fb Prophet model using only the data which is in **promotion**

Then depending on if the product was in promotion or not at this time, we select the output from one of this model. It's working in the same way to forecast a week ahead, in promotion or not we are going to use one of this model.

In [5]:
# class need to be imported at least once before runing it
import sys
from src.ah_forecast_sales.pipeline.fbProphetMultivariate import fbProphetMultivariate
del sys.modules['src.ah_forecast_sales.pipeline.fbProphetMultivariate']
from src.ah_forecast_sales.pipeline.fbProphetMultivariate import fbProphetMultivariate

## Based on the full Dataset 2016/2017
print('Model based on 2016 and 2017 data')
fb_prophet_forecast = fbProphetMultivariate(itemNumberSample, start_date = '2018-01-01')
print('The RMSE is {}'. format(fb_prophet_forecast.rmse))
print('The NRMSE is {}'. format(fb_prophet_forecast.nrmse))

fb_prophet_forecast.get_vizualisation_metrics().show()

## Based on the last year of the dataset
print('Model based on 2017 data')
tmp = itemNumberSample[
    itemNumberSample.years=='2017'
].copy()
fb_prophet_forecast = fbProphetMultivariate(tmp, start_date = '2018-01-01')
print('The RMSE is {}'. format(fb_prophet_forecast.rmse))
print('The NRMSE is {}'. format(fb_prophet_forecast.nrmse))

fb_prophet_forecast.get_vizualisation_metrics().show()

Model based on 2016 and 2017 data
The RMSE is 873.6743559781817
The NRMSE is 0.5978496745865958


Model based on 2017 data
The RMSE is 331.99978058656063
The NRMSE is 0.2678151325464658


## c. fbProphetMutlivariate using IsPromo variables and corraleted variable

In this case, 
We going to


In [53]:
# class need to be imported at least once before runing it
import sys
from src.ah_forecast_sales.pipeline.fbProphetMultivariate import fbProphetMultivariate
del sys.modules['src.ah_forecast_sales.pipeline.fbProphetMultivariate']
from src.ah_forecast_sales.pipeline.fbProphetMultivariate import fbProphetMultivariate

## Based on the full Dataset 2016/2017
print('Model based on 2016 and 2017 data')
fb_prophet_forecast = fbProphetMultivariate(
    itemNumberSample,
    start_date = '2018-01-01',
    regressors = ['CommunicationChannelCode']

)
print('The RMSE is {}'. format(fb_prophet_forecast.rmse))
print('The NRMSE is {}'. format(fb_prophet_forecast.nrmse))

fb_prophet_forecast.get_vizualisation_metrics().show()

## Based on the last year of the dataset
print('Model based on 2017 data')
tmp = itemNumberSample[
    itemNumberSample.years=='2017'
].copy()
fb_prophet_forecast = fbProphetMultivariate(
    tmp,
    start_date = '2018-01-01',
    regressors = ['CommunicationChannelCode']
    
)
print('The RMSE is {}'. format(fb_prophet_forecast.rmse))
print('The NRMSE is {}'. format(fb_prophet_forecast.nrmse))

fb_prophet_forecast.get_vizualisation_metrics().show()

Model based on 2016 and 2017 data
The RMSE is 872.2445917321634
The NRMSE is 0.5968712962201385


Model based on 2017 data
The RMSE is 311.2062836022457
The NRMSE is 0.2510415878738755


### Logarithm

In [55]:
# class need to be imported at least once before runing it
import sys
from src.ah_forecast_sales.pipeline.fbProphetMultivariate import fbProphetMultivariate
del sys.modules['src.ah_forecast_sales.pipeline.fbProphetMultivariate']
from src.ah_forecast_sales.pipeline.fbProphetMultivariate import fbProphetMultivariate

## Based on the full Dataset 2016/2017
print('Model based on 2016 and 2017 data')
fb_prophet_forecast = fbProphetMultivariate(
    itemNumberSample,
    start_date = '2018-01-01',
    regressors = ['CommunicationChannelCode'],
    log=True


)
print('The RMSE is {}'. format(fb_prophet_forecast.rmse))
print('The NRMSE is {}'. format(fb_prophet_forecast.nrmse))

fb_prophet_forecast.get_vizualisation_metrics().show()

## Based on the last year of the dataset
print('Model based on 2017 data')
tmp = itemNumberSample[
    itemNumberSample.years=='2017'
].copy()
fb_prophet_forecast = fbProphetMultivariate(
    tmp,
    start_date = '2018-01-01',
    regressors = ['CommunicationChannelCode'],
    log=True
    
)
print('The RMSE is {}'. format(fb_prophet_forecast.rmse))
print('The NRMSE is {}'. format(fb_prophet_forecast.nrmse))

fb_prophet_forecast.get_vizualisation_metrics().show()

Model based on 2016 and 2017 data
The RMSE is 868.0901854886561
The NRMSE is 0.5940284630709377


Model based on 2017 data
The RMSE is 306.8719674408603
The NRMSE is 0.24754521370396224


In [71]:
fb_prophet_forecast.forecast.IsPromo.unique()

array([0.        , 0.93713261])

## D. Evaluation and Final Model 


Going to run on a sample of


In [80]:
import sys
from src.ah_forecast_sales.pipeline.evaluation import get_evaluation_fbProphetUnivariate
del sys.modules['src.ah_forecast_sales.pipeline.evaluation']
from src.ah_forecast_sales.pipeline.evaluation import get_evaluation_fbProphetUnivariate
from src.ah_forecast_sales.pipeline.evaluation import get_evaluation_fbProphetMultivariate
from src.ah_forecast_sales.utils.exploratory_analysis import get_sample
from tqdm import tqdm

sample = get_sample(df, n=100)


for observation in tqdm(sample.to_dict('records')):
    # Univariate
    sample = get_evaluation_fbProphetUnivariate(
        sample,
        df,
        observation['ItemNumber'],
        'univariate'
    ) 
    
    # Multivariate with the column IsPromo + 
    sample = get_evaluation_fbProphetMultivariate(
        sample,
        df,
        observation['ItemNumber'],
        'multivariate_isPromo',
        regressors=[],
        log=False
    ) 
    # Multivariate with the column IsPromo +CommunicationCHannelCode
    sample = get_evaluation_fbProphetMultivariate(
        sample,
        df,
        observation['ItemNumber'],
        'multivariate_isPromo_CommunicationChannel',
        regressors=['CommunicationChannelCode'],
        log=False

    )
    # Multivariate with the column IsPromo +CommunicationCHannelCode
    sample = get_evaluation_fbProphetMultivariate(
        sample,
        df,
        observation['ItemNumber'],
        'multivariate_isPromo_CommunicationChannel_Log',
        regressors=['CommunicationChannelCode'],
        log=True
    ) 
    
    

 14%|█▍        | 7/50 [01:40<10:24, 14.52s/it]INFO:fbprophet:n_changepoints greater than number of observations. Using 21.
 18%|█▊        | 9/50 [02:08<09:43, 14.22s/it]INFO:fbprophet:n_changepoints greater than number of observations. Using 21.
INFO:fbprophet:n_changepoints greater than number of observations. Using 21.
 36%|███▌      | 18/50 [04:26<08:14, 15.46s/it]INFO:fbprophet:n_changepoints greater than number of observations. Using 21.
INFO:fbprophet:n_changepoints greater than number of observations. Using 21.
 84%|████████▍ | 42/50 [10:54<02:07, 15.97s/it]INFO:fbprophet:n_changepoints greater than number of observations. Using 21.
 96%|█████████▌| 48/50 [12:33<00:32, 16.48s/it]INFO:fbprophet:n_changepoints greater than number of observations. Using 21.
INFO:fbprophet:n_changepoints greater than number of observations. Using 21.
100%|██████████| 50/50 [13:13<00:00, 15.88s/it]


In [81]:
columns = [x for x in list(sample) if 'NRMSE' in x]
sample[columns].sum().reset_index()

Unnamed: 0,index,0
0,univariate_NRMSE_2year,34.466114
1,univariate_NRMSE_1year,26.054567
2,multivariate_isPromo_NRMSE_2year,35.638079
3,multivariate_isPromo_NRMSE_1year,28.472552
4,multivariate_isPromo_CommunicationChannel_NRMS...,34.020857
5,multivariate_isPromo_CommunicationChannel_NRMS...,25.851822
6,multivariate_isPromo_CommunicationChannel_Log_...,34.020857
7,multivariate_isPromo_CommunicationChannel_Log_...,25.851822


In [1]:
sample[columns].mean().reset_index()

NameError: name 'sample' is not defined

In [None]:
sample[columns].median().reset_index()

In [None]:
sample[columns].describe()[[
    'univariate_NRMSE_1year',
    'multivariate_isPromo_NRMSE_1year',
    'multivariate_isPromo_CommunicationChannel_NRMSE_1year'
]]

To Conclude:



In [None]:
## 

### Conclusion

- Feature Engineering
- Model Evaluation based on the MSE of each algorithm

Work only if we have promotion
Issues to compare RMSE 


Answer that for the data we have which one is the best but also what it the need for the data

Next Steps:
- Try to use more variable


One idea will be to use instead of the Nationnal day to get all the week of the nationnal day.
Because we can imagine that for Christmas the effect of the day is not the day by iteself but most likely the weeks before

