# Chapter 3.5 and 4 - Experimentation and Results (Pt. 2)
This notebook contains the code on the multivariate experiments conducted as part of my thesis. Together with the notebook on univariate models, it serves as the baseline both for Chapter 3.4 covering the approach that was pursued in the multivariate NP experiments as well as for Chapter 4 presenting the results from the experiments. Note that the experiments only took place on datasets C4 and C6 as the two prototypes that covered enough data to execute the experiments.

## Initialization

In [None]:
# set up connection to Google Sheets to access the datasets
from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

In [None]:
# general imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import signal
import math
from sklearn.metrics import mean_squared_error, explained_variance_score
import warnings
warnings.filterwarnings('ignore')

In [None]:
# clone NeuralProphet git repository
!git clone https://github.com/ourownstory/neural_prophet.git

Cloning into 'neural_prophet'...
remote: Enumerating objects: 9204, done.[K
remote: Counting objects: 100% (644/644), done.[K
remote: Compressing objects: 100% (320/320), done.[K
remote: Total 9204 (delta 443), reused 476 (delta 317), pack-reused 8560[K
Receiving objects: 100% (9204/9204), 190.61 MiB | 13.36 MiB/s, done.
Resolving deltas: 100% (6333/6333), done.


In [None]:
cd neural_prophet

In [None]:
# install NeuralProphet repository
pip install .[live]

In [None]:
# import NeuralProphet functionality
from neuralprophet import NeuralProphet

In [None]:
# ensuring reproducibility of results
from neuralprophet import set_random_seed 

## Group 1 - prototype dataset C4

C4 is characterized by being the longest dataset and including low volatility with only few outliers.

In [None]:
# load Sales model data from Google Sheets
worksheet = gc.open('C4').sheet1
rows = worksheet.get_all_values()

# convert to dataframe
opp_df = pd.DataFrame.from_records(rows[1:],columns=rows[0])
opp_df['VALUE'] = pd.to_numeric(opp_df['VALUE'])
opp_df['DATE'] = pd.to_datetime(opp_df['DATE'])
opp_df.head()

In [None]:
# Group and pivot data
opp_df = pd.DataFrame(opp_df.groupby(['DATE','METRIC'])['VALUE'].sum()).reset_index()
opp_df = opp_df.pivot(index=['DATE'],columns=['METRIC'],values=['VALUE'])
opp_df.columns = [col[1] for col in opp_df.columns]
opp_df.head()

In [None]:
# load Marketing model data from Google Sheets
worksheet = gc.open('C4').worksheet('Lead Print Out')
rows = worksheet.get_all_values()

# convert to dataframe
lead_df = pd.DataFrame.from_records(rows[1:],columns=rows[0])
lead_df['VALUE'] = pd.to_numeric(lead_df['VALUE'])
lead_df['DATE'] = pd.to_datetime(lead_df['DATE'])
lead_df.head()

In [None]:
# Group and pivot data
lead_df = pd.DataFrame(lead_df.groupby(['DATE','METRIC'])['VALUE'].sum()).reset_index()
lead_df = lead_df.pivot(index=['DATE'],columns=['METRIC'],values=['VALUE'])
lead_df.columns = [col[1] for col in lead_df.columns]
lead_df.head()

In [None]:
# join datasets on DATE index
df = opp_df.join(lead_df)
df.head()

In [None]:
# drop irrelevant columns
df = df.drop(['THREE_MONTHS_ACV','THREE_MONTHS_OPP_WON_CVR','SIX_MONTHS_LEAD_MQL_CVR','SIX_MONTHS_LEAD_OPP_CVR','SIX_MONTHS_MQL_OPP_CVR','WON_CUSTOMERS','MQLS'],axis=1)
df['CREATED_LEADS'] = [x if x==x else 0 for x in df['CREATED_LEADS']]
df.head()

In [None]:
# adjust data format to NP requirements
df = df.reset_index()
df.columns = ['ds','CREATED_OPPORTUNITIES','y','CREATED_LEADS']
df

#### Full series

In [None]:
# split data
df_train = df[:-12]
df_test = df[-12:]

In [None]:
# create enlarged train series for prediction generation
enlarged_df = df_train.append(pd.DataFrame({'ds':[pd.Timestamp('2022-01-01'),pd.Timestamp('2022-02-01'),pd.Timestamp('2022-03-01'),pd.Timestamp('2022-04-01'),pd.Timestamp('2022-05-01'),pd.Timestamp('2022-06-01'),pd.Timestamp('2022-07-01'),pd.Timestamp('2022-08-01'),pd.Timestamp('2022-09-01'),pd.Timestamp('2022-10-01'),pd.Timestamp('2022-11-01'),pd.Timestamp('2022-12-01')],'CREATED_OPPORTUNITIES':[np.nan for i in range(12)],'CREATED_LEADS':[np.nan for i in range(12)],'y':[np.nan for i in range(12)]}))
enlarged_df.tail(13)

##### Heuristics
For the heuristics approach n_forecasts was set to be equal to the forecasting horizon (12 periods), n_lags twice the forecasting horizon (24 periods), num_hidden_layers to 0 and the loss_func set to the MSE.

In [None]:
# initialize heuristics model
set_random_seed(0)
NP = NeuralProphet(growth='linear',
                   changepoints=None,
                   n_changepoints=10,
                   changepoints_range=0.8,
                   trend_reg=0, 
                   trend_reg_threshold=False, 
                   trend_global_local='global', 
                   yearly_seasonality='auto', 
                   weekly_seasonality='auto', 
                   daily_seasonality='auto', 
                   seasonality_mode='additive', 
                   seasonality_reg=0, 
                   season_global_local='global', 
                   n_forecasts=12, 
                   n_lags=24, 
                   num_hidden_layers=0, 
                   d_hidden=None, 
                   ar_reg=None, 
                   learning_rate=None, 
                   epochs=None, 
                   batch_size=None, 
                   loss_func='MSE',
                   optimizer='AdamW', 
                   newer_samples_weight=2, 
                   newer_samples_start=0.0, 
                   quantiles=None, 
                   impute_missing=True, 
                   impute_linear=10, 
                   impute_rolling=10, 
                   drop_missing=False, 
                   collect_metrics=True, 
                   normalize='auto', 
                   global_normalization=False, 
                   global_time_normalization=True, 
                   unknown_data_normalization=False)

In [None]:
# add past covariates to the model
NP.add_lagged_regressor(['CREATED_OPPORTUNITIES','CREATED_LEADS'])

INFO - (NP.forecaster.add_lagged_regressor) - n_lags = 'auto', number of lags for regressor is set to Autoregression number of lags (24)


<neuralprophet.forecaster.NeuralProphet at 0x7f86988d5220>

In [None]:
# train heuristics model
train_metrics = NP.fit(df_train)

INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [98.876]% of the data.
INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as MS
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling weekly seasonality. Run NeuralProphet with weekly_seasonality=True to override this.
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 16
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 820


Finding best initial lr:   0%|          | 0/206 [00:00<?, ?it/s]

Training: 0it [00:00, ?it/s]

In [None]:
# generate prediction
predicted = NP.predict(enlarged_df)

INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [99.01]% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - MS
INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [99.01]% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - MS


Predicting: 4it [00:00, ?it/s]

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column


In [None]:
# clip values below 0
predicted['yhat12'] = [x if (x!=x or x >= 0) else 0 for x in predicted['yhat12']]

In [None]:
# plot heuristics forecast
plt.figure(figsize=(15, 7.5))
plt.plot(predicted.set_index('ds')['yhat12'], color='r', label='model')
plt.axvspan(predicted.set_index('ds').index[-12], predicted.set_index('ds').index[-1], alpha=0.5, color='lightgrey')
plt.plot(df.set_index("ds")['y'], label='actual')
plt.grid(axis='y')
plt.xlabel('DATE', fontsize=20)
plt.ylabel('WON_REVENUE', fontsize=20)
plt.legend()
plt.show()

In [None]:
# plot NP parameters
NP.plot_parameters()

In [None]:
print('RMSE:',math.sqrt(mean_squared_error(df_test['y'], predicted['yhat12'][-12:])))
print('Explained variance:',explained_variance_score(df_test['y'], predicted['yhat12'][-12:]))

RMSE: 11385.428203876641
Explained variance: -39.114878531272275


##### Optimized
For the optimization approach, n_forecasts was again set to be equal to the forecasting horizon (12 periods) and the loss_func to be the MSE, but this time, n_lags was varied in the range from 1 to 24 and num_hidden_layers was varied within the range from 0 to 2. Further a validation split of 12 periods was introduced to the train set to optimize for the validation loss.

In [None]:
# initialize lists to store configurations along with the respective train, validation and test loss
'''configs = []
test_losses = []
train_losses = []
val_losses = []'''

In [None]:
# Grid search (value of p must be adjusted)
'''p = 1
for l in range(0,3):
    set_random_seed(0)
    NP = NeuralProphet(growth='linear',
                      n_lags=p, 
                      n_forecasts=12,
                      num_hidden_layers=l, 
                      d_hidden=None,
                      loss_func='MSE')
    NP.add_lagged_regressor(['CREATED_OPPORTUNITIES','CREATED_LEADS'])
    train,val = NP.split_df(df_train,valid_p=1)
    train_metrics = NP.fit(train)
    val_metrics = NP.test(val)
    configs.append((l,p))
    predicted = NP.predict(enlarged_df)
    test_losses.append(mean_squared_error(df_test['y'], NP.predict(enlarged_df)['yhat12'][-12:]))
    train_losses.append(train_metrics.iloc[-1,2])
    val_losses.append(val_metrics.iloc[0,0]) '''

In [None]:
# sorted results from hyperparameter tuning
results = pd.DataFrame({'config':configs,'test loss':test_losses, 'val loss':val_losses, 'train loss':train_losses})
results.sort_values('val loss')

In [None]:
# initialize optimal model with n_lags=5 and num_hidden_layers=2
set_random_seed(0)
NP = NeuralProphet(growth='linear',
                   changepoints=None,
                   n_changepoints=10,
                   changepoints_range=0.8,
                   trend_reg=0, 
                   trend_reg_threshold=False, 
                   trend_global_local='global', 
                   yearly_seasonality='auto', 
                   weekly_seasonality='auto', 
                   daily_seasonality='auto', 
                   seasonality_mode='additive', 
                   seasonality_reg=0, 
                   season_global_local='global', 
                   n_forecasts=12, 
                   n_lags=5, 
                   num_hidden_layers=2, 
                   d_hidden=None, 
                   ar_reg=None, 
                   learning_rate=None, 
                   epochs=None, 
                   batch_size=None, 
                   loss_func='MSE',
                   optimizer='AdamW', 
                   newer_samples_weight=2, 
                   newer_samples_start=0.0, 
                   quantiles=None, 
                   impute_missing=True, 
                   impute_linear=10, 
                   impute_rolling=10, 
                   drop_missing=False, 
                   collect_metrics=True, 
                   normalize='auto', 
                   global_normalization=False, 
                   global_time_normalization=True, 
                   unknown_data_normalization=False)

In [None]:
# addd past covariates to the model
NP.add_lagged_regressor(['CREATED_OPPORTUNITIES','CREATED_LEADS'])

INFO - (NP.forecaster.add_lagged_regressor) - n_lags = 'auto', number of lags for regressor is set to Autoregression number of lags (5)


<neuralprophet.forecaster.NeuralProphet at 0x7f867f65ed00>

In [None]:
# train optimized model
train_metrics = NP.fit(df_train)

INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [98.876]% of the data.
INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as MS
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling weekly seasonality. Run NeuralProphet with weekly_seasonality=True to override this.
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 16
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 662


Finding best initial lr:   0%|          | 0/206 [00:00<?, ?it/s]

Training: 0it [00:00, ?it/s]

In [None]:
# generate prediction
predicted = NP.predict(enlarged_df)

INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [99.01]% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - MS
INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [99.01]% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - MS


Predicting: 5it [00:00, ?it/s]

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column


In [None]:
# clip values below 0
predicted['yhat12'] = [x if (x!=x or x >= 0) else 0 for x in predicted['yhat12']]

In [None]:
# plot optimized forecast
plt.figure(figsize=(15, 7.5))
plt.plot(predicted.set_index('ds')['yhat12'], color='r', label='model')
plt.axvspan(predicted.set_index('ds').index[-12], predicted.set_index('ds').index[-1], alpha=0.5, color='lightgrey')
plt.plot(df.set_index("ds")['y'], label='actual')
plt.grid(axis='y')
plt.xlabel('DATE', fontsize=20)
plt.ylabel('WON_REVENUE', fontsize=20)
plt.legend()
plt.show()

In [None]:
# plot NP parameters
NP.plot_parameters()

In [None]:
print('RMSE:',math.sqrt(mean_squared_error(df_test['y'], predicted['yhat12'][-12:])))
print('Explained variance:',explained_variance_score(df_test['y'], predicted['yhat12'][-12:]))

RMSE: 18958.16733656666
Explained variance: -39.98507553384325


#### 2 years window
For the 2 years window, only the heuristics approach was pursued. Thus, n_forecasts was set to be equal to the forecasting horizon (12 periods), num_hidden_layers to 0 and the loss_func set to the MSE. Due to the shortened length of the series n_lags could only be set to a maximum of 12.

In [None]:
# shorten dataset to only retain most recent three years (2 years of training data + 1 year of test data)
df = df[df['ds']>=pd.Timestamp("2020-01-01")]
df.head()

In [None]:
# split data
df_train = df[:-12]
df_test = df[-12:]

In [None]:
# create enlarged train series for prediction generation
enlarged_df = df_train.append(pd.DataFrame({'ds':[pd.Timestamp('2022-01-01'),pd.Timestamp('2022-02-01'),pd.Timestamp('2022-03-01'),pd.Timestamp('2022-04-01'),pd.Timestamp('2022-05-01'),pd.Timestamp('2022-06-01'),pd.Timestamp('2022-07-01'),pd.Timestamp('2022-08-01'),pd.Timestamp('2022-09-01'),pd.Timestamp('2022-10-01'),pd.Timestamp('2022-11-01'),pd.Timestamp('2022-12-01')],'CREATED_OPPORTUNITIES':[np.nan for i in range(12)],'CREATED_LEADS':[np.nan for i in range(12)],'y':[np.nan for i in range(12)]}))
enlarged_df.tail(13)

In [None]:
# initialize heuristics model
set_random_seed(0)
NP = NeuralProphet(growth='linear',
                   changepoints=None,
                   n_changepoints=10,
                   changepoints_range=0.8,
                   trend_reg=0, 
                   trend_reg_threshold=False, 
                   trend_global_local='global', 
                   yearly_seasonality='auto', 
                   weekly_seasonality='auto', 
                   daily_seasonality='auto', 
                   seasonality_mode='additive', 
                   seasonality_reg=0, 
                   season_global_local='global', 
                   n_forecasts=12, 
                   n_lags=12, 
                   num_hidden_layers=0, 
                   d_hidden=None, 
                   ar_reg=None, 
                   learning_rate=None, 
                   epochs=None, 
                   batch_size=None, 
                   loss_func='MSE', #quantile loss?
                   optimizer='AdamW', 
                   newer_samples_weight=2, 
                   newer_samples_start=0.0, 
                   quantiles=None, 
                   impute_missing=True, 
                   impute_linear=10, 
                   impute_rolling=10, 
                   drop_missing=False, 
                   collect_metrics=True, 
                   normalize='auto', 
                   global_normalization=False, 
                   global_time_normalization=True, 
                   unknown_data_normalization=False)

In [None]:
# addd past covariates to the model
NP.add_lagged_regressor(['CREATED_OPPORTUNITIES','CREATED_LEADS'])

INFO - (NP.forecaster.add_lagged_regressor) - n_lags = 'auto', number of lags for regressor is set to Autoregression number of lags (12)


<neuralprophet.forecaster.NeuralProphet at 0x7f867fd3b670>

In [None]:
# train heuristics model
train_metrics = NP.fit(df_train)

INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [95.833]% of the data.
INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as MS
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling yearly seasonality. Run NeuralProphet with yearly_seasonality=True to override this.
INFO - (NP.utils.set_auto_seasonalities) - Disabling weekly seasonality. Run NeuralProphet with weekly_seasonality=True to override this.
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 1
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 1000


Finding best initial lr:   0%|          | 0/202 [00:00<?, ?it/s]

Training: 0it [00:00, ?it/s]

In [None]:
# generate prediction
predicted = NP.predict(enlarged_df)

INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [97.222]% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - MS
INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [97.222]% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - MS


Predicting: 1it [00:00, ?it/s]

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column


In [None]:
# clip values below 0
predicted['yhat12'] = [x if (x!=x or x >= 0) else 0 for x in predicted['yhat12']]

In [None]:
# plot heuristics forecast
plt.figure(figsize=(15, 7.5))
plt.plot(predicted.set_index('ds')['yhat12'], color='r', label='model')
plt.axvspan(predicted.set_index('ds').index[-12], predicted.set_index('ds').index[-1], alpha=0.5, color='lightgrey')
plt.plot(df.set_index("ds")['y'], label='actual')
plt.grid(axis='y')
plt.xlabel('DATE', fontsize=20)
plt.ylabel('WON_REVENUE', fontsize=20)
plt.legend()
plt.show()

In [None]:
# plot NP parameters
NP.plot_parameters()

In [None]:
print('RMSE:',math.sqrt(mean_squared_error(df_test['y'], predicted['yhat12'][-12:])))
print('Explained variance:',explained_variance_score(df_test['y'], predicted['yhat12'][-12:]))

RMSE: 28072.133176794414
Explained variance: -171.17716216665076


## Group 2 - prototype dataset C6

C6 is characterized by being of medium length and including high volatility with strong irregularity.

In [None]:
# load Sales model data from Google Sheets
worksheet = gc.open('C6').sheet1
rows = worksheet.get_all_values()

# convert to dataframe
opp_df = pd.DataFrame.from_records(rows[1:],columns=rows[0])
opp_df['VALUE'] = pd.to_numeric(opp_df['VALUE'])
opp_df['DATE'] = pd.to_datetime(opp_df['DATE'])
opp_df.head()

In [None]:
# Group and pivot data
opp_df = pd.DataFrame(opp_df.groupby(['DATE','METRIC'])['VALUE'].sum()).reset_index()
opp_df = opp_df.pivot(index=['DATE'],columns=['METRIC'],values=['VALUE'])
opp_df.columns = [col[1] for col in opp_df.columns]
opp_df.head()

In [None]:
# load Marketing Model Data from Google Sheets
worksheet = gc.open('C6').worksheet('Lead Print Out')
rows = worksheet.get_all_values()

# convert to dataframe
lead_df = pd.DataFrame.from_records(rows[1:],columns=rows[0])
lead_df['VALUE'] = pd.to_numeric(lead_df['VALUE'])
lead_df['DATE'] = pd.to_datetime(lead_df['DATE'])
lead_df.head()

In [None]:
# Group and pivot data
lead_df = pd.DataFrame(lead_df.groupby(['DATE','METRIC'])['VALUE'].sum()).reset_index()
lead_df = lead_df.pivot(index=['DATE'],columns=['METRIC'],values=['VALUE'])
lead_df.columns = [col[1] for col in lead_df.columns]
lead_df.head()

In [None]:
# join datasets on DATE index
df = opp_df.join(lead_df)
df.head()

In [None]:
# drop irrelevant columns
df = df.drop(['SIX_MONTHS_ACV','SIX_MONTHS_OPP_WON_CVR','THREE_MONTHS_ACV','THREE_MONTHS_OPP_WON_CVR','THREE_MONTHS_LEAD_MQL_CVR','THREE_MONTHS_LEAD_OPP_CVR','THREE_MONTHS_MQL_OPP_CVR','WON_CUSTOMERS','MQLS'],axis=1)
df.head()

In [None]:
# adjust data format to NP requirements
df = df.reset_index()
df.columns = ['ds','CREATED_OPPORTUNITIES','y','CREATED_LEADS']
df

#### Full series

In [None]:
# split data
df_train = df[:-12]
df_test = df[-12:]

In [None]:
# create enlarged train series for prediction generation
enlarged_df = df_train.append(pd.DataFrame({'ds':[pd.Timestamp('2022-01-01'),pd.Timestamp('2022-02-01'),pd.Timestamp('2022-03-01'),pd.Timestamp('2022-04-01'),pd.Timestamp('2022-05-01'),pd.Timestamp('2022-06-01'),pd.Timestamp('2022-07-01'),pd.Timestamp('2022-08-01'),pd.Timestamp('2022-09-01'),pd.Timestamp('2022-10-01'),pd.Timestamp('2022-11-01'),pd.Timestamp('2022-12-01')],'CREATED_OPPORTUNITIES':[np.nan for i in range(12)],'CREATED_LEADS':[np.nan for i in range(12)],'y':[np.nan for i in range(12)]}))
enlarged_df.tail(13)

##### Heuristics
For the heuristics approach n_forecasts was set to be equal to the forecasting horizon (12 periods), n_lags twice the forecasting horizon (24 periods), num_hidden_layers to 0 and the loss_func set to the MSE.

In [None]:
# initialize heuristics model
set_random_seed(0)
NP = NeuralProphet(growth='linear',
                   changepoints=None,
                   n_changepoints=10,
                   changepoints_range=0.8,
                   trend_reg=0, 
                   trend_reg_threshold=False, 
                   trend_global_local='global', 
                   yearly_seasonality='auto', 
                   weekly_seasonality='auto', 
                   daily_seasonality='auto', 
                   seasonality_mode='additive', 
                   seasonality_reg=0, 
                   season_global_local='global', 
                   n_forecasts=12, 
                   n_lags=24, 
                   num_hidden_layers=0, 
                   d_hidden=None, 
                   ar_reg=None, 
                   learning_rate=None, 
                   epochs=None, 
                   batch_size=None, 
                   loss_func='MSE',
                   optimizer='AdamW', 
                   newer_samples_weight=2, 
                   newer_samples_start=0.0, 
                   quantiles=None, 
                   impute_missing=True, 
                   impute_linear=10, 
                   impute_rolling=10, 
                   drop_missing=False, 
                   collect_metrics=True, 
                   normalize='auto', 
                   global_normalization=False, 
                   global_time_normalization=True, 
                   unknown_data_normalization=False)

In [None]:
# addd past covariates to the model
NP.add_lagged_regressor(['CREATED_OPPORTUNITIES','CREATED_LEADS'])

INFO - (NP.forecaster.add_lagged_regressor) - n_lags = 'auto', number of lags for regressor is set to Autoregression number of lags (24)


<neuralprophet.forecaster.NeuralProphet at 0x7f86801b44c0>

In [None]:
# train heuristics model
train_metrics = NP.fit(df_train)

INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [98.507]% of the data.
INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as MS
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling weekly seasonality. Run NeuralProphet with weekly_seasonality=True to override this.
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 16
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 1000


Finding best initial lr:   0%|          | 0/205 [00:00<?, ?it/s]

Training: 0it [00:00, ?it/s]

In [None]:
# generate prediction
predicted = NP.predict(enlarged_df)

INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [98.734]% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - MS
INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [98.734]% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - MS


Predicting: 2it [00:00, ?it/s]

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column


In [None]:
# clip values below 0
predicted['yhat12'] = [x if (x!=x or x >= 0) else 0 for x in predicted['yhat12']]

In [None]:
# plot heuristics forecast
plt.figure(figsize=(15, 7.5))
plt.plot(predicted.set_index('ds')['yhat12'], color='r', label='model')
plt.axvspan(predicted.set_index('ds').index[-12], predicted.set_index('ds').index[-1], alpha=0.5, color='lightgrey')
plt.plot(df.set_index("ds")['y'], label='actual')
plt.grid(axis='y')
plt.xlabel('DATE', fontsize=20)
plt.ylabel('WON_REVENUE', fontsize=20)
plt.legend()
plt.show()

In [None]:
# plot NP parameters
NP.plot_parameters()

In [None]:
print('RMSE:',math.sqrt(mean_squared_error(df_test['y'], predicted['yhat12'][-12:])))
print('Explained variance:',explained_variance_score(df_test['y'], predicted['yhat12'][-12:]))

RMSE: 158705.36982290875
Explained variance: 0.052086937114757315


##### Optimized
For the optimization approach, n_forecasts was again set to be equal to the forecasting horizon (12 periods) and the loss_func to be the MSE, but this time, n_lags was varied in the range from 1 to 24 and num_hidden_layers was varied within the range from 0 to 2. Further a validation split of 12 periods was introduced to the train set to optimize for the validation loss.

In [None]:
# initialize lists to store configurations along with the respective train, validation and test loss
'''configs = []
test_losses = []
train_losses = []
val_losses = []'''

In [None]:
# Grid search (value of p must be adjusted)
'''p = 1
for l in range(0,3):
    set_random_seed(0)
    NP = NeuralProphet(growth='linear',
                      n_lags=p, 
                      n_forecasts=12,
                      num_hidden_layers=l, 
                      d_hidden=None,
                      loss_func='MSE')
    NP.add_lagged_regressor(['CREATED_OPPORTUNITIES','CREATED_LEADS'])
    train,val = NP.split_df(df_train,valid_p=1)
    train_metrics = NP.fit(train)
    val_metrics = NP.test(val)
    configs.append((l,p))
    predicted = NP.predict(enlarged_df)
    test_losses.append(mean_squared_error(df_test['y'], NP.predict(enlarged_df)['yhat12'][-12:]))
    train_losses.append(train_metrics.iloc[-1,2])
    val_losses.append(val_metrics.iloc[0,0]) '''

In [None]:
# sorted results from hyperparameter tuning
results = pd.DataFrame({'config':configs,'test loss':test_losses, 'val loss':val_losses, 'train loss':train_losses})
results.sort_values('val loss')

Unnamed: 0,config,test loss,val loss,train loss
6,"(0, 3)",2.305943e+10,0.530428,0.022731
0,"(0, 1)",2.312171e+10,0.563162,0.024955
3,"(0, 2)",2.604733e+10,0.574342,0.023598
29,"(2, 10)",2.844907e+10,0.592984,0.018774
1,"(1, 1)",2.865230e+10,0.599964,0.025180
...,...,...,...,...
33,"(0, 12)",4.957205e+10,1.517755,0.000027
40,"(1, 14)",2.761474e+10,1.615229,0.003377
53,"(2, 18)",4.480163e+10,1.729113,0.006697
46,"(1, 16)",3.751091e+10,1.761422,0.001640


In [None]:
# initialize optimal model with n_lags=3 and num_hidden_layers=0
set_random_seed(0)
NP = NeuralProphet(growth='linear',
                   changepoints=None,
                   n_changepoints=10,
                   changepoints_range=0.8,
                   trend_reg=0, 
                   trend_reg_threshold=False, 
                   trend_global_local='global', 
                   yearly_seasonality='auto', 
                   weekly_seasonality='auto', 
                   daily_seasonality='auto', 
                   seasonality_mode='additive', 
                   seasonality_reg=0, 
                   season_global_local='global', 
                   n_forecasts=12, 
                   n_lags=3, 
                   num_hidden_layers=0, 
                   d_hidden=None, 
                   ar_reg=None, 
                   learning_rate=None, 
                   epochs=None, 
                   batch_size=None, 
                   loss_func='MSE',
                   optimizer='AdamW', 
                   newer_samples_weight=2, 
                   newer_samples_start=0.0, 
                   quantiles=None, 
                   impute_missing=True, 
                   impute_linear=10, 
                   impute_rolling=10, 
                   drop_missing=False, 
                   collect_metrics=True, 
                   normalize='auto', 
                   global_normalization=False, 
                   global_time_normalization=True, 
                   unknown_data_normalization=False)

In [None]:
# addd past covariates to the model
NP.add_lagged_regressor(['CREATED_OPPORTUNITIES','CREATED_LEADS'])

INFO - (NP.forecaster.add_lagged_regressor) - n_lags = 'auto', number of lags for regressor is set to Autoregression number of lags (3)


<neuralprophet.forecaster.NeuralProphet at 0x7f86803d3250>

In [None]:
# train optimized model
train_metrics = NP.fit(df_train)

INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [98.507]% of the data.
INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as MS
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling weekly seasonality. Run NeuralProphet with weekly_seasonality=True to override this.
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 16
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 831


Finding best initial lr:   0%|          | 0/205 [00:00<?, ?it/s]

Training: 0it [00:00, ?it/s]

In [None]:
# generate prediction
predicted = NP.predict(enlarged_df)

INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [98.734]% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - MS
INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [98.734]% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - MS


Predicting: 4it [00:00, ?it/s]

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column


In [None]:
# clip values below 0
predicted['yhat12'] = [x if (x!=x or x >= 0) else 0 for x in predicted['yhat12']]

In [None]:
# plot optimized forecast
plt.figure(figsize=(15, 7.5))
plt.plot(predicted.set_index('ds')['yhat12'], color='r', label='model')
plt.axvspan(predicted.set_index('ds').index[-12], predicted.set_index('ds').index[-1], alpha=0.5, color='lightgrey')
plt.plot(df.set_index("ds")['y'], label='actual')
plt.grid(axis='y')
plt.xlabel('DATE', fontsize=20)
plt.ylabel('WON_REVENUE', fontsize=20)
plt.legend()
plt.show()

In [None]:
# plot NP parameters
NP.plot_parameters()

In [None]:
print('RMSE:',math.sqrt(mean_squared_error(df_test['y'], predicted['yhat12'][-12:])))
print('Explained variance:',explained_variance_score(df_test['y'], predicted['yhat12'][-12:]))

RMSE: 171660.69782902786
Explained variance: -0.24798697918613666


#### 2 years window
For the 2 years window, only the heuristics approach was pursued. Thus, n_forecasts was set to be equal to the forecasting horizon (12 periods), num_hidden_layers to 0 and the loss_func set to the MSE. Due to the shortened length of the series n_lags could only be set to a maximum of 12.

In [None]:
# shorten dataset to only retain most recent three years (2 years of training data + 1 year of test data)
df = df[df['ds']>=pd.Timestamp("2020-01-01")]
df.head()

In [None]:
# split data
df_train = df[:-12]
df_test = df[-12:]

In [None]:
# create enlarged train series for prediction generation
enlarged_df = df_train.append(pd.DataFrame({'ds':[pd.Timestamp('2022-01-01'),pd.Timestamp('2022-02-01'),pd.Timestamp('2022-03-01'),pd.Timestamp('2022-04-01'),pd.Timestamp('2022-05-01'),pd.Timestamp('2022-06-01'),pd.Timestamp('2022-07-01'),pd.Timestamp('2022-08-01'),pd.Timestamp('2022-09-01'),pd.Timestamp('2022-10-01'),pd.Timestamp('2022-11-01'),pd.Timestamp('2022-12-01')],'CREATED_OPPORTUNITIES':[np.nan for i in range(12)],'CREATED_LEADS':[np.nan for i in range(12)],'y':[np.nan for i in range(12)]}))
enlarged_df.tail(13)

In [None]:
# initialize heuristics model
set_random_seed(0)
NP = NeuralProphet(growth='linear',
                   changepoints=None,
                   n_changepoints=10,
                   changepoints_range=0.8,
                   trend_reg=0, 
                   trend_reg_threshold=False, 
                   trend_global_local='global', 
                   yearly_seasonality='auto', 
                   weekly_seasonality='auto', 
                   daily_seasonality='auto', 
                   seasonality_mode='additive', 
                   seasonality_reg=0, 
                   season_global_local='global', 
                   n_forecasts=12, 
                   n_lags=12, 
                   num_hidden_layers=0, 
                   d_hidden=None, 
                   ar_reg=None, 
                   learning_rate=None, 
                   epochs=None, 
                   batch_size=None, 
                   loss_func='MSE', #quantile loss?
                   optimizer='AdamW', 
                   newer_samples_weight=2, 
                   newer_samples_start=0.0, 
                   quantiles=None, 
                   impute_missing=True, 
                   impute_linear=10, 
                   impute_rolling=10, 
                   drop_missing=False, 
                   collect_metrics=True, 
                   normalize='auto', 
                   global_normalization=False, 
                   global_time_normalization=True, 
                   unknown_data_normalization=False)

In [None]:
# add past covariates to the model
NP.add_lagged_regressor(['CREATED_OPPORTUNITIES','CREATED_LEADS'])

INFO - (NP.forecaster.add_lagged_regressor) - n_lags = 'auto', number of lags for regressor is set to Autoregression number of lags (12)


<neuralprophet.forecaster.NeuralProphet at 0x7f867f8cfb50>

In [None]:
# train heuristics model
train_metrics = NP.fit(df_train)

INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [95.833]% of the data.
INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as MS
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling yearly seasonality. Run NeuralProphet with yearly_seasonality=True to override this.
INFO - (NP.utils.set_auto_seasonalities) - Disabling weekly seasonality. Run NeuralProphet with weekly_seasonality=True to override this.
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 1
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 1000


Finding best initial lr:   0%|          | 0/202 [00:00<?, ?it/s]

Training: 0it [00:00, ?it/s]

In [None]:
# generate prediction
predicted = NP.predict(enlarged_df)

INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [97.222]% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - MS
INFO - (NP.df_utils._infer_frequency) - Major frequency MS corresponds to [97.222]% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - MS


Predicting: 1it [00:00, ?it/s]

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column


In [None]:
# plot heuristics forecast
plt.figure(figsize=(15, 7.5))
plt.plot(predicted.set_index('ds')['yhat12'], color='r', label='model')
plt.axvspan(predicted.set_index('ds').index[-12], predicted.set_index('ds').index[-1], alpha=0.5, color='lightgrey')
plt.plot(df.set_index("ds")['y'], label='actual')
plt.grid(axis='y')
plt.xlabel('DATE', fontsize=20)
plt.ylabel('WON_REVENUE', fontsize=20)
plt.legend()
plt.show()

In [None]:
# plot NP parameters
NP.plot_parameters()

In [None]:
print('RMSE:',math.sqrt(mean_squared_error(df_test['y'], predicted['yhat12'][-12:])))
print('Explained variance:',explained_variance_score(df_test['y'], predicted['yhat12'][-12:]))

RMSE: 330643.71835086984
Explained variance: -2.08103354321336
