## Experiment 1: The role of variable importance
 
  
The importance of features plays a fundamental role in our method. Given a new product to forecast for, the mechanism that enables the search of its closest neighbours is a distance which is calculated as the value of the features scaled by the feature importance. Furthermore, the inverse of this distance is used to weigh each internal prediction to calculate the final forecast.

The purpose of this first experiment is to demonstrate that our method is able to recognise the importance of the variables that contribute to the promotional sales and also, to interpret the results that the model produces.

A simple approach to evaluate if the model finds the relevant features that drive the sales is to use a linear model where the contribution of the independent variables to the response variable is known. 

Let us generate 500 samples of a model with 5 independent variables ($x_1$, ... , $x_5$) which are drawn from the uniform distribution $\textit{U}(0,1)$, being $\mathbf{X} \in \mathbb{R}^{500 \times 5}$. To define the impact of the independent variables in the response variable, we use a vector of weights $\mathbf{w} = [42,34,16,0,8]$ being $\sum_{i=1}^5 w_i = 100$. The response variable $\mathbf{y} \in \mathbb{R}^{500}$ is the linear combination of $\mathbf{w}$ with $\mathbf{X}$ as per $\mathbf{y}=\mathbf{w}^\top \mathbf{X}$. We then train our model on $\mathbf{X}$ and $\mathbf{y}$ using as the base regressor _CatBoost_, a learning rate of 0.08, a validation set of 20\%, the tree depth set to 12 and 300 iterations. 


In [3]:
import fcn_helpers as fhelp
import pandas as pd
import numpy as np
from contrastiveRegressor import contrastiveRegressor
from sklearn.model_selection import train_test_split
from catboost import CatBoostRegressor
import preprocessing_utils as pt
import datetime as dt

Define a linear model where the sales come from the product of the input variables and predefined weights

In [4]:
# Fake sales
experiment_label = 'linear_model'
num_samples = 500
num_features = 5
input_vars = [f'x_{idx}' for idx in range(1,num_features+1)]
input_data = np.random.rand(num_samples, num_features)

weights = np.array([42,34,16,0,8])
y_train = np.dot(input_data, weights.T)

df = pd.DataFrame(input_data, columns=input_vars)


# Ad-hoc test set to see the influence of the variables
df_test = pd.DataFrame([{'x_1': 0.1, 'x_2': 0.5, 'x_3': 0.5, 'x_4': 0.5, 'x_5': 0.5},
{'x_1': 0.9, 'x_2': 0.5, 'x_3': 0.5, 'x_4': 0.5, 'x_5':0.5 },
{'x_1': 0.5, 'x_2': 0.9, 'x_3': 0.5, 'x_4': 0.5, 'x_5':0.5 },
{'x_1': 0.5, 'x_2': 0.1, 'x_3': 0.5, 'x_4': 0.5, 'x_5':0.5 }])
# Response variable
y_actual = np.dot(df_test.values, weights.T)

In [5]:
# Get the actual weights in a DF
df_actual_weights = pd.DataFrame(weights, index=df_test.columns, columns=['Weights'])

Set the parameters and the base regressor for the comntrastive algo (fix params don't use HyperOpt in this experiment)

#### Train the model using CatBoost as the base regressor

In [6]:
numericalVars = input_vars
categoricalVars = []

num_inputVars = len(input_vars)

# Hyper-parameters
num_neighbours = 5
validation_test_size = 0.20
feat_importance_keyword = 'feature_importances_'
# Regressor
num_iterations = 300
learning_rate  = 0.08 
depth = 12
# CatBoost
cb_model = CatBoostRegressor(iterations=num_iterations, learning_rate=learning_rate,
depth=depth, loss_function='RMSE', cat_features=None, silent=True)
# Create the forecaster
contrastiveReg = contrastiveRegressor(num_neighbours = num_neighbours, 
  validation_test_size = validation_test_size)

# Set the regressor
contrastiveReg.set_regressor(cb_model, feat_importance_keyword, input_vars)
# fit the regressor
contrastiveReg.fit(df.values, y_train)
# eval results
contrastiveReg.predict_eval_test()
eval_results = contrastiveReg.get_results()

Preparing Training set...
Training set (2000, 10). Evaluation (500, 10)...done.
...Symmetrical Weights
MAE: 1.20
MSE: 2.23
RMSE: 1.49
meanError: 0.19
MAPE: 2.69
R2: 0.99
frc_error: 0.02
frc_bias: 0.00
frc_acc: 1.00
Var explained: 0.99


In [7]:
# Predict
contrastiveReg.predict(df_test.values)
cold_start_results = contrastiveReg.get_results()

...Symmetrical Weights


#### First question to address

*Is the variable importance representative of the mechanisms driving the sales?*

To facilitate interpretability, the model returns $\mathbf{v'}= \mathbf{v^{neig}} + \mathbf{v^{ref}}$ as the variable importance. In any case, the individual contributions can be retrieved as _contrastiveReg.x_weights_ and _contrastiveReg.x_ref_weights_


In [8]:
# Sort by importance
df_feature_importances = cold_start_results.get('df_feat_importances', None)
df_feature_importances.columns = ['variable_importance']
pd.concat([df_actual_weights, df_feature_importances], axis=1)

Unnamed: 0,Weights,variable_importance
x_1,42,41.169522
x_2,34,34.755821
x_3,16,15.565136
x_4,0,3.822892
x_5,8,4.686629


In [9]:
# Forecast errors
y_forecast = cold_start_results['y_hat_weighted']
_ = contrastiveReg.get_frc_errors(y_actual, y_forecast)

MAE: 1.42
MSE: 2.02
RMSE: 1.42
meanError: -0.65
MAPE: 3.17
R2: 0.99
frc_error: 0.03
frc_bias: -0.01
frc_acc: 1.01
Var explained: 0.99


In [10]:
# Predict using random neighbours to see if it makes a difference
y_hat_random = fhelp.frc_with_random_neighbours(contrastiveReg.X_train, df_test.values, \
  contrastiveReg.num_neighbours, contrastiveReg)
# Predict with CatBoost (original)
y_hat_catboost = fhelp.frc_plain_CatBoost(num_neighbours, validation_test_size,
    num_iterations, learning_rate, depth, \
    contrastiveReg.X_train, contrastiveReg.y_train, df_test.values)


0:	learn: 16.6999983	test: 16.6974534	best: 16.6974534 (0)	total: 50.9ms	remaining: 15.2s
50:	learn: 2.6021158	test: 6.0040334	best: 6.0040334 (50)	total: 2.46s	remaining: 12s
100:	learn: 0.7473834	test: 4.8439246	best: 4.8439246 (100)	total: 4.85s	remaining: 9.56s
150:	learn: 0.3155133	test: 4.6190509	best: 4.6190509 (150)	total: 7.23s	remaining: 7.14s
200:	learn: 0.1611658	test: 4.5457619	best: 4.5457619 (200)	total: 9.69s	remaining: 4.77s
250:	learn: 0.0954347	test: 4.5224198	best: 4.5224198 (250)	total: 12.1s	remaining: 2.36s
299:	learn: 0.0602161	test: 4.5145075	best: 4.5144976 (298)	total: 14.5s	remaining: 0us

bestTest = 4.514497627
bestIteration = 298

Shrink model to first 299 iterations.


Arrange in a single Dataframe all the forecast with the contrastive results + the CatBoost forecast

In [11]:
all_cold_forecast = []

for idx_review in range(y_actual.shape[0]):
  df_forecast_ext = contrastiveReg.arrange_regressor_results(idx_review, df, \
  y_train, None, input_vars, \
  None, df_test, y_actual, num_inputVars)
  df_forecast_ext['y_hat_catboost'] = ''
  df_forecast_ext.reset_index(inplace=True)
  df_forecast_ext['y_hat_catboost'].iloc[-2] = y_hat_catboost[idx_review]
  all_cold_forecast.append(df_forecast_ext)

# Append them all
df_all_cold_forecast = pd.concat(all_cold_forecast)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [12]:
# Review first forecast
all_cold_forecast[0]

Unnamed: 0,index,x_1,x_2,x_3,x_4,x_5,y_train,delta_y_train,y_train_plus_delta,y_train_distances,y_actual,y_forecast,y_weighted_forecast,y_hat_catboost
0,266,0.115239,0.543172,0.651175,0.672674,0.627717,38.748421,-2.993457,35.754964,0.787715,,,,
1,185,0.100129,0.398977,0.355752,0.616011,0.647458,28.642321,7.08661,35.728931,0.912105,,,,
2,94,0.113257,0.35213,0.44567,0.447399,0.446392,27.431068,6.210376,33.641444,0.914974,,,,
3,462,0.007468,0.519497,0.320056,0.485472,0.584748,27.775438,6.788332,34.56377,0.950884,,,,
4,304,0.096481,0.662936,0.563723,0.327809,0.547491,39.991502,-6.885717,33.105786,1.053723,,,,
5,0,0.1,0.5,0.5,0.5,0.5,,,,,33.2,34.558979,34.636666,35.0799
6,variable_importance,41.169522,34.755821,15.565136,3.822892,4.686629,,,,,,,,


In [13]:
# Review the second forecast
all_cold_forecast[1]

Unnamed: 0,index,x_1,x_2,x_3,x_4,x_5,y_train,delta_y_train,y_train_plus_delta,y_train_distances,y_actual,y_forecast,y_weighted_forecast,y_hat_catboost
0,190,0.861984,0.543925,0.498406,0.445981,0.517049,66.807685,1.642119,68.449804,0.372978,,,,
1,334,0.865691,0.547308,0.407673,0.641035,0.460283,65.172494,0.431251,65.603744,0.585117,,,,
2,279,0.941851,0.389306,0.408907,0.537923,0.416496,62.668618,6.027992,68.696611,0.815669,,,,
3,271,0.827327,0.553836,0.628042,0.697368,0.508695,67.696412,0.289189,67.985601,0.85007,,,,
4,93,0.832146,0.564888,0.673431,0.371335,0.540361,69.254124,0.502444,69.756568,0.935404,,,,
5,1,0.9,0.5,0.5,0.5,0.5,,,,,66.8,68.098466,67.977817,67.6128
6,variable_importance,41.169522,34.755821,15.565136,3.822892,4.686629,,,,,,,,


In [14]:
# Review the third forecast
all_cold_forecast[2]

Unnamed: 0,index,x_1,x_2,x_3,x_4,x_5,y_train,delta_y_train,y_train_plus_delta,y_train_distances,y_actual,y_forecast,y_weighted_forecast,y_hat_catboost
0,67,0.573351,0.824451,0.542359,0.419824,0.356705,63.643452,-4.155337,59.488115,0.754067,,,,
1,473,0.551059,0.838274,0.445321,0.212784,0.456894,62.426078,0.159124,62.585202,0.781257,,,,
2,457,0.412544,0.914293,0.59471,0.579891,0.3123,60.42653,2.694204,63.120733,0.806923,,,,
3,101,0.501369,0.819987,0.566166,0.890815,0.414399,61.310893,2.510842,63.821735,0.953397,,,,
4,358,0.511901,0.908727,0.32044,0.144068,0.577395,62.142735,-0.452602,61.690133,1.011292,,,,
5,2,0.5,0.9,0.5,0.5,0.5,,,,,63.6,62.141184,62.069896,60.1431
6,variable_importance,41.169522,34.755821,15.565136,3.822892,4.686629,,,,,,,,


In [15]:
# Review the fourth forecast
all_cold_forecast[3]

Unnamed: 0,index,x_1,x_2,x_3,x_4,x_5,y_train,delta_y_train,y_train_plus_delta,y_train_distances,y_actual,y_forecast,y_weighted_forecast,y_hat_catboost
0,350,0.472353,0.101474,0.647858,0.454556,0.523629,37.843722,-0.595954,37.247768,0.618337,,,,
1,254,0.550579,0.139386,0.501355,0.757431,0.673026,41.26932,-3.639788,37.629532,0.743585,,,,
2,294,0.421152,0.118796,0.410555,0.281586,0.531195,32.545904,5.653872,38.199776,0.761372,,,,
3,272,0.441256,0.171372,0.601778,0.667023,0.550181,38.389303,0.244261,38.633564,0.773822,,,,
4,198,0.453571,0.122721,0.486151,0.709685,0.172339,32.379651,5.692749,38.0724,0.883697,,,,
5,3,0.5,0.1,0.5,0.5,0.5,,,,,36.4,37.956608,37.91765,36.4797
6,variable_importance,41.169522,34.755821,15.565136,3.822892,4.686629,,,,,,,,
