## Experiment 1: The role of variable importance
 
  
  Why?

  To generate contrastive explanations we rearrange the data into neighbours and reference. Therefore, the variable importance 
  calculated by the GBDT regressor is divided into parts. Asumming that data do not have date or time variables, the importance
  vector can be written as $\mathbf{v} = [\mathbf{v^{neig}}, \mathbf{v^{ref}}]$.

  In this experiment we demonstrate that the method is able to accurately calculate the variable importance and moreover we
  can rearrange the importances as $\mathbf{v'}= \mathbf{v^{neig}} + \mathbf{v^{ref}}$ to facilitate interpretability.

  How?

  To demonstrate this, let us generate a linear model with 5 independent variables drawn from drawn from a uniform distribution $\textit{U}(0,1)$ and 500 samples.

  500 samples with

Linear model (no noise). 500 samples with 5 independent variables: $\mathbf{X} \in \mathbb{R}^{500 \times 5}$

    Dependant variable is a linear combination
    $\mathbf{w} = [13,9,6,1,0]$

    The variables of the model are the following: variable $\mathbf{x_1}$ is drawn from the standard uniform distribution $\textit{U}(0,1)$


In [2]:
import fcn_helpers as fhelp
import pandas as pd
import numpy as np
from contrastiveRegressor import contrastiveRegressor
from sklearn.model_selection import train_test_split
from catboost import CatBoostRegressor
import preprocessing_utils as pt
from os import path as _p
import datetime as dt

Define a linear model where the sales come from the product of the input variables and predefined weights

In [4]:
# Fake sales
experiment_label = 'linear_model'
num_samples = 500
num_features = 5
input_vars = [f'x_{idx}' for idx in range(1,num_features+1)]
input_data = np.random.rand(num_samples, num_features)

weights = np.array([42,34,16,0,8])
y_train = np.dot(input_data, weights.T)

df = pd.DataFrame(input_data, columns=input_vars)


# Ad-hoc test set to see the influence of the variables
df_test = pd.DataFrame([{'x_1': 0.1, 'x_2': 0.5, 'x_3': 0.5, 'x_4': 0.5, 'x_5': 0.5},
{'x_1': 0.9, 'x_2': 0.5, 'x_3': 0.5, 'x_4': 0.5, 'x_5':0.5 },
{'x_1': 0.5, 'x_2': 0.9, 'x_3': 0.5, 'x_4': 0.5, 'x_5':0.5 },
{'x_1': 0.5, 'x_2': 0.1, 'x_3': 0.5, 'x_4': 0.5, 'x_5':0.5 }])
# Response variable
y_actual = np.dot(df_test.values, weights.T)

In [26]:
# Get the actual weights in a DF
df_actual_weights = pd.DataFrame(weights, index=df_test.columns, columns=['Weights'])

Weights of the input variables


Unnamed: 0,Weights
x_1,42
x_2,34
x_3,16
x_4,0
x_5,8


Set the parameters and the base regressor for the comntrastive algo (fix params don't use HyperOpt in this experiment)

In [7]:
numericalVars = input_vars
categoricalVars = []

num_inputVars = len(input_vars)

# Hyper-parameters
num_neighbours = 5
validation_test_size = 0.20
feat_importance_keyword = 'feature_importances_'

# Regressor
num_iterations = 300
learning_rate  = 0.08 
depth = 12
# CatBoost
cb_model = CatBoostRegressor(iterations=num_iterations, learning_rate=learning_rate,
depth=depth, loss_function='RMSE', cat_features=None, silent=True)

In [8]:
'''
  Model. Using CatBoost here
'''
# Create the forecaster
contrastiveReg = contrastiveRegressor(num_neighbours = num_neighbours, 
  validation_test_size = validation_test_size)


# Set the regressor
contrastiveReg.set_regressor(cb_model, feat_importance_keyword, input_vars)
# fit the regressor
contrastiveReg.fit(df.values, y_train)
# eval results
contrastiveReg.predict_eval_test()
eval_results = contrastiveReg.get_results()

Preparing Training set...
Training set (2000, 10). Evaluation (500, 10)...done.
...Symmetrical Weights
MAE: 1.26
MSE: 2.46
RMSE: 1.57
meanError: 0.46
MAPE: 2.91
R2: 0.99
frc_error: 0.02
frc_bias: 0.01
frc_acc: 0.99
Var explained: 0.99


In [18]:
# Predict
contrastiveReg.predict(df_test.values)
cold_start_results = contrastiveReg.get_results()

...Symmetrical Weights


First question to address:
*Is the variable importance representative of the mechanisms driving the sales?*

In [28]:
# Sort by importance
df_feature_importances = cold_start_results.get('df_feat_importances', None)
df_feature_importances.columns = ['variable_importance']
pd.concat([df_actual_weights, df_feature_importances], axis=1)

Unnamed: 0,Weights,variable_importance
x_1,42,40.118651
x_2,34,36.112082
x_3,16,14.047777
x_4,0,4.12634
x_5,8,5.59515


In [30]:
# Forecast errors
y_forecast = cold_start_results['y_hat_weighted']
_ = contrastiveReg.get_frc_errors(y_actual, y_forecast)

MAE: 1.15
MSE: 2.14
RMSE: 1.46
meanError: -0.20
MAPE: 2.61
R2: 0.99
frc_error: 0.02
frc_bias: -0.00
frc_acc: 1.00
Var explained: 0.99


In [33]:
# Predict using random neighbours to see if it makes a difference
y_hat_random = fhelp.frc_with_random_neighbours(contrastiveReg.X_train, df_test.values, \
  contrastiveReg.num_neighbours, contrastiveReg)
# Predict with CatBoost
y_hat_catboost = fhelp.frc_plain_CatBoost(num_neighbours, validation_test_size,
    num_iterations, learning_rate, depth, \
    contrastiveReg.X_train, contrastiveReg.y_train, df_test.values)


0:	learn: 15.1641458	test: 14.8473588	best: 14.8473588 (0)	total: 126ms	remaining: 37.6s
50:	learn: 2.4438440	test: 5.8381183	best: 5.8381183 (50)	total: 3.9s	remaining: 19s
100:	learn: 0.7611357	test: 4.9557010	best: 4.9557010 (100)	total: 7.05s	remaining: 13.9s
150:	learn: 0.3130393	test: 4.7995102	best: 4.7995102 (150)	total: 10.5s	remaining: 10.3s
200:	learn: 0.1547523	test: 4.7429465	best: 4.7429465 (200)	total: 13.9s	remaining: 6.85s
250:	learn: 0.0864935	test: 4.7201917	best: 4.7201917 (250)	total: 17.1s	remaining: 3.35s
299:	learn: 0.0521557	test: 4.7106081	best: 4.7106081 (299)	total: 20.3s	remaining: 0us

bestTest = 4.710608059
bestIteration = 299



Arrange in a single Dataframe all the forecast with the contrastive results + the CatBoost forecast

In [37]:
all_cold_forecast = []

for idx_review in range(y_actual.shape[0]):
  df_forecast_ext = contrastiveReg.arrange_regressor_results(idx_review, df, \
  y_train, None, input_vars, \
  None, df_test, y_actual, num_inputVars)
  df_forecast_ext['y_hat_catboost'] = ''
  df_forecast_ext.reset_index(inplace=True)
  df_forecast_ext['y_hat_catboost'].iloc[-2] = y_hat_catboost[idx_review]
  all_cold_forecast.append(df_forecast_ext)
  y_actual_A = df_forecast_ext['y_actual'].iloc[-2]
  y_forecast = df_forecast_ext['y_weighted_forecast'].iloc[-2]

# Append them all
df_all_cold_forecast = pd.concat(all_cold_forecast)

df_all_cold_forecast

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,index,x_1,x_2,x_3,x_4,x_5,y_train,delta_y_train,y_train_plus_delta,y_train_distances,y_actual,y_forecast,y_weighted_forecast,y_hat_catboost
0,428,0.076108,0.525374,0.68426,0.551406,0.628364,37.034292,-5.755472,31.27882,0.791325,,,,
1,248,0.163153,0.478689,0.547078,0.289518,0.727133,37.698193,-6.56236,31.135833,0.824015,,,,
2,415,0.123789,0.397592,0.464679,0.551673,0.26374,28.262059,4.025594,32.287653,0.861557,,,,
3,79,0.004801,0.592981,0.428241,0.508175,0.721189,32.984342,-3.773362,29.21098,1.011015,,,,
4,213,0.14866,0.494795,0.247506,0.561978,0.433908,30.498087,2.275931,32.774018,1.015802,,,,
5,0,0.1,0.5,0.5,0.5,0.5,,,,,33.2,31.337461,31.354387,34.0202
6,variable_importance,40.118651,36.112082,14.047777,4.12634,5.59515,,,,,,,,
0,293,0.806047,0.566573,0.491117,0.382977,0.564253,65.489337,0.349299,65.838636,0.771294,,,,
1,9,0.868585,0.459564,0.422812,0.541824,0.214107,60.583571,4.928333,65.511904,0.804285,,,,
2,194,0.97209,0.397908,0.436838,0.397039,0.656022,66.594229,2.094805,68.689034,0.906022,,,,
