## DM2 DMC | "Big" Cluster

Credits: Building on datamining2/neuralnetworks/mlp_baseline.ipynb

Install XGBoost using e.g.: conda install -c rdonnelly py-xgboost

For an introductory example on XGBoost, see: https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/

#### Working Directory

In [1]:
working_directory = 'F:/S2_slides/Data Mining2/DMC2018/Github_clone/datamining2-develop/data/clean/'

#### Imports

In [2]:
import pandas as pd
import numpy as np
import pickle
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostRegressor, ExtraTreesRegressor, GradientBoostingRegressor,RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt
import multiprocessing as mp
import itertools

Count of logical processors for speeding-up computations:

In [3]:
cpus = mp.cpu_count()

#### Loading the Data

As provided by Chung (and modified to also filter Y_full):

In [4]:
# Import cluster identifier
sales = pd.read_csv(working_directory + 'data_v0.1_sales.csv')
big_key = sales['key'][sales['cluster'] == "big"]
print(len(big_key.unique())) # Should only have 2907 keys remaining

# Import datasets
X_full = pickle.load(open(working_directory + 'X_flat.pkl', 'rb'))
Y_full = pickle.load(open(working_directory + 'Y_flat.pkl', 'rb'))

# Keep only rows which belong to cluster 'big'; should be 2,907*123 = 357,561 rows
X_full['key'] = X_full['key'].astype(str)
X_big = X_full[X_full['key'].isin(big_key.astype(str))]
X_big = X_big.reset_index(drop=True)
print(X_big.shape) # Check the number of rows = 357,561

# Keep only rows which belong to cluster 'big'; should be 2,907*123 = 357,561 rows
Y_full['key'] = Y_full['key'].astype(str)
Y_big = Y_full[Y_full['key'].isin(big_key.astype(str))]
Y_big = Y_big.reset_index(drop=True)
print(Y_big.shape) # Check the number of rows = 357,561

2907
(357561, 108)
(357561, 3)


In [5]:
X_full = X_big

In [6]:
Y_full = Y_big

In [7]:
X_full.shape

(357561, 108)

In [8]:
Y_full.shape

(357561, 3)

#### Train/Test Split

In [9]:
X_full['month'] = pd.DatetimeIndex(X_full['date']).month

In [10]:
Y_full['month'] = pd.DatetimeIndex(Y_full['date']).month

In [11]:
X_full_train = X_full.loc[X_full['month'] != 1]

In [12]:
Y_full_train = Y_full.loc[Y_full['month'] != 1]

In [13]:
X_full_test = X_full.loc[X_full['month'] == 1]

In [14]:
Y_full_test = Y_full.loc[Y_full['month'] == 1]

#### Grid Search for Equal Step Width Leave-One-Out-Validation w.r.t. Dates with Lagged Embargo for Hyperparameter Tuning (Model Selection)

Note: When I use the term 'test set' in the context of validation, I refer to a subset of the training data, not to the January test data.

As we do not have a lot of observations (Oct-Dec for training, only), it makes sense to use leave-one-out-validation w.r.t the date attribute. This also ensures that in our respective test sets, there are no overlapping observations from the training data w.r.t. to earliest and latest date of the test records. Consequently, "purging" as described by Lopez de Prado [2018] is not necessary. However, we have to prevent leakage from the respective training set into the respective test set by removing from the respective training set all records which dates "[...] immediately follow an observation in the testing set. I call this process "embargo."" [Lopez de Prado, 2018]. As we only have one test date for validation in each round of the leave-one-out-validation, we simply have to remove all records from the respectively upcoming n days, where n is equal to the number of lags of sales data that we include as features times two(!). Times two indicates two different problems that need to be addressed: [1] "embargo" and [2] "lagged embargo" hereinafter. [1] For example, if we chose November 1 as one of the single validation dates, we would have to remove all records from the training set which date value is somewhere between (border values included) November 2 and November 15 (assume, that we drop 'last_15_day_sales', ..., 'last_28_day_sales' so that we do not loose too many training records). This is the above-mentioned "embargo". [2] As we deal with lagged features, we additionally have to remove all records that contain as values for the lagged features (sales) values from the "embargo" period. Consequently, following the example, we would have to remove all records with date values between (border values included) November 16 and November 29. E.g. the problem with November 29 is that it includes as lagged feature for all items 'last_14_day_sales' which refers in this case to November 15. The sales on November 15, however, have as lagged feature 'last_14_day_sales', as well. Unfortunately, this would refer here to November 1, which is our test date. To get rid of all undue leakage from test data into training data, however, we would have to remove records from November 29, as well, as we do not want to include a date in the training data which lagged feature value ('last_14_day_sales') is derived from data from the "embargo" period. One could argue that e.g. for November 29 data, we could at least keep 'last_1_day_sales', ..., 'last_13_day_sales'. That is certainly right, but would introduce a new problem: How to deal with the missing values (e.g. feature value 'last_14_day_sales' that is missing for November 29)? Thus, it might be reasonable to just drop records from all dates from (border values included) November 2 to November 29 in the example. As we have 2907 items in our "big" cluster, there are 2907 records for testing in each round of validation that are available, which should be sufficient. Due to the "embargo", we drop 14 x 2907 = 40,698 observations. Due to the "lagged embargo", we additionally have to drop the same number of observations. Consequently, we are left with (92-2x14) x 2907 = 186,048 records for training (that's about 52% of the complete training data set's records: 357561). We choose the testing dates such that they are equally-distributed: day 11, day 29, day 47, day 65, and day 83 (note the step-width of 18 and that there are 10 days before day 11 and 9 days after day 83 (until the last day in the training data, day 92)). That we train on the future to validate (test) on the past should not be an issue, as we assume by training on Oct-Dec and then finally testing on January data that the overall relationship between the features and the target remains the same and we make sure to remove all undue influence of the next 28 days after the respective validation dates, anyway.

Note on why k-fold cross-validation might be problematic (thanks @Sun Jing for asking that important question): The problem with k-fold w.r.t items is that we cannot e.g. use the sales of item 2 on Nov 2 for training when we test on item 1 on Nov 1, as the sales of item 2 on Nov 2 are probably related to the sales of item 1 on Nov 1 (that is why Nov 2 is in the "embargo" time frame). W.r.t dates the problem is that we have too less data overall, as we also loose even more date due to the (lagged) embargo and that we have to avoid that we have to apply "purging" (cp. above) as this would further reduce the amount of training date available.

Source: https://books.google.de/books?id=oU9KDwAAQBAJ&pg=PA103&lpg=PA110&dq=purged+cv+github&source=bl&ots=7TFGU-xxfx&sig=e94OZffPDeAaRJdn9k_pUHuR2t0&hl=de&sa=X&ved=0ahUKEwiNn-jXv6_aAhWFJZoKHWQCAOUQ6AEIXjAH#v=onepage&q&f=false

In each round of the validation, the respective validation_dates, embargo_dates and lagged_embargo_dates have to be removed:

In [15]:
validation_dates = [['2017-10-11'],
                    ['2017-10-29'],
                    ['2017-11-16'],
                    ['2017-12-04'],
                    ['2017-12-22']]

In [16]:
embargo_dates = [['2017-10-12', '2017-10-13', '2017-10-14', '2017-10-15', '2017-10-16', '2017-10-17', '2017-10-18', '2017-10-19', '2017-10-20', '2017-10-21', '2017-10-22', '2017-10-23', '2017-10-24', '2017-10-25'],
                 ['2017-10-30', '2017-10-31', '2017-11-01', '2017-11-02', '2017-11-03', '2017-11-04', '2017-11-05', '2017-11-06', '2017-11-07', '2017-11-08', '2017-11-09', '2017-11-10', '2017-11-11', '2017-11-12'],
                 ['2017-11-17', '2017-11-18', '2017-11-19', '2017-11-20', '2017-11-21', '2017-11-22', '2017-11-23', '2017-11-24', '2017-11-25', '2017-11-26', '2017-11-27', '2017-11-28', '2017-11-29', '2017-11-30'],
                 ['2017-12-05', '2017-12-06', '2017-12-07', '2017-12-08', '2017-12-09', '2017-12-10', '2017-12-11', '2017-12-12', '2017-12-13', '2017-12-14', '2017-12-15', '2017-12-16', '2017-12-17', '2017-12-18'],
                 ['2017-12-23', '2017-12-24', '2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29', '2017-12-30', '2017-12-31']]

In [17]:
lagged_embargo_dates = [['2017-10-26', '2017-10-27', '2017-10-28', '2017-10-29', '2017-10-30', '2017-10-31', '2017-11-01', '2017-11-02', '2017-11-03', '2017-11-04', '2017-11-05', '2017-11-06', '2017-11-07', '2017-11-08'],
                        ['2017-11-13', '2017-11-14', '2017-11-15', '2017-11-16', '2017-11-17', '2017-11-18', '2017-11-19', '2017-11-20', '2017-11-21', '2017-11-22', '2017-11-23', '2017-11-24', '2017-11-25', '2017-11-26'],
                        ['2017-12-01', '2017-12-02', '2017-12-03', '2017-12-04', '2017-12-05', '2017-12-06', '2017-12-07', '2017-12-08', '2017-12-09', '2017-12-10', '2017-12-11', '2017-12-12', '2017-12-13', '2017-12-14'],
                        ['2017-12-19', '2017-12-20', '2017-12-21', '2017-12-22', '2017-12-23', '2017-12-24', '2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29', '2017-12-30', '2017-12-31'],
                        []]

In [18]:
X_train_subsets = []

In [19]:
Y_train_subsets = []

In [20]:
X_validation_subsets = []

In [21]:
Y_validation_subsets = []

In [22]:
drop_X_cols = ['key', 'pid_x', 'size_x', 'color', 'brand', 'rrp', 'date', 'day_of_week', 
               'mainCategory', 'category', 'subCategory', 'releaseDate', 
               'rrp', 'price', 'month',
               'last_15_day_sales', 'last_16_day_sales', 'last_17_day_sales', 'last_18_day_sales', 'last_19_day_sales', 'last_20_day_sales', 'last_21_day_sales', 
               'last_22_day_sales', 'last_23_day_sales', 'last_24_day_sales', 'last_25_day_sales', 'last_26_day_sales', 'last_27_day_sales', 'last_28_day_sales']

In [23]:
drop_Y_cols = ['key', 'date', 'month']

In [24]:
for i in range(0, 5):
    full_embargo_set = set(validation_dates[i] + embargo_dates[i] + lagged_embargo_dates[i])
    validation_date = validation_dates[i]
    
    X_train_subsets.append(X_full_train.loc[X_full_train['date'].apply(lambda x: x not in full_embargo_set)].drop(drop_X_cols, axis=1).as_matrix())
    Y_train_subsets.append(Y_full_train.loc[Y_full_train['date'].apply(lambda x: x not in full_embargo_set)].drop(drop_Y_cols, axis=1).as_matrix())

    X_validation_subsets.append(X_full_train.loc[X_full_train['date'].apply(lambda x: x in validation_date)].drop(drop_X_cols, axis=1).as_matrix())
    Y_validation_subsets.append(Y_full_train.loc[Y_full_train['date'].apply(lambda x: x in validation_date)].drop(drop_Y_cols, axis=1).as_matrix())

#### Additional Preparations

In [25]:
keys_dates = pd.DataFrame(X_full['key']).join(X_full['date']) # Store for future lookups

In [26]:
X_train = X_full_train.drop(drop_X_cols, axis=1).as_matrix()

In [27]:
Y_train = Y_full_train.drop(drop_Y_cols, axis=1).as_matrix()

In [28]:
X_test = X_full_test.drop(drop_X_cols, axis=1).as_matrix()

In [29]:
Y_test = Y_full_test.drop(drop_Y_cols, axis=1).as_matrix()

#### Model Selection

In [30]:
models = [RandomForestRegressor]
#XGBRegressor, LinearRegression, MLPRegressor, AdaBoostRegressor, ExtraTreesRegressor,GradientBoostingRegressor,

In [31]:
models_called = [] 

XGBoost Hyperparameters to Try:

In [None]:
XGBoost_hyperparameters_gbtree_options = {
    'booster': ['gbtree'],
    'n_jobs': [cpus],
    'learning_rate': [i/100 for i in range(25, 41, 5)],
    'n_estimators': [i for i in range(40, 71, 5)],
    'max_depth': [2, 3],
    'min_child_weight': [1, 5, 11],
    'subsample': [0.7, 0.85, 1.0],
    #'max_depth ': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
}

In [None]:
keys, values = zip(*XGBoost_hyperparameters_gbtree_options.items())
XGBoost_hyperparameters_gbtree = [dict(zip(keys, v)) for v in itertools.product(*values)]

In [None]:
XGBoost_hyperparameters_gblinear_options = {
    'booster': ['gblinear'],
    'n_jobs': [cpus],
    #'learning_rate': [i/100 for i in range(21, 40, 1)],
    #'n_estimators': [i for i in range(20, 120, 20)],
    #'alpha': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    #'lambda': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
}

In [None]:
keys, values = zip(*XGBoost_hyperparameters_gblinear_options.items())
XGBoost_hyperparameters_gblinear = [dict(zip(keys, v)) for v in itertools.product(*values)]

In [None]:
XGBoost_hyperparameters_dart_options = {
    'booster': ['dart'],
    'n_jobs': [cpus],
    #'learning_rate': [i/100 for i in range(21, 40, 1)],
    #'n_estimators': [i for i in range(40, 71, 1)],
}

In [None]:
keys, values = zip(*XGBoost_hyperparameters_dart_options.items())
XGBoost_hyperparameters_dart = [dict(zip(keys, v)) for v in itertools.product(*values)]

In [38]:
#XGBoost_hyperparameters = XGBoost_hyperparameters_gbtree + XGBoost_hyperparameters_gblinear + XGBoost_hyperparameters_dart

MLPRegressor Hyperparameters to Try:

In [None]:
MLPRegressor_hyperparameters_options = {
    'activation': ['identity', 'logistic', 'tanh', 'relu'],
    'hidden_layer_sizes': [(50, ), (75, ), (100, ), (125, )],
    #'batch_size': ['auto', 10, 20, 40, 60, 80, 100],
    #'max_iter': [10, 50, 100, 200],
}

In [None]:
keys, values = zip(*MLPRegressor_hyperparameters_options.items())
MLPRegressor_hyperparameters = [dict(zip(keys, v)) for v in itertools.product(*values)]

Linear Regression Hyperparameters to try:

In [None]:
LinearRegression_hyperparameters_options = {
    'fit_intercept': [True, False],
    'n_jobs': [cpus],
}

In [None]:
keys, values = zip(*LinearRegression_hyperparameters_options.items())
LinearRegression_hyperparameters = [dict(zip(keys, v)) for v in itertools.product(*values)]

AdaBoostRegressor Hyperparameters to Try:

In [None]:
AdaBoostRegressor_hyperparameters_options = {
    #'n_estimators':[(50, ), (75, ), (100, )],
    'learning_rate':[(0.7), (0.8), (1.0)], 
    'loss':['linear','square','exponential'],
    #'n_jobs': [cpus],
}

In [100]:
keys, values = zip(*AdaBoostRegressor_hyperparameters_options.items())
AdaBoostRegressor_hyperparameters = [dict(zip(keys, v)) for v in itertools.product(*values)]

ExtraTreesRegressor Hyperparameters to Try:

In [34]:
ExtraTreesRegressor_hyperparameters_options = {
    'n_estimators':[(10)],#, (20, ), (30, )
    #'min_samples_split':[(2,)],#, (3), (4)
    #'n_jobs': [cpus],
    #'verbose':[(0)],
    #'min_samples_leaf':[(0.1,)],
    #'max_features':['auto']#,'sqrt','log2'
}

In [35]:
keys, values = zip(*ExtraTreesRegressor_hyperparameters_options.items())
ExtraTreesRegressor_hyperparameters = [dict(zip(keys, v)) for v in itertools.product(*values)]

In [36]:
model_hyperparameters = [ExtraTreesRegressor_hyperparameters]
models_avg_rmse_scores = []

In [37]:
for model_id, model_val in enumerate(models):
    model_hyperparameters_called = []
    models_hyperparameters_avg_rmse_scores = []
    
    for model_hyperparameter_id, model_hyperparameter_val in enumerate(model_hyperparameters[model_id]):
        model = model_val(**model_hyperparameter_val)
        model_hyperparameters_called.append(model)
        models_hyperparameters_rmse_score_subset = []
        
        for X_train_subset_id, X_train_subset_val in enumerate(X_train_subsets):
            model.fit(X_train_subset_val, Y_train_subsets[X_train_subset_id])
            models_hyperparameters_rmse_score_subset.append(sqrt(mean_squared_error(Y_validation_subsets[X_train_subset_id], model.predict(X_validation_subsets[X_train_subset_id]))))
            
        models_hyperparameters_avg_rmse_scores.append(np.average(models_hyperparameters_rmse_score_subset))

        print(model_hyperparameters_called[-1:])
        print(models_hyperparameters_avg_rmse_scores[-1:])
        print("\n")
    
    models_called.append(model_hyperparameters_called)
    models_avg_rmse_scores.append(models_hyperparameters_avg_rmse_scores)

  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()


[ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None,
          min_impurity_decrease=0.0, min_impurity_split=None,
          min_samples_leaf=1, min_samples_split=2,
          min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
          oob_score=False, random_state=None, verbose=0, warm_start=False)]
[2.275156494384631]




IndexError: list index out of range

GradientBoostingRegressor Hyperparameters to Try:

In [70]:
GradientBoostingRegressor_hyperparameters_options = {
    'n_estimators':[(10), (20)],
    #'learning_rate':[(0.0),(0.3), (0.5)], 
    #'max_features':['auto'],
    #'max_depth':[(2, ), (3, ), (8, ), (10, )],
    #'min_samples_leaf':[(0.1,)],
    #'alpha':[(0.1), (0.5),(0.9)],
    #'n_jobs': [cpus],
}

In [71]:
keys, values = zip(*GradientBoostingRegressor_hyperparameters_options.items())
GradientBoostingRegressor_hyperparameters = [dict(zip(keys, v)) for v in itertools.product(*values)]

In [72]:
model_hyperparameters = [GradientBoostingRegressor_hyperparameters]
models_avg_rmse_scores = []

In [None]:
for model_id, model_val in enumerate(models):
    model_hyperparameters_called = []
    models_hyperparameters_avg_rmse_scores = []
    
    for model_hyperparameter_id, model_hyperparameter_val in enumerate(model_hyperparameters[model_id]):
        model = model_val(**model_hyperparameter_val)
        model_hyperparameters_called.append(model)
        models_hyperparameters_rmse_score_subset = []
        
        for X_train_subset_id, X_train_subset_val in enumerate(X_train_subsets):
            model.fit(X_train_subset_val, Y_train_subsets[X_train_subset_id])
            models_hyperparameters_rmse_score_subset.append(sqrt(mean_squared_error(Y_validation_subsets[X_train_subset_id], model.predict(X_validation_subsets[X_train_subset_id]))))
            
        models_hyperparameters_avg_rmse_scores.append(np.average(models_hyperparameters_rmse_score_subset))

        print(model_hyperparameters_called[-1:])
        print(models_hyperparameters_avg_rmse_scores[-1:])
        print("\n")
    
    models_called.append(model_hyperparameters_called)
    models_avg_rmse_scores.append(models_hyperparameters_avg_rmse_scores)

  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()


[ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None,
          min_impurity_decrease=0.0, min_impurity_split=None,
          min_samples_leaf=1, min_samples_split=2,
          min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
          oob_score=False, random_state=None, verbose=0, warm_start=False)]
[2.2456600111750906]




  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()


RandomForestRegressor Hyperparameters to Try:

In [33]:
RandomForestRegressor_hyperparameters_options = {
    'n_estimators':[(10), (30), (50)],
    #'learning_rate':[(0.7), (0.8), (1.0)], 
    'loss':['linear','square','exponential'], 
    #'n_jobs': [cpus],
    'oob_score':[False],
    #'learning_rate':[(0.0), (0.1), (0.2), (0.3), (0.4), (0.5), (0.6)], 
    #'max_features':['auto','sqrt','log2'],
    'max_depth':[(2, ), (3, ), (8, ), (10, )],
    #'min_samples_leaf':[(0.1,)],
    #'alpha':[(0.1), (0.3), (0.4), (0.5), (0.6),(0.9)],
}

In [34]:
keys, values = zip(*RandomForestRegressor_hyperparameters_options.items())
RandomForestRegressor_hyperparameters = [dict(zip(keys, v)) for v in itertools.product(*values)]

In [35]:
model_hyperparameters = [RandomForestRegressor_hyperparameters]
models_avg_rmse_scores = []

In [36]:
for model_id, model_val in enumerate(models):
    model_hyperparameters_called = []
    models_hyperparameters_avg_rmse_scores = []
    
    for model_hyperparameter_id, model_hyperparameter_val in enumerate(model_hyperparameters[model_id]):
        model = model_val(**model_hyperparameter_val)
        model_hyperparameters_called.append(model)
        models_hyperparameters_rmse_score_subset = []
        
        for X_train_subset_id, X_train_subset_val in enumerate(X_train_subsets):
            model.fit(X_train_subset_val, Y_train_subsets[X_train_subset_id])
            models_hyperparameters_rmse_score_subset.append(sqrt(mean_squared_error(Y_validation_subsets[X_train_subset_id], model.predict(X_validation_subsets[X_train_subset_id]))))
            
        models_hyperparameters_avg_rmse_scores.append(np.average(models_hyperparameters_rmse_score_subset))

        print(model_hyperparameters_called[-1:])
        print(models_hyperparameters_avg_rmse_scores[-1:])
        print("\n")
    
    models_called.append(model_hyperparameters_called)
    models_avg_rmse_scores.append(models_hyperparameters_avg_rmse_scores)

  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()


[RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)]
[2.347180430802927]




  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()


[RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)]
[2.1539202477220933]




  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()
  # This is added back by InteractiveShellApp.init_path()


[RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)]
[2.173084840122146]




Hyperparameter Settings to Try for all Models:

In [289]:
model_hyperparameters = [ExtraTreesRegressor_hyperparameters,
                         GradientBoostingRegressor_hyperparameters,
                         RandomForestRegressor_hyperparameters,
                         XGBoost_hyperparameters,
                         LinearRegression_hyperparameters,
                         MLPRegressor_hyperparameters,
                         AdaBoostRegressor_hyperparameters]

In [290]:
models_avg_rmse_scores = []

In [291]:
for model_id, model_val in enumerate(models):
    model_hyperparameters_called = []
    models_hyperparameters_avg_rmse_scores = []
    
    for model_hyperparameter_id, model_hyperparameter_val in enumerate(model_hyperparameters[model_id]):
        model = model_val(**model_hyperparameter_val)
        model_hyperparameters_called.append(model)
        models_hyperparameters_rmse_score_subset = []
        
        for X_train_subset_id, X_train_subset_val in enumerate(X_train_subsets):
            model.fit(X_train_subset_val, Y_train_subsets[X_train_subset_id])
            models_hyperparameters_rmse_score_subset.append(sqrt(mean_squared_error(Y_validation_subsets[X_train_subset_id], model.predict(X_validation_subsets[X_train_subset_id]))))
            
        models_hyperparameters_avg_rmse_scores.append(np.average(models_hyperparameters_rmse_score_subset))

        print(model_hyperparameters_called[-1:])
        print(models_hyperparameters_avg_rmse_scores[-1:])
        print("\n")
    
    models_called.append(model_hyperparameters_called)
    models_avg_rmse_scores.append(models_hyperparameters_avg_rmse_scores)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
         n_estimators=10, random_state=None)]
[2.7417077881870684]




  # This is added back by InteractiveShellApp.init_path()


TypeError: '<=' not supported between instances of 'tuple' and 'int'

The best hyperparameter settings for the respective models are:

In [259]:
selected_models = []

In [None]:
for models_avg_rmse_scores_id, models_avg_rmse_scores_val in enumerate(models_avg_rmse_scores):
    selected_model = models_called[models_avg_rmse_scores_id][models_avg_rmse_scores_val.index(min(models_avg_rmse_scores_val))]
    print(selected_model)
    selected_models.append(selected_model)

#### Train Selected Models on the Whole Training Data and Test on the January Data

In [None]:
selected_models_avg_rmse_scores = []

In [None]:
selected_models_predicted_sales = []

In [None]:
for selected_model in selected_models:
    selected_model.fit(X_train, Y_train)
    
    selected_model_predicted_sales = selected_model.predict(X_test)
    
    selected_models_predicted_sales.append(selected_model_predicted_sales)
    selected_models_avg_rmse_scores.append(sqrt(mean_squared_error(Y_test, selected_model_predicted_sales)))

In [None]:
selected_models_avg_rmse_scores

#### Reading-Out Predicted Sales to Continue Work on the Evaluation on the Test Data Provided by the Chair (Sold-Out Dates rather than Sales)

Example: For the item with key '12985L', we predicted at '2018-01-12' 21.877096 with the hyperparameter-optimized XGBRegressor and 24.39178646 with the hyperparameter-optimized linear regression model. The true value is 29.

In [None]:
X_full_test.iloc[23230]['key']

In [None]:
X_full_test.iloc[23230]['date']

In [None]:
selected_models_predicted_sales[0][23230]

In [None]:
selected_models_predicted_sales[1][23230]

In [None]:
Y_test[23230]