## DM2 DMC | All Products

Credits: Building on datamining2/neuralnetworks/mlp_baseline.ipynb

Install XGBoost using e.g.: conda install -c rdonnelly py-xgboost

For an introductory example on XGBoost, see: https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/

#### Working Directory

In [70]:
working_directory = 'C:/Users/JulianWeller/Desktop/DM2_DMC_Data/'

In [71]:
test_data_directory = 'C:/Users/JulianWeller/OneDrive - Julian Weller/01_MMDS/03_Semester/04_A_6_Data Mining II/03_DMC/02_Test_Data/DMC_2018_test/'

#### Imports

In [72]:
import pandas as pd
import numpy as np
import pickle
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt
import multiprocessing as mp
import itertools

Count of logical processors for speeding-up computations:

In [73]:
cpus = mp.cpu_count()

#### Loading the Data

As provided by Chung (and modified to also filter Y_full):

In [75]:
# Import datasets
X_big = pickle.load(open(working_directory + 'X_flat.pkl', 'rb'))
Y_big = pickle.load(open(working_directory + 'Y_flat.pkl', 'rb'))

Backup with cluster filter

In [5]:
# # Import cluster identifier
# sales = pd.read_csv(working_directory + 'data_v0.1_sales.csv')
# big_key = sales['key'][sales['cluster'] == "big"]
# print(len(big_key.unique())) # Should only have 2907 keys remaining

# # Import datasets
# X_full = pickle.load(open(working_directory + 'X_flat.pkl', 'rb'))
# Y_full = pickle.load(open(working_directory + 'Y_flat.pkl', 'rb'))

# # Keep only rows which belong to cluster 'big'; should be 2,907*123 = 357,561 rows
# X_full['key'] = X_full['key'].astype(str)
# X_big = X_full[X_full['key'].isin(big_key.astype(str))]
# X_big = X_big.reset_index(drop=True)
# print(X_big.shape) # Check the number of rows = 357,561

# # Keep only rows which belong to cluster 'big'; should be 2,907*123 = 357,561 rows
# Y_full['key'] = Y_full['key'].astype(str)
# Y_big = Y_full[Y_full['key'].isin(big_key.astype(str))]
# Y_big = Y_big.reset_index(drop=True)
# print(Y_big.shape) # Check the number of rows = 357,561

2907
(357561, 108)
(357561, 3)


In [6]:
X_full = X_big

In [7]:
Y_full = Y_big

In [8]:
X_full.shape

(357561, 108)

In [9]:
Y_full.shape

(357561, 3)

#### Train/Test Split

In [10]:
X_full['month'] = pd.DatetimeIndex(X_full['date']).month

In [11]:
Y_full['month'] = pd.DatetimeIndex(Y_full['date']).month

In [12]:
X_full_train = X_full.loc[X_full['month'] != 1]

In [13]:
Y_full_train = Y_full.loc[Y_full['month'] != 1]

In [14]:
X_full_test = X_full.loc[X_full['month'] == 1]

In [15]:
Y_full_test = Y_full.loc[Y_full['month'] == 1]

#### Grid Search for Equal Step Width Leave-One-Out-Validation w.r.t. Dates with Lagged Embargo for Hyperparameter Tuning (Model Selection)

Note: When I use the term 'test set' in the context of validation, I refer to a subset of the training data, not to the January test data.

As we do not have a lot of observations (Oct-Dec for training, only), it makes sense to use leave-one-out-validation w.r.t the date attribute. This also ensures that in our respective test sets, there are no overlapping observations from the training data w.r.t. to earliest and latest date of the test records. Consequently, "purging" as described by Lopez de Prado [2018] is not necessary. However, we have to prevent leakage from the respective training set into the respective test set by removing from the respective training set all records which dates "[...] immediately follow an observation in the testing set. I call this process "embargo."" [Lopez de Prado, 2018]. As we only have one test date for validation in each round of the leave-one-out-validation, we simply have to remove all records from the respectively upcoming n days, where n is equal to the number of lags of sales data that we include as features times two(!). Times two indicates two different problems that need to be addressed: [1] "embargo" and [2] "lagged embargo" hereinafter. [1] For example, if we chose November 1 as one of the single validation dates, we would have to remove all records from the training set which date value is somewhere between (border values included) November 2 and November 15 (assume, that we drop 'last_15_day_sales', ..., 'last_28_day_sales' so that we do not loose too many training records). This is the above-mentioned "embargo". [2] As we deal with lagged features, we additionally have to remove all records that contain as values for the lagged features (sales) values from the "embargo" period. Consequently, following the example, we would have to remove all records with date values between (border values included) November 16 and November 29. E.g. the problem with November 29 is that it includes as lagged feature for all items 'last_14_day_sales' which refers in this case to November 15. The sales on November 15, however, have as lagged feature 'last_14_day_sales', as well. Unfortunately, this would refer here to November 1, which is our test date. To get rid of all undue leakage from test data into training data, however, we would have to remove records from November 29, as well, as we do not want to include a date in the training data which lagged feature value ('last_14_day_sales') is derived from data from the "embargo" period. One could argue that e.g. for November 29 data, we could at least keep 'last_1_day_sales', ..., 'last_13_day_sales'. That is certainly right, but would introduce a new problem: How to deal with the missing values (e.g. feature value 'last_14_day_sales' that is missing for November 29)? Thus, it might be reasonable to just drop records from all dates from (border values included) November 2 to November 29 in the example. As we have 2907 items in our "big" cluster, there are 2907 records for testing in each round of validation that are available, which should be sufficient. Due to the "embargo", we drop 14 x 2907 = 40,698 observations. Due to the "lagged embargo", we additionally have to drop the same number of observations. Consequently, we are left with (92-2x14) x 2907 = 186,048 records for training (that's about 52% of the complete training data set's records: 357561). We choose the testing dates such that they are equally-distributed: day 11, day 29, day 47, day 65, and day 83 (note the step-width of 18 and that there are 10 days before day 11 and 9 days after day 83 (until the last day in the training data, day 92)). That we train on the future to validate (test) on the past should not be an issue, as we assume by training on Oct-Dec and then finally testing on January data that the overall relationship between the features and the target remains the same and we make sure to remove all undue influence of the next 28 days after the respective validation dates, anyway.

Note on why k-fold cross-validation might be problematic (thanks @Sun Jing for asking that important question): The problem with k-fold w.r.t items is that we cannot e.g. use the sales of item 2 on Nov 2 for training when we test on item 1 on Nov 1, as the sales of item 2 on Nov 2 are probably related to the sales of item 1 on Nov 1 (that is why Nov 2 is in the "embargo" time frame). W.r.t dates the problem is that we have too less data overall, as we also loose even more date due to the (lagged) embargo and that we have to avoid that we have to apply "purging" (cp. above) as this would further reduce the amount of training date available.

Source: https://books.google.de/books?id=oU9KDwAAQBAJ&pg=PA103&lpg=PA110&dq=purged+cv+github&source=bl&ots=7TFGU-xxfx&sig=e94OZffPDeAaRJdn9k_pUHuR2t0&hl=de&sa=X&ved=0ahUKEwiNn-jXv6_aAhWFJZoKHWQCAOUQ6AEIXjAH#v=onepage&q&f=false

In each round of the validation, the respective validation_dates, embargo_dates and lagged_embargo_dates have to be removed:

In [16]:
validation_dates = [['2017-10-11'],
                    ['2017-10-29'],
                    ['2017-11-16'],
                    ['2017-12-04'],
                    ['2017-12-22']]

In [17]:
embargo_dates = [['2017-10-12', '2017-10-13', '2017-10-14', '2017-10-15', '2017-10-16', '2017-10-17', '2017-10-18', '2017-10-19', '2017-10-20', '2017-10-21', '2017-10-22', '2017-10-23', '2017-10-24', '2017-10-25'],
                 ['2017-10-30', '2017-10-31', '2017-11-01', '2017-11-02', '2017-11-03', '2017-11-04', '2017-11-05', '2017-11-06', '2017-11-07', '2017-11-08', '2017-11-09', '2017-11-10', '2017-11-11', '2017-11-12'],
                 ['2017-11-17', '2017-11-18', '2017-11-19', '2017-11-20', '2017-11-21', '2017-11-22', '2017-11-23', '2017-11-24', '2017-11-25', '2017-11-26', '2017-11-27', '2017-11-28', '2017-11-29', '2017-11-30'],
                 ['2017-12-05', '2017-12-06', '2017-12-07', '2017-12-08', '2017-12-09', '2017-12-10', '2017-12-11', '2017-12-12', '2017-12-13', '2017-12-14', '2017-12-15', '2017-12-16', '2017-12-17', '2017-12-18'],
                 ['2017-12-23', '2017-12-24', '2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29', '2017-12-30', '2017-12-31']]

In [18]:
lagged_embargo_dates = [['2017-10-26', '2017-10-27', '2017-10-28', '2017-10-29', '2017-10-30', '2017-10-31', '2017-11-01', '2017-11-02', '2017-11-03', '2017-11-04', '2017-11-05', '2017-11-06', '2017-11-07', '2017-11-08'],
                        ['2017-11-13', '2017-11-14', '2017-11-15', '2017-11-16', '2017-11-17', '2017-11-18', '2017-11-19', '2017-11-20', '2017-11-21', '2017-11-22', '2017-11-23', '2017-11-24', '2017-11-25', '2017-11-26'],
                        ['2017-12-01', '2017-12-02', '2017-12-03', '2017-12-04', '2017-12-05', '2017-12-06', '2017-12-07', '2017-12-08', '2017-12-09', '2017-12-10', '2017-12-11', '2017-12-12', '2017-12-13', '2017-12-14'],
                        ['2017-12-19', '2017-12-20', '2017-12-21', '2017-12-22', '2017-12-23', '2017-12-24', '2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29', '2017-12-30', '2017-12-31'],
                        []]

In [19]:
X_train_subsets = []

In [20]:
Y_train_subsets = []

In [21]:
X_validation_subsets = []

In [22]:
Y_validation_subsets = []

In [23]:
drop_X_cols = ['key', 'pid_x', 'size_x', 'color', 'brand', 'rrp', 'date', 'day_of_week', 
               'mainCategory', 'category', 'subCategory', 'releaseDate', 
               'rrp', 'price', 'month',
               'last_15_day_sales', 'last_16_day_sales', 'last_17_day_sales', 'last_18_day_sales', 'last_19_day_sales', 'last_20_day_sales', 'last_21_day_sales', 
               'last_22_day_sales', 'last_23_day_sales', 'last_24_day_sales', 'last_25_day_sales', 'last_26_day_sales', 'last_27_day_sales', 'last_28_day_sales']

In [24]:
drop_Y_cols = ['key', 'date', 'month']

In [25]:
for i in range(0, 5):
    full_embargo_set = set(validation_dates[i] + embargo_dates[i] + lagged_embargo_dates[i])
    validation_date = validation_dates[i]
    
    X_train_subsets.append(X_full_train.loc[X_full_train['date'].apply(lambda x: x not in full_embargo_set)].drop(drop_X_cols, axis=1).as_matrix())
    Y_train_subsets.append(Y_full_train.loc[Y_full_train['date'].apply(lambda x: x not in full_embargo_set)].drop(drop_Y_cols, axis=1).as_matrix())

    X_validation_subsets.append(X_full_train.loc[X_full_train['date'].apply(lambda x: x in validation_date)].drop(drop_X_cols, axis=1).as_matrix())
    Y_validation_subsets.append(Y_full_train.loc[Y_full_train['date'].apply(lambda x: x in validation_date)].drop(drop_Y_cols, axis=1).as_matrix())

#### Additional Preparations

In [26]:
keys_dates = pd.DataFrame(X_full['key']).join(X_full['date']) # Store for future lookups

In [27]:
X_train = X_full_train.drop(drop_X_cols, axis=1).as_matrix()

In [28]:
Y_train = Y_full_train.drop(drop_Y_cols, axis=1).as_matrix()

#### Model Selection

In [None]:
models = [Lasso, LinearRegression, XGBRegressor, MLPRegressor]

In [None]:
models_called = []

Lasso Hyperparameters to Try:

In [None]:
Lasso_hyperparameters_options = {
    'fit_intercept': [True, False],
}

In [None]:
keys, values = zip(*Lasso_hyperparameters_options.items())
Lasso_hyperparameters = [dict(zip(keys, v)) for v in itertools.product(*values)]

Linear Regression Hyperparameters to try:

In [None]:
LinearRegression_hyperparameters_options = {
    'fit_intercept': [True, False],
    'n_jobs': [cpus],
}

In [None]:
keys, values = zip(*LinearRegression_hyperparameters_options.items())
LinearRegression_hyperparameters = [dict(zip(keys, v)) for v in itertools.product(*values)]

XGBoost Hyperparameters to Try:

In [None]:
XGBoost_hyperparameters_gbtree_options = {
    'booster': ['gbtree'],
    'n_jobs': [cpus],
    'learning_rate': [0.01, 0.1, 0.3],
    'n_estimators': [40, 65, 100],
    'subsample': [0.7, 1.0],
    'max_depth ': [5, 10, 15, 20],
}

In [None]:
keys, values = zip(*XGBoost_hyperparameters_gbtree_options.items())
XGBoost_hyperparameters_gbtree = [dict(zip(keys, v)) for v in itertools.product(*values)]

In [None]:
XGBoost_hyperparameters_gblinear_options = {
    'booster': ['gblinear'],
    'n_jobs': [cpus],
    #'learning_rate': [i/100 for i in range(21, 40, 1)],
    #'n_estimators': [i for i in range(20, 120, 20)],
    #'alpha': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    #'lambda': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
}

In [None]:
keys, values = zip(*XGBoost_hyperparameters_gblinear_options.items())
XGBoost_hyperparameters_gblinear = [dict(zip(keys, v)) for v in itertools.product(*values)]

In [None]:
XGBoost_hyperparameters_dart_options = {
    'booster': ['dart'],
    'n_jobs': [cpus],
    #'learning_rate': [i/100 for i in range(21, 40, 1)],
    #'n_estimators': [i for i in range(40, 71, 1)],
}

In [None]:
keys, values = zip(*XGBoost_hyperparameters_dart_options.items())
XGBoost_hyperparameters_dart = [dict(zip(keys, v)) for v in itertools.product(*values)]

In [None]:
XGBoost_hyperparameters = XGBoost_hyperparameters_gbtree + XGBoost_hyperparameters_gblinear + XGBoost_hyperparameters_dart

MLPRegressor Hyperparameters to Try:

In [None]:
MLPRegressor_hyperparameters_options = {
    'activation': ['identity'],
    #'activation': ['identity', 'logistic', 'tanh', 'relu'],
    'hidden_layer_sizes': [(25, )],
    #'batch_size': ['auto', 10, 20, 40, 60, 80, 100],
    #'max_iter': [10, 50, 100, 200],
}

In [None]:
keys, values = zip(*MLPRegressor_hyperparameters_options.items())
MLPRegressor_hyperparameters = [dict(zip(keys, v)) for v in itertools.product(*values)]

Hyperparameter Settings to Try for all Models:

In [None]:
model_hyperparameters = [Lasso_hyperparameters,
                         LinearRegression_hyperparameters,
                         XGBoost_hyperparameters,
                         MLPRegressor_hyperparameters]

In [None]:
models_avg_rmse_scores = []

In [None]:
for model_id, model_val in enumerate(models):
    model_hyperparameters_called = []
    models_hyperparameters_avg_rmse_scores = []
    
    for model_hyperparameter_id, model_hyperparameter_val in enumerate(model_hyperparameters[model_id]):
        model = model_val(**model_hyperparameter_val)
        model_hyperparameters_called.append(model)
        models_hyperparameters_rmse_score_subset = []
        
        for X_train_subset_id, X_train_subset_val in enumerate(X_train_subsets):
            model.fit(X_train_subset_val, Y_train_subsets[X_train_subset_id])
            models_hyperparameters_rmse_score_subset.append(sqrt(mean_squared_error(Y_validation_subsets[X_train_subset_id], np.round(model.predict(X_validation_subsets[X_train_subset_id])))))
            
        models_hyperparameters_avg_rmse_scores.append(np.average(models_hyperparameters_rmse_score_subset))

        print(model_hyperparameters_called[-1:])
        print(models_hyperparameters_avg_rmse_scores[-1:])
        print("\n")
    
    models_called.append(model_hyperparameters_called)
    models_avg_rmse_scores.append(models_hyperparameters_avg_rmse_scores)

The best hyperparameter settings for the respective models are:

In [None]:
selected_models = []

In [None]:
for models_avg_rmse_scores_id, models_avg_rmse_scores_val in enumerate(models_avg_rmse_scores):
    selected_model = models_called[models_avg_rmse_scores_id][models_avg_rmse_scores_val.index(min(models_avg_rmse_scores_val))]
    print(selected_model)
    selected_models.append(selected_model)

In [32]:
selected_models= [Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=8, normalize=False),
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.3, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=65,
       n_jobs=8, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=0.7),
MLPRegressor(activation='identity', alpha=0.0001, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(25,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)]

In [33]:
selected_models

[Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
    normalize=False, positive=False, precompute=False, random_state=None,
    selection='cyclic', tol=0.0001, warm_start=False),
 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=8, normalize=False),
 XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
        colsample_bytree=1, gamma=0, learning_rate=0.3, max_delta_step=0,
        max_depth=3, min_child_weight=1, missing=None, n_estimators=65,
        n_jobs=8, nthread=None, objective='reg:linear', random_state=0,
        reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
        silent=True, subsample=0.7),
 MLPRegressor(activation='identity', alpha=0.0001, batch_size='auto',
        beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
        hidden_layer_sizes=(25,), learning_rate='constant',
        learning_rate_init=0.001, max_iter=200, momentum=0.9,
        nesterovs_momentum=True, power_t=0.5, random_state=None,
        

# Thanks Lu!

## Model
### 1. Import and load

In [108]:
import pandas as pd
import numpy as np
import pickle
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt

In [109]:
# # Import cluster identifier

# sales = pd.read_csv(working_directory + 'data_v0.1_sales.csv')
# big_key = sales['key'][sales['cluster'] == "big"]

In [110]:
# # Import datasets

# X_full = pickle.load(open(working_directory + 'X_flat.pkl', 'rb'))
# Y_full = pickle.load(open(working_directory + 'Y_flat.pkl', 'rb'))

In [111]:
#drop_cols = X_full.columns[12+14:12+14+14]
#X_full = X_full.drop(drop_cols, axis=1)
#print(X_full.columns)

### 2. Prepare

In [112]:
# # Find rows which belong to cluster 'big' in X_full

# X_full['key'] = X_full['key'].astype(str)
# X_big = X_full[X_full['key'].isin(big_key.astype(str))]
# X_big = X_big.reset_index(drop=True)

In [113]:
# # Find rows which belong to cluster 'big' in Y_full

# Y_full['key'] = Y_full['key'].astype(str)
# Y_big = Y_full[Y_full['key'].isin(big_key.astype(str))]
# Y_big = Y_big.reset_index(drop=True)

In [114]:
# Split the X_big and Y_big into traing and test

X_big['month'] = pd.DatetimeIndex(X_big['date']).month
Y_big['month'] = pd.DatetimeIndex(Y_big['date']).month
X_big_train = X_big.loc[X_big['month'] != 1]
Y_train = Y_big.loc[Y_big['month'] != 1]['sales']
X_big_test = X_big.loc[X_big['month'] == 1]
Y_test = Y_big.loc[Y_big['month'] == 1]['sales']

In [115]:
# Prepare the data for fitting the input of the model

drop_x_cols = ['key', 'pid_x', 'size_x', 'color', 'brand', 'rrp', 'date', 'day_of_week', 
               'mainCategory', 'category', 'subCategory', 'releaseDate', 
               'rrp', 'price', 'month']
X_train = X_big_train.drop(drop_x_cols, axis=1)
X_test = X_big_test.drop(drop_x_cols, axis=1)
print(X_train.columns)
X_train = X_train.as_matrix()
X_train = np.delete(X_train, np.s_[14:28], axis=1)
Y_train = Y_train.as_matrix()
X_test = X_test.as_matrix()
X_test = np.delete(X_test, np.s_[14:28], axis=1)
Y_test = Y_test.as_matrix()

Index(['last_1_day_sales', 'last_2_day_sales', 'last_3_day_sales',
       'last_4_day_sales', 'last_5_day_sales', 'last_6_day_sales',
       'last_7_day_sales', 'last_8_day_sales', 'last_9_day_sales',
       'last_10_day_sales', 'last_11_day_sales', 'last_12_day_sales',
       'last_13_day_sales', 'last_14_day_sales', 'last_15_day_sales',
       'last_16_day_sales', 'last_17_day_sales', 'last_18_day_sales',
       'last_19_day_sales', 'last_20_day_sales', 'last_21_day_sales',
       'last_22_day_sales', 'last_23_day_sales', 'last_24_day_sales',
       'last_25_day_sales', 'last_26_day_sales', 'last_27_day_sales',
       'last_28_day_sales', 'is_eleventh', 'is_crazy_day', 'day_Friday',
       'day_Monday', 'day_Saturday', 'day_Sunday', 'day_Thursday',
       'day_Tuesday', 'day_Wednesday', 'days_since_release', 'price_diff',
       'color_beige', 'color_blau', 'color_braun', 'color_gelb', 'color_gold',
       'color_grau', 'color_gruen', 'color_khaki', 'color_lila',
       'color_orange

In [116]:
X_test.shape

(397544, 81)

### 3. Train model

In [117]:
# Training the model

model = selected_models[2]
model.fit(X_train, Y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.3, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=65,
       n_jobs=8, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=0.7)

## Test

In [118]:
# Test the model
# Only the sale unit of the first day for each item is right int the 'X_test'
# Select the row 'on Jan 1st'

X_Jan1 = X_test[0:1,:]
for i in range(int(len(X_test)/31-1)):
    X_Jan1 = np.vstack([X_Jan1, X_test[(31+i*31):(32+i*31),:]])
print(X_Jan1)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]]


In [119]:
print(X_Jan1.shape)

(12824, 81)


In [120]:
# Predict the sales unit 'on Jan 1st' for each items
# Change the format of the prediction results on Jan_1st

Y_Jan1 = model.predict(X_Jan1)
prediction_1 = np.asarray([round(value) for value in Y_Jan1])
prediction_1 = np.reshape(prediction_1, (len(prediction_1),1))
print(prediction_1)

[[0.]
 [0.]
 [0.]
 ...
 [0.]
 [0.]
 [0.]]


In [121]:
# Delete the 'last_28_day_sales'
# Add the prediction results as the 'last_1_day_sales'

X_Jan = X_Jan1
X_Jan = np.delete(X_Jan, np.s_[13:14], axis=1)
X_Jan = np.append(prediction_1, X_Jan, axis=1)
print(X_Jan)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]


In [122]:
# Add the process above into a loop
# Predict the sales units eery day in January for each item

predictions = prediction_1
for i in range(30):
    Y_Jan = model.predict(X_Jan)
    prediction = np.asarray([round(value) for value in Y_Jan])
    prediction = np.reshape(prediction, (len(prediction),1))
    predictions = np.append(predictions, prediction, axis=1)
    X_Jan = np.delete(X_Jan, np.s_[13:14], axis=1)
    X_Jan = np.append(prediction, X_Jan, axis=1)
print(predictions)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## Evaluation

### 1. Restructure the predictions

In [123]:
# Reshape predictions: row ->'big' items, columns -> date

column_name = X_big_test['date'].unique().astype(str)
row_name = X_big_test['key'].unique().astype(str)

In [124]:
# Aggregate sales for each day each item.

pred_agg = predictions
agg_sum = predictions[:,0]
for i in range(len(column_name)-1):
    agg_sum = pred_agg[:,i] + predictions[:, i+1]
    pred_agg[:, i+1] = agg_sum
print(pred_agg)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### 2. Test_0.csv
#### 2.1. Prepare 

In [125]:
# Load the test data

test_0 = pd.read_csv(test_data_directory + 'test_0.csv')

In [126]:
# Add 'key' for test data by merging 'pid' and 'size'
# Select useful attributes

test_0["key"] = test_0["pid"].map(int).map(str) + test_0["size"]
test_0_big = test_0.loc[test_0['key'].isin(row_name)]
subtest_0_big = test_0_big[['key','stock','sold_out_date']]
test = np.asarray(subtest_0_big)
print(test)

[['10001L' 1.0 '2018-01-07']
 ['100035 ( 43-46 )' 1.0 '2018-01-30']
 ['10008XL' 4.0 '2018-01-30']
 ...
 ['228782XL' 1.0 '2018-01-02']
 ['22878M' 2.0 '2018-01-28']
 ['22881S' 1.0 '2018-01-02']]


#### 2.2. Match the 'test' with the 'pred_agg'

In [127]:
# Define arrays for storing the predicted day and date

pred_day = np.zeros((len(test),1), dtype=int)
pred_date = np.asarray(test_0_big[['sold_out_date']])

for i in range(len(test)):
    
    # 'key' is the key for each test item
    key = test[i,0]
    # Find the index of the item in predictions sharing the same key
    index = 0
    
    # Match the items in 'test' with the items in 'pred_agg'
    # Retern the index of the item in the 'pred_agg'
    for j in range(len(row_name)):
        if row_name[j] == key:
            index = j
            break
    
    # Match
    if test[i,1] < pred_agg[index,0]:
        pred_day[i,0] = 1
        pred_date[i,0] = column_name[0]
        continue
    if test[i,1] > pred_agg[index,30]:
        pred_day[i,0] = 15
        pred_date[i,0] = column_name[14]
        continue
    for k in range(len(pred_agg[0])):
        if pred_agg[index,k] - test[i,1] >=0:
            pred_day[i,0] = k+1
            pred_date[i,0] = column_name[k]
            break

#### 2.3. Error Calculation

In [128]:
# Find the day of the real sold out date

day = [i[-2:] for i in test[:,2].tolist()]
real_day = np.reshape(np.asarray(list(map(int,day))), (len(test),1))
print(real_day)

[[ 7]
 [30]
 [30]
 ...
 [ 2]
 [28]
 [ 2]]


In [129]:
# Show the predicted sold out date

print(pred_day)

[[15]
 [15]
 [15]
 ...
 [15]
 [15]
 [15]]


In [130]:
# Error calculation

error = sqrt(np.sum(np.abs(np.subtract(pred_day, real_day))))
print(error)

258.20921749620015


#### 2.4. Visualize result

In [131]:
# Visualize the result in dataframe

result = np.append(test, pred_date, axis=1)
result_column =['key','stock','real_sold_out_day','predicted_sold_out_day']
prediction_result = pd.DataFrame(result, columns=result_column)
prediction_result

Unnamed: 0,key,stock,real_sold_out_day,predicted_sold_out_day
0,10001L,1,2018-01-07,2018-01-15
1,100035 ( 43-46 ),1,2018-01-30,2018-01-15
2,10008XL,4,2018-01-30,2018-01-15
3,10013L,1,2018-01-24,2018-01-15
4,10013M,1,2018-01-22,2018-01-15
5,10020XL,1,2018-01-18,2018-01-15
6,1003143,1,2018-01-11,2018-01-15
7,10035L ( 152-158 ),1,2018-01-26,2018-01-15
8,10035XL ( 158-170 ),4,2018-01-16,2018-01-15
9,10035XS ( 116-128 ),1,2018-01-31,2018-01-15


### 3. Test_1.csv

In [132]:
# Load the test data
test_0 = pd.read_csv(test_data_directory + 'test_1.csv')

# Add 'key' for test data by merging 'pid' and 'size'
# Select useful attributes
test_0["key"] = test_0["pid"].map(int).map(str) + test_0["size"]
test_0_big = test_0.loc[test_0['key'].isin(row_name)]
subtest_0_big = test_0_big[['key','stock','sold_out_date']]
test = np.asarray(subtest_0_big)

In [133]:
# Define arrays for storing the predicted day and date

pred_day = np.zeros((len(test),1), dtype=int)
pred_date = np.asarray(test_0_big[['sold_out_date']])

for i in range(len(test)):
    
    # 'key' is the key for each test item
    key = test[i,0]
    # Find the index of the item in predictions sharing the same key
    index = 0
    
    # Match the items in 'test' with the items in 'pred_agg'
    # Retern the index of the item in the 'pred_agg'
    for j in range(len(row_name)):
        if row_name[j] == key:
            index = j
            break
    
    # Match
    if test[i,1] < pred_agg[index,0]:
        pred_day[i,0] = 1
        pred_date[i,0] = column_name[0]
        continue
    if test[i,1] > pred_agg[index,30]:
        pred_day[i,0] = 15
        pred_date[i,0] = column_name[14]
        continue
    for k in range(len(pred_agg[0])):
        if pred_agg[index,k] - test[i,1] >=0:
            pred_day[i,0] = k+1
            pred_date[i,0] = column_name[k]
            break

In [134]:
# Error calculation

day = [i[-2:] for i in test[:,2].tolist()]
real_day = np.reshape(np.asarray(list(map(int,day))), (len(test),1))
error = sqrt(np.sum(np.abs(np.subtract(pred_day, real_day))))
print(error)

258.2034081881957


### 4. Test_2.csv

In [135]:
# Load the test data
test_0 = pd.read_csv(test_data_directory + 'test_2.csv')

# Add 'key' for test data by merging 'pid' and 'size'
# Select useful attributes
test_0["key"] = test_0["pid"].map(int).map(str) + test_0["size"]
test_0_big = test_0.loc[test_0['key'].isin(row_name)]
subtest_0_big = test_0_big[['key','stock','sold_out_date']]
test = np.asarray(subtest_0_big)

In [136]:
# Define arrays for storing the predicted day and date

pred_day = np.zeros((len(test),1), dtype=int)
pred_date = np.asarray(test_0_big[['sold_out_date']])

for i in range(len(test)):
    
    # 'key' is the key for each test item
    key = test[i,0]
    # Find the index of the item in predictions sharing the same key
    index = 0
    
    # Match the items in 'test' with the items in 'pred_agg'
    # Retern the index of the item in the 'pred_agg'
    for j in range(len(row_name)):
        if row_name[j] == key:
            index = j
            break
    
    # Match
    if test[i,1] < pred_agg[index,0]:
        pred_day[i,0] = 1
        pred_date[i,0] = column_name[0]
        continue
    if test[i,1] > pred_agg[index,30]:
        pred_day[i,0] = 15
        pred_date[i,0] = column_name[14]
        continue
    for k in range(len(pred_agg[0])):
        if pred_agg[index,k] - test[i,1] >=0:
            pred_day[i,0] = k+1
            pred_date[i,0] = column_name[k]
            break

In [137]:
# Error calculation

day = [i[-2:] for i in test[:,2].tolist()]
real_day = np.reshape(np.asarray(list(map(int,day))), (len(test),1))
error = sqrt(np.sum(np.abs(np.subtract(pred_day, real_day))))
print(error)

258.5169240107889


### 5. Test_3.csv

In [138]:
# Load the test data
test_0 = pd.read_csv(test_data_directory + 'test_3.csv')

# Add 'key' for test data by merging 'pid' and 'size'
# Select useful attributes
test_0["key"] = test_0["pid"].map(int).map(str) + test_0["size"]
test_0_big = test_0.loc[test_0['key'].isin(row_name)]
subtest_0_big = test_0_big[['key','stock','sold_out_date']]
test = np.asarray(subtest_0_big)

In [139]:
# Define arrays for storing the predicted day and date

pred_day = np.zeros((len(test),1), dtype=int)
pred_date = np.asarray(test_0_big[['sold_out_date']])

for i in range(len(test)):
    
    # 'key' is the key for each test item
    key = test[i,0]
    # Find the index of the item in predictions sharing the same key
    index = 0
    
    # Match the items in 'test' with the items in 'pred_agg'
    # Retern the index of the item in the 'pred_agg'
    for j in range(len(row_name)):
        if row_name[j] == key:
            index = j
            break
    
    # Match
    if test[i,1] < pred_agg[index,0]:
        pred_day[i,0] = 1
        pred_date[i,0] = column_name[0]
        continue
    if test[i,1] > pred_agg[index,30]:
        pred_day[i,0] = 15
        pred_date[i,0] = column_name[14]
        continue
    for k in range(len(pred_agg[0])):
        if pred_agg[index,k] - test[i,1] >=0:
            pred_day[i,0] = k+1
            pred_date[i,0] = column_name[k]
            break

In [140]:
# Error calculation

day = [i[-2:] for i in test[:,2].tolist()]
real_day = np.reshape(np.asarray(list(map(int,day))), (len(test),1))
error = sqrt(np.sum(np.abs(np.subtract(pred_day, real_day))))
print(error)

257.9961240018927


### 6. Test_4.csv

In [141]:
# Load the test data
test_0 = pd.read_csv(test_data_directory + 'test_4.csv')

# Add 'key' for test data by merging 'pid' and 'size'
# Select useful attributes
test_0["key"] = test_0["pid"].map(int).map(str) + test_0["size"]
test_0_big = test_0.loc[test_0['key'].isin(row_name)]
subtest_0_big = test_0_big[['key','stock','sold_out_date']]
test = np.asarray(subtest_0_big)

In [142]:
# Define arrays for storing the predicted day and date

pred_day = np.zeros((len(test),1), dtype=int)
pred_date = np.asarray(test_0_big[['sold_out_date']])

for i in range(len(test)):
    
    # 'key' is the key for each test item
    key = test[i,0]
    # Find the index of the item in predictions sharing the same key
    index = 0
    
    # Match the items in 'test' with the items in 'pred_agg'
    # Retern the index of the item in the 'pred_agg'
    for j in range(len(row_name)):
        if row_name[j] == key:
            index = j
            break
    
    # Match
    if test[i,1] < pred_agg[index,0]:
        pred_day[i,0] = 1
        pred_date[i,0] = column_name[0]
        continue
    if test[i,1] > pred_agg[index,30]:
        pred_day[i,0] = 15
        pred_date[i,0] = column_name[14]
        continue
    for k in range(len(pred_agg[0])):
        if pred_agg[index,k] - test[i,1] >=0:
            pred_day[i,0] = k+1
            pred_date[i,0] = column_name[k]
            break

In [143]:
# Error calculation

day = [i[-2:] for i in test[:,2].tolist()]
real_day = np.reshape(np.asarray(list(map(int,day))), (len(test),1))
error = sqrt(np.sum(np.abs(np.subtract(pred_day, real_day))))
print(error)

257.93603858321154
