# Baseline One-Hot Evaluation Results

This notebook contains the baseline results of various evaluation models with the one-hot encoded product data, and a confusion matrix. It shows the preliminary process of testing out different regression models and neural networks. The team selected an MLP Classifier as the main evaluation model to use going forward. Please see Evaluation.ipynb for the full evaluation process of the team's embeddings.

The classification task is to predict "merch_lob_nm". The onehot encoded vectors exclude any heirarchical data (merch_division_nm, merch_lob_nm, merch_bus_cat_nm, merch_subcat_nm, merch_fineline_nm) for a fair prediction.

references: 

https://www.statology.org/one-hot-encoding-in-python/

In [99]:
# Import the packages for this lab
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Import linear regression models
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV

# Import logistic regression models
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

# Import confusion matrix function from sklearn
from sklearn.metrics import confusion_matrix

In [100]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder


In [101]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


In [102]:
%matplotlib inline

In [103]:
product_onehot_path = "embeddings\onehot.csv"
df_cont_and_onehot = pd.read_csv(product_onehot_path, index_col=0, header=0)

store_onehot_path = "embeddings\onehot_store.csv"
df_cont_and_onehot_store = pd.read_csv(store_onehot_path, index_col=0, header=0)

# Product One-hot Embedding Evaluation

In [104]:
product_path = "clean_data/cleaned_products.csv"
df_product_standard = pd.read_csv(product_path)
df_product_standard = df_product_standard[["ctr_product_num", "merch_lob_nm"]]

df_cont_and_onehot = df_cont_and_onehot.join(df_product_standard.set_index("ctr_product_num"))
df_cont_and_onehot.dropna(inplace=True)

In [105]:
df_cont_and_onehot

Unnamed: 0_level_0,corporate_status_cd_ACT,corporate_status_cd_DWO,corporate_status_cd_FD,corporate_status_cd_INA,corporate_status_cd_INC,corporate_status_cd_SD,corporate_status_cd_TD,ctr_good_better_best_cd_BEST,ctr_good_better_best_cd_BETTER,ctr_good_better_best_cd_GOOD,...,cold_sensitive_ind_N,cold_sensitive_ind_Y,heat_sensitive_ind_N,heat_sensitive_ind_Y,package_depth_qty,package_height_qty,package_volume_qty,package_weight_qty,national_consumer_price_amt,merch_lob_nm
ctr_product_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
650092,0,0,1,0,0,0,0,0,1,0,...,1,0,1,0,21.3,17.5,0.097515,0.667,0.000,HARDWARE
62383,0,0,1,0,0,0,0,0,0,0,...,1,0,1,0,31.8,9.3,5.442437,45.300,191.490,TIRES
6680281,1,0,0,0,0,0,0,0,0,1,...,1,0,1,0,1.0,1.0,0.000579,1.000,13.990,BACKYARD LIVING
1121723,1,0,0,0,0,0,0,1,0,0,...,1,0,1,0,32.0,6.0,0.777778,10.500,0.000,HEAVY AUTO PARTS
467698,0,0,1,0,0,0,0,1,0,0,...,1,0,1,0,15.5,15.1,0.262990,1.983,119.990,CAR CARE & ACCESSORIES
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
779497,1,0,0,0,0,0,0,1,0,0,...,1,0,1,0,7.9,6.8,0.153483,13.227,0.000,FISHING
962038,0,0,1,0,0,0,0,0,0,0,...,1,0,1,0,0.1,4.8,0.013333,0.100,5.052,NON MERCHANDISING LOB
1951189,0,0,1,0,0,0,0,0,0,0,...,1,0,1,0,1.6,46.9,0.124262,10.970,16.410,NON MERCHANDISING LOB
6511575,1,0,0,0,0,0,0,0,0,1,...,1,0,1,0,1.0,1.0,0.000579,1.000,5.790,SEASONAL


In [106]:
#replace merch_lob_nm with numerical values
#original_label = df_cont_and_onehot.merch_lob_nm
#df_cont_and_onehot.merch_lob_nm = pd.Categorical(pd.factorize(df_cont_and_onehot.merch_lob_nm)[0])

In [107]:
#create a mapping of original label to the new numerical label
#label_map = dict(zip(df_cont_and_onehot['merch_lob_nm'], original_label))
#label_map

Sample 100k products

In [108]:
df_cont_and_onehot100k = df_cont_and_onehot.sample(frac=1, random_state=42)[:100000]

Create train test split for product embeddings.

In [109]:
drop_for_X = ["merch_lob_nm"]

X = df_cont_and_onehot100k.drop(columns=drop_for_X)
Y = df_cont_and_onehot100k[["merch_lob_nm"]]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

## MLP Classifier (Neural Network Model)

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier

In [110]:
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

In [111]:
#3 hidden layers
mlp = MLPClassifier(hidden_layer_sizes=(150, 100, 50), random_state=1, max_iter=300).fit(X_train, y_train)

train_score = mlp.score(X_train, y_train)
test_score = mlp.score(X_test, y_test)

y_pred = mlp.predict(X_test)

  y = column_or_1d(y, warn=True)


In [112]:
print(train_score)
print(test_score)

0.812179104477612
0.7552727272727273


# Store One-hot Embedding Evaluation

In [113]:
#read in the sales data
store_sales_path = "embeddings\store_embeddings\store_sales_embedding.csv"
df_sales = pd.read_csv(store_sales_path)
df_sales = df_sales[["yr_num","wk_num","store_num","sales_qty"]]

#append one-hot store embeddings onto the sales data 
df_cont_and_onehot_store = df_cont_and_onehot_store.join(df_sales.set_index("store_num"))
df_cont_and_onehot_store.dropna(inplace=True)
df_cont_and_onehot_store

Unnamed: 0_level_0,latitude_qty,longitude_qty,retail_square_ft_qty,ins_garden_centre_sqr_ft_qty,number_of_service_bays_qty,checkouts_count,province_cd_AB,province_cd_BC,province_cd_MB,province_cd_NB,...,store_concept_type_nm_Smart2,store_concept_type_nm_Traditional,onsite_propane_txt_No,onsite_propane_txt_Yes,winterized_canopy_txt_No,winterized_canopy_txt_Not Determined,winterized_canopy_txt_Yes,yr_num,wk_num,sales_qty
store_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,44.149236,-79.884000,47006,0,10,11,0,0,0,0,...,0,0,0,1,0,1,0,2021,36,22847
1,44.149236,-79.884000,47006,0,10,11,0,0,0,0,...,0,0,0,1,0,1,0,2021,37,21874
1,44.149236,-79.884000,47006,0,10,11,0,0,0,0,...,0,0,0,1,0,1,0,2021,38,20441
1,44.149236,-79.884000,47006,0,10,11,0,0,0,0,...,0,0,0,1,0,1,0,2021,39,21213
1,44.149236,-79.884000,47006,0,10,11,0,0,0,0,...,0,0,0,1,0,1,0,2021,40,19903
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
935,48.471791,-123.334038,1995,0,8,2,0,1,0,0,...,0,0,1,0,0,1,0,2022,31,1252
935,48.471791,-123.334038,1995,0,8,2,0,1,0,0,...,0,0,1,0,0,1,0,2022,32,1631
935,48.471791,-123.334038,1995,0,8,2,0,1,0,0,...,0,0,1,0,0,1,0,2022,33,1749
935,48.471791,-123.334038,1995,0,8,2,0,1,0,0,...,0,0,1,0,0,1,0,2022,34,1310


## Sales Forecasting using Linear Regression 

In [116]:
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV

#shuffle the data
df = df_cont_and_onehot_store.sample(frac=1, random_state=42)

drop_for_X = ["sales_qty"]

X = df.drop(columns=drop_for_X)
Y = df[["sales_qty"]]

#obtain train test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

# Lasso CV applies cross validation automatically
linreg = LassoCV()
linreg.fit(X_train, y_train)

# Predict the sales
y_test_predictions = linreg.predict(X_test)

  y = column_or_1d(y, warn=True)


In [117]:
# Evaluate models
train_score = linreg.score(X_train, y_train)
test_score = linreg.score(X_test, y_test)
print(f'The train score is {train_score:.3f} and the test score is {test_score:.3f}')

The train score is 0.511 and the test score is 0.509


The baseline R^2 score for the Store Entity Embeddings is 0.509 on the test data