# Onehot Encodings

This notebook creates the one-hot encoded product data and write it to a csv for use by other notebooks. Columns with hierarchical data are excluded from the one-hot encoded vectors.

references: 

https://www.statology.org/one-hot-encoding-in-python/

In [93]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

In [94]:
product_path = "clean_data/cleaned_products.csv"
df_product_standard = pd.read_csv(product_path)

df_product_standard = df_product_standard.drop(columns=["Unnamed: 0"])
df_product_standard = df_product_standard.sample(frac=1, random_state=42)

In [None]:
df_product_standard.head()

In [96]:
#list the continuous columns and categorical columns
contcols = ["ctr_product_num", "package_depth_qty", "package_height_qty", "package_volume_qty", "package_weight_qty", "national_consumer_price_amt"]
catcols = ["corporate_status_cd", "ctr_good_better_best_cd", "ctr_product_profile_cd", "ctr_consumer_role_cd", "cold_sensitive_ind", "heat_sensitive_ind"]

df_categorical = df_product_standard[catcols]
df_continuous = df_product_standard[contcols]

In [None]:
#get one hot encodings for categorical columns
df_onehot = pd.get_dummies(df_categorical)
df_onehot

In [None]:
#combine continuous column values with onehot encoding for each product
df_cont_and_onehot = df_onehot.join(df_continuous)
df_cont_and_onehot

In [99]:
df_cont_and_onehot.set_index("ctr_product_num", inplace=True)

In [100]:
df_cont_and_onehot.dropna(inplace=True)

In [None]:
df_cont_and_onehot

In [102]:
df_cont_and_onehot.to_csv("embeddings/onehot.csv")

# Data Preview

These embeddings are generated from the product_standard_details.csv by first converting the categorical columns into one-hot encodings, then combining these with the numerical/continuous values. The end result is a sparse vector for each product. Note: heirarchical categorical columns were dropped since these embeddings will be used to predict product category later on.

For example, product 779760 is a fishing hook, with the following embedding:

In [None]:
df_cont_and_onehot.iloc[779760]

# One-Hot Encoded Store Embeddings

In [104]:
store_path = "clean_data/cleaned_store.csv"
df_store = pd.read_csv(store_path)

df_store = df_store.drop(columns=["Unnamed: 0"])
#df_store = df_store.sample(frac=1, random_state=42)

In [None]:
df_store

In [106]:
#list the continuous columns and categorical columns (dropping store name and shopping center name)
contcols_store = ["store_num", "latitude_qty", "longitude_qty", "retail_square_ft_qty", "ins_garden_centre_sqr_ft_qty", "number_of_service_bays_qty", "checkouts_count"]
catcols_store = ["province_cd", "store_size_cd", "store_concept_type_nm", "onsite_propane_txt", "winterized_canopy_txt"]

df_categorical_store = df_store[catcols_store]
df_continuous_store = df_store[contcols_store]

In [None]:
#df with only continuous (numerical) columns
df_continuous_store

In [None]:
#df with only categorical columns
df_categorical_store

In [None]:
#get one hot encodings for categorical columns
df_onehot_store = pd.get_dummies(df_categorical_store)
df_onehot_store

In [None]:
#combine continuous column values with onehot encoding for each product
df_cont_and_onehot_store = df_continuous_store.join(df_onehot_store)
df_cont_and_onehot_store.set_index("store_num", inplace=True)
df_cont_and_onehot_store.dropna(inplace=True)
df_cont_and_onehot_store

In [111]:
df_cont_and_onehot_store.to_csv("embeddings/onehot_store.csv")