![HSV-AI Logo](https://github.com/HSV-AI/hugo-website/blob/master/static/images/logo_v9.png?raw=true)

[Dataset](https://www.kaggle.com/mkechinov/ecommerce-purchase-history-from-electronics-store)

This file contains purchase data from April 2020 to November 2020 from a large home appliances and electronics online store.

Each row in the file represents an event. All events are related to products and users. Each event is like many-to-many relation between products and users.


In [1]:
%reload_kedro

2022-03-07 13:24:35,298 - kedro.framework.session.store - INFO - `read()` not implemented for `BaseSessionStore`. Assuming empty store.
2022-03-07 13:24:35,330 - kedro.config.config - INFO - Config from path `/home/jlangley/git/product-recommendation/conf/electronics` will override the following existing top-level config keys: alpha, factors, filter_value, iterations, maximum_order_size, minimum_order_size, regularization, seed, test_size, wandb_project
2022-03-07 13:24:35,331 - root - INFO - ** Kedro project productrec
2022-03-07 13:24:35,332 - root - INFO - Defined global variable `context`, `session` and `catalog`
2022-03-07 13:24:35,345 - root - INFO - Registered line magic `run_viz`


No files found in ['/home/jlangley/git/product-recommendation/conf/base', '/home/jlangley/git/product-recommendation/conf/electronics'] matching the glob pattern(s): ['credentials*', 'credentials*/**', '**/credentials*']
  warn(f"Credentials not found in your Kedro project config.\n{str(exc)}")


# Global Imports

In [2]:
import pandas as pd
import numpy as np
import random
from matplotlib import pyplot as plt

# Data Exploration

In [3]:
df = catalog.load("electronics_kaggle_data")

df.head()


2022-03-07 13:24:37,502 - kedro.io.data_catalog - INFO - Loading data from `electronics_kaggle_data` (CSVDataSet)...


Unnamed: 0,event_time,order_id,product_id,category_id,category_code,brand,price,user_id
0,2020-04-24 11:50:39 UTC,2294359932054536986,1515966223509089906,2.268105e+18,electronics.tablet,samsung,162.01,1.515916e+18
1,2020-04-24 11:50:39 UTC,2294359932054536986,1515966223509089906,2.268105e+18,electronics.tablet,samsung,162.01,1.515916e+18
2,2020-04-24 14:37:43 UTC,2294444024058086220,2273948319057183658,2.268105e+18,electronics.audio.headphone,huawei,77.52,1.515916e+18
3,2020-04-24 14:37:43 UTC,2294444024058086220,2273948319057183658,2.268105e+18,electronics.audio.headphone,huawei,77.52,1.515916e+18
4,2020-04-24 19:16:21 UTC,2294584263154074236,2273948316817424439,2.268105e+18,,karcher,217.57,1.515916e+18


## Checking for missing values

It looks like the order and product id are always available. That is all that we will be using from this dataset, to the rest is fine.

In [None]:
df.isna().sum()

Let's look at the number of products and see how they are distributed among the orders. We can use the value_counts method from pandas to get an idea of how often each product is ordered.

In [None]:
product_counts = df['product_id'].value_counts().to_numpy()
print('There are', len(product_counts), 'unique products\n')
print('Here are the counts of products ordered from largest to smallest')
print(product_counts)
print('\nAnd a graph of what the curve looks like:')
plt.plot(product_counts) 
plt.show()

Wow! It looks like there are a few products that are purchased a lot. Let's take a look at those to see what they are.

In [None]:
df['product_id'].value_counts().head()

In [None]:
print(len(df['order_id'].unique()))
print(len(df))
# from collections import Counter
# Counter(df['product_id'].value_counts().to_numpy())[3]

This is a very extreme curve. It's unlikely that we will be able to use any products that don't appear in multiple orders. We can do a few more things to see how much usable data we have.

First, we will tell value_counts to use percentages of the total instead of the sum values and divide the results equally into 10 bins.

In [None]:
df['product_id'].value_counts(normalize=True, bins=10)

In [None]:
df['price'].value_counts().sort_index().plot()

In [None]:
totals = df.groupby(df.order_id)['price'].sum()
 
totals.plot()

In [None]:
df['category_code'].value_counts()

In [None]:
df['brand'].value_counts()

Another thing we can do is compute the sparsity of the data. This is useful to see if there is enough overlap between the orders and products to make a useful decision for recommendations.

In [None]:
order_counts = df['order_id'].value_counts()
num_orders = len(order_counts)
num_items = len(product_counts)
sparsity = 1 - len(df) / (num_orders * num_items)
print(f'number of orders: {num_orders}, number of items: {num_items}')
print(f'matrix sparsity: {sparsity:f}')

Compare that with the 100k movielens dataset that has:

```
number of users: 943, number of items: 1682
matrix sparsity: 0.936953
```

In addition to reducing the sparsity, another issue with this dataset is the greater number of items and orders. When I tried to re-use a notebook built for the 100k movielens dataset on this ecomerce data, it immediately ran out of memory when attempting to use the KNNBasic algorithm.

In [None]:
product = 943 * 1682
print('Size for movielens: 'f'{product:,}')
 
product = 1435266 * 25113
print('Size for ecommerce dataset: 'f'{product:,}')

This is a pretty clear reason why the in-memory recommendation approaches that work with movielens run out of memory when trying to apply them to the ecommerce dataset.

We need to look at reducing the dataset into something both useful and manageable. To start with, we can remove any products that don't appear more than some value.

In [None]:
#@title Example form fields
#@markdown Forms support many types of fields.
 
filter_value = 100  #@param {type: "number"}
#@markdown

In [None]:
product_group = df.loc[:, ['order_id', 'product_id']].groupby('product_id').count()
 
multi_product = product_group[product_group.order_id >= filter_value].count()
single_product = product_group[product_group.order_id < filter_value].count()
 
print('Products in at least',filter_value,'orders:',multi_product['order_id'])
print('Products in less than',filter_value,'orders:',single_product['order_id'])
 
# We can capture the list of mutiple product orders with this:
product_filter = product_group[product_group.order_id >= filter_value].index.tolist()
 
product_filtered_df = df[df['product_id'].isin(product_filter)].copy()

We can also remove orders that don't have more than some number of items.



In [None]:
#@title Example form fields
#@markdown Forms support many types of fields.
 
minimum_order_size =   5#@param {type: "number"}
maximum_order_size =   20#@param {type: "number"}
 
#@markdown

In [None]:
order_group = product_filtered_df.loc[:, ['order_id', 'product_id']].groupby('order_id').count()
 
multi_order = order_group[(order_group.product_id >= minimum_order_size) & (order_group.product_id <= maximum_order_size)].count()
single_order = order_group[(order_group.product_id < minimum_order_size) | (order_group.product_id > maximum_order_size)].count()
 
print('Orders with at least',minimum_order_size,'products:',multi_order['product_id'])
print('Orders with less than',minimum_order_size,'products:',single_order['product_id'])
 
# We can capture the list of mutiple product orders with this:
order_filter = order_group[(order_group.product_id >= minimum_order_size) & (order_group.product_id <= maximum_order_size)].index.tolist()

In [None]:
filtered_df = product_filtered_df[product_filtered_df['order_id'].isin(order_filter)].copy()
print('Original dataframe length:', len(df))
print('Filtered dataframe length:', len(filtered_df))

In [None]:
product_counts = filtered_df['product_id'].value_counts().to_numpy()
print('There are', len(product_counts), 'unique products\n')
print('\nAnd a graph of what the curve looks like:')
plt.plot(product_counts) 
plt.show()
 
order_counts = filtered_df['order_id'].value_counts()
num_orders = len(order_counts)
num_items = len(product_counts)
sparsity = 1 - len(df) / (num_orders * num_items)
print(f'number of orders: {num_orders}, number of items: {num_items}')
print(f'matrix sparsity: {sparsity:f}')

In [None]:
filtered_df['product_id'] = filtered_df['product_id'].astype(str)
filtered_df['quantity'] = 1
filtered_df['description'] = filtered_df['brand'] + filtered_df['category_code']

item_lookup = filtered_df[['product_id', 'description']].drop_duplicates() # Only get unique item/description pairs
item_lookup['product_id'] = item_lookup.product_id.astype(str) # Encode as strings for future lookup ease


In [None]:
catalog.save("electronics_transactions", filtered_df)
catalog.save("electronics_products", item_lookup)