![HSV-AI Logo](https://github.com/HSV-AI/hugo-website/blob/master/static/images/logo_v9.png?raw=true)

# Implicit Recommendation from ECommerce Data

Some of the material for this work is based on [A Gentle Introduction to Recommender Systems with Implicit Feedback](https://jessesw.com/Rec-System/) by Jesse Steinweg Woods. This tutorial includes an implementation of the Alternating Least Squares algorithm and some other useful functions (like the area under the curve calculation). Other parts of the tutorial are based on a previous version of the Implicit library and had to be reworked.

The dataset used for this work is from Kaggle [E-Commerce Data, Actual transactions from UK retailer](https://www.kaggle.com/carrie1/ecommerce-data)


# Global Imports

In [None]:
import pandas as pd
import numpy as np
import random
from matplotlib import pyplot as plt
import implicit
import scipy
from sklearn import metrics
from pandas.api.types import CategoricalDtype


# Data Exploration

In [None]:
# It appears that the csv file is encoded as iso-8859-1 (I guessed) and has to be loaded using the encoding parameter.
df = pd.read_csv('../data/external/ecommerce/data.csv', encoding='iso-8859-1')
df.head()

In [None]:
print('Unique invoices', len(pd.unique(df['InvoiceNo'])))
print('Unique products', len(pd.unique(df['StockCode'])))
print('Total rows', len(df))

## Checking for missing values

It looks like the InvoiceNo, StockCode, and Quantity are always available. That is all that we will be using from this dataset, so the rest is fine.

In [None]:
df.isna().sum()

Let's look at the number of products and see how they are distributed among the orders. We can use the value_counts method from pandas to get an idea of how often each product is ordered.

In [None]:
product_counts = df['StockCode'].value_counts().to_numpy()
print('There are', len(product_counts), 'unique products\n')
print('Here are the counts of products ordered from largest to smallest')
print(product_counts)
print('\nAnd a graph of what the curve looks like:')
plt.plot(product_counts) 
plt.show()

It appears that there are a few items in the store that sell a LOT, and most that are sold a few times. This seems normal for a retail store. Let's take a quick look at the most purchased item to see if it makes sense.

In [None]:
df['StockCode'].value_counts().head()

In [None]:
df[df['StockCode']=='85123A'].head()

We don't have information about the market of the retail store, but looking at a price of 2.55 - this looks like a normal high volume item.

Now we can check the value of each invoice and see what jumps out.

In [None]:
df['StockTotal'] = df['Quantity'] * df['UnitPrice']
totals = df.groupby(df.InvoiceNo)['StockTotal'].sum()
totals.plot()

Well there's something worth looking into. We need to figure out what the negative order totals are. It would have to be either a negative quantity or price - so let's figure out which it is.

In [None]:
print('There are', len(df[df.Quantity < 0]), 'negative quantities')
df[df.Quantity < 0].head()

Now we need to figure out what to do with these. We could throw out all invoices that include negative quanties, or just the items with negative quanties. Let's check to see if we have any mixed invoices.

In [None]:
temp_df = df.groupby(df.InvoiceNo).agg(minQ=('Quantity', 'min'), 
                               maxQ=('Quantity', 'max'))
temp_df[(temp_df.minQ < 0) & (temp_df.maxQ > 0)].head()

Given that all negative quantities are on invoices with no purchases, we should be able to remove them.

In [None]:
print('There are', len(df[df.UnitPrice < 0]), 'negative unit prices')

df[df.UnitPrice < 0].head()

It looks like we can throw out anything with a negative UnitPrice.

In [None]:
df = df[(df.UnitPrice > 0) & (df.Quantity > 0)]

Now we need to look into those very large sums on the invoice total to see what is happening there.

In [None]:
totals.sort_values(ascending=False)

In [None]:
df[df.InvoiceNo == '541431'].head()

In [None]:
totals = df.groupby(df.InvoiceNo).sum()
print(len(totals))
quantity_filter = totals[(totals.Quantity <= 100)].index.tolist()
print(len(quantity_filter))

It looks like these were actual orders with a giant quantity. These aren't your average customers, so we may need to try both with the data and without.

Another thing we can do is compute the sparsity of the data. This is useful to see if there is enough overlap between the orders and products to make a useful decision for recommendations.

In [None]:
order_counts = df['InvoiceNo'].value_counts()
num_orders = len(order_counts)
num_items = len(product_counts)
sparsity = 1 - len(df) / (num_orders * num_items)
print(f'number of orders: {num_orders}, number of items: {num_items}')
print(f'matrix sparsity: {sparsity:f}')

Compare that with the 100k movielens dataset that has:

```
number of users: 943, number of items: 1682
matrix sparsity: 0.936953
```

Given that this is intended to be used for recommendations based in individual orders, we can remove any invoice that has less than 2 items.

In [None]:
minimum_order_size = 2
order_group = df.loc[:, ['InvoiceNo', 'StockCode']].groupby('InvoiceNo').count()
 
multi_order = order_group[(order_group.StockCode >= minimum_order_size)].count()
single_order = order_group[(order_group.StockCode < minimum_order_size)].count()
 
print('Orders with at least',minimum_order_size,'products:',multi_order['StockCode'])
print('Orders with less than',minimum_order_size,'products:',single_order['StockCode'])
 
# We can capture the list of mutiple product orders with this:
order_filter = order_group[(order_group.StockCode >= minimum_order_size)].index.tolist()

In [None]:
filtered_df = df[df['InvoiceNo'].isin(order_filter)].copy()

# Also filter by quantity
filtered_df = filtered_df[filtered_df['InvoiceNo'].isin(quantity_filter)].copy()

print('Original dataframe length:', len(df))
print('Filtered dataframe length:', len(filtered_df))

In [None]:
product_counts = filtered_df['StockCode'].value_counts().to_numpy()
print('There are', len(product_counts), 'unique products\n')
print('\nAnd a graph of what the curve looks like:')
plt.plot(product_counts) 
plt.show()
 
order_counts = filtered_df['InvoiceNo'].value_counts()
num_orders = len(order_counts)
num_items = len(product_counts)
sparsity = 1 - len(df) / (num_orders * num_items)
print(f'number of orders: {num_orders}, number of items: {num_items}')
print(f'matrix sparsity: {sparsity:f}')

In [None]:
filtered_df['StockCode'] = filtered_df['StockCode'].astype(str)

In [None]:
item_lookup = filtered_df[['StockCode', 'Description']].drop_duplicates() # Only get unique item/description pairs
item_lookup['StockCode'] = item_lookup.StockCode.astype(str) # Encode as strings for future lookup ease

price_lookup = filtered_df[['StockCode', 'UnitPrice']].drop_duplicates()
price_lookup['StockCode'] = price_lookup.StockCode.astype(str)

In [None]:
item_lookup.to_pickle('../data/interim/item_lookup.gz')
item_lookup.to_csv('../data/interim/item_lookup.csv', index_label=False)

In [None]:
selected_df = filtered_df[['InvoiceNo', 'StockCode', 'Quantity']]
selected_df.info()
selected_df.head()

In [None]:
invoices = list(np.sort(selected_df.InvoiceNo.unique())) # Get our unique customers
products = list(selected_df.StockCode.unique()) # Get our unique products that were purchased
quantity = list(selected_df.Quantity) # All of our purchases

cols = selected_df.InvoiceNo.astype(CategoricalDtype(categories=invoices, ordered=True)).cat.codes 
# Get the associated row indices
rows = selected_df.StockCode.astype(CategoricalDtype(categories=products, ordered=True)).cat.codes 
# Get the associated column indices
purchases_sparse = scipy.sparse.csr_matrix((quantity, (rows, cols)), shape=(len(products), len(invoices)))

In [None]:
matrix_size = purchases_sparse.shape[0]*purchases_sparse.shape[1] # Number of possible interactions in the matrix
num_purchases = len(purchases_sparse.nonzero()[0]) # Number of items interacted with
sparsity = 100*(1 - (num_purchases/matrix_size))
sparsity

In [None]:
selected_df.to_pickle('../data/interim/selected_invoices.gz')
selected_df.to_csv('../data/interim/selected_invoices.csv', index_label=False)