![HSV-AI Logo](https://github.com/HSV-AI/hugo-website/blob/master/static/images/logo_v9.png?raw=true)

# Implicit Recommendation from ECommerce Data

Some of the material for this work is based on [A Gentle Introduction to Recommender Systems with Implicit Feedback](https://jessesw.com/Rec-System/) by Jesse Steinweg Woods. This tutorial includes an implementation of the Alternating Least Squares algorithm and some other useful functions (like the area under the curve calculation). Other parts of the tutorial are based on a previous version of the Implicit library and had to be reworked.

[Complete Journey Dataset](https://www.kaggle.com/frtgnn/dunnhumby-the-complete-journey)

This dataset contains household level transactions over two years from a group of 2,500 households who are frequent shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.

# Global Imports

In [11]:
import pandas as pd
import numpy as np
import random
from matplotlib import pyplot as plt
import implicit
import scipy
from sklearn import metrics
from pandas.api.types import CategoricalDtype

In [12]:
%run Common-Functions.ipynb

In [13]:
transactions = pd.read_pickle('../data/interim/journey/transactions.gz')
print('Loaded',len(transactions),'rows')

Loaded 2482662 rows


In [14]:
transaction_list = list(np.sort(transactions.order_id.unique())) # Get our unique customers
item_list = list(transactions.product_id.unique()) # Get our unique products that were purchased
quantity_list = list(transactions.quantity) # All of our purchases

cols = transactions.order_id.astype(CategoricalDtype(categories=transaction_list, ordered=True)).cat.codes 
# Get the associated row indices
rows = transactions.product_id.astype(CategoricalDtype(categories=item_list, ordered=True)).cat.codes 
# Get the associated column indices
purchases_sparse = scipy.sparse.csr_matrix((quantity_list, (rows, cols)), shape=(len(item_list), len(transaction_list)))

In [15]:
matrix_size = purchases_sparse.shape[0]*purchases_sparse.shape[1] # Number of possible interactions in the matrix
num_purchases = len(purchases_sparse.nonzero()[0]) # Number of items interacted with
sparsity = 100*(1 - (num_purchases/matrix_size))
sparsity

99.98008043099004

# Training & Test Datasets

We will use the function below to create a training and test dataset from the tutorial linked at the top. The test dataset masks some percentage of purchases to tested later with a recommendation.

In [16]:
product_train, product_test, products_altered, transactions_altered = make_train(purchases_sparse, pct_test = 0.1)
print('Total number of masked items:',product_test.count_nonzero()-product_train.count_nonzero())


Total number of masked items: 248267


# Implicit Recommendation Model

The code below creates and trains one of the models available from the Implicit package. Currently using hyperparameters suggested by various tutorials with no tuning.

In [17]:
alpha = 15
factors = 64
regularization = 0.003
iterations = 50

model = implicit.als.AlternatingLeastSquares(factors=factors,
                                    regularization=regularization,
                                    iterations=iterations)

## BayesianPersonalizedRanking was pretty bad
# model = implicit.bpr.BayesianPersonalizedRanking(factors=31,
#                                     regularization=0.1,
#                                     iterations=50)


# model = implicit.lmf.LogisticMatrixFactorization(factors=32,
#                                     regularization=0.1,
#                                     iterations=50)

model.fit((product_train * alpha).astype('double'))

user_vecs = model.user_factors
item_vecs = model.item_factors

# Deprecated function below
# user_vecs, item_vecs = implicit.alternating_least_squares((product_train*alpha).astype('double'), 
#                                                           factors=32, 
#                                                           regularization = 0.1, 
#                                                           iterations = 50)

  0%|          | 0/50 [00:00<?, ?it/s]

In [18]:
np.save('../data/interim/jewelry/user_factors', user_vecs)
np.save('../data/interim/jewelry/item_factors', item_vecs)
np.save('../data/interim/jewelry/product_train', product_train*alpha)

# Scoring the Model

Following the tutorial, we will use the area under the Receiver Operating Characteristic curve. 

In [19]:
test, popular = calc_mean_auc(product_train, products_altered, 
              [scipy.sparse.csr_matrix(item_vecs), scipy.sparse.csr_matrix(user_vecs.T)], product_test)


print('Our model scored',test,'versus a score of',popular,'if we always recommended the most popular item.')

KeyboardInterrupt: 

# Spot Checking

Now that we have a pretty good idea of the model performance overall, we can spot check a few things like finding similar items and checking item recommendations for an existing invoice.

In [20]:
transactions.head()

Unnamed: 0,order_id,product_id,quantity,price,description
0,26984851472,1004906,1,1.39,
1,26984851472,1033142,1,0.82,
2,26984851472,1036325,1,0.99,
3,26984851472,1082185,1,1.21,
4,26984851472,8160430,1,1.5,


In [21]:
item_lookup = transactions[['product_id', 'description']].drop_duplicates() # Only get unique item/description pairs
item_lookup['product_id'] = item_lookup.product_id.astype(str) # Encode as strings for future lookup ease

price_lookup = transactions[['product_id', 'price']].drop_duplicates() # Only get unique item/description pairs
price_lookup['product_id'] = price_lookup.product_id.astype(str) # Encode as strings for future lookup ease


In [22]:
related = model.similar_items(1284)
for rel in related:
    index = rel[0]
    prob = rel[1]
    item = item_lookup[item_lookup.product_id == str(item_list[index])].values
    print(prob, item[0][1])

1.0000001 nan
0.99391216 nan
0.9938326 nan
0.993668 nan
0.9936674 nan
0.9935585 nan
0.9935377 nan
0.99352044 nan
0.9934621 nan
0.9933757 nan


In [25]:
user_items = (product_train * alpha).astype('double').T.tocsr()
def recommend(order):
    print('Order Contents:')
    print(transactions[transactions.order_id == transaction_list[order]].loc[:, ['product_id', 'description']])
    print('Recommendations:')
    recommendations = model.recommend(order, user_items)
    for rec in recommendations:
        index = rec[0]
        prob = rec[1]
        stock_code = item_list[index]
        item = item_lookup[item_lookup.product_id == str(item_list[index])].values
        print(prob, stock_code, item[0][1])

In [26]:
recommend(1)

Order Contents:
    product_id description
5       826249         NaN
6      1043142         NaN
7      1085983         NaN
8      1102651         NaN
9      6423775         NaN
10     9487839         NaN
Recommendations:
0.18406916 1106523 nan
0.13535233 1068719 nan
0.116720706 1082185 nan
0.10422222 1004906 nan
0.099963754 866227 nan
0.08210845 1005186 nan
0.07961941 840361 nan
0.07687943 845208 nan
0.07628969 1139830 nan
0.074534245 981944 nan


In [27]:
recommend(2200)

Order Contents:
       product_id description
26995      883932         NaN
26996      942792         NaN
26997     1056082         NaN
26998     1110572         NaN
26999     1135868         NaN
Recommendations:
0.5290389 1085604 nan
0.388162 986912 nan
0.35630012 879755 nan
0.35545498 1005186 nan
0.32639477 1080414 nan
0.3128656 1038217 nan
0.30810016 1022003 nan
0.29854798 1037894 nan
0.280601 1050229 nan
0.27526525 866227 nan


In [28]:
transactions['ItemTotal'] = transactions['quantity'] * transactions['price']

In [29]:
recommended_price = []
for user in range(0, len(transaction_list)):
    recommendations = model.recommend(user, user_items)
    index = recommendations[0][0]
    price = price_lookup[price_lookup.product_id == str(item_list[index])].values
    item = item_lookup[item_lookup.product_id == str(item_list[index])].values
    recommended_price.append(price[0][1])
    
total_recommended = np.sum(recommended_price)


KeyboardInterrupt: 

In [27]:
accept_rate = 0.3
print('After recommending',len(transaction_list),'items with a',accept_rate,'acceptance rate, there would be an increase of',
      "${:,.2f}".format(total_recommended*accept_rate),'in additional purchases.')

After recommending 85348 items with a 0.3 acceptance rate, there would be an increase of $5,825,037.11 in additional purchases.


In [28]:
totals = transactions.groupby(transactions.order_id)['ItemTotal'].sum()
total = totals.sum()

print('Added to the initial total of all',len(transaction_list),'purchases valued at',
      "${:,.2f}".format(total),', the percentage increase in revenue would be', "{:,.4f}%".format(total_recommended*accept_rate / total * 100 ))


Added to the initial total of all 85348 purchases valued at $32,997,331.17 , the percentage increase in revenue would be 17.6531%
