![HSV-AI Logo](https://github.com/HSV-AI/hugo-website/blob/master/static/images/logo_v9.png?raw=true)

# Implicit Recommendation from ECommerce Data

Some of the material for this work is based on [A Gentle Introduction to Recommender Systems with Implicit Feedback](https://jessesw.com/Rec-System/) by Jesse Steinweg Woods. This tutorial includes an implementation of the Alternating Least Squares algorithm and some other useful functions (like the area under the curve calculation). Other parts of the tutorial are based on a previous version of the Implicit library and had to be reworked.


## Basics of EDA

Here are a few things that we are looking for in the invoice / transaction data:

1. Were there any negative totals? If so why?
2. What percentage of the purchases actually contained multiple items?
3. What is the spread of purchases by customer ID? Do we have a few customers whose behavior may drive recommendations in a way that doesn't fit the average customer?
4. Where there any purchases that were VERY large? If so why? Do we want to include these values to train model behavior?
5. Is there any missing data that we need to scrub?


In [None]:
%reload_kedro

In [None]:
import pandas as pd
import numpy as np
import random
import scipy.sparse
from matplotlib import pyplot as plt
from pandas.api.types import CategoricalDtype


# Available Files

Let's go ahead and look into some of these files and see what we can see.

In [None]:
transactions = catalog.load("brazilian_kaggle_order_data")
products = catalog.load("brazilian_kaggle_product_data")
customers = catalog.load("brazilian_kaggle_customer_data")

In [None]:
transactions['order_id'] = transactions.order_id.astype(str)
transactions['product_id'] = transactions.product_id.astype(str)
transactions.head()

In [None]:
products['product_id'] = products.product_id.astype(str)
products.head()

In [None]:
customers.head()

In [None]:
customers = customers[["customer_id", "order_id", "order_purchase_timestamp"]]
customers['order_id'] = customers.order_id.astype(str)
customers['customer_id'] = customers.customer_id.astype(str)

customers.head()

In [None]:
print(transactions.dtypes)
print(customers.dtypes)
transactions = transactions.merge(customers, on='order_id')

# Checking for missing data

In [None]:
print('Total length is',len(transactions))
transactions.isna().sum()

In [None]:
transaction_counts = transactions['order_id'].value_counts().to_numpy()
print('There are', len(transaction_counts), 'unique transactions\n')
print('Here are the counts of transactions ordered from largest to smallest')
print(transaction_counts)
print('\nAnd a graph of what the curve looks like:')
plt.plot(transaction_counts) 
plt.show()

# User Interactions

Let's take a look at how many unique customers are included in this dataset

In [None]:
user_counts = transactions['customer_id'].value_counts().to_numpy()
print('There are', len(user_counts), 'unique customers\n')
print('Here are the counts of transactions per customer ordered from largest to smallest')
print(user_counts)
print('\nAnd a graph of what the curve looks like:')
plt.plot(user_counts) 
plt.show()

In [None]:
transactions.groupby(['customer_id'])['customer_id'] \
                             .count() \
                             .reset_index(name='count') \
                             .sort_values(['count'], ascending=False) \
                             .head(10)

It looks like "-1" is used when the customer is unknown. Let's take a look at the UserId with the very high count of items in the transactions.

In [None]:
transactions[transactions.customer_id == "fc3d1daec319d62d49bfb5e1f83123e9"].groupby(transactions.order_id).count()

It appears that there are a lot of different transactions, so probably not just the same thing being purchased over and over. Not really sure what to do with this at the moment.

# Transactions over Time

Now we need to look at the number of items purchased each day to see if there is anything interesting that pops out.

In [None]:
from datetime import datetime

datetime_object = datetime.strptime('2017-10-02 10:56:33', '%Y-%m-%d %H:%M:%S')

In [None]:
from datetime import datetime

datetime_object = datetime.strptime('Mon Feb 12 04:26:00 IST 2018', '%a %b %d %H:%M:%S IST %Y')

def func(date):
    temp = datetime.strptime(str(date), '%Y-%m-%d %H:%M:%S')
    return temp.strftime('%Y%m%d')

transactions['datetime'] = transactions.apply(lambda x: func(x.order_purchase_timestamp), axis=1)


In [None]:
transactions.groupby(['datetime'])['datetime'] \
                             .count() \
                             .reset_index(name='count') \
                             .sort_values(['datetime'], ascending=True) \
                             .plot(figsize=(15,10))

# Checking Invoice Totals

We need to make sure all the invoice totals that we're using are positive - this keeps us from using invoices that captured customer returned items.

In [None]:
totals = transactions.groupby(transactions.order_id)['price'].sum()
totals.plot()

In [None]:
totals.sort_values(ascending=False)

In [None]:
print('There are', len(transactions[transactions.price < 0]), 'negative prices')
transactions[transactions.price < 0].head()

In [None]:
q = transactions["price"].quantile(0.98)
# transactions = transactions[transactions["price"] < q]
print(q)

We will need to remove transactions that only included a single item

In [None]:
minimum_order_size = 2
order_group = transactions.loc[:, ['order_id', 'product_id']].groupby('order_id').count()
 
multi_order = order_group[(order_group.product_id >= minimum_order_size)].count()
single_order = order_group[(order_group.product_id < minimum_order_size)].count()
 
print('Orders with at least',minimum_order_size,'products:',multi_order['product_id'])
print('Orders with less than',minimum_order_size,'products:',single_order['product_id'])
 
# We can capture the list of mutiple product orders with this:
order_filter = order_group[(order_group.product_id >= minimum_order_size)].index.tolist()

filtered_df = transactions[transactions['order_id'].isin(order_filter)].copy()

print('Original dataframe length:', len(transactions))
print('Filtered dataframe length:', len(filtered_df))

filtered_df['quantity'] = 1

Well it looks like this entire dataset has transactions with multiple products. No need to filter out transactions with only a single.

# Data Sparcity

Let's take a look at the sparcity of the data. This will tell us how many products were purchased across multiple orders. This is directly related to how well a recommendation system can be trained.

In [None]:
transaction_list = list(np.sort(filtered_df.order_id.unique())) # Get our unique customers
item_list = list(filtered_df.product_id.unique()) # Get our unique products that were purchased
quantity_list = list(filtered_df.quantity) # All of our purchases

cols = filtered_df.order_id.astype(CategoricalDtype(categories=transaction_list, ordered=True)).cat.codes 
# Get the associated row indices
rows = filtered_df.product_id.astype(CategoricalDtype(categories=item_list, ordered=True)).cat.codes 
# Get the associated column indices
purchases_sparse = scipy.sparse.csr_matrix((quantity_list, (rows, cols)), shape=(len(item_list), len(transaction_list)))

In [None]:
matrix_size = purchases_sparse.shape[0]*purchases_sparse.shape[1] # Number of possible interactions in the matrix
num_purchases = len(purchases_sparse.nonzero()[0]) # Number of items interacted with
sparsity = 100*(1 - (num_purchases/matrix_size))
sparsity

In [None]:
plt.figure(figsize=(15, 15))
plt.spy(purchases_sparse, markersize=1, aspect='auto')

# Storing Interim Data

Now that we have the data cleaned up a bit and formatted correctly, we can save it to an interim file to be picked up by the model training algorithm.

In [None]:
catalog.save("brazilian_transactions", filtered_df[["order_id", "product_id", "price", "quantity"]])

products["description"] = products["product_category_name"] + str(products["product_description_lenght"])

catalog.save("brazilian_products", products[["product_id", "description"]])