![HSV-AI Logo](https://github.com/HSV-AI/hugo-website/blob/master/static/images/logo_v9.png?raw=true)

[Dataset](https://www.kaggle.com/mkechinov/ecommerce-purchase-history-from-electronics-store)

This file contains purchase data from April 2020 to November 2020 from a large home appliances and electronics online store.

Each row in the file represents an event. All events are related to products and users. Each event is like many-to-many relation between products and users.


# Global Imports

In [None]:
import pandas as pd
import numpy as np
import random
from matplotlib import pyplot as plt

# Data Exploration

In [None]:
df = pd.read_csv('../data/external/kz.csv')
df.head()

## Checking for missing values

It looks like the order and product id are always available. That is all that we will be using from this dataset, to the rest is fine.

In [None]:
df.isna().sum()

Let's look at the number of products and see how they are distributed among the orders. We can use the value_counts method from pandas to get an idea of how often each product is ordered.

In [None]:
product_counts = df['product_id'].value_counts().to_numpy()
print('There are', len(product_counts), 'unique products\n')
print('Here are the counts of products ordered from largest to smallest')
print(product_counts)
print('\nAnd a graph of what the curve looks like:')
plt.plot(product_counts) 
plt.show()

Wow! It looks like there are a few products that are purchased a lot. Let's take a look at those to see what they are.

In [None]:
df['product_id'].value_counts().head()

In [None]:
print(len(df['order_id'].unique()))
print(len(df))
# from collections import Counter
# Counter(df['product_id'].value_counts().to_numpy())[3]

This is a very extreme curve. It's unlikely that we will be able to use any products that don't appear in multiple orders. We can do a few more things to see how much usable data we have.

First, we will tell value_counts to use percentages of the total instead of the sum values and divide the results equally into 10 bins.

In [None]:
df['product_id'].value_counts(normalize=True, bins=10)

In [None]:
df['price'].value_counts().sort_index().plot()

In [None]:
totals = df.groupby(df.order_id)['price'].sum()
 
totals.plot()

In [None]:
df['category_code'].value_counts()

In [None]:
df['brand'].value_counts()

Another thing we can do is compute the sparsity of the data. This is useful to see if there is enough overlap between the orders and products to make a useful decision for recommendations.

In [None]:
order_counts = df['order_id'].value_counts()
num_orders = len(order_counts)
num_items = len(product_counts)
sparsity = 1 - len(df) / (num_orders * num_items)
print(f'number of orders: {num_orders}, number of items: {num_items}')
print(f'matrix sparsity: {sparsity:f}')

Compare that with the 100k movielens dataset that has:

```
number of users: 943, number of items: 1682
matrix sparsity: 0.936953
```

In addition to reducing the sparsity, another issue with this dataset is the greater number of items and orders. When I tried to re-use a notebook built for the 100k movielens dataset on this ecomerce data, it immediately ran out of memory when attempting to use the KNNBasic algorithm.

In [None]:
product = 943 * 1682
print('Size for movielens: 'f'{product:,}')
 
product = 1435266 * 25113
print('Size for ecommerce dataset: 'f'{product:,}')

This is a pretty clear reason why the in-memory recommendation approaches that work with movielens run out of memory when trying to apply them to the ecommerce dataset.

We need to look at reducing the dataset into something both useful and manageable. To start with, we can remove any products that don't appear more than some value.

In [None]:
#@title Example form fields
#@markdown Forms support many types of fields.
 
filter_value = 1000  #@param {type: "number"}
#@markdown

In [None]:
product_group = df.loc[:, ['order_id', 'product_id']].groupby('product_id').count()
 
multi_product = product_group[product_group.order_id >= filter_value].count()
single_product = product_group[product_group.order_id < filter_value].count()
 
print('Products in at least',filter_value,'orders:',multi_product['order_id'])
print('Products in less than',filter_value,'orders:',single_product['order_id'])
 
# We can capture the list of mutiple product orders with this:
product_filter = product_group[product_group.order_id >= filter_value].index.tolist()
 
product_filtered_df = df[df['product_id'].isin(product_filter)].copy()

We can also remove orders that don't have more than some number of items.



In [None]:
#@title Example form fields
#@markdown Forms support many types of fields.
 
minimum_order_size =   3#@param {type: "number"}
maximum_order_size =   20#@param {type: "number"}
 
#@markdown

In [None]:
order_group = product_filtered_df.loc[:, ['order_id', 'product_id']].groupby('order_id').count()
 
multi_order = order_group[(order_group.product_id >= minimum_order_size) & (order_group.product_id <= maximum_order_size)].count()
single_order = order_group[(order_group.product_id < minimum_order_size) | (order_group.product_id > maximum_order_size)].count()
 
print('Orders with at least',minimum_order_size,'products:',multi_order['product_id'])
print('Orders with less than',minimum_order_size,'products:',single_order['product_id'])
 
# We can capture the list of mutiple product orders with this:
order_filter = order_group[(order_group.product_id >= minimum_order_size) & (order_group.product_id <= maximum_order_size)].index.tolist()

In [None]:
filtered_df = product_filtered_df[product_filtered_df['order_id'].isin(order_filter)].copy()
print('Original dataframe length:', len(df))
print('Filtered dataframe length:', len(filtered_df))

In [None]:
product_counts = filtered_df['product_id'].value_counts().to_numpy()
print('There are', len(product_counts), 'unique products\n')
print('\nAnd a graph of what the curve looks like:')
plt.plot(product_counts) 
plt.show()
 
order_counts = filtered_df['order_id'].value_counts()
num_orders = len(order_counts)
num_items = len(product_counts)
sparsity = 1 - len(df) / (num_orders * num_items)
print(f'number of orders: {num_orders}, number of items: {num_items}')
print(f'matrix sparsity: {sparsity:f}')

In [None]:
filtered_df['product_id'] = filtered_df['product_id'].astype(str)

In [None]:
orderdf = filtered_df[['order_id', 'product_id']].sort_values('product_id').groupby('order_id').sum('product_id')

In [None]:
order_distro = orderdf['product_id'].value_counts().to_numpy()
print('There are', len(order_distro), 'unique orders\n')
print('\nAnd a graph of what the curve looks like:')
plt.plot(order_distro) 
plt.show()

# Fun with Numbers

The initial work I had done with this dataset used 1.0 as the "rating" for each product in an order. That turned out to be problematic because some algorithms multiple ratings as part of their score. And of course, this meant that a product "rated" once has the same score as a product "rated" 20 times.

Changing this "rating" to a 5 seems to move past that issue.

In [None]:
filtered_df['rating'] = 5.0
print(filtered_df)

In [None]:
selected_df = filtered_df[['order_id', 'product_id', 'rating']].apply(pd.to_numeric, errors='coerce')
selected_df.info()
selected_df.head()

In [None]:
def precision_recall_at_k(predictions, k=10, threshold=.75):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set it to 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set it to 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    return precisions, recalls

In [None]:
def get_f_score(precision, recall):
  denominator = precision + recall
  if(denominator == 0):
    return 0
  return 2 * (precision * recall) / denominator

In [None]:
pip install implicit

In [None]:
selected_df.head(5)

In [None]:
selected_df.head(5).values

In [None]:
print(scipy.sparse.csr_matrix(selected_df.values))

In [None]:
 
import implicit
import scipy
 
# initialize a model
model = implicit.als.AlternatingLeastSquares(factors=50)

# train the model on a sparse matrix of item/user/confidence weight
sparse_product = scipy.sparse.csr_matrix(selected_df.values)
model.fit(sparse_product)

sparse_order = sparse_order.T.tocsr()
# recommend items for a user
# user_items = item_user_data.T.tocsr()
# recommendations = model.recommend(userid=0, user_items=[0], recalculate_user=True)
 
# # find related items
# related = model.similar_items(3000)
# print(related)

In [None]:
related = model.similar_items(0)
print(related)

In [None]:
model.recommend(1, sparse_order)

In [None]:
from implicit.datasets.lastfm import get_lastfm

In [None]:
artists, users, plays = get_lastfm()

In [None]:
plays_scr = plays.tocsr()
print(plays_scr)

# Surprise Install and Import

In [None]:
!pip install surprise

In [None]:
from surprise import Reader, Dataset
from surprise import SVD, KNNBasic
from surprise.model_selection import cross_validate


# Model Training and Evaluation

In [None]:
# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(0.0, 5.0))
# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(filtered_df[['order_id', 'product_id', 'rating']], reader)

filtered_df.count()

In [None]:
from surprise.model_selection import KFold
from collections import defaultdict

kf = KFold(n_splits=5)

# We'll use the famous SVD algorithm.
algo = SVD()

for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=5)

    # Precision and recall can then be averaged over all users
    precision = sum(prec for prec in precisions.values()) / len(precisions)
    recall = sum(rec for rec in recalls.values()) / len(recalls)
    fscore = get_f_score(precision, recall)
    print('Precision:', precision, 'Recall:', recall, 'F1 Score', fscore)

In [None]:
## TODO - figure out what the scores would be if we simply recommended the most popular item

## TODO - what happens if we remove duplicates - but no really, how the heck to we even do that?




Show probability that the recommendation was acted on 50%

Assume that in the real world maybe 25%

For 25%, calculate additional value that would have been added if this system was in place.




In [None]:
for i in range(20):

  uid = random.choice(order_filter)
  iid = random.choice(product_filter)

  print(algo.predict(uid, iid))
  

In [None]:
def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.
    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.
    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


In [None]:
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

uid = random.choice(order_filter)

top_n = get_top_n(predictions, n=10)