In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Kaggle Noteboook -[my link](https://www.kaggle.com/tolulade/recommender-system-using-alternating-least-squares)

# With Implicit Feedback Datasets

Unlike Explicit feedback, Implicit feedback can be tracked automatically such as events/occurrence on a site, monitoring clicks, view times, purchases, and so on.

They are much easier to collect.


**Project Goal:**

Build an item recommendation system by creating specific ranking for a set of items per user.

To determine user preferences about each items, we are going to have to learn from the past user interaction with the system. For exaple, online purchasing, gaming, etc.

For the case of online purchasing, in other words, we can say that we are building a recommender system that provides personalised recommendations to customers based on their purchasing history.

## Data Extraction from Repo

Let's source for our datasets.

For this exercise, we are going to make use of this dataset- online retail dataset from UCI machine learning repository. And we will be using Implicit Library, a Fast Python Collaborative Filtering for Implicit Datasets, for our matrix factorization.

[Source dataset](https://archive.ics.uci.edu/ml/datasets/online+retail)

For my dataset extraction and download, I can do either of these.

Upload to github and extract from git to kaggle
Upload to kaggle data drive, then get data
Use the **wget command** to get the data directly from the web source/data repo on web

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx

Notice that the data is in xlsx and not csv.

Also, let's import libraries. 

[Implicit python docs](https://pypi.org/project/implicit/)

In [None]:
!pip install openpyxl #for kaggle - python excel reader 

In [None]:
import sys
import pandas as pd
import numpy as np
import scipy.sparse as sparse
from scipy.sparse.linalg import spsolve
import random
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
import implicit
import openpyxl

online_retail = pd.read_excel('https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
online_retail

In [None]:
# !apt install cuda #might not be needed as implicit installation worked in kaggle

In [None]:
# !pip install git+https://github.com/benfred/implicit.git@18a8010b07e8a86f8eb37f837b5bbda11647501f ##might not be needed as implicit installation imported in kaggle unlike
    #google colab and jupyter notebook

In [None]:
online_retail.info()

In [None]:
online_retail.describe()

In [None]:
online_retail.Quantity.describe()

In [None]:
online_retail.CustomerID.value_counts() #counting the number of occurence for CustomerId column

In [None]:
print(f'Number of rows: {online_retail.shape[0]}; Number of columns: {online_retail.shape[1]}; Number of missing values: {sum(online_retail.isna().sum())}')

## Lots of missing values - 136534

A lot of “CustomerID” is missing from the data, so we will have to remove those rows. -using .notna()

Then we,

Group “CustomerID” and “StockCode” then sum the “Quantity”. So that we get each customer and each item interactions.

If “Quantity” = 0, we change to one.

Eliminate negative “Quantity”.

In [None]:
online_retail = online_retail[online_retail['CustomerID'].notna()] #Detect existing (non-missing) values. #only detect rows with customerID values
grouped_retail = online_retail[['CustomerID', 'StockCode', 'Description', 'Quantity']].groupby(['CustomerID', 'StockCode', 'Description']).sum().reset_index()
grouped_retail.loc[grouped_retail['Quantity'] == 0, ['Quantity']] = 1
#elminate negative quantity
grouped_retail = grouped_retail.loc[grouped_retail['Quantity'] > 0] #elminate negative quantity

In [None]:
grouped_retail

In [None]:
import plotly.express as px

fig =px.histogram(grouped_retail, x='Quantity', title='Distribution of the purchase quantity', nbins=500)
fig.show();

The purchased quantity for majority of the customers is very low, only a few purchase above 2, 000 pieces in one interaction

In [None]:
grouped_retail.CustomerID.value_counts() #some rows have been removed

In [None]:
print(f'Number of unique customers: {grouped_retail.CustomerID.nunique()}')
print(f'Number of unique items: {grouped_retail.StockCode.nunique()}')

print(f'Average purchase quantity per interaction: {int(grouped_retail.Quantity.mean())}')
print(f'Minimum purchase quantity per interaction: {grouped_retail.Quantity.min()}')
print(f'Maximum purchase quantity per interaction: {grouped_retail.Quantity.max()}')

In [None]:
import seaborn as sns
sns.countplot(online_retail['CustomerID'])

In [None]:
sns.countplot(grouped_retail['CustomerID'])

# Implicit Feedback

"Instead of representing an explicit rating, the “Quantity” can represent a “confidence” in terms of how strong the interaction was. Items with a larger number of “Quantity” by a customer can carry more weight in our ratings matrix of “Quantity”."

1. Let's create numeric “customer_id and “item_id” columns.

2. Create two matrices, one for fitting the model (item-customer) and another one for recommendations (customer-item).

3. Initialise the Alternating Least Squares (ALS) recommendation model.

4. Fit the model using the sparse item-customer matrix.

5. We set the type of our matrix to double for the ALS function to run properly.

In [None]:
unique_customers = grouped_retail.CustomerID.unique()
customer_ids = dict(zip(unique_customers, np.arange(unique_customers.shape[0], dtype=np.int32)))

unique_items = grouped_retail.StockCode.unique()
item_ids = dict(zip(unique_items, np.arange(unique_items.shape[0], dtype=np.int32)))

grouped_retail['customer_id'] = grouped_retail.CustomerID.apply(lambda i: customer_ids[i])
grouped_retail['item_id'] = grouped_retail.StockCode.apply(lambda i: item_ids[i])

sparse_item_customer = sparse.csr_matrix((grouped_retail['Quantity'].astype(float), (grouped_retail['item_id'], grouped_retail['customer_id'])))
sparse_customer_item = sparse.csr_matrix((grouped_retail['Quantity'].astype(float), (grouped_retail['customer_id'], grouped_retail['item_id'])))


model = implicit.als.AlternatingLeastSquares(factors=20, regularization=0.1, iterations=50)

alpha = 15
data = (sparse_item_customer * alpha).astype('double')

model.fit(data)

# Recommendation Findings

Let’s start with “WHITE METAL LANTERN”. We found that “item_id” for “WHITE METAL LANTERN” is 1319.

In [None]:
grouped_retail.loc[grouped_retail['item_id'] == 1319].head()

Finding the 10 most similar items to “WHITE METAL LANTERN”.

Get the customer and item vectors from our trained model.

Calculate the vector norms.

Calculate the similarity score.

Get the top 10 items.

Create a list of item-score tuples of most similar items with this item.

In [None]:
item_id = 1319
n_similar = 10

item_vecs = model.item_factors
customer_vecs = model.user_factors

#item_norms = np.sqrt((item_vecs * item_vecs).sum(axis=1))
item_norms = np.sqrt((item_vecs * item_vecs).sum(axis=1))

scores = item_vecs.dot(item_vecs[item_id]) / item_norms
top_idx = np.argpartition(scores, -n_similar)[-n_similar:]
similar = sorted(zip(top_idx, scores[top_idx] / item_norms[item_id]), key=lambda x: -x[1])

for item in similar:
    idx, score = item
    print(grouped_retail.Description.loc[grouped_retail.item_id == idx].iloc[0])

In [None]:
def recommend(customer_id, sparse_customer_item, customer_vecs, item_vecs, num_items=10):
    
    customer_interactions = sparse_customer_item[customer_id,:].toarray()
    customer_interactions = customer_interactions.reshape(-1) + 1
    customer_interactions[customer_interactions > 1] = 0
    
    rec_vector = customer_vecs[customer_id,:].dot(item_vecs.T).toarray()
    
    min_max = MinMaxScaler()
    rec_vector_scaled = min_max.fit_transform(rec_vector.reshape(-1,1))[:,0]
    recommend_vector = customer_interactions * rec_vector_scaled

    item_idx = np.argsort(recommend_vector)[::-1][:num_items]
    
    descriptions = []
    scores = []

    for idx in item_idx:
        descriptions.append(grouped_retail.Description.loc[grouped_retail.item_id == idx].iloc[0])
        scores.append(recommend_vector[idx])

    recommendations = pd.DataFrame({'description': descriptions, 'score': scores})

    return recommendations
    
customer_vecs = sparse.csr_matrix(model.user_factors)
item_vecs = sparse.csr_matrix(model.item_factors)
# Create recommendations for customer with id 2
customer_id = 2
recommendations = recommend(customer_id, sparse_customer_item, customer_vecs, item_vecs)

print(recommendations)

So we have top 10 recommendations for customer_id 2. 

Let’s get top 20 items this customer has purchased.

In [None]:
grouped_retail.loc[grouped_retail['customer_id'] == 2].sort_values('Quantity', ascending=False)[['customer_id', 'Description', 'Quantity']].head(20)


### **Conclusions**

We can see the customer’s top purchases which are; lip glosses, designed tissues and holiday cake cases, etc (looks like a seasonal/anniversary ocassion/party). 

Items recommended to the customer includes fruit straws, gift boxes, cocktail parasols, etc mainly for folks hosting anniversary ocassion/party and the likes.

Remember **“Customers who bought this item also bought…”?**


Please note:

"The best evaluation metrics for a recommender system is how much the system adds value to the customers and/or business, whether the system increase sales and profits."

Performing some kind of online A/B testing to evaluate these metrics would help.

Though, there are other common metrics for evaluating the performance of a recommender in isolation.

In this tututorial, we were able to calculate the **AUC** for each customer in our training set that had at least one item purchased. And [AUC](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc) for the most popular items for the customers to compare."

**Reference**

[Source 1](https://actsusanli.medium.com/building-a-recommender-system-with-implicit-feedback-datasets-using-alternating-least-squares-64d4f5ba3c57)