# Recommender Systems on Retail/E-commerce data

## Table Of Contents
* [Overview](#section-1)
* [Dataset](#section-2)
* [Objective](#section-3)
* [Cost](#section-4)
* [Load Data](#section-5)
* [Analyze the data](#section-6)
* [Perform feature Engineering](#section-7)
* [Select required amount of training data](#section-8)
* [Define Training Parameters](#section-9)
* [Perform Cross-Validation](#section-10)
* [Evaluate the best model from Cross-Validation](#section-11)
* [Generate recommendations](#section-12)
* [Write the recommendations to Bigquery](#section-13)
* [Save the trained model to GCS path](#section-14)
* [Clean Up](#section-15)





## Overview
<a name="section-1"></a>


Recommender systems are powerful tools that model existing customer behavior to generate recommendations. These models generally build complex matrices and map out existing customer preferences in order to find intersecting interests and offer recommendations. These matrices can be very large and will benefit from distributed computing and large memory pools. This is a perfect application for Vertex-AI and Pyspark.


## Dataset
<a name="section-2"></a>


This notebook uses the  <a href="https://www.kaggle.com/retailrocket/ecommerce-dataset">"Retailrocket recommender system dataset - Ecommerce data: web events, item properties (with texts), category tree"</a> dataset from Kaggle. The dataset consists of three files, a behaviour, items and categories set.
 
The behaviour data, i.e. events like clicks, add to carts, transactions, represent interactions that were collected over a period of 4.5 months. A visitor can make three types of events, namely “view”, “addtocart” or “transaction”. In total there are 2,756,101 events including 2,664,312 views, 69,332 add to carts and 22,457 transactions produced by 1,407,580 unique visitors. 

Users and products have been obfuscated by replacing the text with numerical IDs. 


## Objective
<a name="section-3"></a>


In this notebook, we are going to build a recommendation system on <a href="https://www.kaggle.com/retailrocket/ecommerce-dataset">Retail-Rocket dataset</a>. To do so, we shall use managed instances from Vertex-AI and interactive Pyspark services offered by Veretex-AI. The approach we are going to take is the <a href="https://en.wikipedia.org/wiki/Collaborative_filtering#:~:text=Collaborative%20filtering%20(CF)%20is%20a%20technique%20used%20by%20recommender%20systems.&text=In%20the%20newer%2C%20narrower%20sense,from%20many%20users%20(collaborating).">collaborative filtering</a> approach with a learning algorithm as the <a href="http://dl.acm.org/citation.cfm?id=1608614"><b>Alternating Least Squares(ALS)</b></a> method.

Things to do before running this notebook : 
* Collect data from Kaggle and store them into a GCS bucket
* Spawn a Dataproc cluster with JupyterLab extension and component gateway enabled.
* Change the kernel of this notebook on the Vertex-AI's managed instance to the Pyspark on created dataproc cluster(remote).

<img src="images/Cluster_setup.PNG"></img>

Once the cluster creation step has finished, wait ~3 minutes for it to become available to the Managed Notebooks. Once the kernel for the cluster is available, select it.

<img src="images/cluster_kernel_selection.PNG"></img>


#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
import os
PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Otherwise, set your project ID here.

In [9]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

## Cost
<a name="section-4"></a>


## Import libraries and define constants

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

## Load Data
<a name="section-5"></a>

The dataset mainly consists of events data and items data. In the current notebook, we will consider only events data to perform collaborative-filtering approach. 

Load the data from the GCS buckets. File contents of GCS buckets can be browsed from the GCS file browser on the left

In [None]:

events = pd.read_csv('path to events.csv file')
print (events.shape)

## Analyze the data
<a name="section-6"></a>


In [None]:
events.head()

Check the event distribution


In [None]:
events['event'].value_counts()

Check null values in the data


In [None]:
events.isna().sum()

Check the unique ids


In [None]:
print (events['visitorid'].unique().shape, events['itemid'].unique().shape)

There are three types of events in the data : "view", "addtocart" and "transaction" corresponding to ~1M visitorids and ~200K itemids. Among the given fields, transcationid has many null values which makes sense as the visitor may not always make a transaction. Most of the times, the visitor may just view or add an item to cart without any purchase. 

## Perform feature Engineering
<a name="section-7"></a>


Generally in Collaborative filtering technique, a user-item matrix is generated which provides a quantitative measure of the association between users and items. In order to build such an association between users and items, a new column <b>product_rating</b> is defined based on the events taken by the user in the current solution. This new column would serve as a score that is being given between each user and the items associated with them.

Assign a score associated with each item for a user based on the events


In [None]:
def product_rating(interactions):
    addtocart = 0
    view = 0
    
    for e in interactions:
        if e == 'transaction':
            return 3
        elif e == 'addtocart':
            addtocart += 1
        elif e == 'view':
            view += 1
         
    if addtocart > 0:
        return 2
    
    if view > 0:
        return 1
    
    return 0

Aggregate the data to collect the event data for each user and apply the defined product_rating function


In [None]:
convertedData = events.groupby(by=['visitorid', 'itemid'])['event'].agg([product_rating]).reset_index()

Check the data


In [None]:
print (convertedData.shape)
convertedData.head()

Check the distribution of the product_rating column


In [None]:
sns.histplot(convertedData['product_rating'], kde=False)

Check unique user ids and item ids


In [None]:
print (convertedData['visitorid'].unique().shape, convertedData['itemid'].unique().shape)

Lets check the distribution of number of items associated with each user


In [None]:
item_count = convertedData[['visitorid','itemid']].groupby(by=['visitorid']).count()
sns.boxplot(x=item_count['itemid'])

## Select required amount of training data
<a name="section-8"></a>


It can be seen that most of the users are in the low ranges i.e., who are associated with 1 or items. Also, there are 1.5 million users in this dataset which is way many for the current solution. Moving ahead, the training dataset will be limited to the top 1000 users based on their item association.

Select the top 1000 users based on the number of items they are associated with


In [None]:
total_users = 1000
item_count.sort_values(by='itemid', ascending=False, inplace=True)
convertedData = convertedData[convertedData['visitorid'].isin(item_count.iloc[:total_users].index)]
convertedData.shape

Check the spark context to ensure it is connected to the remote Dataproc cluster.
* **Note: To connect to the Dataproc cluster, change the kernel to remote pyspark instance for the created clsuter.**

In [None]:

spark = SparkSession.builder \
.appName('Recommendations') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \ ## specify the jar files required to instantiate Bigquery Connector
.getOrCreate()

spark

## Define Training Parameters
<a name="section-9"></a>


Create ALS model


In [None]:
als = ALS(
         userCol="visitorid", 
         itemCol="itemid",
         ratingCol="product_rating", 
         nonnegative = True, 
         implicitPrefs = False,
         coldStartStrategy="drop"
)

Add hyperparameters and their respective values to param_grid


In [None]:
param_grid = ParamGridBuilder() \
            .addGrid(als.rank, [10, 150]) \
            .addGrid(als.regParam, [.01, .1]) \
            .build()

Define evaluator as RMSE and print length of evaluator


In [None]:
evaluator = RegressionEvaluator(
           metricName="rmse", 
           labelCol="product_rating", 
           predictionCol="prediction") 
print ("Num models to be tested: ", len(param_grid))

Create a SparkDataframe to train the ALS model


In [None]:
ratings = convertedData[['visitorid', 'itemid', 'product_rating']]
ratings=spark.createDataFrame(ratings) 
ratings.printSchema()
ratings.show()

Check the data count


In [None]:
ratings.count()

## Split the data into Train-Test


Create test and train set


In [None]:
(train, test) = ratings.randomSplit([0.8, 0.2], seed = 36)
train.count(), test.count()

## Perform Cross-Validation
<a name="section-10"></a>


Build cross validation using CrossValidator


In [None]:
cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=4)

Fit cross validator to the train dataset


In [None]:
model = cv.fit(train)
#Extract best model from the cv model above
best_model = model.bestModel
print("##Parameters for the Best Model##")
print("Rank:", best_model._java_obj.parent().getRank())
print("MaxIter:", best_model._java_obj.parent().getMaxIter())
print("RegParam:", best_model._java_obj.parent().getRegParam())

## Evaluate the best model from Cross-Validation
<a name="section-11"></a>


View the rating predictions by the model on train and test sets


In [None]:
train_predictions = best_model.transform(train)
train_RMSE = evaluator.evaluate(train_predictions)

test_predictions = best_model.transform(test)
test_RMSE = evaluator.evaluate(test_predictions)

print("Train RMSE ", train_RMSE)
print("Test RMSE " , test_RMSE)

## Generate recommendations
<a name="section-12"></a>


Generate n Recommendations for all users


In [None]:
nrecommendations = best_model.recommendForAllUsers(10)
nrecommendations.limit(10).show()

### Generate for a specific user

Identify the items already associated with a user 


In [None]:
train.where(train.visitorid == 2326 ).select ("itemid").collect()

Get recommendations for the items for the selected user


In [None]:
nrecommendations.where(nrecommendations.visitorid == 2326).select("recommendations.itemid", "recommendations.rating").collect()

## Write the recommendations to Bigquery
<a name="section-13"></a>


Spark's ALS model generates specified number of item-recommendations for all the users it was created on. Further from the generated recommendations, the required user's recommendations can be filtered. So, in order to serve the recommendations to the end-users or any applications it can be hosted to a Bigquery table using Spark's Bigquery connector.


### Create a Dataset in Bigquery

#@bigquery
-- create a dataset in Bigquery
CREATE SCHEMA recommender_sys
OPTIONS(
  location="us"
  )

### Write the Recommendations to Bigquery

In [None]:
DATASET = "recommender_sys"
TABLE = "recommendations"
TEMPORARY_GCS_PATH = "vertex_ai_managed_services_demo/recommender_systems/temporarySparkfolder"

nrecommendations.write \
  .format("bigquery") \
  .option("table","{}.{}".format(DATASET, TABLE)) \
  .option("temporaryGcsBucket", TEMPORARY_GCS_PATH) \
  .mode('overwrite') \
  .save()

## Save the trained model to GCS path
<a name="section-14"></a>


Pyspark's ALS.save() method will create a folder at the specified path where it saves the trained model. With GCS file browser available, this method can directly save the model to a GCS bucket.

Save the trained model


In [None]:
GCS_OUTPUT_PATH = "gs://path-to-save-the-model"
best_model.save(GCS_OUTPUT_PATH)

## Clean Up
<a name="section-15"></a>


In [None]:
## TODO : Clean up 

## After successful training and saving of the model, it is suggested to turn off or delete the created DataProc
## cluster to avoid unnecessary charges.


# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set model_id to the ID of the model to fetch.
# dataset_id = 'your-project.your_dataset'

# Use the delete_contents parameter to delete a dataset and its contents.
# Use the not_found_ok parameter to not receive an error if the dataset has already been deleted.
client.delete_dataset(
    dataset_id, delete_contents=True, not_found_ok=True
)  # Make an API request.

print("Deleted dataset '{}'.".format(dataset_id))