## ALS

Alernative Least Squares

https://medium.com/analytics-vidhya/model-based-recommendation-system-with-matrix-factorization-als-model-and-the-math-behind-fdce8b2ffe6d

In [None]:
# Install the required packages

# Java is required local install.
# Java: https://www.java.com/en/download/
# Spark: https://spark.apache.org/downloads.html

! pip install pyspark  
! pip install findspark

! pip install pandas
! pip install numpy
! pip install python-dotenv

! pip install azure-cosmos
! pip install azure-core
! pip install aiohttp

In [1]:
# Import the required packages
import pandas as pd
import numpy as np

import findspark
findspark.init()

import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

from dotenv import dotenv_values
from uuid import uuid4 as GUID

from azure.cosmos.aio import CosmosClient
from azure.cosmos import PartitionKey, exceptions
from azure.cosmos import ThroughputProperties

In [2]:
env_name = "myconfig.env" 
config = dotenv_values(env_name)

# Cosmos Client
cosmos_endpoint = config['cosmos_endpoint']
cosmos_key = config['cosmos_key']
database_name = config['cosmos_database']
actual_ratings_name = config['cosmos_actual_ratings']
predicted_ratings_name = config['cosmos_predicted_ratings']
product_catalog_name = config['cosmos_product_catalog']

cosmos_client = CosmosClient(cosmos_endpoint, cosmos_key)

database = cosmos_client.get_database_client(database_name)
predicted_ratings_container = database.get_container_client(predicted_ratings_name)


In [3]:
# Only execute this cell if you need to recreate the predicted ratings collection (uncomment the delete_container line)
async def recreate_predicted_ratings_collection():
    
    # Database
    database = await cosmos_client.create_database_if_not_exists(id=database_name)

    # Commented for extra protection from deletion
    # database.delete_container(predicted_ratings_name)

    # Ratings Data Collections
    await database.create_container_if_not_exists(
        id=predicted_ratings_name, 
        partition_key=PartitionKey(path="/UserId"),
        offer_throughput=ThroughputProperties(auto_scale_max_throughput=4000))
    
await recreate_predicted_ratings_collection()

The data we start with here is a set of ratings that have been generated by users for products they have previously purchased.

In this cell we read the dataset of user-item ratings from the file, Augmented Ratings, into a Pandas DataFrame, selecting only relevant columns (UserId, ProductId, Rating). This DataFrame is then converted into a Spark DataFrame, which is required to run the ALS model.

In [4]:

spark = SparkSession.builder \
    .appName("Retail-Product-Predictions") \
    .config("spark.master", "local") \
    .config("spark.executor.memory", "4g") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.cores", "4") \
    .config("spark.driver.cores", "4") \
    .getOrCreate()

# Load the data from the json file
full_pd_data = pd.read_json("./data/ratings/actualRatings.json")

# just keep the required columns
pd_data = full_pd_data[['UserId', 'ProductId', 'Rating']]

# convert the data to spark dataframe, required to run the ALS model
data = spark.createDataFrame(pd_data)

# count the number of rows in the data
data.count()

200000

We next need to split the dataset into training and testing sets with an 80-20 split, ensuring that models are trained on a majority of the data while having a separate subset for evaluation. We then cache both to improve performance by storing them in memory.

In [5]:
# Split to create train (80%) and test (20%) datasets
train, test = data.randomSplit([0.8,0.2],10001)

#cache the train and test datasets
train.cache()
test.cache()

DataFrame[UserId: bigint, ProductId: bigint, Rating: double]

We now need to setup the ALS model, a collaborative filtering technique that will use the existing product ratings given by users over their purchased products to predict for each user the rating they might give to those products.

Hyperparameter tuning is performed through a grid search over a defined parameter space combined with cross-validation to ensure the model's generalizability. The CrossValidator in PySpark automates this process, evaluating the model's performance using RMSE (Root Mean Square Error) metric. RMSE measures the average difference bewteen predicted and actual values in a dataset.

After training, the best model is selected based on its performance, and its hyperparameters are printed for inspection. The model is then used to make predictions on the test set, and the RMSE of these predictions is calculated to assess the model's accuracy.

In [6]:
# we use the cross validator to tune the hyperparameters
als = ALS(
         userCol="UserId", 
         itemCol="ProductId",
         ratingCol="Rating", 
         coldStartStrategy="drop"
)

param_grid = ParamGridBuilder() \
            .addGrid(als.rank, [10, 100]) \
            .addGrid(als.regParam, [.1]) \
            .addGrid(als.maxIter, [10]) \
            .build()

evaluator = RegressionEvaluator(
           metricName="rmse", 
           labelCol="Rating", 
           predictionCol="prediction")

cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3, parallelism = 6)

In [7]:
# train the model itself. This takes 1-2 minutes
model = cv.fit(train)

# return the best model from those that were trained above
best_model = model.bestModel

In [8]:
# now take the train data and calculate how well it does predicting the ratings.
prediction = best_model.transform(test)
rmse = evaluator.evaluate(prediction)
print(f'RMSE = {rmse}. This is the average difference between the actual and predicted ratings. Lower values are better.')

RMSE = 0.35502647060722764. This is the average difference between the actual and predicted ratings. Lower values are better.


Gather the vectors for each user and product.

In [None]:
# vector to describe what user attributes are important in making a predication for a rating
user_v = best_model.userFactors.collect()[0].features

# vector to describe what product attributes are important in making a prediction for a rating
item_v = best_model.itemFactors.collect()[0].features

print('User Vector: ' + str(user_v))
print('Product Vector: ' + str(item_v))

 The **dot product** of a user and item vector is computed as an example, showcasing how to predict a user's rating for a specific item based on their latent features.

In [10]:
np.dot((user_v),(item_v))

5.048279562412358

Instead of calculating dot product each time we need a prediction, to make our ecommerce app as fast as possible, we are going to first generate all of the recommendations for all users and all products. Then save these in Azure Cosmos DB for NoSQL

In [11]:
async def generated_predicted_ratings():
    
    final_predictions = []
    
    # This line uses the best model obtained from the tuning process to generate recommendations for all users.
    predicted_recommendations = best_model.recommendForAllUsers(10001)

    # Create a json object with the recommendations for each product for each user
    for user in predicted_recommendations.collect():
        user_prediction = {}
        user_prediction['id'] = str(user.UserId)
        user_prediction['UserId'] = str(user.UserId)
        user_prediction['Predictions'] = [ { "ProductId": str(user.recommendations[x].ProductId), "Rating": user.recommendations[x].rating } for x in range(len(user.recommendations)) ]
        final_predictions.append(user_prediction)

    i=0
    # Insert these into the predicted ratings container
    for item in final_predictions:
        i+=1
        await predicted_ratings_container.create_item(item)

    print(f"Number of predicted ratings added: {i}")

    # Display the precited recommendations
    predicted_recommendations.show()

await generated_predicted_ratings()


Number of predicted ratings added: 2080
+------+--------------------+
|UserId|     recommendations|
+------+--------------------+
|     1|[{42, 9.012838}, ...|
|     2|[{72, 7.738515}, ...|
|     3|[{42, 8.855551}, ...|
|     4|[{72, 7.276962}, ...|
|     5|[{42, 7.388352}, ...|
|     6|[{42, 8.582646}, ...|
|     7|[{72, 7.672081}, ...|
|     8|[{42, 8.615616}, ...|
|     9|[{72, 7.3360405},...|
|    10|[{42, 8.526747}, ...|
|    11|[{72, 8.249162}, ...|
|    12|[{42, 8.783088}, ...|
|    13|[{42, 7.350541}, ...|
|    14|[{72, 7.131305}, ...|
|    15|[{72, 8.900069}, ...|
|    16|[{72, 8.677717}, ...|
|    17|[{42, 8.914647}, ...|
|    18|[{42, 8.335895}, ...|
|    19|[{42, 8.984622}, ...|
|    20|[{42, 7.158357}, ...|
+------+--------------------+
only showing top 20 rows

