## ALS

Alernative Least Squares

https://medium.com/analytics-vidhya/model-based-recommendation-system-with-matrix-factorization-als-model-and-the-math-behind-fdce8b2ffe6d

In [None]:
# Install the required packages
# be sure you have Java installed on your machine, https://www.java.com/en/download/
! pip install pyspark  
! pip install findspark

! pip install pandas

! pip install numpy
! pip install pymongo
! pip install python-dotenv

In [1]:
# Import the required packages
import pandas as pd
import numpy as np

import findspark
findspark.init()

import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

from dotenv import dotenv_values
import pymongo

In [2]:
# specify the name of the .env file name 
env_name = "myconfig.env" # following example.env template change to your own .env file name
config = dotenv_values(env_name)

# Connection string
cosmos_conn = config['cosmos_mongo_connection_string']
cosmos_client = pymongo.MongoClient(cosmos_conn)

# Database name
database = cosmos_client[config['cosmos_database']]

# Collection to put the predicted ratings
predicted_ratings = database[config['cosmos_predicted_ratings']]

# delete all documents in the collection to clear it out
#predicted_ratings.delete_many({})

The data we start with here is a set of ratings that have been generated by users for products they have previously purchased.

In this cell we read the dataset of user-item ratings from the file, Augmented Ratings, into a Pandas DataFrame, selecting only relevant columns (UserId, ProductId, Rating). This DataFrame is then converted into a Spark DataFrame, which is required to run the ALS model.

In [3]:
# Create a Spark session
conf = SparkConf()
conf.set("spark.executor.memory","6g")
conf.set("spark.driver.memory", "6g")
conf.set("spark.driver.cores", "8")
sc = SparkContext.getOrCreate(conf)
spark = SparkSession.builder.getOrCreate()

# Load the data from the json file
full_pd_data = pd.read_json("./data/ratings/AugmentedRating.json")
# just keep the required columns
pd_data = full_pd_data[['UserId', 'ProductId', 'Rating']]
# convert the data to spark dataframe, required to run the ALS model
data = spark.createDataFrame(pd_data)
# count the number of rows in the data
data.count()


200000

We next need to split the dataset into training and testing sets with an 80-20 split, ensuring that models are trained on a majority of the data while having a separate subset for evaluation. We then cache both to improve performance by storing them in memory.

In [4]:
# Split to create train (80%) and test (20%) datasets
train, test = data.randomSplit([0.8,0.2],10001)

#cache the train and test datasets
train.cache()
test.cache()

DataFrame[UserId: bigint, ProductId: bigint, Rating: double]

We now need to setup the ALS model, a collaborative filtering technique that will use the existing product ratings given by users over their purchased products to predict for each user the rating they might give to those products.

Hyperparameter tuning is performed through a grid search over a defined parameter space combined with cross-validation to ensure the model's generalizability. The CrossValidator in PySpark automates this process, evaluating the model's performance using RMSE (Root Mean Square Error) metric. RMSE measures the average difference bewteen predicted and actual values in a dataset.

After training, the best model is selected based on its performance, and its hyperparameters are printed for inspection. The model is then used to make predictions on the test set, and the RMSE of these predictions is calculated to assess the model's accuracy.

In [5]:
# we use the cross validator to tune the hyperparameters
als = ALS(
         userCol="UserId", 
         itemCol="ProductId",
         ratingCol="Rating", 
         coldStartStrategy="drop"
)

param_grid = ParamGridBuilder() \
            .addGrid(als.rank, [10, 100]) \
            .addGrid(als.regParam, [.1]) \
            .addGrid(als.maxIter, [10]) \
            .build()

evaluator = RegressionEvaluator(
           metricName="rmse", 
           labelCol="Rating", 
           predictionCol="prediction")

cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3, parallelism = 6)

In [6]:
# train the model itself. This takes 1-2 minutes
model = cv.fit(train)

# return the best model from those that were trained above
best_model = model.bestModel

In [7]:
# now take the train data and calculate how well it does predicting the ratings.
prediction = best_model.transform(test)
rmse = evaluator.evaluate(prediction)
print(f'RMSE = {rmse}. This is the average difference between the actual and predicted ratings. Lower values are better.')

RMSE = 0.3472303397007194. This is the average difference between the actual and predicted ratings. Lower values are better.


Gather the vectors for each user and product.

In [8]:
# vector to describe what user attributes are important in making a predication for a rating
user_v = best_model.userFactors.collect()[0].features

# vector to describe what product attributes are important in making a prediction for a rating
item_v = best_model.itemFactors.collect()[0].features

print('User Vector: ' + str(user_v))
print('Product Vector: ' + str(item_v))

User Vector: [0.045703258365392685, 0.716562807559967, -0.008130370639264584, 0.3819727599620819, -0.015725957229733467, -0.2852918803691864, -0.2048124372959137, 0.3428661525249481, -0.31104201078414917, 0.010159505531191826, 0.046196043491363525, -0.4496069550514221, -0.04833582416176796, -0.40178343653678894, -0.01445831824094057, 0.1220896914601326, 0.10066267848014832, -0.08861759305000305, 0.20473569631576538, 0.230158731341362, -0.1505148708820343, -0.15549343824386597, 0.09260598570108414, 0.18946217000484467, -0.2803293466567993, -0.3203554153442383, 0.15769560635089874, -0.7656079530715942, -0.05069020763039589, -0.3472103476524353, -0.4141215980052948, 0.0756869912147522, -0.5032243728637695, 0.0721718817949295, -0.10517171025276184, -0.464933305978775, 0.517517626285553, 0.2943427860736847, 0.1725325584411621, 0.47285524010658264, 0.1917082667350769, 0.046788834035396576, -0.4201229512691498, 0.2171483188867569, -0.05244824290275574, -0.5383483171463013, -0.1663093715906143

 The **dot product** of a user and item vector is computed as an example, showcasing how to predict a user's rating for a specific item based on their latent features.

In [9]:
np.dot((user_v),(item_v))

5.058172606116622

Instead of calculating dot product each time we need a prediction, to make our ecommerce app as fast as possible, we are going to first generate all of the recommendations for all users and all products. Then save these in Azure Cosmos DB for MongoDB

In [None]:
# This line uses the best model obtained from the tuning process to generate recommendations for all users.
val_recommendations = best_model.recommendForAllUsers(10001)
rec_sys_final_predictions = []

for user in val_recommendations.collect():
    to_insert = {}
    to_insert['UserId'] = user.UserId
    to_insert['Predictions'] = [{"ProductId": user.recommendations[x].ProductId, "rating": user.recommendations[x].rating} for x in range(len(user.recommendations))]
    rec_sys_final_predictions.append(to_insert)

# Keep commented if just rerunning the model to test or view results
predicted_ratings.insert_many(rec_sys_final_predictions)

In [None]:
# make a simple print out of the recommendations
val_recommendations.show()