# Matrix Factorization

Matrix Factorization is a powerful technique for building recommendation systems, especially when working with sparse user-item interaction data. One popular approach to matrix factorization is Alternating Least Squares (ALS), which efficiently handles large-scale datasets by decomposing a user-item interaction matrix into two low-rank matrices: one representing user preferences and the other item attributes.

We decided to explore both ALS and SVD for our project.
For our goals, we decided that ALS is generally the better choice because:

1) It is optimized for sparse, explicit feedback datasets.
2) It scales well with larger datasets and distributed computation.
3) It includes regularization and implicit feedback handling, making it versatile for different recommendation scenarios.

However, if you have a specific need to explore latent factors in-depth, or you're working with dense data and non-distributed settings, SVD could also be a viable option.

In [2]:
!pip install pyspark
!pip install pandas
import pandas as pd
user_train_df = pd.read_csv('../Katherine W/dataSets/user_train_df.csv')
user_test_df = pd.read_csv('../Katherine W/dataSets/user_test_df.csv')


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


We use a user-based 80-20 train-test split, where 80% of each user's ratings (earliest timestamps) are included in the training set, and the remaining 20% (latest timestamps) are in the test set. 
We also explored a global 80-20 split, where the first 80% of all ratings is included in the training set, and the remaining 20% are in the test set.

We decided to use a user-based 80-20 split for train-test separation as it
1) Ensures that each user has both training and test data, enabling the model to learn and test for every individual.
2) The impact of highly active users (with many ratings) is contained, preventing them from dominating the training process and skewing the test results.

However, a major drawback of user-based split is that In systems with large user bases, a user split may require additional computation to manage individual user datasets, especially for training models like matrix factorization.

## ALS Model

The decision to drop variables such as user information (e.g., age, occupation, gender) and item information (e.g., genre) in the following model can be explained by the following considerations:
1) The ALS (Alternating Least Squares) algorithm used in the model is inherently a collaborative filtering approach. Collaborative filtering relies only on user-item interaction data (e.g., user IDs, item IDs, and ratings) to make predictions. The assumption is that similar users (based on their historical preferences) and similar items (based on their popularity among users) can be identified purely from the interaction matrix. Including demographic or item-specific data is not essential in this paradigm.
2) ALS models are designed to handle sparse user-item interaction matrices efficiently. Adding additional features like user demographics or item metadata would require incorporating these into the matrix factorization process, significantly increasing computational complexity.
3) The ALS algorithm, in its standard implementation, does not natively support incorporating additional side information like user or item metadata. While hybrid approaches exist (e.g., feature-enhanced matrix factorization or deep learning models), they require additional preprocessing, modeling, and computational resources.

In [3]:
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import col, collect_list, explode
from pyspark.ml.evaluation import RegressionEvaluator

# Initialize Spark session
spark = SparkSession.builder.appName("RecommenderSystem").getOrCreate()

# Prepare the data
columns = ["User ID", "Item ID", "Rating"]
user_train_spark_df = spark.createDataFrame(user_train_df[columns])

# ALS Model Setup
als = ALS(
    userCol="User ID",
    itemCol="Item ID",
    ratingCol="Rating",
    maxIter=10, # Number of iterations - Ensures the algorithm has enough time to converge.
    regParam=0.1, # Regularization parameter - Tests different levels of regularization to prevent overfitting.
    rank=10, # Number of latent factors - This explores different complexities of the model.
    coldStartStrategy="drop"  # Prevent NaN predictions
)

# Train the model
als_model = als.fit(user_train_spark_df)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/25 13:15:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

We have now trained our ALS model, note that optimised parameter values are used in the model. The method is shown in "Hyperparameter Tuning" in the Alan G folder.

Now we will generate a few recommendations to check that the model works.

In [4]:
# Extract the Item ID to Movie Title mapping
item_to_title_mapping = user_train_df[["Item ID", "Movie Title"]].drop_duplicates()

# Convert the mapping to a Spark DataFrame for integration with recommendations
item_to_title_spark_df = spark.createDataFrame(item_to_title_mapping)

# Generate recommendations for all users
user_recommendations = als_model.recommendForAllUsers(5)  # Top 5 recommendations per user

# Explode the recommendations and map Item IDs to Movie Titles
exploded_recommendations = user_recommendations.withColumn("recommendation", explode(col("recommendations")))
exploded_recommendations = exploded_recommendations.withColumn("Item ID", col("recommendation").getItem("Item ID")) \
                                                   .withColumn("Rating", col("recommendation").getItem("Rating")) \
                                                   .drop("recommendation")

# Join with the movie titles
recommendations_with_titles = exploded_recommendations.join(item_to_title_spark_df, on="Item ID", how="inner")

from pyspark.sql.functions import split, trim

# Split the Movie Title column to extract only the relevant part before the '|'
recommendations_cleaned = recommendations_with_titles.withColumn(
    "Movie Title", 
    trim(split(col("Movie Title"), r"\|").getItem(0))  # Get the first part before the '|'
)

# Display cleaned recommendations for 5 users
recommendations_cleaned.select("User ID", "Movie Title", "Rating") \
                       .orderBy(col("User ID").asc(), col("Rating").desc()) \
                       .show(25, truncate=False)  # Show 5 users, 5 movies each


                                                                                

+-------+--------------------------------------+---------+
|User ID|Movie Title                           |Rating   |
+-------+--------------------------------------+---------+
|1      |Pather Panchali (1955)                |5.1304927|
|1      |Boys, Les (1997)                      |5.099012 |
|1      |Angel Baby (1995)                     |4.9815707|
|1      |Faust (1994)                          |4.964393 |
|1      |Anna (1996)                           |4.8791194|
|2      |Mina Tannenbaum (1994)                |5.5116477|
|2      |Angel Baby (1995)                     |5.216749 |
|2      |L.A. Confidential (1997)              |5.071832 |
|2      |Whole Wide World, The (1996)          |4.9777217|
|2      |Spanish Prisoner, The (1997)          |4.8399215|
|3      |Santa with Muscles (1996)             |3.6494858|
|3      |Return of the Jedi (1983)             |3.5813563|
|3      |Boys, Les (1997)                      |3.5257242|
|3      |Anne Frank Remembered (1995)          |3.429989

The rating values represent predicted preferences for movies, estimated by the ALS model. They are used to rank recommendations, with higher scores indicating a higher predicted preference for a movie.

In [5]:
# Evaluate the model with MSE on test set
user_test_spark_df = spark.createDataFrame(user_test_df[columns])
predictions = als_model.transform(user_test_spark_df)
evaluator = RegressionEvaluator(
    metricName="mse",
    labelCol="Rating",
    predictionCol="prediction"
)
mse = evaluator.evaluate(predictions)
print(f"MSE: {mse}")

                                                                                

MSE: 0.48037015142478745


We decided to use MSE (Mean Squared Error) as the performance metric as
1) MSE is a standard metric in regression and recommendation tasks, making it easier to compare models across studies and datasets.
2) It is sensitive to the variance in data, which can highlight models that consistently make large errors on certain predictions.

We have also generated the MSE value for a global 80-20 train-test split in "Global-ALS" in the Alan G folder. We have obtained a lower MSE value for the user-based split which suggests that the user-based split might align better with individual preferences, leading to a lower MSE.

## Larger Scale Datasets

The ALS model in Spark can handle large-scale datasets, but as the dataset size increases by a factor of 1000, computational challenges (e.g., memory requirements, training time) and data sparsity issues become more pronounced. For example, in Spark, ALS requires shuffling during each iteration to update the latent factors. As the dataset grows, this process becomes more time-consuming and memory-intensive, which could create bottlenecks in the training process. The network traffic required for shuffling data between nodes increases, which can negatively impact training time.

To address these issues, optimizations like increasing cluster resources, tuning hyperparameters efficiently, and improving data partitioning will be essential. This means increasing the number of executors, executor memory, and number of cores per executor in the Spark cluster configuration.

Other matrix factorization techniques, such as SGD, GPU-accelerated ALS, and also applying deep learning methods, can also be more suitable for extremely large datasets, depending on the specific requirements and infrastructure available. Each method comes with its trade-offs in terms of computational complexity, memory efficiency, and training time. For example, GPU-accelerated ALS is an extension of the standard ALS algorithm where matrix factorization is performed on GPUs instead of CPUs. This can significantly speed up the process of training ALS models, especially for large datasets, by taking advantage of parallel computation. However, GPU-based solutions require specialized hardware and may involve additional complexity in setting up the environment.

# References

1) yuefeng.zhang@pku.edu.cn, Y.Z. (no date) An introduction to matrix factorization and factorization machines in recommendation system, and beyond, ar5iv. Available at: https://ar5iv.org/html/2203.11026 (Accessed: 25 November 2024). 

2) 21.3. matrix factorization¶ colab [pytorch] open the notebook in colab colab [mxnet] open the notebook in colab colab [jax] open the notebook in colab colab [tensorflow] open the notebook in colab sagemaker studio lab open the notebook in SageMaker Studio Lab (no date b) 21.3. Matrix Factorization - Dive into Deep Learning 1.0.3 documentation. Available at: https://www.d2l.ai/chapter_recommender-systems/mf.html (Accessed: 25 November 2024). 