Matrix Factorization is a powerful technique for building recommendation systems, especially when working with sparse user-item interaction data. One popular approach to matrix factorization is Alternating Least Squares (ALS), which efficiently handles large-scale datasets by decomposing a user-item interaction matrix into two low-rank matrices: one representing user preferences and the other item attributes.

ALS is well-suited for the MovieLens 100K dataset because:
1) Scalability: ALS uses parallelizable updates, making it computationally efficient even for larger datasets.
2) Sparse Data Handling: The dataset is sparse, with many missing user-item interactions, a scenario where ALS performs well.

In [1]:
!pip install pyspark
!pip install pandas
import pandas as pd
global_train_df = pd.read_csv("../MATHM0029_2024_TB-1/global_train_df.csv")
global_test_df = pd.read_csv("../MATHM0029_2024_TB-1/global_test_df.csv")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


We use a 80-20 train-test split, where the test data consists of the latest 20% of ratings for each user.

# Standard Model

The decision to drop variables such as user information (e.g., age, occupation, gender) and item information (e.g., genre) in the following model can be explained by the following considerations:
1) The ALS (Alternating Least Squares) algorithm used in the model is inherently a collaborative filtering approach. Collaborative filtering relies only on user-item interaction data (e.g., user IDs, item IDs, and ratings) to make predictions. The assumption is that similar users (based on their historical preferences) and similar items (based on their popularity among users) can be identified purely from the interaction matrix. Including demographic or item-specific data is not essential in this paradigm.
2) ALS models are designed to handle sparse user-item interaction matrices efficiently. Adding additional features like user demographics or item metadata would require incorporating these into the matrix factorization process, significantly increasing computational complexity.
3) The ALS algorithm, in its standard implementation, does not natively support incorporating additional side information like user or item metadata. While hybrid approaches exist (e.g., feature-enhanced matrix factorization or deep learning models), they require additional preprocessing, modeling, and computational resources.


In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import col, collect_list
from pyspark.ml.evaluation import RegressionEvaluator

# Initialize Spark session
spark = SparkSession.builder.appName("RecommenderSystem").getOrCreate()

# Prepare the data
columns = ["User ID", "Item ID", "Rating"]
global_train_spark_df = spark.createDataFrame(global_train_df[columns])
global_test_spark_df = spark.createDataFrame(global_test_df[columns])

# ALS Model Setup
als = ALS(
    userCol="User ID",
    itemCol="Item ID",
    ratingCol="Rating",
    maxIter=10,
    regParam=0.1,
    rank=10,
    coldStartStrategy="drop"  # Prevent NaN predictions
)

# Train the model
als_model = als.fit(global_train_spark_df)

# Generate Recommendations for All Users
user_recommendations = als_model.recommendForAllUsers(10)  # Top 10 recommendations per user

# Evaluate the model with RMSE on test set
predictions = als_model.transform(global_test_spark_df)
evaluator = RegressionEvaluator(
    metricName="mse",
    labelCol="Rating",
    predictionCol="prediction"
)
mse = evaluator.evaluate(predictions)
print(f"MSE: {mse}")