# ALS

The Alternating Least Squares (ALS) model is a popular collaborative filtering technique used for building recommender systems. It is particularly effective for handling large-scale data and sparse user-item matrices. ALS works by factorizing the user-item interaction matrix into two lower-dimensional matrices: one for users and one for items.

The key idea behind ALS is to alternate between optimizing the user matrix and the item matrix, fixing one while solving for the other using a least squares approach. This process continues iteratively until convergence, producing predictions for unseen user-item interactions.

## Setup

In [None]:
%%bash
DATA_FOLDER="../data"
if [ ! -d "$DATA_FOLDER" ]; then
    wget --no-check-certificate "https://drive.usercontent.google.com/download?id=1qe5hOSBxzIuxBb1G_Ih5X-O65QElollE&export=download&confirm=t&uuid=b2002093-cc6e-4bd5-be47-9603f0b33470" -O KuaiRec.zip
    unzip KuaiRec.zip -d "$DATA_FOLDER"
    rm KuaiRec.zip
fi

In [2]:
import os
import shutil
import pandas as pd
from pyspark.sql import SparkSession, Row
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

In [3]:
# Load the dataset.
# Using the small matrix because the big_matrix makes spark run out of memory.
data = "../data/KuaiRec 2.0/data"
interactions = pd.read_csv(f"{data}/small_matrix.csv")

In [4]:
# Drop useless data.
interactions = interactions.dropna()
interactions = interactions.drop_duplicates()
interactions = interactions[interactions["timestamp"] >= 0]

# Drop useless columns.
interactions.drop(columns=["play_duration", "video_duration", "time", "date", "timestamp"], inplace=True)

In [5]:
# Setup Spark.
spark = SparkSession.builder.appName("ALS Recommender System").getOrCreate()
spark_df = spark.createDataFrame(interactions)

25/05/16 22:28:52 WARN Utils: Your hostname, Amines-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.188.64.109 instead (on interface en0)
25/05/16 22:28:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/16 22:28:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/05/16 22:29:07 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


## Training the model

In [6]:
als = ALS(
    maxIter=10,
    regParam=0.1,
    userCol="user_id",
    itemCol="video_id",
    ratingCol="watch_ratio",
    coldStartStrategy="drop"
)

In [7]:
model = als.fit(spark_df)

model_path = "als_model"

# Delete model if it exists.
if os.path.exists(model_path) and os.path.isdir(model_path):
    shutil.rmtree(model_path)

model.save(model_path)

25/05/16 22:29:33 WARN TaskSetManager: Stage 0 contains a task of very large size (7334 KiB). The maximum recommended task size is 1000 KiB.
25/05/16 22:29:37 WARN PythonRunner: Detected deadlock while completing task 0.0 in stage 0 (TID 0): Attempting to kill Python Worker
25/05/16 22:29:37 WARN TaskSetManager: Stage 1 contains a task of very large size (7334 KiB). The maximum recommended task size is 1000 KiB.
25/05/16 22:29:42 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
25/05/16 22:29:42 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS
25/05/16 22:29:42 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK
25/05/16 22:29:45 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
25/05/16 22:29:45 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap m

## Test the model

In [8]:
# Recommend video 6523 for user 7162.
user_item_df = spark.createDataFrame([Row(user_id=7162, video_id=6523)], ["user_id", "video_id"])
predictions = model.transform(user_item_df)
predictions.show()

+-------+--------+----------+
|user_id|video_id|prediction|
+-------+--------+----------+
|   7162|    6523| 1.3845127|
+-------+--------+----------+



## Benchmarks

In [9]:
predictions = model.transform(spark_df)

evaluator = RegressionEvaluator(metricName="rmse", labelCol="watch_ratio", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)

mae_evaluator = RegressionEvaluator(metricName="mae", labelCol="watch_ratio", predictionCol="prediction")
mae = mae_evaluator.evaluate(predictions)

evaluator = RegressionEvaluator(metricName="r2", labelCol="watch_ratio", predictionCol="prediction")
r2 = evaluator.evaluate(predictions)

print("📊 Model Evaluation Metrics:")
print(f"➡️  MAE  (Mean Absolute Error)    : {round(mae, 4)}")
print(f"➡️  RMSE (Root Mean Squared Error): {round(rmse, 4)}")
print(f"➡️  R² Score                      : {round(r2, 4)}")

25/05/16 22:29:47 WARN TaskSetManager: Stage 183 contains a task of very large size (7334 KiB). The maximum recommended task size is 1000 KiB.
25/05/16 22:29:50 WARN TaskSetManager: Stage 242 contains a task of very large size (7334 KiB). The maximum recommended task size is 1000 KiB.
25/05/16 22:29:54 WARN TaskSetManager: Stage 301 contains a task of very large size (7334 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

📊 Model Evaluation Metrics:
➡️  MAE  (Mean Absolute Error)    : 0.3515
➡️  RMSE (Root Mean Squared Error): 1.2322
➡️  R² Score                      : 0.1748


                                                                                

## Conclusion

In this notebook, we implemented and evaluated an **ALS (Alternating Least Squares)** model for collaborative filtering. The ALS algorithm is widely used for building recommendation systems by factorizing the user-item interaction matrix into latent factors.

#### Insights from the Evaluation:
- **MAE**: The **Mean Absolute Error** of **0.3515** indicates the average magnitude of error between the predicted ratings and actual ratings. A lower MAE suggests better accuracy, though there is still room for improvement in fine-tuning the model.
  
- **RMSE**: The **Root Mean Squared Error** of **1.2322** highlights the model's deviation from actual ratings. RMSE is more sensitive to large errors, and while the value here suggests reasonable predictive performance, further optimization may reduce it.

- **R² Score**: The **R² Score** of **0.1748** implies that the model is able to explain around 17.5% of the variance in the actual ratings. This suggests that while the ALS model captures some of the patterns in the data, there is significant room for improvement in terms of how much variance it can explain.

The **ALS model** demonstrates moderate performance based on the evaluation metrics, with an acceptable **MAE** and **RMSE**, but there is potential to improve both **accuracy** and **coverage** by tuning hyperparameters.

Overall, the ALS model is a useful starting point for collaborative filtering tasks, but further optimization is needed to achieve better generalization and more accurate recommendations.
