<i>Adapted from Recommenders ALS example</i>

# Running ALS on MIND (with PySpark)

Matrix factorization by [ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS) (Alternating Least Squares) is a well known collaborative filtering algorithm.

This notebook provides an example of how to utilize and evaluate ALS PySpark ML (DataFrame-based API) implementation, meant for large-scale distributed datasets.

In [39]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import os
import sys
import pyspark
from pyspark.ml.recommendation import ALS
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import StringType, IntegerType

from recommenders.utils.timer import Timer
from recommenders.datasets.spark_splitters import spark_random_split
from recommenders.evaluation.spark_evaluation import SparkRatingEvaluation, SparkRankingEvaluation
from recommenders.utils.spark_utils import start_or_get_spark

from tempfile import TemporaryDirectory
from recommenders.datasets.mind import download_mind
from recommenders.datasets.download_utils import unzip_file

print(f"System version: {sys.version}")
print("Spark version: {}".format(pyspark.__version__))


System version: 3.12.3 (main, Jan 17 2025, 18:03:48) [GCC 13.3.0]
Spark version: 3.5.4


Set the default parameters.

In [18]:
# top k items to recommend
TOP_K = 10

# MIND sizes: "demo", "small", or "large"
mind_type = 'demo'

# Column names for the dataset
COL_USER = "user_id"
COL_ITEM = "news_id"
COL_RATING = "rating"

### 0. Set up Spark context & directory

The following settings work well for debugging locally on VM - change when running on a cluster. We set up a giant single executor with many threads and specify memory cap. 

In [19]:
# the following settings work well for debugging locally on VM - change when running on a cluster
# set up a giant single executor with many threads and specify memory cap
spark = start_or_get_spark("ALS PySpark", memory="16g")
spark.conf.set("spark.sql.analyzer.failAmbiguousSelfJoin", "false")

In [20]:
# Setup data storage location

tmpdir = TemporaryDirectory()
data_path = tmpdir.name
train_zip, valid_zip = download_mind(size=mind_type, dest_path=data_path)
unzip_file(train_zip, os.path.join(data_path, 'train'), clean_zip_file=False)
unzip_file(valid_zip, os.path.join(data_path, 'valid'), clean_zip_file=False)
train_behaviors_path = os.path.join(data_path, "train", "behaviors.tsv")

100%|██████████| 17.0k/17.0k [00:06<00:00, 2.45kKB/s]
100%|██████████| 9.84k/9.84k [00:04<00:00, 2.25kKB/s]


### 1. Download the MIND dataset

In [25]:
# Schema for behaviors.tsv
schema = StructType([
    StructField("impression_id", StringType(), True),  # Ignored for ALS
    StructField(COL_USER, StringType(), True),  # Will be converted later
    StructField("timestamp", StringType(), True),  # Convert to long if needed
    StructField("history", StringType(), True),  # List of past clicked news
    StructField("impressions", StringType(), True)  # Needs to be split into news_id + rating
])

# Load raw behaviors.tsv
data = (
    spark.read.option("sep", "\t").option("header", "false")
    .schema(schema)
    .csv(train_behaviors_path)
)

# Split impressions column ("n4-1 n5-0 n6-1") into separate rows
data = data.withColumn("impressions", F.explode(F.split(F.col("impressions"), " ")))

# Extract news_id and click status (e.g., "n4-1" → news_id="n4", rating=1)
data = data.withColumn(COL_ITEM, F.split(F.col("impressions"), "-")[0])
data = data.withColumn(COL_RATING, F.split(F.col("impressions"), "-")[1].cast(IntegerType()))

# Convert user_id and news_id to integers (ALS requires numeric IDs)
data = data.withColumn(COL_USER, F.hash(COL_USER).cast(IntegerType()))
data = data.withColumn(COL_ITEM, F.hash(COL_ITEM).cast(IntegerType()))

# Drop unnecessary columns
data = data.select(COL_USER, COL_ITEM, COL_RATING)

# Count and remove articles / users with few interactions
user_counts = data.groupBy(COL_USER).count().filter(F.col("count") >= 10)
news_counts = data.groupBy(COL_ITEM).count().filter(F.col("count") >= 10)

data = data.join(user_counts, "user_id").join(news_counts, "news_id")


# Show transformed data
data.show()
data.groupBy(COL_RATING).count().show()


+-----------+---------+------+-----+-----+
|    news_id|  user_id|rating|count|count|
+-----------+---------+------+-----+-----+
| 1572667918|641278344|     0|  162| 1893|
|-1179971679|641278344|     0|  162|  725|
|-1588919390|641278344|     0|  162|  881|
| -704032733|641278344|     0|  162| 1269|
| -945553399|641278344|     0|  162|  604|
|  206958755|641278344|     0|  162|  391|
| -473001627|641278344|     0|  162|  590|
| 1343085119|641278344|     0|  162|  513|
|  584181417|641278344|     0|  162|  864|
|-1875555226|641278344|     0|  162| 1305|
| -146224004|641278344|     0|  162|   31|
| 1885426536|641278344|     0|  162|  937|
|-1471141211|641278344|     0|  162|  255|
|  969666805|641278344|     0|  162|  484|
| -652530682|641278344|     0|  162| 1523|
| 2032506888|641278344|     0|  162| 1486|
| 1938754531|641278344|     0|  162|  708|
|  710941895|641278344|     0|  162| 1195|
|-1433334447|641278344|     0|  162| 1387|
| 2114527946|641278344|     0|  162|  639|
+----------

### 2. Split the data using the Spark random splitter provided in utilities

In [26]:
train, test = spark_random_split(data, ratio=0.75, seed=123)
print ("N train", train.cache().count())
print ("N test", test.cache().count())

                                                                                

N train 618004




N test 205737


                                                                                

### 3. Train the ALS model on the training data, and get the top-k recommendations for our testing data

To article interactions movie ratings, we use the rating data in the training set as users' explicit feedback.

In [44]:
header = {
    "userCol": COL_USER,
    "itemCol": COL_ITEM,
    "ratingCol": COL_RATING,
}


als = ALS(
    rank=50,
    maxIter=15,
    implicitPrefs=True,
    regParam=0.01,
    coldStartStrategy='drop',
    nonnegative=True,
    seed=42,
    alpha=45,
    **header
)

In [31]:
with Timer() as train_time:
    model = als.fit(train)

print(f"Took {train_time.interval} seconds for training.")

                                                                                

Took 19.21276932500041 seconds for training.


In the movie recommendation use case, recommending movies that have been rated by the users do not make sense. Therefore, the rated movies are removed from the recommended items.

In order to achieve this, we recommend all movies to all users, and then remove the user-movie pairs that exist in the training dataset.

In [32]:
with Timer() as test_time:

    # Get the cross join of all user-item pairs and score them.
    users = train.select(COL_USER).distinct()
    items = train.select(COL_ITEM).distinct()
    user_item = users.crossJoin(items)
    dfs_pred = model.transform(user_item)

    # Remove seen items.
    dfs_pred_exclude_train = dfs_pred.alias("pred").join(
        train.alias("train"),
        (dfs_pred[COL_USER] == train[COL_USER]) & (dfs_pred[COL_ITEM] == train[COL_ITEM]),
        how='outer'
    )

    top_all = dfs_pred_exclude_train.filter(dfs_pred_exclude_train[f"train.{COL_RATING}"].isNull()) \
        .select('pred.' + COL_USER, 'pred.' + COL_ITEM, 'pred.' + "prediction")

    # In Spark, transformations are lazy evaluation
    # Use an action to force execute and measure the test time 
    top_all.cache().count()

print(f"Took {test_time.interval} seconds for prediction.")

25/02/10 13:55:59 WARN Column: Constructing trivially true equals predicate, 'user_id#3230 = user_id#3230'. Perhaps you need to use aliases.
25/02/10 13:55:59 WARN Column: Constructing trivially true equals predicate, 'news_id#3238 = news_id#3238'. Perhaps you need to use aliases.

Took 204.26381239799957 seconds for prediction.


                                                                                

In [33]:
top_all.show()

+-----------+-----------+------------+
|    user_id|    news_id|  prediction|
+-----------+-----------+------------+
|-2147037735|-2121750872|         0.0|
|-2147037735|-2043278394| 0.027811762|
|-2147037735|-1863699942| 0.011070045|
|-2147037735|-1794496239| 0.006054819|
|-2147037735|-1699973722|         0.0|
|-2147037735|-1584119234|0.0031453876|
|-2147037735|-1493080894|3.7277272E-4|
|-2147037735| -960522234| 0.013794501|
|-2147037735| -532027687|  0.04946326|
|-2147037735|   51381774|  0.22901253|
|-2147037735|   65115344|         0.0|
|-2147037735|  227072232| 0.021506796|
|-2147037735|  324601621|         0.0|
|-2147037735|  604387412|  0.17451674|
|-2147037735| 1029253112|  0.03898862|
|-2147037735| 1104310681|         0.0|
|-2147037735| 1732144371|         0.0|
|-2147037735| 1762144659| 0.012311478|
|-2147037735| 2075122702|         0.0|
|-2146676128|-1876942238|         0.0|
+-----------+-----------+------------+
only showing top 20 rows



### 4. Evaluate how well ALS performs

In [34]:
rank_eval = SparkRankingEvaluation(test, top_all, k = TOP_K, col_user=COL_USER, col_item=COL_ITEM, 
                                    col_rating=COL_RATING, col_prediction="prediction", 
                                    relevancy_method="top_k")

                                                                                

In [35]:
print("Model:\tALS",
      "Top K:\t%d" % rank_eval.k,
      "MAP:\t%f" % rank_eval.map_at_k(),
      "NDCG:\t%f" % rank_eval.ndcg_at_k(),
      "Precision@K:\t%f" % rank_eval.precision_at_k(),
      "Recall@K:\t%f" % rank_eval.recall_at_k(), sep='\n')



Model:	ALS
Top K:	10
MAP:	0.026334
NDCG:	0.066722
Precision@K:	0.058610
Recall@K:	0.020850


                                                                                

### 5. Evaluate rating prediction

In [36]:
# Generate predicted ratings.
prediction = model.transform(test)
prediction.cache().show()




+-----------+-----------+------+-----+-----+------------+
|    news_id|    user_id|rating|count|count|  prediction|
+-----------+-----------+------+-----+-----+------------+
|-2120713446|-1482601076|     0|  478| 1936|  0.17146717|
|-2090215252|-1482601076|     0|  478|  426| 0.039239056|
|-2031019083|-1482601076|     0|  478| 3512|  0.10484441|
|-1989979155|  685447168|     0|   53| 1341|    0.105992|
|-1986381822|-1482601076|     0|  478|   11|         0.0|
|-1977550441| 1215769402|     0|  529|  244| 0.115620576|
|-1976588074|-1230168420|     1|  259| 1654|2.5876146E-4|
|-1888287721| 1504945774|     0|   71|  353|         0.0|
|-1869033591|-1230168420|     0|  259|  514|         0.0|
|-1865583007| -859415635|     0|  382|  225|  0.27295318|
|-1853492005| 1215769402|     0|  529|  119|  0.06399846|
|-1849280153| 1215769402|     0|  529|  169| 0.027338777|
|-1844046333|  666429805|     0|  400| 1535|  0.88676363|
|-1838142718| 1215769402|     0|  529|  584| 0.014787877|
|-1835425778|-

                                                                                

In [37]:
rating_eval = SparkRatingEvaluation(test, prediction, col_user=COL_USER, col_item=COL_ITEM, 
                                    col_rating=COL_RATING, col_prediction="prediction")

print("Model:\tALS rating prediction",
      "RMSE:\t%f" % rating_eval.rmse(),
      "MAE:\t%f" % rating_eval.mae(),
      "Explained variance:\t%f" % rating_eval.exp_var(),
      "R squared:\t%f" % rating_eval.rsquared(), sep='\n')

                                                                                

Model:	ALS rating prediction
RMSE:	0.317204
MAE:	0.167511
Explained variance:	-1.369951
R squared:	-1.660865


                                                                                

In [38]:
# cleanup spark instance and clear temp directory
spark.stop()
tmpdir.cleanup()

### 6. Changes over time

#### Attempt 0 - no modifications

Model:	ALS
Top K:	10
MAP:	0.000041
NDCG:	0.000069
Precision@K:	0.000020
Recall@K:	0.000041

Model:	ALS rating prediction
RMSE:	0.194057
MAE:	0.047414
Explained variance:	0.019782
R squared:	-0.002237

These extremely low metrics might be due to the dataset mostly being made up of non-interactions. Possible changes:
- Remove articles with low engagement
- Tune hyperparameters: increase rank, decrease regParam, and change alpha
- Convert data to implicit feedback

#### Attempt 1 - adding atmpt. 0 suggestions

Model:	ALS
Top K:	10
MAP:	0.026334
NDCG:	0.066722
Precision@K:	0.058610
Recall@K:	0.020850

Model:	ALS rating prediction
RMSE:	0.317204
MAE:	0.167511
Explained variance:	-1.369951
R squared:	-1.660865

Much better, but still very low. Might attempt a different method before continuing. 
