<i>Adapted from Recommenders ALS example</i>

# Running ALS on MIND (with PySpark)

Matrix factorization by [ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS) (Alternating Least Squares) is a well known collaborative filtering algorithm.

This notebook provides an example of how to utilize and evaluate ALS PySpark ML (DataFrame-based API) implementation, meant for large-scale distributed datasets.

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import os
import sys
import pyspark
from pyspark.ml.recommendation import ALS
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import StringType, IntegerType

from recommenders.utils.timer import Timer
from recommenders.datasets.spark_splitters import spark_random_split
from recommenders.evaluation.spark_evaluation import SparkRatingEvaluation, SparkRankingEvaluation
from recommenders.utils.spark_utils import start_or_get_spark

from tempfile import TemporaryDirectory
from recommenders.datasets.mind import download_mind
from recommenders.datasets.download_utils import unzip_file

print(f"System version: {sys.version}")
print("Spark version: {}".format(pyspark.__version__))


System version: 3.12.3 (main, Jan 17 2025, 18:03:48) [GCC 13.3.0]
Spark version: 3.5.4


Set the default parameters.

In [2]:
# top k items to recommend
TOP_K = 10

# MIND sizes: "demo", "small", or "large"
mind_type = 'demo'

# Column names for the dataset
COL_USER = "user_id"
COL_ITEM = "news_id"
COL_RATING = "rating"

### 0. Set up Spark context & directory

The following settings work well for debugging locally on VM - change when running on a cluster. We set up a giant single executor with many threads and specify memory cap. 

In [16]:
# the following settings work well for debugging locally on VM - change when running on a cluster
# set up a giant single executor with many threads and specify memory cap
spark = start_or_get_spark("ALS PySpark", memory="16g")
spark.conf.set("spark.sql.analyzer.failAmbiguousSelfJoin", "false")

In [4]:
# Setup data storage location

tmpdir = TemporaryDirectory()
data_path = tmpdir.name
train_zip, valid_zip = download_mind(size=mind_type, dest_path=data_path)
unzip_file(train_zip, os.path.join(data_path, 'train'), clean_zip_file=False)
unzip_file(valid_zip, os.path.join(data_path, 'valid'), clean_zip_file=False)
train_behaviors_path = os.path.join(data_path, "train", "behaviors.tsv")

 15%|█▌        | 2.62k/17.0k [00:01<00:08, 1.70kKB/s]25/02/10 13:38:20 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
100%|██████████| 17.0k/17.0k [00:08<00:00, 2.12kKB/s]
100%|██████████| 9.84k/9.84k [00:03<00:00, 2.89kKB/s]


### 1. Download the MIND dataset

In [5]:
# Schema for behaviors.tsv
schema = StructType([
    StructField("impression_id", StringType(), True),  # Ignored for ALS
    StructField(COL_USER, StringType(), True),  # Will be converted later
    StructField("timestamp", StringType(), True),  # Convert to long if needed
    StructField("history", StringType(), True),  # List of past clicked news
    StructField("impressions", StringType(), True)  # Needs to be split into news_id + rating
])

# Load raw behaviors.tsv
data = (
    spark.read.option("sep", "\t").option("header", "false")
    .schema(schema)
    .csv(train_behaviors_path)
)

# Split impressions column ("n4-1 n5-0 n6-1") into separate rows
data = data.withColumn("impressions", F.explode(F.split(F.col("impressions"), " ")))

# Extract news_id and click status (e.g., "n4-1" → news_id="n4", rating=1)
data = data.withColumn(COL_ITEM, F.split(F.col("impressions"), "-")[0])
data = data.withColumn(COL_RATING, F.split(F.col("impressions"), "-")[1].cast(IntegerType()))

# Convert user_id and news_id to integers (ALS requires numeric IDs)
data = data.withColumn(COL_USER, F.hash(COL_USER).cast(IntegerType()))
data = data.withColumn(COL_ITEM, F.hash(COL_ITEM).cast(IntegerType()))

# Drop unnecessary columns
data = data.select(COL_USER, COL_ITEM, COL_RATING)

# Show transformed data
data.show()
data.groupBy(COL_RATING).count().show()


+---------+-----------+------+
|  user_id|    news_id|rating|
+---------+-----------+------+
|641278344| 1572667918|     0|
|641278344|-1179971679|     0|
|641278344|-1588919390|     0|
|641278344| -704032733|     0|
|641278344| -945553399|     0|
|641278344|  206958755|     0|
|641278344| -473001627|     0|
|641278344|-1578886272|     0|
|641278344| 1343085119|     0|
|641278344|  584181417|     0|
|641278344|-1875555226|     0|
|641278344| -146224004|     0|
|641278344| 1885426536|     0|
|641278344|-1471141211|     0|
|641278344|  969666805|     0|
|641278344| -652530682|     0|
|641278344| 2032506888|     0|
|641278344| 1938754531|     0|
|641278344|  710941895|     0|
|641278344|-1433334447|     0|
+---------+-----------+------+
only showing top 20 rows





+------+------+
|rating| count|
+------+------+
|     1| 34747|
|     0|811630|
+------+------+



                                                                                

### 2. Split the data using the Spark random splitter provided in utilities

In [6]:
train, test = spark_random_split(data, ratio=0.75, seed=123)
print ("N train", train.cache().count())
print ("N test", test.cache().count())

                                                                                

N train 634960




N test 211417


                                                                                

### 3. Train the ALS model on the training data, and get the top-k recommendations for our testing data

To article interactions movie ratings, we use the rating data in the training set as users' explicit feedback.

In [7]:
header = {
    "userCol": COL_USER,
    "itemCol": COL_ITEM,
    "ratingCol": COL_RATING,
}


als = ALS(
    rank=10,
    maxIter=15,
    implicitPrefs=False,
    regParam=0.05,
    coldStartStrategy='drop',
    nonnegative=False,
    seed=42,
    **header
)

In [8]:
with Timer() as train_time:
    model = als.fit(train)

print(f"Took {train_time.interval} seconds for training.")

25/02/10 13:39:35 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS
                                                                                

Took 15.069147076999798 seconds for training.


In the movie recommendation use case, recommending movies that have been rated by the users do not make sense. Therefore, the rated movies are removed from the recommended items.

In order to achieve this, we recommend all movies to all users, and then remove the user-movie pairs that exist in the training dataset.

In [9]:
with Timer() as test_time:

    # Get the cross join of all user-item pairs and score them.
    users = train.select(COL_USER).distinct()
    items = train.select(COL_ITEM).distinct()
    user_item = users.crossJoin(items)
    dfs_pred = model.transform(user_item)

    # Remove seen items.
    dfs_pred_exclude_train = dfs_pred.alias("pred").join(
        train.alias("train"),
        (dfs_pred[COL_USER] == train[COL_USER]) & (dfs_pred[COL_ITEM] == train[COL_ITEM]),
        how='outer'
    )

    top_all = dfs_pred_exclude_train.filter(dfs_pred_exclude_train[f"train.{COL_RATING}"].isNull()) \
        .select('pred.' + COL_USER, 'pred.' + COL_ITEM, 'pred.' + "prediction")

    # In Spark, transformations are lazy evaluation
    # Use an action to force execute and measure the test time 
    top_all.cache().count()

print(f"Took {test_time.interval} seconds for prediction.")

25/02/10 13:40:02 WARN Column: Constructing trivially true equals predicate, 'user_id#33 = user_id#33'. Perhaps you need to use aliases.
25/02/10 13:40:02 WARN Column: Constructing trivially true equals predicate, 'news_id#41 = news_id#41'. Perhaps you need to use aliases.

Took 158.32637336300104 seconds for prediction.


                                                                                

In [10]:
top_all.show()

+-----------+-----------+------------+
|    user_id|    news_id|  prediction|
+-----------+-----------+------------+
|-2147037735|-2121750872|         0.0|
|-2147037735|-2106887980|         0.0|
|-2147037735|-2043278394| 0.001307607|
|-2147037735|-2037856057|         0.0|
|-2147037735|-1896789963|         0.0|
|-2147037735|-1867395284|         0.0|
|-2147037735|-1863699942|0.0029707374|
|-2147037735|-1824052933|         0.0|
|-2147037735|-1794496239|0.0067589157|
|-2147037735|-1702684571|         0.0|
|-2147037735|-1699973722|         0.0|
|-2147037735|-1584119234| 8.228555E-4|
|-2147037735|-1545121822|         0.0|
|-2147037735|-1493080894| 0.006497734|
|-2147037735|-1362864834|         0.0|
|-2147037735|-1344848052|         0.0|
|-2147037735|-1126679957|         0.0|
|-2147037735|-1034258104| 0.051721632|
|-2147037735|-1004796837|  0.04036267|
|-2147037735| -999960394|         0.0|
+-----------+-----------+------------+
only showing top 20 rows



### 4. Evaluate how well ALS performs

In [11]:
rank_eval = SparkRankingEvaluation(test, top_all, k = TOP_K, col_user=COL_USER, col_item=COL_ITEM, 
                                    col_rating=COL_RATING, col_prediction="prediction", 
                                    relevancy_method="top_k")

                                                                                

In [12]:
print("Model:\tALS",
      "Top K:\t%d" % rank_eval.k,
      "MAP:\t%f" % rank_eval.map_at_k(),
      "NDCG:\t%f" % rank_eval.ndcg_at_k(),
      "Precision@K:\t%f" % rank_eval.precision_at_k(),
      "Recall@K:\t%f" % rank_eval.recall_at_k(), sep='\n')



Model:	ALS
Top K:	10
MAP:	0.000041
NDCG:	0.000069
Precision@K:	0.000020
Recall@K:	0.000041


                                                                                

### 5. Evaluate rating prediction

In [13]:
# Generate predicted ratings.
prediction = model.transform(test)
prediction.cache().show()




+-----------+-----------+------+------------+
|    user_id|    news_id|rating|  prediction|
+-----------+-----------+------+------------+
|-1713021867|-1560029370|     0|  0.06315659|
|-1713021867|-1080198752|     0|  0.20678647|
|-1713021867| -903223166|     1|  0.03217992|
|-1713021867| -513306611|     0|  0.10695357|
|-1482601076|-2090215252|     0| 0.006467443|
|-1482601076|-1793048640|     0|0.0106193535|
|-1482601076|-1755588721|     0|         0.0|
|-1482601076|-1688951603|     0|0.0066513084|
|-1482601076|-1677718748|     0| 0.016954418|
|-1482601076|-1560029370|     0| 0.019275913|
|-1482601076| -674018655|     0|0.0037705882|
|-1482601076| -560679498|     0| 0.013008596|
|-1482601076| -553221724|     0| 0.011344071|
|-1482601076| -547306936|     0| 0.010215879|
|-1482601076| -289872533|     0|         0.0|
|-1482601076| -199816989|     0| 0.004642839|
|-1482601076| -189976086|     0| 0.010997589|
|-1482601076| -155987723|     0| 0.016828649|
|-1482601076| -129293393|     0|  

                                                                                

In [14]:
rating_eval = SparkRatingEvaluation(test, prediction, col_user=COL_USER, col_item=COL_ITEM, 
                                    col_rating=COL_RATING, col_prediction="prediction")

print("Model:\tALS rating prediction",
      "RMSE:\t%f" % rating_eval.rmse(),
      "MAE:\t%f" % rating_eval.mae(),
      "Explained variance:\t%f" % rating_eval.exp_var(),
      "R squared:\t%f" % rating_eval.rsquared(), sep='\n')

                                                                                

Model:	ALS rating prediction
RMSE:	0.194057
MAE:	0.047414
Explained variance:	0.019782
R squared:	-0.002237


                                                                                

In [15]:
# cleanup spark instance and clear temp directory
spark.stop()
tmpdir.cleanup()

### 6. Changes over time

#### Attempt 0 - no modifications

Model:	ALS
Top K:	10
MAP:	0.000041
NDCG:	0.000069
Precision@K:	0.000020
Recall@K:	0.000041

Model:	ALS rating prediction
RMSE:	0.194057
MAE:	0.047414
Explained variance:	0.019782
R squared:	-0.002237

These extremely low metrics might be due to the dataset mostly being made up of non-interactions. Possible changes:
- Remove articles with low engagement
- Tune hyperparameters: increase rank, decrease regParam, and change alpha
- Convert data to implicit feedback