# 1. LAST-FM: Baseline ALS Recommender

This notebook implements a simple ALS recommender based on the LastFM user listening dataset. It uses spark and is written in Scala. Minimal data cleaning/pre-processing is performed to provide a baseline model. 

## 1.1 Imports and set up 

Key libraries are imported, the spark session is initialised and the listening data is loaded in. 

In [68]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.StandardScaler
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.functions.vector_to_array
import org.apache.spark.sql.types._

import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.StandardScaler
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.functions.vector_to_array
import org.apache.spark.sql.types._


In [2]:
val spark:SparkSession = SparkSession.builder()
      .master("local[1]")
      .appName("lastfm")
      .getOrCreate() 

22/07/24 22:07:52 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@2f8bb184


In [3]:
// path to last-fm dataset. Can be downloaded here: http://millionsongdataset.com/lastfm/
var data_path:String = "../resources/lastfm-dataset-1K/userid-timestamp-artid-artname-traid-traname.tsv"

data_path: String = ../resources/lastfm-dataset-1K/userid-timestamp-artid-artname-traid-traname.tsv


In [4]:
// schema defined below to set column names and types. 
val schema = new StructType()
            .add("user_id", StringType, true)
            .add("timestamp", StringType, true)
            .add("artist_id", StringType, true)
            .add("artist_name", StringType, true)
            .add("track_id", StringType, true)
            .add("track_name", StringType, true)

schema: org.apache.spark.sql.types.StructType = StructType(StructField(user_id,StringType,true),StructField(timestamp,StringType,true),StructField(artist_id,StringType,true),StructField(artist_name,StringType,true),StructField(track_id,StringType,true),StructField(track_name,StringType,true))


In [5]:
// read in data
val listener_data = spark.read.option("header", false).schema(schema).option("sep", "\t").csv(data_path)
listener_data.show()

+-----------+--------------------+--------------------+---------------+--------------------+--------------------+
|    user_id|           timestamp|           artist_id|    artist_name|            track_id|          track_name|
+-----------+--------------------+--------------------+---------------+--------------------+--------------------+
|user_000001|2009-05-04T23:08:57Z|f1b1cf71-bd35-4e9...|      Deep Dish|                null|Fuck Me Im Famous...|
|user_000001|2009-05-04T13:54:10Z|a7f7df4a-77d8-4f1...|       坂本龍一|                null|Composition 0919 ...|
|user_000001|2009-05-04T13:52:04Z|a7f7df4a-77d8-4f1...|       坂本龍一|                null|Mc2 (Live_2009_4_15)|
|user_000001|2009-05-04T13:42:52Z|a7f7df4a-77d8-4f1...|       坂本龍一|                null|Hibari (Live_2009...|
|user_000001|2009-05-04T13:42:11Z|a7f7df4a-77d8-4f1...|       坂本龍一|                null|Mc1 (Live_2009_4_15)|
|user_000001|2009-05-04T13:38:31Z|a7f7df4a-77d8-4f1...|       坂本龍一|                null|To Stanford (Liv

listener_data: org.apache.spark.sql.DataFrame = [user_id: string, timestamp: string ... 4 more fields]


We remove all NaN values as determining ratings with missing data would be problematic for a simple model. Additionally, the timestamp column is also dropped as its not directly contributing to the user-item matrix we are trying to build. 

In [6]:
val df = listener_data.drop("timestamp").na.drop()
df.show()

+-----------+--------------------+----------------+--------------------+--------------------+
|    user_id|           artist_id|     artist_name|            track_id|          track_name|
+-----------+--------------------+----------------+--------------------+--------------------+
|user_000001|a7f7df4a-77d8-4f1...|        坂本龍一|f7c1f8f8-b935-45e...|The Last Emperor ...|
|user_000001|a7f7df4a-77d8-4f1...|        坂本龍一|475d4e50-cebb-4cd...|Tibetan Dance (Ve...|
|user_000001|ba2f4f3b-0293-4bc...|      Underworld|dc394163-2b78-4b5...|Boy, Boy, Boy (Sw...|
|user_000001|ba2f4f3b-0293-4bc...|      Underworld|340d9a0b-9a43-409...|Crocodile (Innerv...|
|user_000001|a16e47f5-aa54-47f...| Ennio Morricone|0b04407b-f517-4e0...|Ninna Nanna In Bl...|
|user_000001|463a94f1-2713-40b...|         Minus 8|4e78efc4-e545-47a...|      Elysian Fields|
|user_000001|ad0811ea-e213-451...|       Beanfield|fb51d2c4-cc69-412...|  Planetary Deadlock|
|user_000001|309e2dfc-678e-4d0...|        Dj Linus|4277434f-e3c2-41a

df: org.apache.spark.sql.DataFrame = [user_id: string, artist_id: string ... 3 more fields]


The dataframe is then aggregated by user and track, to get the number of times a user has heard a particular track. 

In [113]:
val df_agg = df.select("user_id", "track_id")
            .groupBy("user_id", "track_id")
            .agg(count("*")alias("count")).orderBy("user_id")
val df_agg_filtered = df_agg.limit(30000)
df_agg_filtered.show()

22/07/25 01:15:51 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:15:51 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:15:52 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:15:52 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:15:52 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:15:52 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:15:52 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:15:52 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:15:52 WARN RowBasedKeyValueBatch: Calling spill() on

df_agg: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [user_id: string, track_id: string ... 1 more field]
df_agg_filtered: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [user_id: string, track_id: string ... 1 more field]


In [114]:
val Array(training, test) = df_agg_filtered.randomSplit(Array[Double](0.7, 0.3), 18)

training: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [user_id: string, track_id: string ... 1 more field]
test: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [user_id: string, track_id: string ... 1 more field]


In [119]:
//revisit to make more efficient

val user_indexer = new StringIndexer()
    .setInputCol("user_id")
    .setOutputCol("user_index")
val track_indexer = new StringIndexer()
    .setInputCol("track_id")
    .setOutputCol("track_index")

val va = new VectorAssembler()
    .setInputCols(Array("count"))
    .setOutputCol("count_assembled")

val scaler = new StandardScaler()
  .setInputCol("count_assembled")
  .setOutputCol("rating")

val tr_u = user_indexer.fit(training).transform(training)
val tr_i = track_indexer.fit(tr_u).transform(tr_u)
val tr_s = scaler.fit(va.transform(tr_i)).transform(va.transform(tr_i))
val tr_full = tr_s.withColumn("rating_as_array", vector_to_array(tr_s("rating")).getItem(0))
val tr_final = tr_full.select("user_index", "track_index", "rating_as_array").orderBy("user_index")

val ts_u = user_indexer.fit(test).transform(test)
val ts_i = track_indexer.fit(ts_u).transform(ts_u)
val ts_s = scaler.fit(va.transform(ts_i)).transform(va.transform(ts_i))
val ts_full = ts_s.withColumn("rating_as_array", vector_to_array(ts_s("rating")).getItem(0))
val ts_final = ts_full.select("user_index", "track_index", "rating_as_array").orderBy("user_index")

<console>: 91: error: value useMean is not a member of org.apache.spark.ml.feature.StandardScaler

In [116]:
tr_final.show()

22/07/25 01:17:18 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:18 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:19 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:19 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:19 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:19 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:19 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:19 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:19 WARN RowBasedKeyValueBatch: Calling spill() on

In [117]:
val als = new ALS()
  .setMaxIter(10)
  .setRegParam(0.01)
  .setUserCol("user_index")
  .setImplicitPrefs(true)
  .setItemCol("track_index")
  .setRatingCol("rating_as_array")

val model = als.fit(tr_final)

val predictions = model.transform(ts_final)

val evaluator = new RegressionEvaluator()
  .setMetricName("rmse")
  .setLabelCol("rating_as_array")
  .setPredictionCol("prediction")

val rmse = evaluator.evaluate(predictions)
println(s"Root-mean-square error = $rmse")


22/07/25 01:17:31 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:31 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:32 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:32 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:32 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:41 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:41 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:42 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/07/25 01:17:42 WARN RowBasedKeyValueBatch: Calling spill() on

als: org.apache.spark.ml.recommendation.ALS = als_7187857ae7a2
model: org.apache.spark.ml.recommendation.ALSModel = ALSModel: uid=als_7187857ae7a2, rank=10
predictions: org.apache.spark.sql.DataFrame = [user_index: double, track_index: double ... 2 more fields]
evaluator: org.apache.spark.ml.evaluation.RegressionEvaluator = RegressionEvaluator: uid=regEval_cbae2dd1c686, metricName=rmse, throughOrigin=false
rmse: Double = 1.0764759282390737


In [118]:
val userRecs = model.recommendForAllUsers(10)
val movieRecs = model.recommendForAllItems(10)

userRecs: org.apache.spark.sql.DataFrame = [user_index: int, recommendations: array<struct<track_index:int,rating:float>>]
movieRecs: org.apache.spark.sql.DataFrame = [track_index: int, recommendations: array<struct<user_index:int,rating:float>>]
