# 1. LAST-FM: Baseline ALS Recommender

This notebook implements a simple ALS recommender based on the LastFM user listening dataset. It uses spark and is written in Scala. Minimal data cleaning/pre-processing is performed to provide a baseline model. 

## 1.1 Imports and set up 

Key libraries are imported, the spark session is initialised and the listening data is loaded in. 

In [1]:
import $ivy.`org.apache.spark::spark-sql:3.0.0` // Or use any other 2.x version here

[32mimport [39m[36m$ivy.$                                   // Or use any other 2.x version here[39m

In [2]:
import $ivy.`org.apache.spark::spark-mllib:3.0.0`

[32mimport [39m[36m$ivy.$                                    [39m

In [3]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.mllib.evaluation.{RankingMetrics, RegressionMetrics}
import org.apache.spark.ml.feature.QuantileDiscretizer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.functions.vector_to_array
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql._

[32mimport [39m[36morg.apache.spark.sql.SparkSession
[39m
[32mimport [39m[36morg.apache.spark.ml.feature.StringIndexer
[39m
[32mimport [39m[36morg.apache.spark.ml.Pipeline
[39m
[32mimport [39m[36morg.apache.spark.ml.recommendation.ALS
[39m
[32mimport [39m[36morg.apache.spark.ml.evaluation.RegressionEvaluator
[39m
[32mimport [39m[36morg.apache.spark.mllib.evaluation.{RankingMetrics, RegressionMetrics}
[39m
[32mimport [39m[36morg.apache.spark.ml.feature.QuantileDiscretizer
[39m
[32mimport [39m[36morg.apache.spark.ml.feature.VectorAssembler
[39m
[32mimport [39m[36morg.apache.spark.ml.functions.vector_to_array
[39m
[32mimport [39m[36morg.apache.spark.sql.types._
[39m
[32mimport [39m[36morg.apache.spark.sql.functions._
[39m
[32mimport [39m[36morg.apache.spark.sql._[39m

In [4]:
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)



val spark = {
  NotebookSparkSession.builder()
    .master("local[*]")
    .appName("lastfm")
    .getOrCreate()
}

Loading spark-stubs
Getting spark JARs
Creating SparkSession


Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties


[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@563b8a75

In [5]:
// path to last-fm dataset. Can be downloaded here: http://millionsongdataset.com/lastfm/
var data_path:String = "../resources/lastfm-dataset-1K/userid-timestamp-artid-artname-traid-traname.tsv"

In [6]:
// schema defined below to set column names and types. 
val schema = new StructType()
            .add("user_id", StringType, true)
            .add("timestamp", StringType, true)
            .add("artist_id", StringType, true)
            .add("artist_name", StringType, true)
            .add("track_id", StringType, true)
            .add("track_name", StringType, true)

[36mschema[39m: [32mStructType[39m = [33mStructType[39m(
  [33mStructField[39m([32m"user_id"[39m, StringType, true, {}),
  [33mStructField[39m([32m"timestamp"[39m, StringType, true, {}),
  [33mStructField[39m([32m"artist_id"[39m, StringType, true, {}),
  [33mStructField[39m([32m"artist_name"[39m, StringType, true, {}),
  [33mStructField[39m([32m"track_id"[39m, StringType, true, {}),
  [33mStructField[39m([32m"track_name"[39m, StringType, true, {})
)

In [7]:
// read in data
val listener_data = spark.read.option("header", false).schema(schema).format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").option("sep", "\t").load(data_path)
listener_data.show()

+-----------+--------------------+--------------------+---------------+--------------------+--------------------+
|    user_id|           timestamp|           artist_id|    artist_name|            track_id|          track_name|
+-----------+--------------------+--------------------+---------------+--------------------+--------------------+
|user_000001|2009-05-04T23:08:57Z|f1b1cf71-bd35-4e9...|      Deep Dish|                null|Fuck Me Im Famous...|
|user_000001|2009-05-04T13:54:10Z|a7f7df4a-77d8-4f1...|       坂本龍一|                null|Composition 0919 ...|
|user_000001|2009-05-04T13:52:04Z|a7f7df4a-77d8-4f1...|       坂本龍一|                null|Mc2 (Live_2009_4_15)|
|user_000001|2009-05-04T13:42:52Z|a7f7df4a-77d8-4f1...|       坂本龍一|                null|Hibari (Live_2009...|
|user_000001|2009-05-04T13:42:11Z|a7f7df4a-77d8-4f1...|       坂本龍一|                null|Mc1 (Live_2009_4_15)|
|user_000001|2009-05-04T13:38:31Z|a7f7df4a-77d8-4f1...|       坂本龍一|                null|To Stanford (Liv

[36mlistener_data[39m: [32mDataFrame[39m = [user_id: string, timestamp: string ... 4 more fields]

We remove all NaN values as determining ratings with missing data would be problematic for a simple model. Additionally, the timestamp column is also dropped as its not directly contributing to the user-item matrix we are trying to build. 

In [8]:
val df = listener_data.drop("timestamp").na.drop()
df.show()

+-----------+--------------------+----------------+--------------------+--------------------+
|    user_id|           artist_id|     artist_name|            track_id|          track_name|
+-----------+--------------------+----------------+--------------------+--------------------+
|user_000001|a7f7df4a-77d8-4f1...|        坂本龍一|f7c1f8f8-b935-45e...|The Last Emperor ...|
|user_000001|a7f7df4a-77d8-4f1...|        坂本龍一|475d4e50-cebb-4cd...|Tibetan Dance (Ve...|
|user_000001|ba2f4f3b-0293-4bc...|      Underworld|dc394163-2b78-4b5...|Boy, Boy, Boy (Sw...|
|user_000001|ba2f4f3b-0293-4bc...|      Underworld|340d9a0b-9a43-409...|Crocodile (Innerv...|
|user_000001|a16e47f5-aa54-47f...| Ennio Morricone|0b04407b-f517-4e0...|Ninna Nanna In Bl...|
|user_000001|463a94f1-2713-40b...|         Minus 8|4e78efc4-e545-47a...|      Elysian Fields|
|user_000001|ad0811ea-e213-451...|       Beanfield|fb51d2c4-cc69-412...|  Planetary Deadlock|
|user_000001|309e2dfc-678e-4d0...|        Dj Linus|4277434f-e3c2-41a

[36mdf[39m: [32mDataFrame[39m = [user_id: string, artist_id: string ... 3 more fields]

The dataframe is then aggregated by user and track, to get the number of times a user has heard a particular track. 

In [9]:
val df_agg = df.select("user_id", "track_id")
            .groupBy("user_id", "track_id")
            .agg(count("*").alias("count")).orderBy("user_id")
val df_agg_filtered = df_agg.limit(5000)
df_agg_filtered.show()

+-----------+--------------------+-----+
|    user_id|            track_id|count|
+-----------+--------------------+-----+
|user_000001|20a5a368-3f4d-433...|   27|
|user_000001|d276b077-c05d-43c...|    4|
|user_000001|763b2ea5-3314-48c...|    2|
|user_000001|e7638bb7-bf57-435...|   14|
|user_000001|caca6626-7ba5-474...|    3|
|user_000001|19d1c947-fea8-459...|    2|
|user_000001|7b793966-abc8-423...|   10|
|user_000001|a494e993-a717-498...|    5|
|user_000001|687f53a1-b800-488...|    7|
|user_000001|d1b1d17a-87c8-410...|    4|
|user_000001|d9b7d831-e92a-4bb...|    1|
|user_000001|0024d72c-136f-49f...|    4|
|user_000001|95a65991-79d6-41d...|    4|
|user_000001|08cc9791-ac56-47d...|    6|
|user_000001|64e65892-1ab9-4c7...|   21|
|user_000001|97d2ae22-d794-49f...|   15|
|user_000001|f1400a93-16d0-452...|    3|
|user_000001|cd16ace9-2044-495...|    5|
|user_000001|c1680617-1a68-4ca...|   21|
|user_000001|d19e67ef-6ae5-446...|    3|
+-----------+--------------------+-----+
only showing top

[36mdf_agg[39m: [32mDataset[39m[[32mRow[39m] = [user_id: string, track_id: string ... 1 more field]
[36mdf_agg_filtered[39m: [32mDataset[39m[[32mRow[39m] = [user_id: string, track_id: string ... 1 more field]

In [10]:
val Array(training, test) = df_agg_filtered.randomSplit(Array[Double](0.8, 0.2), 18)

//revisit to make more efficient

val feat = df_agg_filtered.columns.filter(_ .contains("id"))
val inds = feat.map { colName =>
   new StringIndexer()
    .setInputCol(colName)
    .setOutputCol(colName.replace("id", "index"))
    .fit(df_agg_filtered)
    .setHandleInvalid("keep")
}

val va = new VectorAssembler()
    .setInputCols(Array("count"))
    .setOutputCol("count_assembled")

val scaler = new QuantileDiscretizer()
  .setInputCol("count")
  .setOutputCol("rating")
  .setNumBuckets(5)

val pipeline = new Pipeline()
  .setStages(inds.toArray ++ Array(va, scaler))
  
val tr_s = pipeline.fit(training).transform(training)
val ts_s = pipeline.fit(training).transform(test)

// val tr_full = tr_s.withColumn("rating_as_array", vector_to_array(tr_s("rating")).getItem(0))
// val ts_full = ts_s.withColumn("rating_as_array", vector_to_array(ts_s("rating")).getItem(0))

val tr_final = tr_s.select("user_index", "track_index", "rating").orderBy("user_index")
val ts_final = ts_s.select("user_index", "track_index", "rating").orderBy("user_index")

[36mtraining[39m: [32mDataset[39m[[32mRow[39m] = [user_id: string, track_id: string ... 1 more field]
[36mtest[39m: [32mDataset[39m[[32mRow[39m] = [user_id: string, track_id: string ... 1 more field]
[36mfeat[39m: [32mArray[39m[[32mString[39m] = [33mArray[39m([32m"user_id"[39m, [32m"track_id"[39m)
[36minds[39m: [32mArray[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32mml[39m.[32mfeature[39m.[32mStringIndexerModel[39m] = [33mArray[39m(
  StringIndexerModel: uid=strIdx_27fc685e97b0, handleInvalid=keep,
  StringIndexerModel: uid=strIdx_565c2eed98e9, handleInvalid=keep
)
[36mva[39m: [32mVectorAssembler[39m = VectorAssembler: uid=vecAssembler_79e5b8037ffc, handleInvalid=error, numInputCols=1
[36mscaler[39m: [32mQuantileDiscretizer[39m = quantileDiscretizer_9e21d341cb3b
[36mpipeline[39m: [32mPipeline[39m = pipeline_9379c28e453f
[36mtr_s[39m: [32mDataFrame[39m = [user_id: string, track_id: string ... 5 more fields]
[36mts_s[39m: [3

In [11]:
tr_final.show()

+----------+-----------+------+
|user_index|track_index|rating|
+----------+-----------+------+
|       0.0|       15.0|   3.0|
|       0.0|       74.0|   2.0|
|       0.0|       22.0|   2.0|
|       0.0|       23.0|   1.0|
|       0.0|       24.0|   2.0|
|       0.0|       25.0|   4.0|
|       0.0|       26.0|   4.0|
|       0.0|       29.0|   2.0|
|       0.0|       36.0|   4.0|
|       0.0|       38.0|   2.0|
|       0.0|       41.0|   4.0|
|       0.0|       42.0|   3.0|
|       0.0|       45.0|   4.0|
|       0.0|       53.0|   1.0|
|       0.0|       60.0|   1.0|
|       0.0|       61.0|   1.0|
|       0.0|       63.0|   2.0|
|       0.0|       70.0|   1.0|
|       0.0|       71.0|   4.0|
|       0.0|       72.0|   4.0|
+----------+-----------+------+
only showing top 20 rows



In [12]:
val als = new ALS()
  .setRank(5)
  .setUserCol("user_index")
  .setImplicitPrefs(true)
  .setItemCol("track_index")
  .setRatingCol("rating")

val model = als.fit(tr_final)
model.setColdStartStrategy("drop")

val predictions = model.transform(ts_final)

val evaluator = new RegressionEvaluator()
  .setMetricName("rmse")
  .setLabelCol("rating")
  .setPredictionCol("prediction")

val rmse = evaluator.evaluate(predictions)
println(s"Root-mean-square error = $rmse")


22/07/29 18:06:25 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
22/07/29 18:06:25 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS


22/07/29 18:06:25 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
22/07/29 18:06:25 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK


Root-mean-square error = 1.8399802861614443


[36mals[39m: [32mALS[39m = als_6902f9e866ad
[36mmodel[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32mml[39m.[32mrecommendation[39m.[32mALSModel[39m = ALSModel: uid=als_6902f9e866ad, rank=5
[36mres11_2[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32mml[39m.[32mrecommendation[39m.[32mALSModel[39m = ALSModel: uid=als_6902f9e866ad, rank=5
[36mpredictions[39m: [32mDataFrame[39m = [user_index: double, track_index: double ... 2 more fields]
[36mevaluator[39m: [32mRegressionEvaluator[39m = RegressionEvaluator: uid=regEval_ae81e2e2ea9b, metricName=rmse, throughOrigin=false
[36mrmse[39m: [32mDouble[39m = [32m1.8399802861614443[39m

In [13]:
predictions.show()

+----------+-----------+------+------------+
|user_index|track_index|rating|  prediction|
+----------+-----------+------+------------+
|       0.0|        1.0|   1.0|0.0013718307|
|       1.0|        6.0|   4.0|0.0012261495|
|       0.0|        9.0|   1.0|0.0014073029|
|       1.0|        7.0|   1.0|0.0011952445|
|       1.0|        0.0|   3.0|0.0012261495|
|       0.0|     4985.0|   4.0|    1.364928|
|       0.0|     4985.0|   1.0|    1.364928|
|       0.0|     4985.0|   3.0|    1.364928|
|       0.0|     4985.0|   3.0|    1.364928|
|       0.0|     4985.0|   3.0|    1.364928|
|       0.0|     4985.0|   1.0|    1.364928|
|       0.0|     4985.0|   2.0|    1.364928|
|       0.0|     4985.0|   2.0|    1.364928|
+----------+-----------+------+------------+



In [None]:
val userRecs = model.recommendForAllUsers(10)
val movieRecs = model.recommendForAllItems(10)