<h1>Collaborative filtering</h1>

[Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering) is often used for recommender systems.

<tt>spark.mllib</tt> uses the alternating least squares (ALS) algorithm to fill in the missing entries of a user-item association matrix. The following parameters are available:

- *numBlocks* is the number of blocks used to parallelize computation (set to -1 to auto-configure).
- *rank* is the number of latent factors in the model.
- *iterations* is the number of iterations of ALS to run.
- *lambda* specifies the regularization parameter in ALS.
- *implicitPrefs* specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
- *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.

We start up in the usual way and carry out the relevant imports.

In [None]:
val sparkVersion = "2.0.1"
val scalaVersion = scala.util.Properties.versionNumberString

In [None]:
classpath.add(
    "org.apache.spark" %% "spark-yarn" % sparkVersion,
    "org.apache.spark" %% "spark-mllib" % sparkVersion
)

In [None]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.util.MLUtils

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating

In the cell below, we present a small example of [collaborative filtering](http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#collaborative-filtering) with the data taken from the [MovieLens](http://grouplens.org/datasets/movielens/) project. In this notebook, we use the old 100k dataset (available to Jupyter at <tt>files/ml.data</tt>).

The dataset looks like this:

    196     242     3       881250949
    186     302     3       891717742
    22      377     1       878887116
    244     51      2       880606923
    ...

This is a tab separated list of 
    
    user id | item id | rating | timestamp 

We use the default [ALS](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS).train() method which assumes ratings are explicit. The Mean Squared Error of rating prediction is used to evaluate the recommendation model.

In [None]:
val sparkSession = SparkSession.builder
  .master("local[1]")
  .appName("Collaborative filtering")
  .getOrCreate()

val sc = sparkSession.sparkContext

// Load and parse the data
val data = sc.textFile("files/ml.data")
val ratings = data.map(_.split("\t") match { case Array(user, item, rate, timestamp) =>
    Rating(user.toInt, item.toInt, rate.toDouble)
})

// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)

// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate) =>
    (user, product)
}

val predictions =
    model.predict(usersProducts).map { case Rating(user, product, rate) =>
        ((user, product), rate)
}

val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>
    ((user, product), rate)
}.join(predictions)

val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
    val err = (r1 - r2)
    err * err
}.mean()

println("Mean Squared Error = " + MSE)

<h1>Exercises</h1>

<h2>Exercise 1</h2>

Create a standalone program that carries out collaborative filtering. Run this on the 10Mb [MovieLens](http://grouplens.org/datasets/movielens/) data (available from the link provided).

<h2>Exercise 2</h2>

Use 10-fold cross validation (with an 80% training and 20% testing split) to find an average mean average (or squared) error on your test data. Keep your program as parallel as possible. You can create your splits randomly (or any other way you choose!), and don't forget who has access to various variables and who doesn't...