## Recommendations with Spark ALS

This notebook assumes nothing about your jupyter installation -- it needn't be configured to talk to pyspark. Set spark_home below, and you may need to also set some environment variables for jupyter

    export SPARK_HOME=/where/spark/lives
    export PYSPARK_PYTHON=python3
    export PYSPARK_DRIVER_PYTHON=python3
    jupyter notebook
    


In [None]:
# set this to point to your spark installation
spark_home = "/srv/spark"

from glob import glob
import sys, os
spark_python = os.path.join(spark_home, 'python')
py4j = glob(os.path.join(spark_python, 'lib', 'py4j-*.zip'))[0]
sys.path[:0] = [spark_python, py4j]
import pyspark

In [None]:
sc = pyspark.SparkContext("local[*]")

In [None]:
from pyspark.mllib.recommendation import ALS
from pyspark.mllib.recommendation import Rating

In [None]:
def expand_user(a, user):
    return [Rating(user, item, ranking) for item, ranking in enumerate(a) if ranking != 0]

In [None]:
def expand_all(a):
    return [expand_user(items, user) for user, items in enumerate(a)]

### Here we have ratings from eight users for six different movies: Titanic, Dirty Dancing, Die Hard, Terminator 2, Wayne's World, and Zoolander. Or in other words, two romantic films, two action films, and two comedies. Each row is a user, each column is a movie.

### The ratings are constructed so that if a user has seen both movies in one of these pairs, their ratings for the two movies are similar.

### There is no evidence in this data that anyone likes all three film genres.

In [None]:
rawdata = [
    [5,5,0,0,0,0],
    [0,0,5,5,0,0],
    [0,0,0,0,5,5],
    [0,1,5,5,5,0],
    [1,1,5,0,5,5],
    [5,5,0,5,1,1],
    [5,0,0,5,0,1],
    [5,5,5,0,1,0]
    ]
list_of_ratings = expand_all(rawdata)

In [None]:
# construct an RDD of Ratings for every non-zero rating
ratings = [val for sublist in list_of_ratings for val in sublist]
ratingsRDD = sc.parallelize(ratings)
ratingsRDD.take(5)

In [None]:
rank = 2
numIterations = 20
als_lambda = 0.1
model = ALS.train(ratingsRDD, rank, numIterations, als_lambda, seed=4242, nonnegative=True)
# there is also a trainImplicit method that one uses when
# working with implicit ratings (it uses a different cost function)

In [None]:
# here we see the model's vector of features for each user
users = model.userFeatures().collect()
sorted(users, key=lambda x: x[0])

In [None]:
# and the features for the "products"
products = model.productFeatures().collect()
sorted(products, key=lambda x: x[0])

In [None]:
# recommend 3 items for user 2
model.recommendProducts(2, 3)

### Display the original matrix side-by-side with the reconstructed matrix. The values that were originally non-zero should be closely approximated, and the values that were zero (empty) now have predictions.

In [None]:
print(" original      reconstructed")
for user in range(0, len(rawdata)):
    for product in range (0, len(rawdata[0])):
        sys.stdout.write("%d " % rawdata[user][product])
    sys.stdout.write("    ")
    for product in range (0, len(rawdata[0])):
        sys.stdout.write("%0.0f " % model.predict(user, product))
    print(" ")

In [None]:
print(" original         errors        predictions")
for user in range(0, len(rawdata)):
    for product in range (0, len(rawdata[0])):
        sys.stdout.write("%d " % rawdata[user][product])
    sys.stdout.write("    ")
    for product in range (0, len(rawdata[0])):
        if rawdata[user][product] != 0:
            prediction = model.predict(user, product)
            if rawdata[user][product] != round(prediction, 0):
                sys.stdout.write("%0.0f " % prediction)
            else:
                sys.stdout.write("- ")
        else:
            sys.stdout.write("- ")
    sys.stdout.write("    ")
    for product in range (0, len(rawdata[0])):
        if rawdata[user][product] == 0:
            prediction = model.predict(user, product)
            sys.stdout.write("%0.0f " % prediction)
        else:
            sys.stdout.write("- ")
    print(" ")

### Compute the mean squared error of the reconstructed matrix. This can be used to decide if the rank is sufficiently large.

In [None]:
evalRDD = ratingsRDD.map(lambda p: (p[0], p[1]))
evalRDD.take(5)

In [None]:
predictions = model.predictAll(evalRDD).map(lambda r: ((r[0], r[1]), r[2]))
predictions.take(5)

In [None]:
ratingsAndPreds = ratingsRDD.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
ratingsAndPreds.take(5)

In [None]:
ratingsAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()

With a larger dataset we would separate the rating data into training and test sets, and see how well our predicted ratings match the actual data.

### Questions

How does lambda affect the results?

* try setting lambda to 0.01 (this is the default in some versions of spark)
* can you get good results? what if you increase the rank?

What happens as you increase the rank?

How sensitive are the results to the random seed?

What would happen if one movie was universally loved, or hated?

What happens if you remove some of the rating data?