The two most common types of recommender systems are Content-Based and Collaborative Filtering (CF)

Collaborative filtering produces recommendations based on the knowledge of users' attitude to items, that is it uses the "wisdom of the crowd" to recommend items. In general, collaborative filtering is more commonly used than content-based systems because it usually gives better results and is relatively easy to understand (from an implementation perspective).

Content-based recommender systems focus on the attributes of the items and give you recommendations based on the similarity between them.

In [1]:
# Boiler Plate
import findspark
import numpy as np
findspark.init('/home/ubuntu/spark-2.1.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('recsys').getOrCreate()

In [2]:
from pyspark.ml.recommendation import ALS

In [3]:
from pyspark.ml.evaluation import RegressionEvaluator

In [5]:
data = spark.read.csv('movielens_ratings.csv', header=True, inferSchema=True)

In [6]:
data.show()

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|      2|   3.0|     0|
|      3|   1.0|     0|
|      5|   2.0|     0|
|      9|   4.0|     0|
|     11|   1.0|     0|
|     12|   2.0|     0|
|     15|   1.0|     0|
|     17|   1.0|     0|
|     19|   1.0|     0|
|     21|   1.0|     0|
|     23|   1.0|     0|
|     26|   3.0|     0|
|     27|   1.0|     0|
|     28|   1.0|     0|
|     29|   1.0|     0|
|     30|   1.0|     0|
|     31|   1.0|     0|
|     34|   1.0|     0|
|     37|   1.0|     0|
|     41|   2.0|     0|
+-------+------+------+
only showing top 20 rows



In [7]:
data.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



Split into a training and a testing set. Can be pretty hard to evaluate recommender systems so keep that in mind.

In [8]:
training,test = data.randomSplit([0.8,0.2])

In [13]:
# Create Alternating Least Squares
als = ALS(maxIter=5,regParam=0.01,userCol='userId',itemCol='movieId',ratingCol='rating')

In [14]:
model = als.fit(training)

In [15]:
predictions = model.transform(test)

In [16]:
predictions.show()

+-------+------+------+------------+
|movieId|rating|userId|  prediction|
+-------+------+------+------------+
|     31|   1.0|    29|     1.44495|
|     31|   1.0|    18|  0.33091593|
|     85|   1.0|    28| -0.06769058|
|     85|   2.0|    20|   2.0418518|
|     85|   1.0|     4|    2.891379|
|     65|   1.0|    28|    0.864897|
|     65|   2.0|    15|   1.4939982|
|     65|   5.0|    23|   1.1819575|
|     53|   1.0|    12|   1.4822946|
|     53|   3.0|    13|   2.3080106|
|     53|   1.0|     9|   2.2862818|
|     53|   1.0|    23|  -0.5441834|
|     53|   1.0|     7|   3.3260317|
|     78|   1.0|    19|   1.1040903|
|     34|   1.0|    15|   1.1419798|
|     34|   4.0|     2|  -0.7178331|
|     81|   1.0|    22|   1.2614721|
|     81|   1.0|     6|-0.036242664|
|     81|   1.0|    15| -0.07101822|
|     28|   1.0|    27| -0.17232063|
+-------+------+------+------------+
only showing top 20 rows



User number 29 saw movie 31 and gave it a rating of 1. Our model predicted that user 29 would give movie 31 a rating of 1.4.

Keep in mind that this is a pretty small data set...

In [17]:
evaluator = RegressionEvaluator(metricName='rmse',labelCol='rating',predictionCol='prediction')

In [18]:
# How far off was our prediction from the actual rating.
rmse = evaluator.evaluate(predictions)

In [19]:
print("RMSE")
print(rmse)

RMSE
1.8151677777581123


This is a pretty bad RMSE, largely due to the small dataset.

In [25]:
single_user = test.filter(test['userId']==11).select(['movieId','userId'])

In [26]:
single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|     22|    11|
|     35|    11|
|     39|    11|
|     59|    11|
|     70|    11|
|     75|    11|
|     80|    11|
|     82|    11|
|     97|    11|
+-------+------+



This is user 11's information. Let's try and predict what movies they'd like...

In [27]:
recommendations = model.transform(single_user)

In [28]:
recommendations.orderBy('prediction',ascending=False).show()

+-------+------+----------+
|movieId|userId|prediction|
+-------+------+----------+
|     39|    11|  4.401478|
|     80|    11| 3.1656933|
|     70|    11| 2.8276863|
|     97|    11| 1.9877481|
|     59|    11| 1.4024693|
|     82|    11| 1.3245432|
|     22|    11| 1.1847949|
|     75|    11| 1.0997206|
|     35|    11|-1.0592866|
+-------+------+----------+



User number 11 might enjoy movie 39 based on our recommender system...