# Recommender Code Along

[movielens data set](https://grouplens.org/datasets/movielens/)

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('recSystem').getOrCreate()

Collaborative filtering makes prediction (filtering) about the interests of a user by collecting preferences information from many users (collaborating). The new predictions are built upon the existing ratings of other users with similar ratings with the active user. For example:

<img src=https://upload.wikimedia.org/wikipedia/commons/5/52/Collaborative_filtering.gif />

Spark MLlib library for Machine Learning provides a Collaborative Filtering implementation by using Alternating Least Squares. The implementation in MLlib has these parameters:

* numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
* rank is the number of latent factors in the model.
* iterations is the number of iterations to run.
* lambda specifies the regularization parameter in ALS.
* implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
* alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.


In [4]:
data = spark.read.csv('movielens_ratings.csv',inferSchema=True,header=True)

In [11]:
data.show(7)

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|      2|   3.0|     0|
|      3|   1.0|     0|
|      5|   2.0|     0|
|      9|   4.0|     0|
|     11|   1.0|     0|
|     12|   2.0|     0|
|     15|   1.0|     0|
+-------+------+------+
only showing top 7 rows



In [6]:
data.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



We can do a split to evaluate how well our model performed

In [7]:
# Smaller dataset so we will use 0.8 / 0.2

(training, test) = data.randomSplit([0.8, 0.2])

In [9]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

In [10]:
# Build the recommendation model using ALS on the training data

als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")

model = als.fit(training)

Now let's see how the model performed!

In [12]:
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)

In [13]:
predictions.show()

+-------+------+------+-----------+
|movieId|rating|userId| prediction|
+-------+------+------+-----------+
|     31|   1.0|    13|  1.3541343|
|     31|   2.0|    25| 0.22553566|
|     85|   1.0|    26| -4.7627378|
|     85|   4.0|     7| -1.1938593|
|     85|   1.0|    29|  1.4854777|
|     85|   3.0|    21|  3.4416761|
|     65|   1.0|    28|  0.7786545|
|     65|   1.0|     4|-0.62037826|
|     65|   5.0|    23|  2.1504264|
|     53|   3.0|    20|  1.0125616|
|     53|   5.0|     8|  2.5181978|
|     53|   5.0|    21|  3.0839343|
|     78|   1.0|    11| 0.43350846|
|     81|   1.0|     6|  1.5962185|
|     81|   1.0|    15|   2.909353|
|     28|   3.0|     1|   1.365908|
|     28|   1.0|    17| -1.4350067|
|     28|   1.0|    23| -1.3205612|
|     76|   5.0|    14| -2.3735511|
|     26|   3.0|    15| -0.6821565|
+-------+------+------+-----------+
only showing top 20 rows



# Evaluate

In [14]:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

rmse = evaluator.evaluate(predictions)

print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 2.016007506631373


### So now that we have the model, how would you actually supply a recommendation to a single user?

In [15]:
single_user = test.filter(test['userId']==11).select(['movieId','userId'])

In [16]:
# User had 10 ratings in the test data set 

single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|     11|    11|
|     12|    11|
|     19|    11|
|     25|    11|
|     32|    11|
|     67|    11|
|     70|    11|
|     77|    11|
|     78|    11|
|     89|    11|
|     94|    11|
+-------+------+



In [17]:
reccomendations = model.transform(single_user)

In [18]:
reccomendations.orderBy('prediction', ascending=False).show()

+-------+------+-----------+
|movieId|userId| prediction|
+-------+------+-----------+
|     94|    11|  3.9866672|
|     89|    11|   3.216775|
|     32|    11|  2.7695441|
|     12|    11|  2.3684196|
|     70|    11|  1.9249603|
|     25|    11|  1.6387994|
|     11|    11|  1.2789872|
|     67|    11|  1.1437042|
|     19|    11|  0.8907236|
|     78|    11| 0.43350846|
|     77|    11|-0.46602625|
+-------+------+-----------+



# Great Job!