# Movie Recommender Systems using Spark with PySpark

This project is about building a movie recommendation system using explicit feedback of movie ratings by user's aka Alternating Least Squares (ALS) - Collaborative Filtering

### Two main approaches for recommendation system

<b>Collaborative Filtering:</b> 
we make predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption is that if a user A has the same opinion as a user B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a user chosen randomly.

<b>Content Based Filtering</b>
methods are based on a description of the item and a profile of the user's preferences. In a content-based recommender system, keywords are used to describe the items and a user profile is built to indicate the type of item this user likes. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.

In this project we will be focusing on a type of Collaborative Filtering called Alternating Least Sqaures (ALS)

<b>ALS:</b> ALS is one of the low rank matrix approximation algorithms for collaborative filtering. ALS decomposes user-item matrix into two low rank matrixes: user matrix and item matrix. In collaborative filtering, users and products are described by a small set of latent factors that can be used to predict missing entries. And ALS algorithm learns these latent factors by matrix factorization.

In [1]:
import findspark
findspark.init('/home/sonal/spark-2.4.5-bin-hadoop2.7')
import pyspark

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('recommendation').getOrCreate()

Spark MLlib library for Machine Learning provides a Collaborative Filtering implementation by using Alternating Least Squares. The implementation in MLlib has these parameters:

* numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
* rank is the number of latent factors in the model.
* iterations is the number of iterations to run.
* lambda specifies the regularization parameter in ALS.
* implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
* alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.

Let's see this all in action!

In [3]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

In [4]:
data = spark.read.csv('movielens_ratings.csv',inferSchema=True,header=True)

In [5]:
data.head()

Row(movieId=2, rating=3.0, userId=0)

In [6]:
data.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- userId: integer (nullable = true)



In [7]:
data.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



We can do a split to evaluate how well our model performed, but keep in mind that it is very hard to know conclusively how well a recommender system is truly working for some topics. Especially if subjectivity is involved, for example not everyone that loves star wars is going to love star trek, even though a recommendation system may suggest otherwise.

In [8]:
# Smaller dataset so we will use 0.8 / 0.2
(train_data, test_data) = data.randomSplit([0.8, 0.2], seed=42)

In [9]:
# Build the recommendation model using ALS on the training data
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")
model = als.fit(train_data)

Now let's see hwo the model performed!

In [10]:
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test_data)

In [11]:
predictions.show()

+-------+------+------+-----------+
|movieId|rating|userId| prediction|
+-------+------+------+-----------+
|     31|   1.0|    26| -0.8328631|
|     31|   1.0|     5| 0.91361725|
|     31|   1.0|     4|  1.8772194|
|     31|   2.0|    25| -1.0943574|
|     31|   1.0|    18| -0.2835476|
|     85|   3.0|     1|  3.0139313|
|     85|   1.0|    13|  2.2822132|
|     85|   3.0|     6| -1.7273896|
|     85|   1.0|    25| -1.3156357|
|     65|   1.0|    16|-0.07137105|
|     65|   1.0|     2|  2.3545816|
|     78|   1.0|     1|  1.1508821|
|     78|   1.0|    19| 0.69761777|
|     78|   1.0|    24|   1.624121|
|     78|   1.0|     2|  1.4557397|
|     34|   1.0|    28|  2.2461605|
|     34|   1.0|    16|-0.91016656|
|     34|   1.0|    19| 0.17841205|
|     81|   1.0|     6|   2.857556|
|     81|   2.0|     5|-0.10178676|
+-------+------+------+-----------+
only showing top 20 rows



In [12]:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 1.874538232312848


The RMSE described our error in terms of the stars rating column.

So now that we have the model, how would you actually supply a recommendation to a user?

The same way we did with the test data! For example:

In [22]:
single_user = test_data.filter(test_data['userId']==12).select(['movieId','userId'])

In [23]:
# User had 10 ratings in the test data set 
# Realistically this should be some sort of hold out set!
single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|     18|    12|
|     24|    12|
|     25|    12|
|     30|    12|
|     38|    12|
|     45|    12|
|     50|    12|
|     54|    12|
|     57|    12|
|     79|    12|
|     83|    12|
+-------+------+



In [24]:
reccomendations = model.transform(single_user)

In [25]:
reccomendations.orderBy('prediction',ascending=False).show()

+-------+------+----------+
|movieId|userId|prediction|
+-------+------+----------+
|     50|    12|  4.368422|
|     79|    12| 3.8063822|
|     18|    12| 3.4928117|
|     30|    12|  3.491328|
|     25|    12| 3.3623793|
|     54|    12| 1.2524549|
|     24|    12|0.97046584|
|     38|    12| 0.7397181|
|     57|    12|0.47302982|
|     45|    12|0.08336234|
|     83|    12| -0.785038|
+-------+------+----------+



We can recommend movie with id 50 to user. The user might like those based on previous history. However, we say don't watch movies with id 86,66 and 81. You might don't like it.

What if a user has never watched any movie or a new user, what we can recommend, it's called a cold start in the recommendation system. Well, in that case, we can ask the user to take a survey and get an idea of his interest in movies. Or we can give other users recommendations. Cold start is a problem for the recommendation system problem in general.