# Recommendation System - Movielens Data

Aim - Build a recommendation engine using Movielens data to predict the movies a user would enjoy.

Spark MLlib library for Machine Learning provides a Collaborative Filtering implementation by using Alternating Least Squares. The implementation in MLlib has these parameters:

* numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
* rank is the number of latent factors in the model.
* iterations is the number of iterations to run.
* lambda specifies the regularization parameter in ALS.
* implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
* alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.

Steps:

1. Create a spark session and import data
2. Check whether data is in movieID, userID and rating (all numerical) format
3. Check for missing values and see the summary to get an idea of the data
4. Split data into train and test to evaluate the performance of the recommendation system
5. Import ALS and build the model on the train set
6. Predict the ratings on the test set
7. Import an evaluator to evaluate the accuracy of the predictions

In [1]:
# Create a spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('movie').getOrCreate()

In [2]:
# Import movielens data

data = spark.read.csv('movielens_ratings.csv',inferSchema=True,header=True)

In [3]:
data.head().asDict()

{'movieId': 2, 'rating': 3.0, 'userId': 0}

In [4]:
data.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



In [5]:
# Check for any missing values
from pyspark.sql.functions import isnan, isnull, when, count, col

data.select([count(when(isnan(c)| isnull(c), c)).alias(c) for c in data.columns]).show()

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|      0|     0|     0|
+-------+------+------+



In [6]:
# Split the data into train and test set to evaluate the performance of our recommendation system
train, test = data.randomSplit([0.9, 0.1])

In [7]:
# Import ALS

from pyspark.ml.recommendation import ALS

In [8]:
# Build the recommendation model using ALS on the training data
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")
model = als.fit(train)

In [9]:
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)

In [10]:
predictions.show()

+-------+------+------+----------+
|movieId|rating|userId|prediction|
+-------+------+------+----------+
|     31|   1.0|    27|0.42751443|
|     31|   1.0|     5| 2.7613196|
|     78|   1.0|     1|0.96409684|
|     34|   1.0|     4| 1.2006195|
|     81|   1.0|     7| 0.7912039|
|     28|   1.0|     2| 4.9103336|
|     26|   1.0|    19| 0.8854952|
|     26|   3.0|    15|  2.203689|
|     26|   1.0|     2| 2.0831113|
|     27|   1.0|     5|  4.838405|
|     27|   3.0|    24| 0.9148974|
|     44|   1.0|    22|  0.672422|
|     44|   4.0|    18| 1.0771607|
|     12|   1.0|     1| 1.3843888|
|     12|   1.0|    21|0.63092315|
|     12|   2.0|    18| 0.4667408|
|     91|   1.0|    20| 1.2612588|
|     47|   1.0|    26|-1.6067481|
|     47|   4.0|    25|  2.358118|
|     47|   1.0|    24| 2.4211466|
+-------+------+------+----------+
only showing top 20 rows



In [11]:
# Import evaluator

from pyspark.ml.evaluation import RegressionEvaluator

Since our predicted ratings have a continuous value, we decided to use the Regression Evaluator to evaluate our predictions.

In [12]:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root mean square error = " + str(rmse))

Root mean square error = 1.5892349847085798


This is pretty bad since from a rating of 1 to 5 stars the error is nearly approximately of 2 stars. We can attribute the high value of RMSE to the small dataset (only 1501 observations).

Finally we will use the model to predict the ratings of a single user.

In [13]:
single_user = test.filter(test['userId']==11).select(['movieId','userId'])

In [14]:
# Check which movies userId '11' has seen
single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|     10|    11|
|     20|    11|
|     36|    11|
|     39|    11|
|     62|    11|
|     75|    11|
|     82|    11|
|     88|    11|
+-------+------+



In [15]:
recommendations = model.transform(single_user)

In [16]:
recommendations.orderBy('prediction',ascending=False).show()

+-------+------+----------+
|movieId|userId|prediction|
+-------+------+----------+
|     10|    11|  2.567011|
|     39|    11| 2.2638342|
|     20|    11| 2.1479535|
|     36|    11| 1.9692445|
|     88|    11| 1.9242296|
|     82|    11| 1.7972623|
|     75|    11| 1.5097756|
|     62|    11|-2.6585352|
+-------+------+----------+



------------------------------------