## Movie Recommender System - Collaborative Filtering

DataSet:

https://www.kaggle.com/rounakbanik/the-movies-dataset

Source:

http://www.3leafnodes.com/apache-spark-introduction-recommender-system

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('MovieRecommender').getOrCreate()
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

24/01/30 18:35:48 WARN Utils: Your hostname, Sais-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 10.150.105.91 instead (on interface en0)
24/01/30 18:35:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/30 18:35:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/01/30 18:35:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### Import Data

In [3]:
ratings = spark.read.csv("/Users/saiomkarkandukuri/Desktop/Academics/Academics-001/big-data-platforms/week8/the-movies-dataset/ratings_small.csv", inferSchema=True, header=True)
movies = spark.read.csv("/Users/saiomkarkandukuri/Desktop/Academics/Academics-001/big-data-platforms/week8/the-movies-dataset/movies_metadata.csv", inferSchema=True, header=True)

In [4]:
ratings.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



### Data Exploration

In [5]:
ratings.columns

['userId', 'movieId', 'rating', 'timestamp']

In [6]:
movies.columns

['adult',
 'belongs_to_collection',
 'budget',
 'genres',
 'homepage',
 'id',
 'imdb_id',
 'original_language',
 'original_title',
 'overview',
 'popularity',
 'poster_path',
 'production_companies',
 'production_countries',
 'release_date',
 'revenue',
 'runtime',
 'spoken_languages',
 'status',
 'tagline',
 'title',
 'video',
 'vote_average',
 'vote_count']

In [7]:
ratings = ratings.select(['userId', 'movieId', 'rating'])

In [8]:
ratings.head(5)

[Row(userId=1, movieId=31, rating=2.5),
 Row(userId=1, movieId=1029, rating=3.0),
 Row(userId=1, movieId=1061, rating=3.0),
 Row(userId=1, movieId=1129, rating=2.0),
 Row(userId=1, movieId=1172, rating=4.0)]

In [9]:
ratings.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|     31|   2.5|
|     1|   1029|   3.0|
|     1|   1061|   3.0|
|     1|   1129|   2.0|
|     1|   1172|   4.0|
|     1|   1263|   2.0|
|     1|   1287|   2.0|
|     1|   1293|   2.0|
|     1|   1339|   3.5|
|     1|   1343|   2.0|
|     1|   1371|   2.5|
|     1|   1405|   1.0|
|     1|   1953|   4.0|
|     1|   2105|   4.0|
|     1|   2150|   3.0|
|     1|   2193|   2.0|
|     1|   2294|   2.0|
|     1|   2455|   2.5|
|     1|   2968|   1.0|
|     1|   3671|   3.0|
+------+-------+------+
only showing top 20 rows



In [10]:
ratings.describe().show()

+-------+------------------+------------------+------------------+
|summary|            userId|           movieId|            rating|
+-------+------------------+------------------+------------------+
|  count|            100004|            100004|            100004|
|   mean| 347.0113095476181|12548.664363425463| 3.543608255669773|
| stddev|195.16383797819535|26369.198968815268|1.0580641091070326|
|    min|                 1|                 1|               0.5|
|    max|               671|            163949|               5.0|
+-------+------------------+------------------+------------------+



In [11]:
training, test = ratings.randomSplit([0.8,0.2])

### ALS

[Alternating Least Squares(ALS)](https://spark.apache.org/docs/latest/ml-collaborative-filtering.html) is a the model we’ll use to fit our data and find similarities. ALS is an iterative optimization process where we for every iteration try to arrive closer and closer to a factorized representation of our original data.

For implicit preference data, the algorithm used is based on “Collaborative Filtering for Implicit Feedback Datasets”,, adapted for the blocked approach used here.

Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r > 0 and 0 if r <= 0. The ratings then act as ‘confidence’ values related to strength of indicated user preferences rather than explicit ratings given to items.

### Cold Start Predictions

When there are cold start users or items to make predictions on (ones not available in the model) the predictions produce NaNs as shown in the summary below. This also causes evaluation with the mean squared error to produce a NaN.To solve this problem, the rows can be dropped with <code>predictions.na.drop()</code>. A more streamlined way is to add the <code>coldStartStrategy="drop"</code> as a model parameter.

In [12]:
# Build the recommendation model using ALS on the training data
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics

als = ALS(maxIter=10, regParam=0.01, rank = 10, userCol="userId", itemCol="movieId", ratingCol="rating",
          coldStartStrategy="drop", nonnegative = True)

In [15]:
#fit and predict
model = als.fit(training)
predictions = model.transform(test)

In [18]:
#explain parameters of the model
model.explainParams()

"blockSize: block size for stacking input data in matrices. Data is stacked within partitions. If block size is more than remaining data in a partition then it is adjusted to the size of this data. (default: 4096)\ncoldStartStrategy: strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: 'nan', 'drop'. (default: nan, current: drop)\nitemCol: column name for item ids. Ids must be within the integer value range. (default: item, current: movieId)\npredictionCol: prediction column name. (default: prediction)\nuserCol: column name for user ids. Ids must be within the integer value range. (default: user, current: userId)"

In [19]:
#item factors 
model.itemFactors.show(10, truncate = False)

+---+-----------------------------------------------------------------------------------------------------------------+
|id |features                                                                                                         |
+---+-----------------------------------------------------------------------------------------------------------------+
|10 |[0.4526526, 1.10226, 0.6139859, 0.93697685, 0.7231332, 0.49364826, 1.6725879, 0.5438942, 0.45856705, 1.0070657]  |
|20 |[1.408679, 0.0, 1.1627545, 0.86438787, 1.0241486, 0.95211715, 0.42245862, 0.08277715, 1.4531595, 1.0112128]      |
|30 |[1.1937718, 0.46731228, 0.24893686, 2.0532336, 0.18143256, 1.4616607, 0.05457021, 2.1099515, 0.0, 1.9221573]     |
|40 |[0.9202554, 1.4581052, 0.28719586, 0.5365094, 0.6499276, 2.0185122, 1.4123722, 0.3286701, 0.061067536, 1.0841638]|
|50 |[0.9901864, 1.9146293, 0.0, 0.9764758, 1.0007045, 0.18279846, 1.6571925, 0.44981188, 0.76740134, 2.0315585]      |
|60 |[1.0849234, 1.4610733, 0.94705063, 

In [20]:
movies = movies.select('id','title','genres')
predictions = predictions.join(movies, movies.id == predictions.movieId)

In [21]:
predictions = predictions.na.drop()
predictions.show(10, truncate = False)

+------+-------+------+----------+----+-----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|userId|movieId|rating|prediction|id  |title                                                                                                |genres                                                                                                                                                     |
+------+-------+------+----------+----+-----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|148   |58     |4.0   |4.190801  |58  |Pirates of the Caribbean: Dead Man's Chest                         

### Prediction Performance 

The RMSE with 100,004 data points is 1.1244220. 

Adding additional data points (26,024,289) is expected to increase the prediction performance. Run this notebook with the full dataset to see the lift.

In [22]:
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating')
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 1.0263904594881685


### Predictions

In [23]:
# Generate top 10 movie recommendations for each user
userRecs = model.recommendForAllUsers(10)
userRecs.show(10)



+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|     1|[{6450, 7.6936827...|
|     2|[{501, 8.549845},...|
|     3|[{6216, 8.864064}...|
|     4|[{3929, 6.9032474...|
|     5|[{6450, 9.5201435...|
|     6|[{1192, 7.647624}...|
|     7|[{95449, 7.077262...|
|     8|[{390, 8.49079}, ...|
|     9|[{1180, 9.157441}...|
|    10|[{1295, 7.1830487...|
+------+--------------------+
only showing top 10 rows



                                                                                

In [24]:
# Generate top 10 user recommendations for each movie
movieRecs = model.recommendForAllItems(10)

movieRecs.show(10, truncate=False)



+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|movieId|recommendations                                                                                                                                                                 |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|12     |[{289, 13.281313}, {629, 12.158485}, {296, 11.839071}, {20, 10.543511}, {174, 10.407533}, {332, 9.963968}, {114, 9.522163}, {611, 9.408291}, {586, 8.868663}, {498, 8.346689}]  |
|26     |[{348, 7.22755}, {193, 7.0855393}, {360, 7.070237}, {264, 7.069119}, {37, 6.985383}, {123, 6.8676195}, {145, 6.632053}, {645, 6.5727496}, {54, 6.5641513}, {225, 6.4865994}]    |
|27     |[{650, 4.003733}, {273, 3.9162965}, {453, 3.9011173}, {3

                                                                                

In [25]:
# Generate top 10 movie recommendations for a specified set of users
users = ratings.select(als.getUserCol()).distinct().limit(3)
userSubsetRecs = model.recommendForUserSubset(users, 10)

userSubsetRecs.show(10, truncate=False)

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                                                                                                                   |
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|471   |[{279, 7.025709}, {4967, 6.4434013}, {299, 6.4163504}, {3355, 6.281493}, {6003, 6.089942}, {2290, 6.0044303}, {1306, 5.928119}, {766, 5.8316755}, {2068, 5.776059}, {3030, 5.6900644}]            |
|463   |[{392, 6.819105}, {65188, 5.9430437}, {1202, 5.940975}, {6433, 5.8109007}, {1180, 5.7846346}, {3266, 5.697556}, {77658, 5.657424}, {98961, 5.6093698}, {83411, 5.601468}, {83318

In [26]:
# Generate top 10 user recommendations for a specified set of movies
movies = ratings.select(als.getItemCol()).distinct().limit(3)
movieSubSetRecs = model.recommendForItemSubset(movies, 10)

movieSubSetRecs.show(10, truncate=False)

+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|movieId|recommendations                                                                                                                                                                |
+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1580   |[{337, 5.061896}, {46, 5.028004}, {568, 4.964072}, {663, 4.934806}, {432, 4.928523}, {290, 4.894332}, {47, 4.8710666}, {177, 4.8688655}, {287, 4.8474236}, {448, 4.791448}]    |
|3794   |[{332, 6.2603335}, {310, 5.991803}, {325, 5.6808276}, {198, 5.247655}, {28, 5.170321}, {543, 4.99607}, {337, 4.9277444}, {401, 4.7955823}, {154, 4.719326}, {408, 4.68405}]    |
|2659   |[{308, 6.918976}, {348, 6.8935122}, {51, 6.8734593}, {264, 6.