<a href="https://colab.research.google.com/github/NBK-code/Movie_Recommendation_System/blob/main/Movie_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install PySpark

In [None]:
! pip install pyspark

Import neccessary libraries

In [2]:
import pandas as pd
from pyspark.sql.functions import col, explode
from pyspark import SparkContext

In [3]:
from pyspark.sql import SparkSession
sc = SparkContext
# sc.setCheckpointDir('checkpoint')
spark = SparkSession.builder.appName('Recommendations').getOrCreate()

In [4]:
from google.colab import drive
drive.mount('gdrive')

Mounted at gdrive


Load data

In [5]:
movies = spark.read.csv("./gdrive/MyDrive/Colab_Notebooks/Projects/Recommendation_System/Data_/movies.csv",header=True)
ratings = spark.read.csv("./gdrive/MyDrive/Colab_Notebooks/Projects/Recommendation_System/Data_/ratings.csv",header=True)

In [6]:
ratings.show(10)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
|     1|     70|   3.0|964982400|
|     1|    101|   5.0|964980868|
|     1|    110|   4.0|964982176|
|     1|    151|   5.0|964984041|
|     1|    157|   5.0|964984100|
+------+-------+------+---------+
only showing top 10 rows



In [7]:
ratings.printSchema()

root
 |-- userId: string (nullable = true)
 |-- movieId: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- timestamp: string (nullable = true)



Spark implementation of ALS requires users, movies and ratings are encoded as integers or floats. So convert the userId and movieId datatypes to integer and rating datatype to float.

In [8]:
ratings = ratings.\
    withColumn('userId', col('userId').cast('integer')).\
    withColumn('movieId', col('movieId').cast('integer')).\
    withColumn('rating', col('rating').cast('float')).\
    drop('timestamp')
ratings.show(10)

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
+------+-------+------+
only showing top 10 rows



The ratings data matrix is very sparse. Calculate the sparsity of the ratings matrix.

In [9]:
# Count the total number of ratings in the dataset
numerator = ratings.select("rating").count()

# Count the number of distinct userIds and distinct movieIds
num_users = ratings.select("userId").distinct().count()
num_movies = ratings.select("movieId").distinct().count()

# Set the denominator equal to the number of users multiplied by the number of movies
denominator = num_users * num_movies

# Divide the numerator by the denominator
sparsity = (1.0 - (numerator)/denominator)*100
print("The ratings dataframe is ", "%.2f" % sparsity + "% empty.")

The ratings dataframe is  98.30% empty.


Import neccessary PySpark ML libraries

In [10]:
# Import the required functions
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

Create train and test set. Create Alternating Least Square (ALS) model instance.

In [11]:
# Create test and train set
(train, test) = ratings.randomSplit([0.8, 0.2], seed = 1234)

# Create ALS model
als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", nonnegative = True, implicitPrefs = False, coldStartStrategy="drop")

# Confirm that a model called "als" was created
type(als)

pyspark.ml.recommendation.ALS

Create hyperparameter grid for optimization.

In [12]:
# Import the requisite items
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Add hyperparameters and their respective values to param_grid
param_grid = ParamGridBuilder() \
            .addGrid(als.rank, [10, 50, 100, 150]) \
            .addGrid(als.regParam, [.01, .05, .1, .15]) \
            .build()
            #.addGrid(als.maxIter, [5, 50, 100, 200]) \

           
# Define evaluator as RMSE and print length of evaluator
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction") 
print ("Num models to be tested: ", len(param_grid))

Num models to be tested:  16


In [13]:
# Build cross validation using CrossValidator
cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

# Confirm cv was built
print(cv)

CrossValidator_94ac475b7dce


Train the model



In [14]:
#Fit cross validator to the 'train' dataset
model = cv.fit(train)

#Extract best model from the cv model above
best_model = model.bestModel

In [15]:
# Print best_model
print(type(best_model))

# Complete the code below to extract the ALS model parameters
print("**Best Model**")

# # Print "Rank"
print("  Rank:", best_model._java_obj.parent().getRank())

# Print "MaxIter"
print("  MaxIter:", best_model._java_obj.parent().getMaxIter())

# Print "RegParam"
print("  RegParam:", best_model._java_obj.parent().getRegParam())

<class 'pyspark.ml.recommendation.ALSModel'>
**Best Model**
  Rank: 100
  MaxIter: 10
  RegParam: 0.15


Model predictions

In [16]:
# View the predictions
test_predictions = best_model.transform(test)
RMSE = evaluator.evaluate(test_predictions)
print(RMSE)

0.8686831316495118


In [17]:
test_predictions.show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   148|    356|   4.0| 3.5258026|
|   148|   4896|   4.0|  3.519052|
|   148|   4993|   3.0|  3.481496|
|   148|   7153|   3.0| 3.4399748|
|   148|   8368|   4.0| 3.5952313|
|   148|  40629|   5.0|   3.25957|
|   148|  50872|   3.0| 3.6326275|
|   148|  60069|   4.5| 3.6905906|
|   148|  69757|   3.5|  3.290472|
|   148|  72998|   4.0| 3.2283332|
|   148|  81847|   4.5|  3.406727|
|   148|  98491|   5.0| 3.7326372|
|   148| 115617|   3.5|  3.604421|
|   148| 122886|   3.5| 3.4251683|
|   463|    296|   4.0|  4.176363|
|   463|    527|   4.0| 3.8242025|
|   463|   2019|   4.0| 3.9122622|
|   471|    527|   4.5| 3.8093982|
|   471|   6016|   4.0|  3.947259|
|   471|   6333|   2.5|  3.247336|
+------+-------+------+----------+
only showing top 20 rows



In [18]:
# Generate n Recommendations for all users
nrecommendations = best_model.recommendForAllUsers(10)
nrecommendations.limit(10).show()



+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|     1|[{3379, 5.7009163...|
|     3|[{5746, 4.852619}...|
|     5|[{5490, 4.5228095...|
|     6|[{3925, 4.809606}...|
|     9|[{3379, 4.822992}...|
|    12|[{77846, 5.566331...|
|    13|[{3379, 4.876487}...|
|    15|[{3379, 4.4335475...|
|    16|[{3379, 4.5730414...|
|    17|[{3379, 5.1175084...|
+------+--------------------+



In [19]:
nrecommendations = nrecommendations\
    .withColumn("rec_exp", explode("recommendations"))\
    .select('userId', col("rec_exp.movieId"), col("rec_exp.rating"))

nrecommendations.limit(10).show()

+------+-------+---------+
|userId|movieId|   rating|
+------+-------+---------+
|     1|   3379|5.7009163|
|     1|  33649| 5.554334|
|     1|   5490|5.4997315|
|     1| 171495|5.3938065|
|     1|   5915| 5.359663|
|     1|   5328| 5.328846|
|     1|   5416| 5.328846|
|     1|   3951| 5.328846|
|     1|  78836| 5.326736|
|     1|   6460| 5.313456|
+------+-------+---------+



In [20]:
nrecommendations.join(movies, on='movieId').filter('userId = 100').show()

+-------+------+---------+--------------------+--------------------+
|movieId|userId|   rating|               title|              genres|
+-------+------+---------+--------------------+--------------------+
|  67618|   100|5.0475445|Strictly Sexual (...|Comedy|Drama|Romance|
|  33649|   100|5.0124555|  Saving Face (2004)|Comedy|Drama|Romance|
|  42730|   100|4.9845896|   Glory Road (2006)|               Drama|
|   3379|   100|4.9830713| On the Beach (1959)|               Drama|
|  74282|   100|4.8988805|Anne of Green Gab...|Children|Drama|Ro...|
| 117531|   100|  4.88108|    Watermark (2014)|         Documentary|
|   7071|   100|  4.88108|Woman Under the I...|               Drama|
| 179135|   100|  4.88108|Blue Planet II (2...|         Documentary|
|  84273|   100|  4.88108|Zeitgeist: Moving...|         Documentary|
|  26073|   100|  4.88108|Human Condition I...|           Drama|War|
+-------+------+---------+--------------------+--------------------+



In [21]:
ratings.join(movies, on='movieId').filter('userId = 100').sort('rating', ascending=False).limit(10).show()

+-------+------+------+--------------------+--------------------+
|movieId|userId|rating|               title|              genres|
+-------+------+------+--------------------+--------------------+
|   1101|   100|   5.0|      Top Gun (1986)|      Action|Romance|
|   1958|   100|   5.0|Terms of Endearme...|        Comedy|Drama|
|   2423|   100|   5.0|Christmas Vacatio...|              Comedy|
|   4041|   100|   5.0|Officer and a Gen...|       Drama|Romance|
|   5620|   100|   5.0|Sweet Home Alabam...|      Comedy|Romance|
|    368|   100|   4.5|     Maverick (1994)|Adventure|Comedy|...|
|    934|   100|   4.5|Father of the Bri...|              Comedy|
|    539|   100|   4.5|Sleepless in Seat...|Comedy|Drama|Romance|
|     16|   100|   4.5|       Casino (1995)|         Crime|Drama|
|    553|   100|   4.5|    Tombstone (1993)|Action|Drama|Western|
+-------+------+------+--------------------+--------------------+

