<a href="https://colab.research.google.com/github/4k5h1t/PySpark-Movie-Rec/blob/main/movie_recommender_using_spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommender Systems using Spark with PySpark

## Installing required dependencies


In [None]:
!pip install findspark
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 46 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 50.9 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845514 sha256=ae590a356fa8d79f4fa7d53ae9ab9e9727f51273d06c96d6ad8036efd6f11070
  Stored in directory: /root/.cache/pip/wheels/42/59/f5/79a5bf931714dcd201b26025347785f08

## Downloading and Extracting Dataset

In [None]:
!apt-get install -y aria2
!mkdir -p ./MovieLens/ 
!aria2c -s 16 -x 16 "https://files.grouplens.org/datasets/movielens/ml-25m.zip" -d ./MovieLens/
import zipfile
with zipfile.ZipFile("/content/MovieLens/ml-25m.zip", 'r') as zip_ref:
    zip_ref.extractall("/content/MovieLens")

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  libc-ares2
The following NEW packages will be installed:
  aria2 libc-ares2
0 upgraded, 2 newly installed, 0 to remove and 5 not upgraded.
Need to get 1,274 kB of archives.
After this operation, 4,912 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libc-ares2 amd64 1.14.0-1ubuntu0.1 [37.5 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 aria2 amd64 1.33.1-1 [1,236 kB]
Fetched 1,274 kB in 1s (1,053 kB/s)
Selecting previously unselected package libc-ares2:amd64.
(Reading database ... 123991 files and directories currently installed.)
Preparing to unpack .../libc-ares2_1.14.0-1ubuntu0.1_amd64.deb ...
Unpacking libc-ares2:amd64 (1.14.0

## Setting up and installing SPARK and Java Dependencies

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

!wget -q https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz

!tar xf spark-3.2.1-bin-hadoop3.2.tgz

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"

## Importing PySpark

In [None]:
import findspark
findspark.init('/content/spark-3.2.1-bin-hadoop3.2')
import pyspark

## Starting up Spark Session, Clusters etc. 

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('recommendation').getOrCreate()

## And importing ALS model as well as Evaluation metrics (Root Mean Square Error)

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

## Reading loaded Dataset

In [None]:
data = spark.read.csv('MovieLens/ml-25m/ratings.csv',inferSchema=True,header=True)

## Describing / Showcasing Dataset

In [None]:
data.head()

Row(userId=1, movieId=296, rating=5.0, timestamp=1147880044)

In [None]:
data.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [None]:
data.describe().show()

+-------+-----------------+------------------+------------------+--------------------+
|summary|           userId|           movieId|            rating|           timestamp|
+-------+-----------------+------------------+------------------+--------------------+
|  count|         25000095|          25000095|          25000095|            25000095|
|   mean|81189.28115381162|21387.981943268616| 3.533854451353085|1.2156014431215513E9|
| stddev|46791.71589745776| 39198.86210105973|1.0607439611423535| 2.268758080595386E8|
|    min|                1|                 1|               0.5|           789652009|
|    max|           162541|            209171|               5.0|          1574327703|
+-------+-----------------+------------------+------------------+--------------------+



## Implementing ML Algorithm and Evaluation

### Train Test Split

In [None]:
(train_data, test_data) = data.randomSplit([0.7, 0.3], seed=42)

Setting up and Training the ALS Model

In [None]:
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")
model = als.fit(train_data)

### Testing trained model

In [None]:
predictions = model.transform(test_data)

In [None]:
predictions.show()

+------+-------+------+----------+----------+
|userId|movieId|rating| timestamp|prediction|
+------+-------+------+----------+----------+
|     1|    307|   5.0|1147868828| 3.8360734|
|     1|   1175|   3.5|1147868826| 4.8629894|
|     1|   1237|   5.0|1147868839|  3.840872|
|     1|   1250|   4.0|1147868414| 3.4070103|
|     1|   2012|   2.5|1147868068|  2.580594|
|     1|   2068|   2.5|1147869044| 3.6549873|
|     1|   2161|   3.5|1147868609| 2.5584805|
|     1|   2692|   5.0|1147869100|  3.935215|
|     1|   3448|   4.0|1147868480| 3.1274965|
|     1|   3949|   5.0|1147868678| 4.5230117|
|     1|   4144|   5.0|1147868898| 3.4086895|
|     1|   4703|   4.0|1147869223| 3.0952768|
|     1|   4973|   4.5|1147869080| 4.9007077|
|     1|   5147|   4.0|1147877654| 3.3945804|
|     1|   5684|   2.0|1147879797|  4.514249|
|     1|   5878|   4.0|1147868807| 3.6380153|
|     1|   5912|   3.0|1147878698| 1.8637687|
|     1|   6377|   4.0|1147868469|  4.405485|
|     1|   6954|   3.5|1147869150|

### Evaluating Predictions

In [None]:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = nan


The RMSE described our error in terms of the stars rating column.

So now that we have the model, how would you actually supply a recommendation to a user?

The same way we did with the test data! For example:

## Extracting information of one user

In [None]:
userId = int(input('Enter a User ID to find recommendations for: '))
single_user = test_data.filter(test_data['userId']==userId).select(['movieId','userId'])

Enter a User ID to find recommendations for: 4


In [None]:
single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|   1036|     4|
|   1210|     4|
|   1220|     4|
|   1527|     4|
|   1610|     4|
|   1732|     4|
|   2985|     4|
|   3033|     4|
|   3114|     4|
|   3827|     4|
|   5299|     4|
|   5952|     4|
|   6156|     4|
|   6874|     4|
|   7373|     4|
|   8641|     4|
|   8665|     4|
|  34048|     4|
|  34405|     4|
|  45431|     4|
+-------+------+
only showing top 20 rows



## Running Model again for one selected user (testing)

In [None]:
reccomendations = model.transform(single_user)

In [None]:
reccomendations.orderBy('prediction',ascending=False).show()

+-------+------+----------+
|movieId|userId|prediction|
+-------+------+----------+
| 180989|     4|       NaN|
|  86345|     4| 4.7234216|
| 148426|     4| 4.4276094|
|   1732|     4|  4.346176|
| 176371|     4| 4.2983685|
| 164179|     4| 4.2658653|
| 115713|     4|  4.204432|
|  79702|     4| 4.1920958|
| 115569|     4| 4.1544185|
| 148626|     4|  4.154091|
|  72226|     4| 4.0784645|
|  51255|     4|  4.045676|
|   1220|     4|  4.003692|
|  99114|     4| 3.8006232|
|  34405|     4|   3.79591|
|   8641|     4| 3.7825315|
|   6874|     4| 3.7486234|
|  70286|     4| 3.7247272|
|  58559|     4| 3.7219198|
|  96737|     4| 3.6870158|
+-------+------+----------+
only showing top 20 rows



## Analysing with the help of supporting Datasets (Easier to Understand)

In [None]:
moviesdf = spark.read.csv(r"MovieLens/ml-25m/movies.csv", inferSchema = True, header = True)  
moviesdf.show()

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
|      7|      Sabrina (1995)|      Comedy|Romance|
|      8| Tom and Huck (1995)|  Adventure|Children|
|      9| Sudden Death (1995)|              Action|
|     10|    GoldenEye (1995)|Action|Adventure|...|
|     11|American Presiden...|Comedy|Drama|Romance|
|     12|Dracula: Dead and...|       Comedy|Horror|
|     13|        Balto (1995)|Adventure|Animati...|
|     14|        Nixon (1995)|               Drama|
|     15|Cutthroat Island ...|Action|Adventure|...|
|     16|       Casino (1995)|         Crime|Drama|
|     17|Sen

## Final Predictions with Movie Titles

In [None]:
rec = reccomendations
joined = moviesdf.join(rec, ['movieId'],how="inner")
joined.select('userId', 'movieId', 'title', 'genres', 'prediction').orderBy('prediction', ascending=False).show()

+------+-------+--------------------+--------------------+----------+
|userId|movieId|               title|              genres|prediction|
+------+-------+--------------------+--------------------+----------+
|     4| 180989| Alien Planet (2005)|Animation|Documen...|       NaN|
|     4|  86345|Louis C.K.: Hilar...|              Comedy| 4.7234216|
|     4| 148426|Fateful Findings ...|Drama|Fantasy|Thr...| 4.4276094|
|     4|   1732|Big Lebowski, The...|        Comedy|Crime|  4.346176|
|     4| 176371|Blade Runner 2049...|              Sci-Fi| 4.2983685|
|     4| 164179|      Arrival (2016)|              Sci-Fi| 4.2658653|
|     4| 115713|   Ex Machina (2015)|Drama|Sci-Fi|Thri...|  4.204432|
|     4|  79702|Scott Pilgrim vs....|Action|Comedy|Fan...| 4.1920958|
|     4| 115569| Nightcrawler (2014)|Crime|Drama|Thriller| 4.1544185|
|     4| 148626|Big Short, The (2...|               Drama|  4.154091|
|     4|  72226|Fantastic Mr. Fox...|Adventure|Animati...| 4.0784645|
|     4|  51255|    