<a href="https://colab.research.google.com/github/4k5h1t/PySpark-Movie-Rec/blob/main/movie_recommender_using_spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommender Systems using Spark with PySpark

## Installing required dependencies


In [1]:
!pip install findspark
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 46 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 42.1 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845512 sha256=118e6f351de951b865718db98298fb1ca47c635995ed1a29d7c1bf15d21e8c0b
  Stored in directory: /root/.cache/pip/wheels/43/dc/11/ec201cd671da62fa9c5cc77078235e407

## Downloading the Dataset

In [2]:
!apt-get install -y aria2
!mkdir -p ./MovieLens/ 
!aria2c -s 16 -x 16 "https://files.grouplens.org/datasets/movielens/ml-25m.zip" -d ./MovieLens/

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  libc-ares2
The following NEW packages will be installed:
  aria2 libc-ares2
0 upgraded, 2 newly installed, 0 to remove and 7 not upgraded.
Need to get 1,274 kB of archives.
After this operation, 4,912 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libc-ares2 amd64 1.14.0-1ubuntu0.1 [37.5 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 aria2 amd64 1.33.1-1 [1,236 kB]
Fetched 1,274 kB in 1s (1,661 kB/s)
Selecting previously unselected package libc-ares2:amd64.
(Reading database ... 124015 files and directories currently installed.)
Preparing to unpack .../libc-ares2_1.14.0-1ubuntu0.1_amd64.deb ...
Unpacking libc-ares2:amd64 (1.14.0

## Downloading SPARK and Java Dependencies

In [3]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

!wget -q https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz

!tar xf spark-3.2.1-bin-hadoop3.2.tgz


## Extracting the Dataset

In [4]:
import zipfile
with zipfile.ZipFile("/content/MovieLens/ml-25m.zip", 'r') as zip_ref:
    zip_ref.extractall("/content/MovieLens")

## Start Setup Time

In [5]:
import time
setupst = time.time()

## Setting up SPARK and Java Paths

In [6]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"

## Importing PySpark

In [7]:
import findspark
findspark.init('/content/spark-3.2.1-bin-hadoop3.2')
import pyspark

## Starting up Spark Session, Clusters etc. 

In [8]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('recommendation').getOrCreate()

## Importing ALS model 

In [9]:
from pyspark.ml.recommendation import ALS

## Reading loaded Dataset

In [10]:
data = spark.read.csv('MovieLens/ml-25m/ratings.csv',inferSchema=True,header=True)
setupet = time.time()

## Describing / Showcasing Dataset

In [11]:
data.head()

Row(userId=1, movieId=296, rating=5.0, timestamp=1147880044)

In [12]:
data.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [13]:
data.describe().show()

+-------+-----------------+------------------+------------------+--------------------+
|summary|           userId|           movieId|            rating|           timestamp|
+-------+-----------------+------------------+------------------+--------------------+
|  count|         25000095|          25000095|          25000095|            25000095|
|   mean|81189.28115381162|21387.981943268616| 3.533854451353085|1.2156014431215513E9|
| stddev|46791.71589745776| 39198.86210105973|1.0607439611423535| 2.268758080595386E8|
|    min|                1|                 1|               0.5|           789652009|
|    max|           162541|            209171|               5.0|          1574327703|
+-------+-----------------+------------------+------------------+--------------------+



## Implementing ML Algorithm and Evaluation

### Train Test Split

In [14]:
(train_data, test_data) = data.randomSplit([0.7, 0.3], seed=42)
setupTime = setupet - setupst

Setting up and Training the ALS Model

In [15]:
trainst = time.time()
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")
model = als.fit(train_data)
trainet = time.time()
trainTime = trainet - trainst

### Testing trained model

In [16]:
testst = time.time()
predictions = model.transform(test_data)
testet = time.time()
testTime = testet - testst

In [17]:
predictions.show()

+------+-------+------+----------+----------+
|userId|movieId|rating| timestamp|prediction|
+------+-------+------+----------+----------+
|     1|    307|   5.0|1147868828|  4.044257|
|     1|   1175|   3.5|1147868826| 3.8279672|
|     1|   1237|   5.0|1147868839| 4.0510416|
|     1|   1250|   4.0|1147868414| 3.5275607|
|     1|   2012|   2.5|1147868068| 2.2809074|
|     1|   2068|   2.5|1147869044| 3.8753283|
|     1|   2161|   3.5|1147868609| 3.6879287|
|     1|   2692|   5.0|1147869100| 4.2605157|
|     1|   3448|   4.0|1147868480| 3.7469707|
|     1|   3949|   5.0|1147868678| 4.7491083|
|     1|   4144|   5.0|1147868898| 3.6138039|
|     1|   4703|   4.0|1147869223| 3.8178573|
|     1|   4973|   4.5|1147869080| 4.5596046|
|     1|   5147|   4.0|1147877654| 3.3674011|
|     1|   5684|   2.0|1147879797|      3.33|
|     1|   5878|   4.0|1147868807| 3.9194965|
|     1|   5912|   3.0|1147878698|  3.214029|
|     1|   6377|   4.0|1147868469| 3.5168176|
|     1|   6954|   3.5|1147869150|

## Extracting information of one user

In [19]:
single_user = test_data.filter(test_data['userId']==12).select(['movieId','userId'])

Enter a User ID to find recommendations for: 12


In [20]:
single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|      1|    12|
|      2|    12|
|     22|    12|
|     50|    12|
|     88|    12|
|    101|    12|
|    140|    12|
|    145|    12|
|    150|    12|
|    163|    12|
|    165|    12|
|    175|    12|
|    185|    12|
|    203|    12|
|    209|    12|
|    257|    12|
|    319|    12|
|    351|    12|
|    377|    12|
|    433|    12|
+-------+------+
only showing top 20 rows



## Running Model again for one selected user (testing)

In [21]:
test1st = time.time()
reccomendations = model.transform(single_user)
test1et = time.time()
test1Time = test1et - test1st

In [22]:
reccomendations.orderBy('prediction',ascending=False).show()

+-------+------+----------+
|movieId|userId|prediction|
+-------+------+----------+
|    858|    12| 4.2775836|
|   4973|    12|  4.250431|
|    745|    12| 4.2312336|
|   1207|    12| 4.1879983|
|  40870|    12| 4.1767607|
|   1136|    12| 4.1566386|
|   1212|    12| 4.1313143|
|   1219|    12| 4.1263475|
|    926|    12|  4.109017|
|   3435|    12| 4.1078067|
|     50|    12|  4.094791|
|   1197|    12| 4.0918846|
|  63876|    12|  4.078976|
|  74416|    12| 4.0785985|
|    527|    12|  4.067445|
|   1225|    12| 4.0649686|
|   1361|    12|  4.055188|
|   4226|    12|  4.011339|
|   3996|    12|  4.007953|
|  55442|    12| 3.9923105|
+-------+------+----------+
only showing top 20 rows



## Analysing with the help of supporting Datasets (Easier to Understand)

In [23]:
moviesdf = spark.read.csv(r"MovieLens/ml-25m/movies.csv", inferSchema = True, header = True)  
moviesdf.show()

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
|      7|      Sabrina (1995)|      Comedy|Romance|
|      8| Tom and Huck (1995)|  Adventure|Children|
|      9| Sudden Death (1995)|              Action|
|     10|    GoldenEye (1995)|Action|Adventure|...|
|     11|American Presiden...|Comedy|Drama|Romance|
|     12|Dracula: Dead and...|       Comedy|Horror|
|     13|        Balto (1995)|Adventure|Animati...|
|     14|        Nixon (1995)|               Drama|
|     15|Cutthroat Island ...|Action|Adventure|...|
|     16|       Casino (1995)|         Crime|Drama|
|     17|Sen

## Final Predictions with Movie Titles

In [24]:
rec = reccomendations
joined = moviesdf.join(rec, ['movieId'],how="inner")
joined.select('userId', 'movieId', 'title', 'genres', 'prediction').orderBy('prediction', ascending=False).show()

+------+-------+--------------------+--------------------+----------+
|userId|movieId|               title|              genres|prediction|
+------+-------+--------------------+--------------------+----------+
|    12|    858|Godfather, The (1...|         Crime|Drama| 4.2775836|
|    12|   4973|Amelie (Fabuleux ...|      Comedy|Romance|  4.250431|
|    12|    745|Wallace & Gromit:...|Animation|Childre...| 4.2312336|
|    12|   1207|To Kill a Mocking...|               Drama| 4.1879983|
|    12|  40870|   C.R.A.Z.Y. (2005)|               Drama| 4.1767607|
|    12|   1136|Monty Python and ...|Adventure|Comedy|...| 4.1566386|
|    12|   1212|Third Man, The (1...|Film-Noir|Mystery...| 4.1313143|
|    12|   1219|       Psycho (1960)|        Crime|Horror| 4.1263475|
|    12|    926|All About Eve (1950)|               Drama|  4.109017|
|    12|   3435|Double Indemnity ...|Crime|Drama|Film-...| 4.1078067|
|    12|     50|Usual Suspects, T...|Crime|Mystery|Thr...|  4.094791|
|    12|   1197|Prin

In [28]:
print("Time to setup = ", setupTime)
print("Time to train = ", trainTime)
print("Time to test = ", testTime)
print("Time to test1 = ", test1Time)

Time to setup =  51.211623191833496
Time to train =  208.77153968811035
Time to test =  0.14310717582702637
Time to test1 =  0.08272528648376465
