# **Spark HW2 Movie Recommendation**
In this notebook, we will use an Alternating Least Squares (ALS) algorithm with Spark APIs to predict the ratings for the movies in [MovieLens small dataset](https://grouplens.org/datasets/movielens/latest/)

# Spark Setup

In [1]:
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz

0% [Working]            Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
0% [Connecting to archive.ubuntu.com (91.189.88.152)] [1 InRelease 14.2 kB/88.7                                                                               Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [696 B]
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:9 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:10 https://developer.download.nvid

In [2]:
!pip install -q findspark
!pip install py4j

!export JAVA_HOME=$(/usr/lib/jvm/java-8-openjdk-amd64 -v 1.8)
! echo $JAVA_HOME
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"
import findspark
findspark.init("spark-3.2.1-bin-hadoop3.2")# SPARK_HOME


from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Collecting py4j
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[?25l[K     |█▋                              | 10 kB 19.2 MB/s eta 0:00:01[K     |███▎                            | 20 kB 23.9 MB/s eta 0:00:01[K     |█████                           | 30 kB 25.7 MB/s eta 0:00:01[K     |██████▋                         | 40 kB 28.2 MB/s eta 0:00:01[K     |████████▎                       | 51 kB 30.7 MB/s eta 0:00:01[K     |█████████▉                      | 61 kB 25.3 MB/s eta 0:00:01[K     |███████████▌                    | 71 kB 23.0 MB/s eta 0:00:01[K     |█████████████▏                  | 81 kB 24.6 MB/s eta 0:00:01[K     |██████████████▉                 | 92 kB 26.4 MB/s eta 0:00:01[K     |████████████████▌               | 102 kB 25.4 MB/s eta 0:00:01[K     |██████████████████▏             | 112 kB 25.4 MB/s eta 0:00:01[K     |███████████████████▊            | 122 kB 25.4 MB/s eta 0:00:01[K     |█████████████████████▍          | 133 kB 25.4 MB/s eta 

In [3]:
spark.version

'3.2.1'

# Data ETL and Data Exploration

In [4]:
# load packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math

import os
os.environ["PYSPARK_PYTHON"] = "python3"

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
!unzip "/content/drive/MyDrive/ml-latest-small.zip"

Archive:  /content/drive/MyDrive/ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [7]:
movies_df = spark.read.load("/content/ml-latest-small/movies.csv", format='csv', header = True)
ratings_df = spark.read.load("/content/ml-latest-small/ratings.csv", format='csv', header = True)
links_df = spark.read.load("/content/ml-latest-small/links.csv", format='csv', header = True)
tags_df = spark.read.load("/content/ml-latest-small/tags.csv", format='csv', header = True)

In [8]:
movies_df.show(5)

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows



In [9]:
ratings_df.show(5)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
+------+-------+------+---------+
only showing top 5 rows



In [10]:
links_df.show(5)

+-------+-------+------+
|movieId| imdbId|tmdbId|
+-------+-------+------+
|      1|0114709|   862|
|      2|0113497|  8844|
|      3|0113228| 15602|
|      4|0114885| 31357|
|      5|0113041| 11862|
+-------+-------+------+
only showing top 5 rows



In [11]:
tags_df.show(5)

+------+-------+---------------+----------+
|userId|movieId|            tag| timestamp|
+------+-------+---------------+----------+
|     2|  60756|          funny|1445714994|
|     2|  60756|Highly quotable|1445714996|
|     2|  60756|   will ferrell|1445714992|
|     2|  89774|   Boxing story|1445715207|
|     2|  89774|            MMA|1445715200|
+------+-------+---------------+----------+
only showing top 5 rows



In [12]:
tmp1 = ratings_df.groupBy("userID").count().toPandas()['count'].min()
tmp2 = ratings_df.groupBy("movieId").count().toPandas()['count'].min()
print('For the users that rated movies and the movies that were rated:')
print('Minimum number of ratings per user is {}'.format(tmp1))
print('Minimum number of ratings per movie is {}'.format(tmp2))

For the users that rated movies and the movies that were rated:
Minimum number of ratings per user is 20
Minimum number of ratings per movie is 1


In [13]:
tmp1 = sum(ratings_df.groupBy("movieId").count().toPandas()['count'] == 1)
tmp2 = ratings_df.select('movieId').distinct().count()
print('{} out of {} movies are rated by only one user'.format(tmp1, tmp2))

3446 out of 9724 movies are rated by only one user


# Part 1: Spark SQL and OLAP

In [14]:
movies_df.registerTempTable("movies")
ratings_df.registerTempTable("ratings")
links_df.registerTempTable("links")
tags_df.registerTempTable("tags")



### Q1: The number of Users

In [15]:
q1_result = spark.sql('SELECT count(DISTINCT userID) AS Number_of_Users FROM ratings')
q1_result.show()

+---------------+
|Number_of_Users|
+---------------+
|            610|
+---------------+



### Q2: The number of movies

In [16]:
q2_result = spark.sql('SELECT count(DISTINCT movieId) AS Number_of_Movies FROM movies')
q2_result.show()

+----------------+
|Number_of_Movies|
+----------------+
|            9742|
+----------------+



### Q3: How many movies are rated by users? List movies not rated before

In [17]:
q3_result_1 = spark.sql('SELECT count(DISTINCT movieId) AS Number_of_Rated_Movies FROM ratings')
q3_result_1.show()

+----------------------+
|Number_of_Rated_Movies|
+----------------------+
|                  9724|
+----------------------+



In [18]:
movie_not_rated = '''
SELECT movieId, title
FROM movies
WHERE movieId not in
(SELECT movieId FROM ratings)
'''
q3_result_2 = spark.sql(movie_not_rated)
q3_result_2.show()

+-------+--------------------+
|movieId|               title|
+-------+--------------------+
|   1076|Innocents, The (1...|
|   2939|      Niagara (1953)|
|   3338|For All Mankind (...|
|   3456|Color of Paradise...|
|   4194|I Know Where I'm ...|
|   5721|  Chosen, The (1981)|
|   6668|Road Home, The (W...|
|   6849|      Scrooge (1970)|
|   7020|        Proof (1991)|
|   7792|Parallax View, Th...|
|   8765|This Gun for Hire...|
|  25855|Roaring Twenties,...|
|  26085|Mutiny on the Bou...|
|  30892|In the Realms of ...|
|  32160|Twentieth Century...|
|  32371|Call Northside 77...|
|  34482|Browning Version,...|
|  85565|  Chalet Girl (2011)|
+-------+--------------------+



### Q4: List movie genres

In [19]:
q4_query = '''
SELECT EXPLODE(split(genres, '[|]')) AS genres
FROM movies
'''
q4_result = spark.sql(q4_query)
q4_result.show()

+---------+
|   genres|
+---------+
|Adventure|
|Animation|
| Children|
|   Comedy|
|  Fantasy|
|Adventure|
| Children|
|  Fantasy|
|   Comedy|
|  Romance|
|   Comedy|
|    Drama|
|  Romance|
|   Comedy|
|   Action|
|    Crime|
| Thriller|
|   Comedy|
|  Romance|
|Adventure|
+---------+
only showing top 20 rows



### Q5: Movie for each category

In [20]:
q5_query = '''
SELECT genres, count(*) AS Num_Movies
FROM (SELECT movieId, title, EXPLODE(split(genres, '[|]')) AS genres
FROM movies)
GROUP BY genres
ORDER BY Num_Movies DESC
'''
spark.sql(q5_query).show()

+------------------+----------+
|            genres|Num_Movies|
+------------------+----------+
|             Drama|      4361|
|            Comedy|      3756|
|          Thriller|      1894|
|            Action|      1828|
|           Romance|      1596|
|         Adventure|      1263|
|             Crime|      1199|
|            Sci-Fi|       980|
|            Horror|       978|
|           Fantasy|       779|
|          Children|       664|
|         Animation|       611|
|           Mystery|       573|
|       Documentary|       440|
|               War|       382|
|           Musical|       334|
|           Western|       167|
|              IMAX|       158|
|         Film-Noir|        87|
|(no genres listed)|        34|
+------------------+----------+



In [21]:
from pyspark.sql.functions import udf
concat_movies = lambda x: ','.join(x)
spark.udf.register('concat_movies', concat_movies)

q5_query_2 = '''
SELECT genres, concat_movies(collect_set(title)) AS MovieList
FROM (SELECT movieId, title, EXPLODE(split(genres, '[|]')) AS genres
FROM movies)
GROUP BY genres
ORDER BY genres
'''
spark.sql(q5_query_2).show()

+------------------+--------------------+
|            genres|           MovieList|
+------------------+--------------------+
|(no genres listed)|T2 3-D: Battle Ac...|
|            Action|Stealing Rembrand...|
|         Adventure|Ice Age: Collisio...|
|         Animation|Ice Age: Collisio...|
|          Children|Ice Age: Collisio...|
|            Comedy|Hysteria (2011),H...|
|             Crime|Stealing Rembrand...|
|       Documentary|The Barkley Marat...|
|             Drama|Airport '77 (1977...|
|           Fantasy|Masters of the Un...|
|         Film-Noir|Rififi (Du rififi...|
|            Horror|Underworld: Rise ...|
|              IMAX|Harry Potter and ...|
|           Musical|U2: Rattle and Hu...|
|           Mystery|Before and After ...|
|           Romance|Vampire in Brookl...|
|            Sci-Fi|Push (2009),SORI:...|
|          Thriller|Element of Crime,...|
|               War|General, The (192...|
|           Western|Man Who Shot Libe...|
+------------------+--------------

# Part2: Spark ALS based approach for training model
We will use an Spark ML to predict the ratings, so let's reload "ratings.csv" using ``sc.textFile`` and then convert it to the form of (user, item, rating) tuples.

In [22]:
ratings_df.show()

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
|     1|     70|   3.0|964982400|
|     1|    101|   5.0|964980868|
|     1|    110|   4.0|964982176|
|     1|    151|   5.0|964984041|
|     1|    157|   5.0|964984100|
|     1|    163|   5.0|964983650|
|     1|    216|   5.0|964981208|
|     1|    223|   3.0|964980985|
|     1|    231|   5.0|964981179|
|     1|    235|   4.0|964980908|
|     1|    260|   5.0|964981680|
|     1|    296|   3.0|964982967|
|     1|    316|   3.0|964982310|
|     1|    333|   5.0|964981179|
|     1|    349|   4.0|964982563|
+------+-------+------+---------+
only showing top 20 rows



In [23]:
movie_ratings=ratings_df.drop('timestamp')

In [24]:
# Data type convert
from pyspark.sql.types import IntegerType, FloatType
movie_ratings = movie_ratings.withColumn("userId", movie_ratings["userId"].cast(IntegerType()))
movie_ratings = movie_ratings.withColumn("movieId", movie_ratings["movieId"].cast(IntegerType()))
movie_ratings = movie_ratings.withColumn("rating", movie_ratings["rating"].cast(FloatType()))

In [25]:
movie_ratings.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
|     1|    163|   5.0|
|     1|    216|   5.0|
|     1|    223|   3.0|
|     1|    231|   5.0|
|     1|    235|   4.0|
|     1|    260|   5.0|
|     1|    296|   3.0|
|     1|    316|   3.0|
|     1|    333|   5.0|
|     1|    349|   4.0|
+------+-------+------+
only showing top 20 rows



## ALS Model Selection and Evaluation
With the ALS model, we can use a grid search to find the optimal hyperparameters.

In [26]:
# import package
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator,ParamGridBuilder

In [27]:
#Create test and train set
(training,test)=movie_ratings.randomSplit([0.8,0.2])

In [28]:
#Create ALS model
als = ALS(maxIter=5, rank=10, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
          coldStartStrategy="drop")

In [29]:
#Tune model using ParamGridBuilder
param_grid = ParamGridBuilder() \
            .addGrid(als.rank, [5, 20, 80]) \
            .addGrid(als.maxIter, [1, 5, 10]) \
            .addGrid(als.regParam, [0.05, 0.1, 0.5]) \
            .build()

In [30]:
# Build and fit the recommendation model using ALS on the training data
#model = als.fit(training)
# Generate predictions on test data
#predictions = model.transform(test)

In [31]:
# Define evaluator as RMSE
# Tell Spark how to evaluate predictions
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")

In [32]:
# Build Cross validation
cv = CrossValidator(estimator = als, estimatorParamMaps = param_grid, evaluator = evaluator, numFolds = 5)

In [33]:
#Fit ALS model to training data
cvModel = cv.fit(training)

In [34]:
#Extract best model from the tuning exercise using ParamGridBuilder
bestModel = cvModel.bestModel

# Model Testing
And finally, make a prediction and check the testing error.

In [35]:
predictions = bestModel.transform(training)
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 0.4516656450482216


In [36]:
#Print evaluation metrics and model parameters
print('---Best Model---')
print('Rank: ', bestModel._java_obj.parent().getRank())
print('regParam: ', bestModel._java_obj.parent().getRegParam())
print('maxIter: ', bestModel._java_obj.parent().getMaxIter())

---Best Model---
Rank:  80
regParam:  0.1
maxIter:  10


In [37]:
predictions.show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   148|   1197|   3.0| 3.5508718|
|   148|   4896|   4.0|  3.779701|
|   148|   4993|   3.0| 3.1318376|
|   148|   5618|   3.0|  3.370928|
|   148|   5816|   4.0| 3.7675152|
|   148|   5952|   3.0|  3.150608|
|   148|   7153|   3.0| 3.2662375|
|   148|   8368|   4.0| 3.9628048|
|   148|  40629|   5.0| 4.2056103|
|   148|  44191|   4.0| 3.6175473|
|   148|  50872|   3.0|  3.458837|
|   148|  54001|   4.0| 3.7767105|
|   148|  60069|   4.5| 3.9827845|
|   148|  68954|   4.0| 4.0239353|
|   148|  69757|   3.5| 3.5186436|
|   148|  69844|   4.0| 3.8859644|
|   148|  72998|   4.0| 3.5891728|
|   148|  79132|   1.5| 2.9572644|
|   148|  79702|   4.0|  3.763425|
|   148|  81834|   4.0| 4.0631843|
+------+-------+------+----------+
only showing top 20 rows



## Model Apply and see the performance

In [38]:
alldata = bestModel.transform(movie_ratings)
rmse = evaluator.evaluate(alldata)
print ("RMSE = "+str(rmse))

RMSE = 0.560407733578558


In [39]:
alldata.createOrReplaceTempView("alldata")

In [40]:
spark.sql("Select * From alldata").show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   463|   1088|   3.5| 3.3214397|
|   137|   1580|   3.5| 3.2332358|
|   580|   1580|   4.0| 3.5752473|
|   580|   3175|   2.5|  3.191332|
|   580|  44022|   3.5| 3.4787579|
|   133|    471|   4.0| 3.4408293|
|   322|   1580|   3.5|  3.116364|
|   362|   1591|   4.0|  3.337042|
|   362|   1645|   5.0| 3.9485297|
|   593|   1580|   1.5| 2.3164692|
|   597|    471|   2.0| 2.7939858|
|   597|   1580|   3.0|   3.27649|
|   597|   1959|   4.0| 3.9473586|
|   597|   2366|   5.0| 3.8442469|
|   108|   1959|   5.0| 4.5452566|
|   155|   1580|   4.0| 3.9446092|
|   155|   3175|   4.0| 3.8770685|
|    34|   1580|   2.5| 2.9747825|
|    34|   3997|   2.0| 1.8643966|
|   368|   1580|   3.0| 2.9085107|
+------+-------+------+----------+
only showing top 20 rows



In [41]:
spark.sql('select * from movies join alldata on movies.movieId=alldata.movieId').show()

+-------+--------------------+--------------------+------+-------+------+----------+
|movieId|               title|              genres|userId|movieId|rating|prediction|
+-------+--------------------+--------------------+------+-------+------+----------+
|    356| Forrest Gump (1994)|Comedy|Drama|Roma...|   148|    356|   4.0|   3.72073|
|   1197|Princess Bride, T...|Action|Adventure|...|   148|   1197|   3.0| 3.5508718|
|   4308| Moulin Rouge (2001)|Drama|Musical|Rom...|   148|   4308|   4.0| 3.0344296|
|   4886|Monsters, Inc. (2...|Adventure|Animati...|   148|   4886|   3.0|  3.784968|
|   4896|Harry Potter and ...|Adventure|Childre...|   148|   4896|   4.0|  3.779701|
|   4993|Lord of the Rings...|   Adventure|Fantasy|   148|   4993|   3.0| 3.1318376|
|   5618|Spirited Away (Se...|Adventure|Animati...|   148|   5618|   3.0|  3.370928|
|   5816|Harry Potter and ...|   Adventure|Fantasy|   148|   5816|   4.0| 3.7675152|
|   5952|Lord of the Rings...|   Adventure|Fantasy|   148|   5952

# Recommend movie to users with id: 575, 232

In [56]:
def topKRec(k,id,model):
  #k: the number of movies to recommend
  #id: the id of the user to give recommendations
  #model: the trained model for recommendation

  all_recs_df = model.recommendForAllUsers(k)
  all_recs_df.registerTempTable('all_recs')
  all_recs_clean = spark.sql('SELECT userId, idNrating.movieId AS movieId, idNrating.rating AS prediction\
  FROM all_recs\
  LATERAL VIEW explode(recommendations) exploded_table AS idNrating')

  temp = all_recs_clean.join(movie_ratings, ['userId','movieId'], how='left').filter(movie_ratings.rating.isNull())
  final= temp.join(movies_df, on='movieId', how='left')

  userRec = final.where(final.userId==id).toPandas()
  return userRec

In [57]:
topKRec(10,575,bestModel)



Unnamed: 0,movieId,userId,prediction,rating,title,genres
0,177593,575,4.489295,,"Three Billboards Outside Ebbing, Missouri (2017)",Crime|Drama
1,2160,575,4.327541,,Rosemary's Baby (1968),Drama|Horror|Thriller
2,158966,575,4.2951,,Captain Fantastic (2016),Drama
3,50,575,4.293466,,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
4,104879,575,4.250824,,Prisoners (2013),Drama|Mystery|Thriller
5,174053,575,4.245746,,Black Mirror: White Christmas (2014),Drama|Horror|Mystery|Sci-Fi|Thriller
6,48516,575,4.223469,,"Departed, The (2006)",Crime|Drama|Thriller


In [58]:
topKRec(10,232,bestModel)



Unnamed: 0,movieId,userId,prediction,rating,title,genres
0,171495,232,4.421206,,Cosmos,(no genres listed)
1,78836,232,4.374097,,Enter the Void (2009),Drama
2,2131,232,4.278293,,Autumn Sonata (Höstsonaten) (1978),Drama
3,177593,232,4.277124,,"Three Billboards Outside Ebbing, Missouri (2017)",Crime|Drama
4,179135,232,4.269114,,Blue Planet II (2017),Documentary
5,117531,232,4.269114,,Watermark (2014),Documentary
6,184245,232,4.269114,,De platte jungle (1978),Documentary
7,7071,232,4.269114,,"Woman Under the Influence, A (1974)",Drama
8,26073,232,4.269114,,"Human Condition III, The (Ningen no joken III)...",Drama|War


# Find the similar movies for movie with id: 463,471

In [108]:
#generate the movie factor matrix
movie_factors= bestModel.itemFactors
#feature_len = len(movie_factors.collect()[0][1])
movie_factors = movie_factors.select(['id'] + [movie_factors.features[i] for i in range(bestModel.rank)])
movie_factors.show()



+---+------------+-----------+-------------+------------+------------+------------+------------+-----------+------------+-------------+------------+------------+-------------+-------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+------------+-------------+------------+------------+------------+------------+------------+------------+-------------+------------+-------------+------------+------------+------------+------------+------------+-------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------

In [127]:
#generate the movie factor matrix
movieFactors= bestModel.itemFactors.toPandas()



Unnamed: 0,id,features
0,10,"[-0.02601017989218235, 0.18641135096549988, 0...."
1,20,"[0.02545834518969059, 0.49619802832603455, -0...."
2,30,"[0.16602551937103271, 0.25617387890815735, 0.0..."
3,40,"[-0.2425207942724228, 0.2238989621400833, -0.0..."
4,50,"[0.07817284017801285, 0.38108113408088684, -0...."
...,...,...
8955,185029,"[0.1228843405842781, 0.027487922459840775, 0.1..."
8956,188189,"[0.15198981761932373, 0.35885190963745117, -0...."
8957,190209,"[0.10802774876356125, 0.474465012550354, -0.07..."
8958,190219,"[0.027006937190890312, 0.1186162531375885, -0...."


In [185]:
movies_df = movies_df.toPandas().astype({'movieId': int})

In [187]:
# Use cosine similarity method to evaluate the similarity between movies

def similarMovie(k, movieId):
  '''
  k: number of similar movies to find
  movieId: id of the movie to find similarities
  '''

  try:
    movieFeature = movieFactors.loc[movieFactors.id==movieId,'features'].to_numpy()[0]
  except:
    return 'There is no such movie with id ' + str(movieId)

  similar = pd.DataFrame(columns=('movieId','cosine_simi'))
  for id, featureList in movieFactors.to_numpy():
    cosine = np.dot(movieFeature,featureList)/(np.linalg.norm(movieFeature)*np.linalg.norm(featureList))
    similar = similar.append({'movieId': id, 'cosine_simi': cosine}, ignore_index=True)
  similar = similar.sort_values(by=['cosine_simi'],ascending=False)[1:k+1]
  similar = similar.astype({'movieId': int})
  final = similar.merge(movies_df, left_on='movieId', right_on = 'movieId', how='left')
  return final[['movieId','title','genres']]

In [188]:
similarMovie(10,463)

'There is no such movie with id 463'

In [189]:
similarMovie(10,471)

Unnamed: 0,movieId,title,genres
0,708,"Truth About Cats & Dogs, The (1996)",Comedy|Romance
1,6331,Spellbound (2002),Documentary
2,532,Serial Mom (1994),Comedy|Crime|Horror
3,745,Wallace & Gromit: A Close Shave (1995),Animation|Children|Comedy
4,2723,Mystery Men (1999),Action|Comedy|Fantasy
5,144,"Brothers McMullen, The (1995)",Comedy
6,7266,"Lost Skeleton of Cadavra, The (2002)",Comedy|Horror|Sci-Fi
7,32469,We're No Angels (1955),Comedy|Crime|Drama
8,6296,"Mighty Wind, A (2003)",Comedy|Musical
9,26,Othello (1995),Drama


# REPORT Section