## Steven Miller
## DSC 650
## 2020-02-18

### 11.2 Programming Exercise: Movie Recommendation Engine

a. Prepare Data

Load the data from the ratings.csv and movies.csv files and combine them on movieId. The resultant data set should contain all of the user ratings and include movie titles. The schema should look something like this.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Exercise12').getOrCreate()
movies = spark.read.load('movielens/movies.csv', format='csv', inferSchema=True, header=True)
ratings = spark.read.load('movielens/ratings.csv', format='csv', inferSchema=True, header=True)

In [2]:
df = movies.join(ratings, 'movieId', 'inner')

In [3]:
df.show()

+-------+--------------------+--------------------+------+------+---------+
|movieId|               title|              genres|userId|rating|timestamp|
+-------+--------------------+--------------------+------+------+---------+
|      1|    Toy Story (1995)|Adventure|Animati...|     1|   4.0|964982703|
|      3|Grumpier Old Men ...|      Comedy|Romance|     1|   4.0|964981247|
|      6|         Heat (1995)|Action|Crime|Thri...|     1|   4.0|964982224|
|     47|Seven (a.k.a. Se7...|    Mystery|Thriller|     1|   5.0|964983815|
|     50|Usual Suspects, T...|Crime|Mystery|Thr...|     1|   5.0|964982931|
|     70|From Dusk Till Da...|Action|Comedy|Hor...|     1|   3.0|964982400|
|    101|Bottle Rocket (1996)|Adventure|Comedy|...|     1|   5.0|964980868|
|    110|   Braveheart (1995)|    Action|Drama|War|     1|   4.0|964982176|
|    151|      Rob Roy (1995)|Action|Drama|Roma...|     1|   5.0|964984041|
|    157|Canadian Bacon (1...|          Comedy|War|     1|   5.0|964984100|
|    163|   

b. Train Recommender

Using the data you prepared in the last step, create a movie recommendation model using collaborative filtering. Spark’s collaborative filtering documentation provides a template for building and testing this model.

Before you train the recommendation model, split the data into a training dataset and a testing dataset using the randomSplit dataframe method. Use 80% of your data for training and 20% for testing.

After fitting your model using the training dataset, calculate the predictions on the test dataset and use the RegressionEvaluator to calculate the root-mean-square error of the model.

As a reminder, Spark’s collaborative filtering documentation will be helpful in completing this task.

In [4]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row

(training, test) = df.randomSplit([0.8, 0.2])

als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
          coldStartStrategy="drop")
model = als.fit(training)

In [5]:
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 1.0790394185889431


c. Generate top 10 movie recommendations

Using the recommendation model, generate the top ten recommendations for each user. Using the show method, print the recommendations for the user IDs, 127, 151, and 300. You should not truncate the results and so should call the show method like this recommendations_127.show(truncate=False).

In [6]:
from pyspark.sql.functions import col

user_ids = [127,151,300]
users = ratings.select(als.getUserCol()).distinct().where(col('userId').isin(user_ids))
userSubsetRecs = model.recommendForUserSubset(users, 10)

In [7]:
userSubsetRecs.show(truncate=False)

+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                                                                                                                |
+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|300   |[[89118, 7.4159837], [1251, 7.3640213], [49932, 7.159583], [91077, 6.921607], [446, 6.8803563], [1227, 6.8485675], [1211, 6.7971067], [86320, 6.782563], [3266, 6.7721453], [68073, 6.7703805]]|
|127   |[[3040, 12.728895], [4450, 12.044945], [7169, 11.58663], [7155, 11.300601], [55052, 10.821467], [33672, 10.620742], [79185, 10.263626], [72, 10.216737], [2384, 10.184221], [3633, 9.966223]

In [8]:
for user_row in userSubsetRecs.collect():
    recommendations = []
    
    for rec in user_row['recommendations']:
        name = movies.select('title').where(movies.movieId==rec[0])
        recommendations.append({'name':name.collect()[0][0], 'rating': rec[1]})

    print(f'\nRecommendations for user {user_row[0]}\n')
    [print(f"{row['name']:<50}\t{round(row['rating'],3):.3f}") for row in recommendations]


Recommendations for user 300

Skin I Live In, The (La piel que habito) (2011)   	7.416
8 1/2 (8½) (1963)                                 	7.364
Inland Empire (2006)                              	7.160
Descendants, The (2011)                           	6.922
Farewell My Concubine (Ba wang bie ji) (1993)     	6.880
Once Upon a Time in America (1984)                	6.849
Wings of Desire (Himmel über Berlin, Der) (1987)  	6.797
Melancholia (2011)                                	6.783
Man Bites Dog (C'est arrivé près de chez vous) (1992)	6.772
Pirate Radio (2009)                               	6.770

Recommendations for user 127

Meatballs (1979)                                  	12.729
Bully (2001)                                      	12.045
Chasing Liberty (2004)                            	11.587
Calendar Girls (2003)                             	11.301
Atonement (2007)                                  	10.821
Lords of Dogtown (2005)                           	10.621
Knight and Day (2