## Import useful Python packages

In [0]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

## Check everything is ok

In [0]:
spark

In [0]:
sc._conf.getAll()

# **MovieLens Recommender System**

In this notebook, we will be using a dataset from [GroupLens](https://grouplens.org/datasets/movielens/), which collected and made available rating data sets from the [MovieLens web site](http://movielens.org). 

More specifically, this dataset - released on December 2019 - is the **MovieLens 25M Dataset** containing **25 million** ratings applied to 62,000 movies by 162,000 users (using 5-star rating system). 
In addition to movie ratings, the original collection includes also **1 million** tags and tag genome data with **15 million** relevance scores across 1,129 tags, which however we will not be using here. Anyway, in case you are interested in working also with those data, the whole collection can be downloaded from this [link](http://files.grouplens.org/datasets/movielens/ml-25m.zip).

The task is to provide movie recommendations to users that they are likely to be interested in and engage with.

## **1. Data Acquisition**

This is the first step we need to accomplish before going any further. The dataset will be downloaded and loaded to DBFS, as usual.

### Download the dataset to the local driver node's ```/tmp``` folder using ```wget```

In [0]:
%sh wget -P /tmp https://github.com/gtolomei/big-data-computing/raw/master/datasets/movielens-ratings-25m.csv.bz2

In [0]:
%fs ls file:/tmp/

### Move the file from local driver node's file system to DBFS

In [0]:
dbutils.fs.mv("file:/tmp/movielens-ratings-25m.csv.bz2", "dbfs:/bdc-2020-21/datasets/movielens-ratings-25m.csv.bz2")

In [0]:
%fs ls /bdc-2020-21/datasets/

### **Read dataset file into a Spark Dataframe**

In [0]:
ratings_df = spark.read.load("dbfs:/bdc-2020-21/datasets/movielens-ratings-25m.csv.bz2", 
                         format="csv", 
                         sep=",", 
                         inferSchema="true", 
                         header="true"
                         )

### **Check the shape of the loaded dataset, i.e., number of rows and columns**

In [0]:
print("The shape of the dataset is {:d} rows by {:d} columns".format(ratings_df.count(), len(ratings_df.columns)))

### **Print out the schema of the loaded dataset**

In [0]:
ratings_df.printSchema()

### **Dataset Shape and Schema**

The dataset contains **25,000,095** records, each one corresponding to the `rating` which an anonymized `userId` gave to a specific `movieId`. Moreover, the `timestamp` associated with each rating is also recorded.

### **Display the first 5 rows of the dataset**

In [0]:
ratings_df.show(5)

### **Check for any missing values**

In [0]:
for c in ratings_df.columns:
  print("N. of missing values of column `{:s}` = {:d}".format(c, ratings_df.where(col(c).isNull()).count()))

### **Check the number of unique users**

In [0]:
print("The number of unique users are: {:d}".format(ratings_df.select("userId").distinct().count()))

### **Check the number of unique movies**

In [0]:
print("The number of unique movies are: {:d}".format(ratings_df.select("movieId").distinct().count()))

## **2. Data Exploration**

### **Summary of Descriptive Statistics**

In [0]:
ratings_df.describe().toPandas()

In [0]:
# To access plotting libraries, we need to first transform our PySpark DataFrame into a Pandas DataFrame
ratings_pdf = ratings_df.toPandas() 

In [0]:
# Set some default plotting configuration using seaborn properties
sns.set_style("darkgrid")
sns.set_context("notebook", rc={"lines.linewidth": 2, 
                                "xtick.labelsize":14, 
                                "ytick.labelsize":14,
                                "axes.labelsize": 18,
                                "axes.titlesize": 20,
                                })

### **Analysis of Rating Distributions**

In [0]:
fig, ax = plt.subplots(1, 1, figsize=(8,8))

ax = sns.countplot(x="rating", data=ratings_pdf)
#ax = sns.barplot(x="rating", y="rating", data=ratings_pdf, estimator=lambda x: len(x)/len(ratings_pdf) * 100)
#ax.set_ylabel("Frequency (%)")

### **Dataset Splitting: Training vs. Test Set**

Before moving along with any preprocessing involving data transformations, we will split our dataset into **2** portions:
- _training set_ (e.g., accounting for **80%** of the total number of instances);
- _test set_ (e.g., accounting for the remaining **20%** of instances)

In [0]:
RANDOM_SEED = 42
# Randomly split our original dataset `house_df` into 80÷20 for training and test, respectively
train_df, test_df = ratings_df.randomSplit([0.8, 0.2], seed=RANDOM_SEED)

In [0]:
print("Training set size: {:d} instances".format(train_df.count()))
print("Test set size: {:d} instances".format(test_df.count()))

### **Working on the Training Set only**

From now on, we will be working on the training set portion only. The test set will come back into play when we evaluate our learned model.

# **Matrix Factorization using Alternating Least Squares (ALS)**

We use ALS to factorize the user-movie matrix R (_m_ x _n_) into the product of two lower rank matrices: the **user-factor matrix** X (_m_ x _d_) and the **item-factor matrix** W (_n_ x _d_), using the training set above. To do so, we use the blocked implementation of ALS (i.e., the `ALS` object) provided by the [PySpark API](https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html?highlight=als#pyspark.ml.recommendation.ALS) within the package `pyspark.ml.recommendation`.

The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix

Among all the parameters that the API offers, the following deserve specific attention:

- `rank` is the rank of the user and item factor matrices X and W, i.e., the number of latent features _d_ (by deafult `rank=10`);
- `maxIter` is the maximum number of iterations performed (by default `maxIter=10`);
- `regParam` is the regularization parameter (by default `regParam=0.1`);
- `implicitPrefs` is a boolean flag to switch between using only explicit or implicit feedback version of ALS (by default `implicitPrefs=false`);
- `coldStartStrategy` indicates the strategy for dealing with unknown or new users/items at prediction time (i.e., cold-start). This may be useful in cross-validation or production scenarios, for handling users/items that the model have not seen in the training data. Supported values are `nan` (default) or `drop`;
- `numUserBlocks`/`numItemBlocks` indicate the number of blocks to process in parallel (by default both are set to `10`, set those to `-1` to allow Spark to autoconfigure those).

As it is always the case, the optimal values of the **hyperparameters** above should be tuned using a dedicated portion of the dataset (i.e., **validation set**) or by performing **k-fold cross validation**.

## **A First Matrix Factorization Model**

### **Model Training**

In [0]:
from pyspark.ml.recommendation import ALS
# Build the recommendation model using ALS on the training data
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop")
model = als.fit(train_df)

### **Making Rating Predictions on the Test Set**

In [0]:
predictions =  model.transform(test_df)

In [0]:
predictions.select(["userId", "movieId", "rating", "prediction"]).show()

### **Measuring Model's Performance**

We will use Root Mean Squared Error (RMSE) measured on the held-out portion to assess the quality of our movie recommender system.

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)

print("Root Mean Squared Error = {:.5f}".format(rmse))

## **Tuning Hyperparameters**

In the following, we try to summarize the whole pipeline making use also of _k_-fold cross validation to get a better estimate of the generalization performance of our matrix factorization model.

More specifically, we will tune the three hyperparameters: `rank`, `regParam`, and `maxIter`.

In [0]:
# This function defines the general pipeline for logistic regression
def matrix_factorization(train, k_fold=5):

    from pyspark.ml.recommendation import ALS
    from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
    from pyspark.ml.evaluation import RegressionEvaluator
    from pyspark.ml import Pipeline

    als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop")

    #pipeline = Pipeline(stages=stages)

    # We use a ParamGridBuilder to construct a grid of parameters to search over.
    # A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
    # We use a ParamGridBuilder to construct a grid of parameters to search over.
    # With 2 values for als.rank, 2 values for als.regParam, and 1 value for als.maxIter,
    # this grid will have 2 x 2 x 1 = 4 parameter settings for CrossValidator to choose from.
    param_grid = ParamGridBuilder()\
    .addGrid(als.rank, [10, 25]) \
    .addGrid(als.regParam, [0.01, 0.1]) \
    .addGrid(als.maxIter, [10]) \
    .build()
    
    cross_val = CrossValidator(estimator=als, 
                               estimatorParamMaps=param_grid,
                               evaluator=RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction"),
                               numFolds=k_fold,
                               collectSubModels=True # this flag allows us to store ALL the models trained during k-fold cross validation
                               )

    # Run cross-validation, and choose the best set of parameters.
    cv_model = cross_val.fit(train)

    return cv_model

In [0]:
cv_model = matrix_factorization(train_df)

In [0]:
# This function summarizes all the models trained during k-fold cross validation
def summarize_all_models(cv_models):
    for k, models in enumerate(cv_models):
        print("*************** Fold #{:d} ***************\n".format(k+1))
        for i, m in enumerate(models):
            print("--- Model #{:d} out of {:d} ---".format(i+1, len(models)))
            print("\tParameters: rank=[{:d}]".format(m.rank))
            print("\tModel summary: {}\n".format(m))
        print("***************************************\n")

In [0]:
# Call the function above|
summarize_all_models(cv_model.subModels)

In [0]:
for i, avg_rmse in enumerate(cv_model.avgMetrics):
    print("Avg. RMSE computed across k-fold cross validation for model setting #{:d}: {:.3f}".format(i+1, avg_rmse))

In [0]:
print("Best model according to k-fold cross validation: rank=[{:d}]".
      format(cv_model.bestModel.rank)
      )
print(cv_model.bestModel)

### **Using the best model from _k_-fold cross validation to make predictions**

In [0]:
# Make predictions on the test set (`cv_model` contains the best model according to the result of k-fold cross validation)
# `test_df` will follow exactly the same pipeline defined above, and already fit to `train_df`
test_predictions = cv_model.transform(test_df)

In [0]:
test_predictions.select("userId", "movieId", "rating", "prediction").show(5)

In [0]:
def evaluate_model(predictions, metric="rmse", labelCol="rating", predictionCol="prediction"):
    
    from pyspark.ml.evaluation import RegressionEvaluator

    evaluator = RegressionEvaluator(metricName=metric, labelCol=labelCol, predictionCol=predictionCol)

    return evaluator.evaluate(predictions)

### **Evaluate model performance on the Test Set**

In [0]:
print("***** Test Set *****")
print("RMSE: {:.3f}".format(evaluate_model(test_predictions)))
print("***** Test Set *****")

### **Provide top-_K_ Recommendations to all users**

In [0]:
k = 10 # number of recommended items for each user
user_recs = cv_model.bestModel.recommendForAllUsers(K).show(10, truncate=False)