# Recommendation Systems 

**In this notebook:**
* Load IMDB dataset
* Explore dataset
* Build two movie recommendation Classifier
* Evaluate classifier

## Imports

In [1]:
import os
from pyspark.sql import SQLContext
import pandas as pd
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import *
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import Row
import numpy as np
from time import time

%matplotlib inline

## Evironment

Connecting to Spark cluster. 

In [2]:
conf = SparkConf().setAppName("HW2").setMaster("local[4]")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)


## Data Loading

Loading IMDB data into Spark data frame.

In [3]:
data_path = "../../../data/HW2/"

In [4]:
users = (
    sqlContext.read.format("csv")
    .options(
        header="true",
        inferSchema="True",
        delimiter=",",
    )
    .load("file://" + os.path.abspath(data_path) + "/ratings.csv")
)

movies = (
    sqlContext.read.format("csv")
    .options(
        header="true",
        inferSchema="True",
        delimiter=",",
    )
    .load("file://" + os.path.abspath(data_path) + "/movies.csv")
)


## Exploring Dataset

Exploring dataset and searching for:
* top-10 movies with the largest number of ratings,
* top-10 movies with the highest average rating grouped by genre, and
* common support for all pair of the first 100 movies. 

In [5]:
# User Data Frame Description
users.show(2)
users.describe().show()
# Movies Data Frame Description
movies.show(2)
movies.describe().show()

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
+------+-------+------+---------+
only showing top 2 rows

+-------+------------------+----------------+------------------+--------------------+
|summary|            userId|         movieId|            rating|           timestamp|
+-------+------------------+----------------+------------------+--------------------+
|  count|            100836|          100836|            100836|              100836|
|   mean|326.12756356856676|19435.2957177992| 3.501556983616962|1.2059460873684695E9|
| stddev| 182.6184914635004|35530.9871987003|1.0425292390606342|2.1626103599513078E8|
|    min|                 1|               1|               0.5|           828124615|
|    max|               610|          193609|               5.0|          1537799250|
+-------+------------------+----------------+------------------+-------------------

In [6]:
# movies rating table 
movies_ratings = movies.join(users,"movieId", "right")
movies_ratings.show()

+-------+--------------------+--------------------+------+------+---------+
|movieId|               title|              genres|userId|rating|timestamp|
+-------+--------------------+--------------------+------+------+---------+
|      1|    Toy Story (1995)|Adventure|Animati...|     1|   4.0|964982703|
|      3|Grumpier Old Men ...|      Comedy|Romance|     1|   4.0|964981247|
|      6|         Heat (1995)|Action|Crime|Thri...|     1|   4.0|964982224|
|     47|Seven (a.k.a. Se7...|    Mystery|Thriller|     1|   5.0|964983815|
|     50|Usual Suspects, T...|Crime|Mystery|Thr...|     1|   5.0|964982931|
|     70|From Dusk Till Da...|Action|Comedy|Hor...|     1|   3.0|964982400|
|    101|Bottle Rocket (1996)|Adventure|Comedy|...|     1|   5.0|964980868|
|    110|   Braveheart (1995)|    Action|Drama|War|     1|   4.0|964982176|
|    151|      Rob Roy (1995)|Action|Drama|Roma...|     1|   5.0|964984041|
|    157|Canadian Bacon (1...|          Comedy|War|     1|   5.0|964984100|
|    163|   

In [8]:
# Names of the top-10 movies with the largest number of ratings
movies_ratings.groupby(["movieId", "title"]).count().sort(col("count").desc()).show(10)

+-------+--------------------+-----+
|movieId|               title|count|
+-------+--------------------+-----+
|    356| Forrest Gump (1994)|  329|
|    318|Shawshank Redempt...|  317|
|    296| Pulp Fiction (1994)|  307|
|    593|Silence of the La...|  279|
|   2571|  Matrix, The (1999)|  278|
|    260|Star Wars: Episod...|  251|
|    480|Jurassic Park (1993)|  238|
|    110|   Braveheart (1995)|  237|
|    589|Terminator 2: Jud...|  224|
|    527|Schindler's List ...|  220|
+-------+--------------------+-----+
only showing top 10 rows



In [9]:
# Names of the top-10 movies with the highest average rating grouped by genre, and

# Define windowing function
w = Window().partitionBy("genre").orderBy(col("avg(rating)").desc())

# Calculate avg rating per movie
movies_avg_ratings = movies_ratings.groupby("movieId").avg("rating")

# Get top 10 movies per group
movies_avg_ratings_by_genre = (
    movies_ratings.withColumn("genre", explode(split("genres", "\|")))
    .join(movies_avg_ratings, "movieId", "left")
    .withColumn("rn", row_number().over(w))
    .where(col("rn") <= 10)
    .select("movieId", "title", "avg(rating)", "genre")
)

movies_avg_ratings_by_genre.show(movies_avg_ratings_by_genre.count())

+-------+--------------------+-----------------+------------------+
|movieId|               title|      avg(rating)|             genre|
+-------+--------------------+-----------------+------------------+
| 147300|Adventures Of She...|              5.0|             Crime|
|    876|Supercop 2 (Proje...|              5.0|             Crime|
| 179133|Loving Vincent (2...|              5.0|             Crime|
| 147286|The Adventures of...|              5.0|             Crime|
|  80124|Sisters (Syostry)...|              5.0|             Crime|
|  59814|   Ex Drummer (2007)|              5.0|             Crime|
| 109241|On the Other Side...|              5.0|             Crime|
|  26840|Sonatine (Sonachi...|              5.0|             Crime|
|   8238|Little Murders (1...|              5.0|             Crime|
|   6086|  I, the Jury (1982)|              5.0|             Crime|
|    496|What Happened Was...|              5.0|           Romance|
|  67618|Strictly Sexual (...|              5.0|

In [10]:
# Common support for all pair of the first 100 movies.

# last id in the first 100 movies
limit = movies.take(100)[-1].movieId

movies_ratings_filtered = movies_ratings.where(col("movieId") <= limit)

movies_ratings_filtered.join(
    movies_ratings_filtered.select("userId", col("movieId").alias("movieIdB")),
    "userId",
    "left",
).where(col("movieIdB") != col("movieId")).groupby(
    ["movieId", "movieIdB"]
).count().select(
    "movieId", "movieIdB", col("count").alias("support")
).sort(
    col("support").desc()
).show(10)

+-------+--------+-------+
|movieId|movieIdB|support|
+-------+--------+-------+
|     47|     110|    134|
|    110|      47|    134|
|     47|      50|    120|
|     50|      47|    120|
|      1|     110|    117|
|    110|       1|    117|
|     50|     110|    110|
|    110|      50|    110|
|      1|      32|    104|
|     32|       1|    104|
+-------+--------+-------+
only showing top 10 rows



## Classification

In [7]:
# Create train, test and validation set
train, test, validation = users.randomSplit([0.8, 0.1, 0.1])
mean_rating = train.select('rating').groupBy().avg("rating").take(1)[0][0]

train.show(5)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|     50|   5.0|964982931|
|     1|     70|   3.0|964982400|
|     1|    101|   5.0|964980868|
+------+-------+------+---------+
only showing top 5 rows



### Baseline Prediction

One of the classifier is a baseline precition. A user bias and movie bias is calculated. Those can be used to give recommendations to users for movies they not rated yet. In the following, two variatons are implemented to train the biases. One optimization is executed on RDDs and the other on DataFrames. 

In [8]:
# Predict ratings based on trained user and movie bias
def baseline_predictor(test, user_bias, movie_bias):
    test = (
        test.join(user_bias, "userId", "inner")
        .join(movie_bias, "movieId", "inner")
        .withColumn("mean", lit(mean_rating))
        .withColumn(
            "prediction",
            col("mean") + col("user_bias") + col("movie_bias"),
        )
    )
    return test

#### With RDD

In [10]:
# Map Reduce Methods used during training
def gradient(matrix, user_bias, movie_bias):
    rating = matrix[:, 2]
    user_ids = list(matrix[:, 0].astype(int))
    movie_ids = list(matrix[:, 1].astype(int))
    print(movie_ids)
    error = rating.T - (
        mean_rating + np.add(user_bias[[user_ids]], movie_bias[[movie_ids]])
    )
    user_bias_error = counter_user_bias = np.zeros(611)
    movie_bias_error = counter_movie_bias = np.zeros(193610)
    for i, user in enumerate(user_ids):
        user_bias_error[user] = error[i]
        counter_user_bias[user] = 1
    for i, movie in enumerate(movie_ids):
        movie_bias_error[movie] = error[i]
        counter_movie_bias[user] = 1
    return user_bias_error, movie_bias_error, counter_user_bias, counter_movie_bias


def add(x, y):
    x1 = np.add(x[0], y[0]) 
    x2 = np.add(x[1], y[1]) 
    x3 = np.add(x[2], y[2]) 
    x4 = np.add(x[3], y[3]) 
    return x1, x2, x3, x4

In [11]:
# Convert RDD Partitions
def readBatch(iterator):
    tuples = list(iterator)
    matrix = np.zeros((len(tuples), 3))
    for i, tup in enumerate(tuples):
        matrix[i][0] = int(tup[0])
        matrix[i][1] = int(tup[1])
        matrix[i][2] = tup[2]
    return [matrix]

In [12]:
# Training Methods
def fit(train, lr, lambda_, iterations, verbose=False):
    # Initialize user and movie bias
    user_bias = np.zeros(611)
    movie_bias = np.zeros(193610)
    for i in range(iterations):
        lr_it = lr / (i + 1)
        #print("On iteration %i" % (i + 1))
        user_bias_error, movie_bias_error = train_rdd.map(lambda m: gradient(m, user_bias, movie_bias)).reduce(add)
        user_bias += lr_it * (user_bias_error - lambda_ * user_bias)
        movie_bias += lr_it * (movie_bias_error - lambda_ * movie_bias)
        if verbose:
            rmse_baseline = evaluate(test, user_bias, movie_bias)
            print("Root-mean-square error = " + str(rmse_baseline))
    return user_bias, movie_bias
    

In [13]:
# Evaluation Method for fitted user and movie bias
def evaluate(test, user_bias, movie_bias):
    df_user = pd.DataFrame(user_bias, columns=["user_bias"])
    df_user['userId'] = df_user.index
    df_movie = pd.DataFrame(user_bias, columns=["movie_bias"])
    df_movie['movieId'] = df_movie.index
    df_user_bias = sqlContext.createDataFrame(df_user)
    df_movie_bias = sqlContext.createDataFrame(df_movie)
    baseline_predictions = baseline_predictor(test, df_user_bias, df_movie_bias)
    evaluator = RegressionEvaluator(
                metricName="rmse", labelCol="rating", predictionCol="prediction"
            )
    rmse_baseline = evaluator.evaluate(baseline_predictions)
    return rmse_baseline

In [15]:
# Map DF to RDD and convert paritions into numpy arrays (faster)
train_rdd = train.rdd.map(tuple).repartition(100).mapPartitions(readBatch).cache()

In [29]:
# Grid search
lambda_list = [0.4, 0.5, 1, 2]
lrs = [0.1, 0.5, 0.01, 0.05, 0.2, 0.3, 0.4]
lowest_rmse = 10


for lambda_ in lambda_list:
    for lr in lrs:
        # Fit biases
        user_bias, movie_bias = fit(train, lr, lambda_, 15)
        rmse_baseline = evaluate(test, user_bias, movie_bias)
        # Save best parameters
        if rmse_baseline < lowest_rmse:
            lowest_rmse = rmse_baseline
            best_lr = lr
            best_lambda = lambda_
            
# Best Parameters
best_lr, best_lambda, lowest_rmse

(0.2, 2, 14, 1.0469672593140502)

In [18]:
# Traning with best parameters
#best_lr = 0.4
#best_lambda = 2
start = time()

user_bias = np.zeros(611)
movie_bias = np.zeros(193610)
user_bias, movie_bias = fit(train, best_lr, best_lambda, 15, verbose=True)

print(f"Training finished after {time() - start} seconds")

Root-mean-square error = 1.043807838776407
Root-mean-square error = 1.040980388897031
Root-mean-square error = 1.0404587965090073
Root-mean-square error = 1.0402463121727379
Root-mean-square error = 1.0401324130819276
Root-mean-square error = 1.0400619177512702
Root-mean-square error = 1.0400142111735278
Root-mean-square error = 1.0399798936701232
Root-mean-square error = 1.039954084823282
Root-mean-square error = 1.0399340062930822
Root-mean-square error = 1.0399179635499685
Root-mean-square error = 1.039904866115887
Root-mean-square error = 1.0398939813904027
Root-mean-square error = 1.0398847995226919
Root-mean-square error = 1.039876955064277
Training finished after 50.54073643684387 seconds


#### With DataFrame

In [22]:
# I tried different combinations of learning rates, and lambdas, the best results I got with: 
lr = 0.4
lambda_ = 1

In [23]:
init = train.withColumn("user_bias", lit(0)).withColumn("movie_bias", lit(0)).cache()
start = time()
for i in range(8):
    start_iteration = time()
    init = init.withColumn(
        "error", col("rating") - (mean_rating + col("user_bias") + col("movie_bias"))
    ).cache()
    # print(f"{time() - start_iteration} seconds for error calculation")
    time_stamp = time()
    error_user = (
        init.groupBy("userId")
        .avg("error")
        .withColumnRenamed("avg(error)", "user_error")
    )
    # print(f"{time() - time_stamp} seconds for error user grouping")
    time_stamp = time()
    error_movie = (
        init.groupBy("movieId")
        .avg("error")
        .withColumnRenamed("avg(error)", "movie_error")
    )
    # print(f"{time() - time_stamp} seconds for error movie grouping")
    time_stamp = time()
    init = (
        init.join(error_user, "userId", "left")
        .withColumn(
            "user_bias",
            col("user_bias") + lr * (col("user_error") - lambda_ * col("user_bias")),
        )
        .join(error_movie, "movieId", "left")
        .withColumn(
            "movie_bias",
            col("movie_bias")
            + lr * (col("movie_error") - lambda_ * col("movie_bias")),
        )
        .drop("user_error")
        .drop("movie_error")
        .cache()
    )

    # print(f"{time() - time_stamp} seconds for bias calculation")
    user_bias = (
        init.groupby("userId")
        .max("user_bias")
        .withColumnRenamed("max(user_bias)", "user_bias")
    )
    movie_bias = (
        init.groupby("movieId")
        .max("movie_bias")
        .withColumnRenamed("max(movie_bias)", "movie_bias")
    )
    baseline_predictions = baseline_predictor(test, user_bias, movie_bias)
    evaluator = RegressionEvaluator(
        metricName="rmse", labelCol="rating", predictionCol="prediction"
    )
    rmse_baseline = evaluator.evaluate(baseline_predictions)
    print(f"{rmse_baseline} after {i+1} iterations")
print(f"Training finished after {time() - start} seconds")

0.9071443483368775 after 1 iterations
0.8969218406766489 after 2 iterations
0.8942664032856237 after 3 iterations
0.8934225167282644 after 4 iterations
0.8931293421157787 after 5 iterations
0.8930230828273417 after 6 iterations
0.8929834387520453 after 7 iterations
0.8929682945703843 after 8 iterations
Training finished after 110.29225063323975 seconds


### Collaborative Filtering

In [39]:
# Build the recommendation model using ALS 
start = time()
als = ALS(maxIter=5,rank=15, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
          coldStartStrategy="drop")

evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")

# Cross validation on training and test data
paramGrid = ParamGridBuilder() \
    .addGrid(als.maxIter, [5, 10, 15, 4, 6]) \
    .addGrid(als.regParam, [0.1, 0.05, 0.2, 0.3]) \
    .addGrid(als.coldStartStrategy, ["drop", "nan"]) \
    .build()

crossval = CrossValidator(estimator=als,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=10)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
train_new = train.union(test)
als = crossval.fit(train_new)

print(f"Cross validation finished after {time() - start} seconds")

Cross validation finished after 1299.9577195644379 seconds


In [40]:
als.avgMetrics

[0.8833934807591354,
 0.9544909458847385,
 0.8847087159245665,
 0.9240325278419677,
 0.880932949494346,
 0.9523791158642447,
 0.8790761422787422,
 0.9154148748376895,
 0.8795190671811418,
 0.94707204949201,
 0.8778950088006628,
 0.9142074222722704,
 0.8857927626943973,
 0.954788475333234,
 0.8884747078584726,
 0.929602237367652,
 0.8824157273408866,
 0.9546835142621444,
 0.8825648910133064,
 0.9206615631096309]

In [24]:
start = time()
als = ALS(maxIter=5,rank=10, regParam=0.1, userCol="userId", itemCol="movieId", ratingCol="rating",
          coldStartStrategy="drop")
als = als.fit(train)
print(f"Training finished after {time() - start} seconds")

Training finished after 3.1862528324127197 seconds


## Evaluate

In [25]:
# Evaluate the model by computing the RMSE on the val data

### ALS
predictions = als.transform(validation)
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error ALS = " + str(rmse))

### Baseline DataFrame
baseline_predictions = baseline_predictor(validation, user_bias, movie_bias)
rmse_baseline = evaluator.evaluate(baseline_predictions)
print("Root-mean-square error Baseline= " + str(rmse_baseline))

Root-mean-square error ALS = 0.8935371713550776
Root-mean-square error Baseline= 0.9031937195812763


ALS performs here a little bit better, and it for one training run also why faster. 

In [28]:
# Movie Recommendation
# Generate top 10 movie recommendations for each user
userRecs = als.recommendForAllUsers(10)

In [31]:
userRecs.toPandas().to_csv('output.csv')