# BIG DATA ASSIGNMENT WEEK 09
## Filtering Collaborative
- Name: Agustinus Aldi Irawan Rahardja
- Student ID: 05111942000010
- Class: Big Data A
- Lecturer: Abdul Munif, S.Kom., M.Sc.

### Reference
https://spark.apache.org/docs/latest/ml-collaborative-filtering.html

## Install & Initialization

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824028 sha256=9ad2ad4a76b5d8b9900c63666a5bd9b6958080c7118e18ffb7771074eb4049ae
  Stored in directory: /root/.cache/pip/wheels/6c/e3/9b/0525ce8a69478916513509d43693511463c6468db0de237c86
Successfully built pyspark
Installing collected packages: py4j, pyspa

In [1]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row, SparkSession

In [2]:
# SparkSession Initialization
spark = SparkSession.builder \
    .master("local") \
    .appName("MovieLens") \
    .getOrCreate()

In [3]:
# Read data from a text file and separate elements of each line
lines = spark.read.text("./sample_data/sample_movielens_ratings.txt").rdd
parts = lines.map(lambda row: row.value.split("::"))

In [4]:
# Convert data into a DataFrame with userId, movieId, rating, and timestamp columns
ratingsRDD = parts.map(lambda p: Row(userId=int(p[0]), movieId=int(p[1]),
                                     rating=float(p[2]), timestamp=int(p[3])))

# Split the data into training (80%) and testing (20%) sets
ratings = spark.createDataFrame(ratingsRDD)
(training, test) = ratings.randomSplit([0.8, 0.2])

## Build Recomendation model using ALS

In [5]:
# Initialize the parameters to be tried
max_iters = [5, 10, 20]
reg_params = [0.1, 0.5, 1.0]

# Dictionary to store RMSE results
results = {}

In [10]:
!pip install findspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [6]:
# Loop for every combination of maxIter and regParam
for max_iter in max_iters:
    for reg_param in reg_params:
        # Build the recommendation model using ALS on the training data
        # Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
        als = ALS(maxIter=max_iter, regParam=reg_param, userCol="userId", itemCol="movieId", ratingCol="rating",
                  coldStartStrategy="drop")
        model = als.fit(training)

        # Evaluate the model by computing the RMSE on the test data
        predictions = model.transform(test)
        evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                        predictionCol="prediction")
        rmse = evaluator.evaluate(predictions)

        # Save the RMSE result in the dictionary
        results[(max_iter, reg_param)] = rmse
        print(f"Root-mean-square error for maxIter={max_iter}, regParam={reg_param} = {rmse}")

Root-mean-square error for maxIter=5, regParam=0.1 = 1.1181599522779593
Root-mean-square error for maxIter=5, regParam=0.5 = 1.3420819037625813
Root-mean-square error for maxIter=5, regParam=1.0 = 1.648282507027418
Root-mean-square error for maxIter=10, regParam=0.1 = 1.0516795859269352
Root-mean-square error for maxIter=10, regParam=0.5 = 1.3397644437245055
Root-mean-square error for maxIter=10, regParam=1.0 = 1.6482810348270984
Root-mean-square error for maxIter=20, regParam=0.1 = 1.0470312120846181
Root-mean-square error for maxIter=20, regParam=0.5 = 1.3402623990325133
Root-mean-square error for maxIter=20, regParam=1.0 = 1.6482810342764438


In [7]:
# Find the hyperparameter combination with the lowest RMSE
best_params = min(results, key=results.get)
best_rmse = results[best_params]
print(f"\nBest hyperparameters: maxIter={best_params[0]}, regParam={best_params[1]} with RMSE={best_rmse}")


Best hyperparameters: maxIter=20, regParam=0.1 with RMSE=1.0470312120846181


## Generate Movie Recomendation

In [8]:
# Train the model with the best hyperparameters
best_als = ALS(maxIter=best_params[0], regParam=best_params[1], userCol="userId", itemCol="movieId", ratingCol="rating",
               coldStartStrategy="drop")
best_model = best_als.fit(training)

## Print Result and Show Ouput

In [9]:
# Generate top 10 movie recommendations for each user
userRecs = best_model.recommendForAllUsers(10)
userRecs.show()

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|    20|[{22, 3.9751616},...|
|    10|[{2, 3.4396794}, ...|
|     0|[{92, 2.9995809},...|
|     1|[{62, 3.087924}, ...|
|    21|[{53, 4.068073}, ...|
|    11|[{30, 4.540894}, ...|
|    12|[{46, 4.1709056},...|
|    22|[{75, 4.627915}, ...|
|     2|[{93, 4.684606}, ...|
|    13|[{93, 3.1345673},...|
|     3|[{51, 4.071301}, ...|
|    23|[{46, 4.97689}, {...|
|     4|[{53, 3.8484697},...|
|    24|[{96, 3.7843173},...|
|    14|[{29, 4.6642785},...|
|     5|[{32, 3.7756653},...|
|    15|[{1, 2.7751334}, ...|
|    25|[{47, 2.9922917},...|
|    26|[{22, 4.817207}, ...|
|     6|[{25, 3.8545783},...|
+------+--------------------+
only showing top 20 rows



In [10]:
# Generate top 10 user recommendations for each movie
movieRecs = best_model.recommendForAllItems(10)
movieRecs.show()

+-------+--------------------+
|movieId|     recommendations|
+-------+--------------------+
|     20|[{17, 3.8699784},...|
|     40|[{2, 3.4557195}, ...|
|     10|[{17, 3.5745504},...|
|     50|[{23, 3.7012196},...|
|     80|[{26, 3.936153}, ...|
|     70|[{21, 3.1901}, {8...|
|     60|[{22, 2.6208816},...|
|     90|[{17, 4.729562}, ...|
|     30|[{11, 4.540894}, ...|
|      0|[{28, 2.524856}, ...|
|     31|[{12, 2.8562407},...|
|     81|[{28, 3.9458597},...|
|     91|[{12, 2.812576}, ...|
|      1|[{25, 2.8214736},...|
|     41|[{21, 3.222383}, ...|
|     61|[{7, 2.0230844}, ...|
|     51|[{26, 4.3740354},...|
|     21|[{26, 2.6921322},...|
|     11|[{2, 1.387227}, {...|
|     71|[{11, 2.5183535},...|
+-------+--------------------+
only showing top 20 rows



In [11]:
# Generate top 10 movie recommendations for a specific set of users
users = ratings.select(best_als.getUserCol()).distinct().limit(3)
userSubsetRecs = best_model.recommendForUserSubset(users, 10)
userSubsetRecs.show()

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|    26|[{22, 4.817207}, ...|
|    19|[{94, 3.4780343},...|
|    29|[{46, 4.0811605},...|
+------+--------------------+



In [12]:
# Generate top 10 user recommendations for a specific set of movies
movies = ratings.select(best_als.getItemCol()).distinct().limit(3)
movieSubSetRecs = best_model.recommendForItemSubset(movies, 10)
movieSubSetRecs.show()

+-------+--------------------+
|movieId|     recommendations|
+-------+--------------------+
|     65|[{23, 4.01176}, {...|
|     26|[{25, 2.0301652},...|
|     29|[{8, 4.7330728}, ...|
+-------+--------------------+



## Summary

The following sections describe the implemented steps:

* Data Loading and Preprocessing: </br>
We loaded the MovieLens dataset, processed it, and split it into training and test sets.

* Hyperparameter Tuning: </br>
We tried different combinations of maxIter and regParam values, evaluated the models on the test set, and stored the RMSE results in a dictionary.

* Model Selection: </br>
We identified the best hyperparameter combination based on the lowest RMSE and displayed the results.

* Generating Recommendations: </br>
We trained the model using the best hyperparameters and generated recommendations for all users, all movies, a subset of users, and a subset of movies.

The output shows the top 10 recommendations for each user, movie, and the specified subsets.