# COURSE: Big Data - CTS43135

## Lab Instruction #6:  
**Building a Movie Recommendation System Using PySpark and Spark MLlib**

### Lab Objectives:
- Understand how to process and analyze large-scale data using PySpark and Spark MLlib.
- Explore and manipulate structured datasets with Spark SQL and DataFrames.
- Implement collaborative filtering for recommendation systems using the ALS algorithm.

### Prerequisites:
- Basic knowledge of Python, SQL, and Machine Learning is recommended.
- This lab runs on Python 3.6+ with Apache Spark 3.x.
- A working environment like Jupyter Notebook or Google Colab is recommended for easier execution.

### Activity 1: Preparing the MovieLens Dataset

#### 1. Download this dataset. 

In [1]:
!wget -O ml-1m.zip https://files.grouplens.org/datasets/movielens/ml-1m.zip

--2025-05-13 08:03:17--  https://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5.6M) [application/zip]
Saving to: ‘ml-1m.zip’


2025-05-13 08:03:20 (3.34 MB/s) - ‘ml-1m.zip’ saved [5917549/5917549]



#### 2. Extract the ZIP file.

In [None]:
import zipfile
with zipfile.ZipFile("ml-1m.zip", "r") as zip_ref:
    # Extracts the contents into a folder named "ml-1m"
    zip_ref.extractall("ml-1m")

#### 3. Defining Schemas for MovieLens Dataset in PySpark.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, StringType

spark = SparkSession.builder \
    .appName("MovieLensRecommendation") \
    .getOrCreate()

ratings_schema = StructType([
    StructField("userId", IntegerType(), True),
    StructField("movieId", IntegerType(), True),
    StructField("rating", DoubleType(), True),
    StructField("timestamp", IntegerType(), True)
])

#### Task 1: create a schema for the movies.data file from the MovieLens dataset


In [5]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

movies_schema = StructType([
    StructField("movieId", IntegerType(), True),
    StructField("title", StringType(), True),
    StructField("genres", StringType(), True)
])

#### 4. Load the MovieLens Dataset.

In [None]:
ratings_df = spark.read.csv(
    'ml-1m/ml-1m/ratings.dat', sep="::", schema=ratings_schema, header=False)
ratings_df = ratings_df.drop("timestamp")
movies_df = spark.read.csv('ml-1m/ml-1m/movies.dat',
                           sep="::", schema=movies_schema, header=False)

#### 5. Cache the DataFrames. 

In [None]:
ratings_df.cache()
movies_df.cache()

25/05/13 08:14:23 WARN CacheManager: Asked to cache already cached data.


DataFrame[movieId: int, title: string, genres: string]

#### 6. Count the Number of Ratings and Movies.

In [None]:
ratings_count = ratings_df.count()
movies_count = movies_df.count()
print(
    f"There are {ratings_count} ratings and {movies_count} movies in the dataset.")

                                                                                

There are 1000209 ratings and 3883 movies in the dataset.


#### 7. Display Sample Data.

In [None]:
print("Ratings Data Sample:")
ratings_df.show(3)  # Show first 3 rows of ratings data
print("Movies Data Sample:")
# Show first 3 rows of movies data (without truncating movie titles)
movies_df.show(3, truncate=False)

Ratings Data Sample:
+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|   1193|   5.0|
|     1|    661|   3.0|
|     1|    914|   3.0|
+------+-------+------+
only showing top 3 rows

Movies Data Sample:
+-------+-----------------------+----------------------------+
|movieId|title                  |genres                      |
+-------+-----------------------+----------------------------+
|1      |Toy Story (1995)       |Animation|Children's|Comedy |
|2      |Jumanji (1995)         |Adventure|Children's|Fantasy|
|3      |Grumpier Old Men (1995)|Comedy|Romance              |
+-------+-----------------------+----------------------------+
only showing top 3 rows



### Activity 2: Computing Average Ratings


#### 1. Import Required Libraries.

In [None]:
from pyspark.sql import functions as F

#### 2. Compute the Average Ratings Per Movie. Using ratings_df, compute the average rating and the number of ratings per movie.

In [None]:
movie_ids_with_avg_ratings_df = (ratings_df
                                 .groupBy('movieId')  # Group by movie ID
                                 .agg(F.count(ratings_df.rating).alias("count"),  # Count the number of ratings
                                      F.avg(ratings_df.rating).alias("average"))  # Compute the average rating
                                 )

In [None]:
print("movie_ids_with_avg_ratings_df:")
movie_ids_with_avg_ratings_df.show(3, truncate=False)

movie_ids_with_avg_ratings_df:




+-------+-----+------------------+
|movieId|count|average           |
+-------+-----+------------------+
|1580   |2538 |3.739952718676123 |
|2366   |756  |3.6560846560846563|
|1088   |687  |3.3114992721979624|
+-------+-----+------------------+
only showing top 3 rows



                                                                                

#### 3. Add Movie Titles to the DataFrame.

In [None]:
movie_names_with_avg_ratings_df = movie_ids_with_avg_ratings_df.join(
    movies_df,  # Joining with movies dataset
    movie_ids_with_avg_ratings_df.movieId == movies_df.movieId  # Matching on movie ID
)

In [None]:
print("movie_names_with_avg_ratings_df:")
movie_names_with_avg_ratings_df.show(3, truncate=False)

movie_names_with_avg_ratings_df:


                                                                                

+-------+-----+------------------+-------+--------------------+------------------------------+
|movieId|count|average           |movieId|title               |genres                        |
+-------+-----+------------------+-------+--------------------+------------------------------+
|1580   |2538 |3.739952718676123 |1580   |Men in Black (1997) |Action|Adventure|Comedy|Sci-Fi|
|2366   |756  |3.6560846560846563|2366   |King Kong (1933)    |Action|Adventure|Horror       |
|1088   |687  |3.3114992721979624|1088   |Dirty Dancing (1987)|Musical|Romance               |
+-------+-----+------------------+-------+--------------------+------------------------------+
only showing top 3 rows



#### Task 2: Filter the movies dataset to include only movies with at least 500 reviews and sort them by highest average rating.


In [None]:
filtered_movies_df = (movie_names_with_avg_ratings_df
                      .filter(movie_names_with_avg_ratings_df["count"] >= 500)  # Filter movies with at least 500 reviews
                      .orderBy(movie_names_with_avg_ratings_df["average"].desc())  # Sort by highest average rating
                      )

print("Filtered Movies with at least 500 reviews, sorted by highest average rating:")
filtered_movies_df.show(10, truncate=False)  # Display the top 10 movies

Filtered Movies with at least 500 reviews, sorted by highest average rating:
+-------+-----+-----------------+-------+-------------------------------------------------------------------+-------------------------------+
|movieId|count|average          |movieId|title                                                              |genres                         |
+-------+-----+-----------------+-------+-------------------------------------------------------------------+-------------------------------+
|2019   |628  |4.560509554140127|2019   |Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)|Action|Drama                   |
|318    |2227 |4.554557700942973|318    |Shawshank Redemption, The (1994)                                   |Drama                          |
|858    |2223 |4.524966261808367|858    |Godfather, The (1972)                                              |Action|Crime|Drama             |
|745    |657  |4.52054794520548 |745    |Close Shave, A (1995)         

### Activity 3: Collaborative Filtering with ALS


#### 1. Creating a Training Set.

In [None]:
from pyspark.mllib.recommendation import Rating

# Convert DataFrame to RDD for MLlib compatibility
ratings_rdd = ratings_df.rdd.map(lambda row: Rating(row.userId, row.movieId, row.rating))
# Randomly split the RDD into training (60%), validation (20%), and test (20%) sets
seed = 1800009193
(training_rdd, validation_rdd, test_rdd) = ratings_rdd.randomSplit([0.6, 0.2,
0.2], seed=seed)
# Cache the RDDs for performance
training_rdd.cache()
validation_rdd.cache()
test_rdd.cache()
# Print dataset sizes
print(f'Training: {training_rdd.count()}, Validation: {validation_rdd.count()}, Test: {test_rdd.count()}\n')
# Show sample data from each dataset
print("Training Sample:")
print(training_rdd.take(3)) # Take 3 samples from training set
print("Validation Sample:")
print(validation_rdd.take(3)) # Take 3 samples from validation set
print("Test Sample:")
print(test_rdd.take(3)) # Take 3 samples from test set

25/05/13 08:25:42 WARN BlockManager: Task 27 already completed, not releasing lock for rdd_70_0
25/05/13 08:25:42 WARN BlockManager: Task 28 already completed, not releasing lock for rdd_71_0


Training: 600148, Validation: 200360, Test: 199701

Training Sample:
[Rating(user=1, product=1193, rating=5.0), Rating(user=1, product=661, rating=3.0), Rating(user=1, product=914, rating=3.0)]
Validation Sample:
[Rating(user=1, product=3408, rating=4.0), Rating(user=1, product=1035, rating=5.0), Rating(user=1, product=720, rating=3.0)]
Test Sample:
[Rating(user=1, product=2804, rating=5.0), Rating(user=1, product=594, rating=4.0), Rating(user=1, product=919, rating=4.0)]


                                                                                

#### 2. Alternating Least Squares

In [None]:
from pyspark.mllib.recommendation import ALS, Rating
from math import sqrt


In [None]:
# Define hyperparameters for ALS
rank = 10  # Number of latent factors
iterations = 10  # Number of iterations
reg_param = 0.1  # Regularization parameter

# Train the ALS model
model = ALS.train(training_rdd, rank, iterations=iterations, lambda_=reg_param)


25/05/13 08:29:18 WARN BlockManager: Task 30 already completed, not releasing lock for rdd_70_0
25/05/13 08:29:21 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
25/05/13 08:29:22 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


In [None]:
# Generate predictions using the ALS model and the test dataset
predictions_rdd = model.predictAll(test_rdd.map(lambda x: (x.user, x.product)))

# Convert predictions to the format ((userId, movieId), rating)
predictions_rdd = predictions_rdd.map(lambda r: ((r.user, r.product), r.rating))

# Convert actual ratings to the same format ((userId, movieId), rating)
actual_ratings_rdd = validation_rdd.map(lambda x: ((x[0], x[1]), x[2]))
# Join actual ratings with predictions
joined_rdd = actual_ratings_rdd.join(predictions_rdd)

25/05/13 08:32:43 WARN BlockManager: Task 155 already completed, not releasing lock for rdd_72_0
                                                                                

In [None]:
# Compute RMSE
mse = joined_rdd.map(lambda x: (x[1][0] - x[1][1]) ** 2).mean()
rmse = sqrt(mse)


                                                                                

#### 3. Implementing Everything in a Single Function.

In [None]:
def train_als_model(training_rdd, validation_rdd, rank, iterations=5, reg_param=0.1):
    """
    Trains an ALS model and evaluates its performance using RMSE.
    Parameters:
    training_rdd (RDD): Training dataset in RDD format
    validation_rdd (RDD): Validation dataset in RDD format
    rank (int): Number of latent factors for ALS
    iterations (int): Number of iterations for ALS optimization
    reg_param (float): Regularization parameter for ALS
    Returns:
    model: Trained ALS model
    rmse (float): Computed RMSE value
    """
    # Train ALS model
    model = ALS.train(training_rdd, rank, iterations=iterations, lambda_=reg_param)
    
    # Generate predictions for validation set using predictAll
    validation_pairs_rdd = validation_rdd.map(lambda x: (x[0], x[1]))  # Extract (userId, movieId) pairs
    predictions_rdd = model.predictAll(validation_pairs_rdd)
    
    # Format predictions as ((userId, movieId), rating)
    predictions_rdd = predictions_rdd.map(lambda r: ((r.user, r.product), r.rating))
    actual_ratings_rdd = validation_rdd.map(lambda x: ((x[0], x[1]), x[2]))
    
    # Join actual ratings with predictions
    joined_rdd = actual_ratings_rdd.join(predictions_rdd)
    
    # Compute RMSE
    mse = joined_rdd.map(lambda x: (x[1][0] - x[1][1]) ** 2).mean()
    rmse = sqrt(mse)
    
    return model, rmse

#### 4. Train the Model with Sample Parameters.

In [None]:
# Define model parameters
rank = 8
iterations = 5
reg_param = 0.1

# Train the model and get RMSE
als_model, rmse = train_als_model(training_rdd, validation_rdd, rank, iterations, reg_param)
print(f"Trained ALS Model with Rank {rank}, RMSE: {rmse}")

25/05/13 08:42:47 WARN BlockManager: Task 502 already completed, not releasing lock for rdd_70_0
                                                                                

TypeError: predict() missing 1 required positional argument: 'product'