# Lab Instruction #6: Building a Movie Recommendation System Using PySpark and Spark MLlib  
## Lab Assignment: Spark MLlib – Book Recommendation    

**Student Information**  
- Name: Thái Hồ Phú Gia  
- Class: 23MMT  
- Student ID: 11012302891  

**Objective**  
- Load and process the Book-Crossing dataset using PySpark.  
- Perform data cleaning and transformation to structure the dataset for recommendations.  
- Use Spark MLlib’s ALS (Alternating Least Squares) to build a book recommendation system.  
- Tune hyperparameters to optimize the recommendation model.  
- Evaluate model performance using Root Mean Squared Error (RMSE).  

**Instructions**  
Download this dataset: Book-Crossing Dataset.  
This dataset contains user ratings for books, which will be used to build a recommendation system using Spark MLlib. Your goal is to process the dataset using Spark and apply ALS (or similar) collaborative filtering to build a book recommendation system.  

- Load and preprocess the dataset, ensuring valid user ratings.  
- Filter out books with very few ratings to improve model performance.  
- Train an ALS model using PySpark MLlib to generate book recommendations.  
- Evaluate the model using Root Mean Squared Error (RMSE).  
- Tune hyperparameters (rank, lambda_, iterations) to optimize the recommendation model.  
- Generate and display the top 5 book recommendations for a given user.  

**Submission**  
- Submission deadline: 2 weeks from the assignment date.  
- Submission Format: Upload the Executed Notebook (or similar) to LMS (lms.siu.edu.vn).  

**Suggested Resources**  
- [Spark Collaborative Filtering Documentation](https://spark.apache.org/docs/latest/ml-collaborative-filtering.html)  
- [Spark SQL Documentation](https://spark.apache.org/sql/)  

### 1. Load and process the Book-Crossing dataset using PySpark.  

In [22]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
import pandas as pd
from pyspark.sql import Row


In [23]:
spark = SparkSession.builder \
    .appName("BookRecommendation") \
    .config("spark.sql.shuffle.partitions", "8") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")


In [24]:
ratings_path = "./dataset/Ratings.csv"
books_path   = "./dataset/Books.csv"
users_path   = "./dataset/Users.csv"

ratings = spark.read.csv(ratings_path, header=True, inferSchema=True, sep=';')
books   = spark.read.csv(books_path,   header=True, inferSchema=True, sep=';')
users   = spark.read.csv(users_path,   header=True, inferSchema=True, sep=';')


                                                                                

In [25]:
ratings.printSchema()
ratings.show(5, truncate=False)

books.printSchema()
books.show(5, truncate=False)

root
 |-- User-ID: integer (nullable = true)
 |-- ISBN: string (nullable = true)
 |-- Rating: integer (nullable = true)

+-------+----------+------+
|User-ID|ISBN      |Rating|
+-------+----------+------+
|276725 |034545104X|0     |
|276726 |0155061224|5     |
|276727 |0446520802|0     |
|276729 |052165615X|3     |
|276729 |0521795028|6     |
+-------+----------+------+
only showing top 5 rows

root
 |-- ISBN: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Author: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Publisher: string (nullable = true)

+----------+--------------------------------------------------------------------------------------------------+--------------------+----+-----------------------+
|ISBN      |Title                                                                                             |Author              |Year|Publisher              |
+----------+------------------------------------------------------------------------

In [26]:
users.printSchema()
users.show(5, truncate=False)

root
 |-- User-ID: string (nullable = true)
 |-- Age: string (nullable = true)

+-------+----+
|User-ID|Age |
+-------+----+
|1      |NULL|
|2      |18  |
|3      |NULL|
|4      |17  |
|5      |NULL|
+-------+----+
only showing top 5 rows



### 2. Perform data cleaning and transformation to structure the dataset for recommendations.

In [27]:

ratings_clean = ratings.filter(col("Rating").isNotNull() & (col("Rating") > 0))
book_counts = ratings_clean.groupBy("ISBN").agg(count("*").alias("cnt"))
popular_books = book_counts.filter(col("cnt") >= 5).select("ISBN")

ratings_filt = ratings_clean.join(popular_books, on="ISBN", how="inner")


In [28]:
train, test = ratings_filt.randomSplit([0.8, 0.2], seed=42)



In [29]:
from pyspark.ml.feature import StringIndexer

isbn_indexer = StringIndexer(
    inputCol="ISBN",      
    outputCol="itemID"    
).fit(train)

train = isbn_indexer.transform(train)
test  = isbn_indexer.transform(test)

train.printSchema()
train.show(5, truncate=False)
test.printSchema()
test.show(5, truncate=False)


                                                                                

root
 |-- ISBN: string (nullable = true)
 |-- User-ID: integer (nullable = true)
 |-- Rating: integer (nullable = true)
 |-- itemID: double (nullable = false)



                                                                                

+----------+-------+------+-------+
|ISBN      |User-ID|Rating|itemID |
+----------+-------+------+-------+
|0000000000|8094   |10    |11305.0|
|0000000000|11676  |9     |11305.0|
|0000000000|71285  |7     |11305.0|
|0002005018|8      |5     |7054.0 |
|0002005018|11676  |8     |7054.0 |
+----------+-------+------+-------+
only showing top 5 rows

root
 |-- ISBN: string (nullable = true)
 |-- User-ID: integer (nullable = true)
 |-- Rating: integer (nullable = true)
 |-- itemID: double (nullable = false)





+----------+-------+------+-------+
|ISBN      |User-ID|Rating|itemID |
+----------+-------+------+-------+
|0000000000|11795  |7     |11305.0|
|0002005018|67544  |8     |7054.0 |
|0002251760|37712  |10    |7055.0 |
|0002558122|11676  |8     |7056.0 |
|0006385427|38835  |10    |13356.0|
+----------+-------+------+-------+
only showing top 5 rows



                                                                                

### 3. Use Spark MLlib’s ALS (Alternating Least Squares) to build a book recommendation system.  

In [30]:
from pyspark.ml.recommendation import ALS

als = ALS(
    userCol="User-ID",     
    itemCol="itemID",
    ratingCol="Rating",
    nonnegative=True,
    implicitPrefs=False,
    coldStartStrategy="drop"
)

model = als.fit(train)


                                                                                

In [31]:
predictions = model.transform(test)
evaluator = RegressionEvaluator(
    metricName="rmse",
    labelCol="Rating",
    predictionCol="prediction"
)
rmse = evaluator.evaluate(predictions)
print(f"Initial RMSE = {rmse:.4f}")


                                                                                

Initial RMSE = 2.0300


### 4. Tune hyperparameters to optimize the recommendation model and Evaluate model performance using Root Mean Squared Error (RMSE).

In [32]:
paramGrid = ParamGridBuilder() \
    .addGrid(als.rank,        [10, 20]) \
    .addGrid(als.regParam,    [0.01, 0.1]) \
    .addGrid(als.maxIter,     [10, 20]) \
    .build()

cv = CrossValidator(
    estimator=als,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=3,
    parallelism=2
)

cvModel = cv.fit(train)
bestModel = cvModel.bestModel

bestPreds = bestModel.transform(test)
bestRmse  = evaluator.evaluate(bestPreds)
print(f"Best RMSE = {bestRmse:.4f}")
print(f"Best rank = {bestModel._java_obj.parent().getRank()}, regParam = {bestModel._java_obj.parent().getRegParam()}, maxIter = {bestModel._java_obj.parent().getMaxIter()}")


                                                                                

Best RMSE = 1.9652
Best rank = 20, regParam = 0.1, maxIter = 20


### 5. Book recommendation system

In [33]:
def get_book_details(isbn, books_df):
    book = books_df.filter(col("ISBN") == isbn).first()
    if book:
        title = book["Title"] if "Title" in book else "Unknown"
        author = book["Author"] if "Author" in book else "Unknown"
        year = book["Year-Of-Publication"] if "Year-Of-Publication" in book else "Unknown"
        return f"{title} by {author} ({year})"
    return f"Unknown book (ISBN: {isbn})"

def recommend_books_for_user(user_id, model, train_df, books_df, isbn_indexer, num_recommendations=5):

    if train_df.filter(col("User-ID") == user_id).count() == 0:
        print(f"User {user_id} not found in the training set")
        return


    user_df = spark.createDataFrame([Row(**{"User-ID": user_id})])

    recs = model.recommendForUserSubset(user_df, num_recommendations)
    if recs.count() == 0:
        print(f"No recommendations could be generated for User {user_id}")
        return

    from pyspark.sql.functions import explode
    recs = recs.select("User-ID", explode("recommendations").alias("rec"))
    recs = recs.select("User-ID", recs.rec.itemID.alias("itemID"), recs.rec.rating.alias("score"))

    
    itemid_isbn = pd.DataFrame({
        "itemID": list(range(len(isbn_indexer.labels))),
        "ISBN": isbn_indexer.labels
    })
    itemid_isbn_sdf = spark.createDataFrame(itemid_isbn)

    recs = recs.join(itemid_isbn_sdf, on="itemID", how="left")

    print(f"\nTop {num_recommendations} book recommendations for User {user_id}:")
    for i, row in enumerate(recs.collect(), 1):
        isbn = row["ISBN"]
        print(f"{i}. {get_book_details(isbn, books_df)} (score: {row['score']:.2f})")


In [34]:
train.select("User-ID").distinct().show(10)

                                                                                

+-------+
|User-ID|
+-------+
|  80945|
|  64171|
|  76355|
|  44925|
|  10917|
|    232|
|  15957|
|   7346|
|  10030|
|  73486|
+-------+
only showing top 10 rows



In [35]:
recommend_books_for_user(80945, bestModel, train, books, isbn_indexer, num_recommendations=5)

                                                                                


Top 5 book recommendations for User 80945:
1. Redeeming Love by Francine Rivers (Unknown) (score: 10.69)
2. Per Anhalter Durch Die Galaxis by Adams (Unknown) (score: 10.65)
3. Miles from Nowhere: A Round the World Bicycle Adventure by Barbara Savage (Unknown) (score: 10.61)
4. Wild Orchids : A Novel by Jude Deveraux (Unknown) (score: 10.41)
5. The Law by Frederic Bastiat (Unknown) (score: 10.34)
