### Problem

One of the biggest challenges when wanting to read books is finding the right book to read. That is why we made BookForYou. BookForYou is a recommender system that suggests books for the user based on their inputted preferences for author, title, and book category. It uses book reviews from Amazon’s Book database to find the ideal book candidate.

### Identification of required data

The dataset used is [Amazon Book Reviews](https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?select=books_data.csv).

The dataset consists of two entities, one with book details and the second containing book reviews. Each entity has 10 features, for a combined dataset size of 3.04 GB. As shown below, one book can have many reviews, but a review can only belong to a single book. Books are identified by their titles. From the book details, the title, author, year and category will be used. From the reviews entity, the content of the reviews, book rating, and the helpfulness rating of a given review will be used.

![Entities picture](images\entities.png)

For Book_details the following features will be used:
For reviews the following features will be used:

# can check relevancy of features using correlation metrics (pearson similarity) of decision tree?
* Id (the id of the book)
* title (Book Title)
* user_id (Id of user who rate the book)
* review/score (rating from 0 to 5 for the book)
* review/summary (the summary of text review)

### Data PreProcessing

The following imports will be used for data preprocesing.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler


Creating SparkSession

In [2]:
def init_spark():
    spark = SparkSession \
        .builder \
        .appName("BookForYou Recommender System") \
        .config("spark.driver.memory", "8g") \
        .config("spark.executor.memory", "8g") \
        .getOrCreate()
    return spark

Decision Trees for feature importance
(USELESS CODE)

In [None]:

# # define the categorical features to one-hot encode
# categorical_cols = ['Id', 'Title', 'Price', 'User_id', 'profileName', 'review/helpfulness', 'review/score', 'review/time', 'review/summary', 'review/text']


# # create a list of string indexers for each categorical feature
# indexers = [StringIndexer(inputCol=col, outputCol=col + "_index") for col in categorical_cols]

# encoder = OneHotEncoder(inputCols=[indexer.getOutputCol() for indexer in indexers],
#                                  outputCols=[col + "_encoded" for col in categorical_cols])

# # fit the indexers and encoder to the data
# indexers_models = [indexer.fit(df) for indexer in indexers]
# encoded_df = encoder.fit(
#     reduce(lambda data, model: model.transform(data), indexers_models, df)
# ).transform(
#     reduce(lambda data, model: model.transform(data), indexers_models, df)
# )
# print(encoded_df.columns)
# dt = DecisionTreeClassifier(labelCol="review/score_index", featuresCol='review/summary_encoded', maxDepth=2, maxBins=1000)
# model = dt.fit(encoded_df)
# print(model.featureImportances)
# print(df.count())

### Importing the dataset

In [3]:
spark  = init_spark()
df_ratings = spark.read.csv("data\\preprocessed\\reviews.csv", inferSchema=True, header=True)
df_books = spark.read.csv("data\\preprocessed\\book_details.csv", inferSchema=True, header=True)

Missing Data
There may frequently be gaps in data sources, which leaves you with three main possibilities for completing the gaps

1. Just keep the missing data points.
2. Drop them missing data points (including the entire row)
3. Fill them in with some other value.

In [4]:
df_ratings = df_ratings.select("Id", "Title", "User_id", "review/score", "review/summary")
df_ratings = df_ratings.na.drop(subset=["Id","Title","User_id","review/score","review/summary"])
df_ratings = df_ratings.withColumnRenamed("Id", "book_string")
df_ratings = df_ratings.withColumnRenamed("User_id", "User_string")
df_ratings = df_ratings.filter(df_ratings["review/score"] <= 5)
df_ratings = df_ratings.filter(df_ratings["review/score"] >= 1)


Content-based filtering => clustering issue



In [5]:
df_books.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|               Title|         description|             authors|               image|         previewLink|           publisher|       publishedDate|            infoLink|          categories|        ratingsCount|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Its Only Art If I...|                null|    ['Julie Strain']|http://books.goog...|http://books.goog...|                null|                1996|http://books.goog...|['Comics & Graphi...|                null|
|Dr. Seuss: Americ...|"Philip Nel takes...| like that of Lew...| has changed lang...| giving us new wo...| inspiring artist...|      ['Philip Nel']|http

Collaborative filtering => based off ratings of other users

In [6]:
print('There are {} ratings in the dataset'.format(df_ratings.count()))
df_ratings.show()

There are 2420208 ratings in the dataset
+-----------+--------------------+--------------+------------+--------------------+
|book_string|               Title|   User_string|review/score|      review/summary|
+-----------+--------------------+--------------+------------+--------------------+
| 1882931173|Its Only Art If I...| AVCGYZL8FQQTD|         4.0|Nice collection o...|
| 0826414346|Dr. Seuss: Americ...|A30TK6U7DNS82R|         5.0|   Really Enjoyed It|
| 0826414346|Dr. Seuss: Americ...|A3UH4UZ4RSVO82|         5.0|Essential for eve...|
| 0826414346|Dr. Seuss: Americ...|A2MVUWT453QH61|         4.0|Phlip Nel gives s...|
| 0826414346|Dr. Seuss: Americ...|A22X4XUPKF66MR|         4.0|Good academic ove...|
| 0826414346|Dr. Seuss: Americ...|A2F6NONFUDB6UK|         4.0|One of America's ...|
| 0826414346|Dr. Seuss: Americ...|A14OJS0VWMOSWO|         5.0|A memorably excel...|
| 0826414346|Dr. Seuss: Americ...|A2RSSXTDZDUSH4|         5.0|Academia At It's ...|
| 0826414346|Dr. Seuss: Americ...|A

In [7]:
df_ratings.describe().show()

+-------+--------------------+--------------------+--------------------+-----------------+--------------------+
|summary|         book_string|               Title|         User_string|     review/score|      review/summary|
+-------+--------------------+--------------------+--------------------+-----------------+--------------------+
|  count|             2420208|             2420208|             2420208|          2420208|             2420208|
|   mean| 1.072548440143609E9|  2029.0781365666878|                null|4.227116429662244|            Infinity|
| stddev|1.2973025907814867E9|  1738.2674242229316|                null|1.179690753614754|                 NaN|
|    min|          0001047604|""" We'll Always ...|A00109803PZJ91RLT...|              1.0|                   !|
|    max|          B0064P287I|xBase Programming...|       AZZZZW74AAX75|              5.0|~~~~~~~~~~~~~~~~~...|
+-------+--------------------+--------------------+--------------------+-----------------+--------------

In [5]:
from pyspark.ml.feature import StringIndexer

# Create a StringIndexer for the "User_id" column
indexer = StringIndexer(inputCols=["book_string","User_string","review/score"], outputCols=["book_id","User_id","rating"])

# Fit the StringIndexer to the DataFrame
df_ratings = indexer.fit(df_ratings).transform(df_ratings).drop("book_string","User_string","review/score")

# Show the result
df_ratings.show()

+--------------------+--------------------+--------+--------+------+
|               Title|      review/summary| book_id| User_id|rating|
+--------------------+--------------------+--------+--------+------+
|Its Only Art If I...|Nice collection o...|180201.0|167334.0|   1.0|
|Dr. Seuss: Americ...|   Really Enjoyed It| 40116.0|    64.0|   0.0|
|Dr. Seuss: Americ...|Essential for eve...| 40116.0|105599.0|   0.0|
|Dr. Seuss: Americ...|Phlip Nel gives s...| 40116.0|  4472.0|   1.0|
|Dr. Seuss: Americ...|Good academic ove...| 40116.0| 31627.0|   1.0|
|Dr. Seuss: Americ...|One of America's ...| 40116.0|  3581.0|   1.0|
|Dr. Seuss: Americ...|A memorably excel...| 40116.0|     0.0|   0.0|
|Dr. Seuss: Americ...|Academia At It's ...| 40116.0|637113.0|   0.0|
|Dr. Seuss: Americ...|And to think that...| 40116.0|130558.0|   0.0|
|Dr. Seuss: Americ...|Fascinating accou...| 40116.0|837115.0|   1.0|
|Wonderful Worship...|Outstanding Resou...| 76542.0|999164.0|   0.0|
|Wonderful Worship...|Small Church

Splitting dataset into test and training sets

In [1]:
df_ratings.filter(df_ratings["ratings"] == 5.0).show()

NameError: name 'df_ratings' is not defined

In [6]:
ratings_training_set, ratings_test_set = df_ratings.randomSplit([0.8, 0.2], seed=1234)

### Creating the model

In [7]:
from pyspark.ml.recommendation import ALS #alternating least squares algorithm
from pyspark.ml.evaluation import RegressionEvaluator

recommender = ALS(userCol="User_id", itemCol="book_id", ratingCol="rating", coldStartStrategy="drop")
recommender = recommender.fit(ratings_training_set)

Predicting with the test set

In [8]:
predictions = recommender.transform(ratings_test_set)

In [9]:
predictions.show(10)

+---------+--------------------+-------+-------+------+------------+
|    Title|      review/summary|book_id|User_id|rating|  prediction|
+---------+--------------------+-------+-------+------+------------+
|The Giver|Nothing less than...|   26.0|  642.0|   0.0|  0.19744895|
|The Giver|Just As Haunting ...|   26.0| 1025.0|   0.0|   0.6680782|
|The Giver|Unabrid Audio 4 c...|   26.0| 1307.0|   0.0|  0.45030588|
|The Giver|  Recovering Skeptic|   26.0| 1404.0|   0.0|-0.057295006|
|The Giver|What a book!! Les...|   26.0| 1483.0|   0.0|  0.12075398|
|The Giver| Inspirational Read!|   26.0| 1873.0|   0.0|  0.43365362|
|The Giver|A Fable About The...|   26.0| 3691.0|   0.0|  0.36760265|
|The Giver|  A Perfect Society?|   26.0| 5417.0|   0.0|  0.19119999|
|The Giver|    Utopian distopia|   26.0|12315.0|   1.0|   0.7907043|
|The Giver|       Great Book!!!|   26.0|26273.0|   0.0|  0.28892416|
+---------+--------------------+-------+-------+------+------------+
only showing top 10 rows



# Evalutate the model

Compute the Root-Mean Squared Error using LogisticRegression
RMS = sqrt( sum(1,n) {pred - actual}^2)

In [10]:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 0.8836118651317195


Making Recommendations

In [17]:
ratings_test_set.filter(ratings_test_set["User_id"] == 642.0).show()

+--------------------+--------------------+--------+-------+------+
|               Title|      review/summary| book_id|User_id|rating|
+--------------------+--------------------+--------+-------+------+
|Witness to Myself...|Terrific modern n...| 33535.0|  642.0|   0.0|
|         Red Prophet|Slow going, but s...|  8659.0|  642.0|   2.0|
|   Carry On, Jeeves!|Classic Wodehousi...|  7566.0|  642.0|   0.0|
|     Nightmare House|A great audio of ...|  8136.0|  642.0|   0.0|
|            Vendetta|Complex and inter...| 87914.0|  642.0|   0.0|
|           The Giver|Nothing less than...|    26.0|  642.0|   0.0|
|    A Stir of Echoes|One of his best; ...|  6264.0|  642.0|   0.0|
|Donovan's Brain (...|      Landmark Novel|104538.0|  642.0|   1.0|
| Beyond the Outposts|A great book in a...| 91328.0|  642.0|   0.0|
|Claudius the god:...|The Sopranos of A...|  8208.0|  642.0|   0.0|
|    Come Out Tonight|One of Laymon's best|213273.0|  642.0|   0.0|
|Stranger in a Str...|Were my expectati...|   47

In [23]:
test_user = ratings_test_set.filter(ratings_test_set["User_id"] == 642.0).select("book_id","User_id","Title","review/summary")

In [24]:
test_user.show()

+--------+-------+--------------------+--------------------+
| book_id|User_id|               Title|      review/summary|
+--------+-------+--------------------+--------------------+
| 33535.0|  642.0|Witness to Myself...|Terrific modern n...|
|  8659.0|  642.0|         Red Prophet|Slow going, but s...|
|  7566.0|  642.0|   Carry On, Jeeves!|Classic Wodehousi...|
|  8136.0|  642.0|     Nightmare House|A great audio of ...|
| 87914.0|  642.0|            Vendetta|Complex and inter...|
|    26.0|  642.0|           The Giver|Nothing less than...|
|  6264.0|  642.0|    A Stir of Echoes|One of his best; ...|
|104538.0|  642.0|Donovan's Brain (...|      Landmark Novel|
| 91328.0|  642.0| Beyond the Outposts|A great book in a...|
|  8208.0|  642.0|Claudius the god:...|The Sopranos of A...|
|213273.0|  642.0|    Come Out Tonight|One of Laymon's best|
|   473.0|  642.0|Stranger in a Str...|Were my expectati...|
| 10481.0|  642.0|The Big Rock Cand...|Terrific autobiog...|
| 98585.0|  642.0|Last W

In [25]:
recommendations = recommender.transform(test_user)

In [26]:
recommendations.orderBy(desc("prediction")).show()

+--------+-------+--------------------+--------------------+------------+
| book_id|User_id|               Title|      review/summary|  prediction|
+--------+-------+--------------------+--------------------+------------+
|   509.0|  642.0|Stanger in a Stra...|Were my expectati...|   1.5090193|
|   473.0|  642.0|Stranger in a Str...|Were my expectati...|    1.481797|
| 33535.0|  642.0|Witness to Myself...|Terrific modern n...|   1.2461519|
|  8659.0|  642.0|         Red Prophet|Slow going, but s...|   0.9878093|
|  7566.0|  642.0|   Carry On, Jeeves!|Classic Wodehousi...|  0.67062634|
| 10481.0|  642.0|The Big Rock Cand...|Terrific autobiog...|   0.5467957|
| 10579.0|  642.0|The Big Rock Cand...|Terrific autobiog...|  0.52053463|
|  1803.0|  642.0|  Tarzan of the Apes|&quot;A Ripping G...|  0.35655648|
|104538.0|  642.0|Donovan's Brain (...|      Landmark Novel|  0.35217682|
|    26.0|  642.0|           The Giver|Nothing less than...|  0.19744895|
|  5672.0|  642.0|The New Shorter O...