### Problem

One of the biggest challenges when wanting to read books is finding the right book to read. That is why we made BookForYou. BookForYou is a recommender system that suggests books for the user based on their inputted preferences for author, title, and book category. It uses book reviews from Amazonâ€™s Book database to find the ideal book candidate.

### Identification of required data

The dataset used is [Amazon Book Reviews](https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?select=books_data.csv).

The dataset consists of two entities, one with book details and the second containing book reviews. Each entity has 10 features, for a combined dataset size of 3.04 GB. As shown below, one book can have many reviews, but a review can only belong to a single book. Books are identified by their titles. From the book details, the title, author, year and category will be used. From the reviews entity, the content of the reviews, book rating, and the helpfulness rating of a given review will be used.

For reviews the following features will be used:

* Id (the id of the book)
* title (Book Title)
* user_id (Id of user who rate the book)
* review/score (rating from 0 to 5 for the book)

### Data PreProcessing

The following imports will be used for data preprocesing.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler

Creating SparkSession

In [2]:
def init_spark():
    spark = SparkSession \
        .builder \
        .appName("BookForYou Recommender System") \
        .config("spark.driver.memory", "8g") \
        .config("spark.executor.memory", "8g") \
        .getOrCreate()
    return spark

### Importing the dataset

In [3]:
spark  = init_spark()
df_ratings = spark.read.csv("data\\preprocessed\\reviews.csv", inferSchema=True, header=True)

Missing Data
There may frequently be gaps in data sources, which leaves you with three main possibilities for completing the gaps

1. Just keep the missing data points.
2. Drop them missing data points (including the entire row)
3. Fill them in with some other value.

In [4]:
df_ratings = df_ratings.select("Id", "Title", "User_id", "review/score", "review/summary")
df_ratings = df_ratings.na.drop(subset=["Id","Title","User_id","review/score","review/summary"])
df_ratings = df_ratings.withColumnRenamed("Id", "book_string")
df_ratings = df_ratings.withColumnRenamed("User_id", "User_string")
df_ratings = df_ratings.filter(df_ratings["review/score"] <= 5)
df_ratings = df_ratings.filter(df_ratings["review/score"] >= 1)
#df.describe().show()

****Content-based recommender system****

In [5]:
df_ratings.show()

+-----------+--------------------+--------------+------------+--------------------+
|book_string|               Title|   User_string|review/score|      review/summary|
+-----------+--------------------+--------------+------------+--------------------+
| 1882931173|Its Only Art If I...| AVCGYZL8FQQTD|         4.0|Nice collection o...|
| 0826414346|Dr. Seuss: Americ...|A30TK6U7DNS82R|         5.0|   Really Enjoyed It|
| 0826414346|Dr. Seuss: Americ...|A3UH4UZ4RSVO82|         5.0|Essential for eve...|
| 0826414346|Dr. Seuss: Americ...|A2MVUWT453QH61|         4.0|Phlip Nel gives s...|
| 0826414346|Dr. Seuss: Americ...|A22X4XUPKF66MR|         4.0|Good academic ove...|
| 0826414346|Dr. Seuss: Americ...|A2F6NONFUDB6UK|         4.0|One of America's ...|
| 0826414346|Dr. Seuss: Americ...|A14OJS0VWMOSWO|         5.0|A memorably excel...|
| 0826414346|Dr. Seuss: Americ...|A2RSSXTDZDUSH4|         5.0|Academia At It's ...|
| 0826414346|Dr. Seuss: Americ...|A25MD5I2GUIW6W|         5.0|And to think t

**#Map strings to numerical values**

In [6]:
from pyspark.ml.feature import StringIndexer

# Create a StringIndexer for the "User_id" column
indexer = StringIndexer(inputCols=["book_string","User_string","review/score","Title"], outputCols=["book_id","User_id","rating","Book_title"])

# Fit the StringIndexer to the DataFrame
df_ratings = indexer.fit(df_ratings).transform(df_ratings).drop("book_string","User_string", "review/score","Title", "review/summary")


In [7]:
# Show the result
df_ratings=df_ratings.drop("Book_title")
df_ratings.show()


+--------+--------+------+
| book_id| User_id|rating|
+--------+--------+------+
|180201.0|167334.0|   1.0|
| 40116.0|    64.0|   0.0|
| 40116.0|105599.0|   0.0|
| 40116.0|  4472.0|   1.0|
| 40116.0| 31627.0|   1.0|
| 40116.0|  3581.0|   1.0|
| 40116.0|     0.0|   0.0|
| 40116.0|637113.0|   0.0|
| 40116.0|130558.0|   0.0|
| 40116.0|837115.0|   1.0|
| 76542.0|999164.0|   0.0|
| 76542.0|  6737.0|   0.0|
| 76542.0|905770.0|   0.0|
| 76542.0|271312.0|   0.0|
| 11449.0|272659.0|   3.0|
| 11449.0|385682.0|   1.0|
| 11449.0|307548.0|   3.0|
| 11449.0| 38094.0|   0.0|
| 11449.0|885875.0|   0.0|
| 11449.0|473482.0|   0.0|
+--------+--------+------+
only showing top 20 rows



**#Combining the features using VectorAssembler**

In [8]:
# Transform the dataset
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=df_ratings.columns, outputCol='features')
df_ratings = assembler.transform(df_ratings)
df_ratings.select(df_ratings["features"]).show()

+--------------------+
|            features|
+--------------------+
|[180201.0,167334....|
|  [40116.0,64.0,0.0]|
|[40116.0,105599.0...|
|[40116.0,4472.0,1.0]|
|[40116.0,31627.0,...|
|[40116.0,3581.0,1.0]|
|   [40116.0,0.0,0.0]|
|[40116.0,637113.0...|
|[40116.0,130558.0...|
|[40116.0,837115.0...|
|[76542.0,999164.0...|
|[76542.0,6737.0,0.0]|
|[76542.0,905770.0...|
|[76542.0,271312.0...|
|[11449.0,272659.0...|
|[11449.0,385682.0...|
|[11449.0,307548.0...|
|[11449.0,38094.0,...|
|[11449.0,885875.0...|
|[11449.0,473482.0...|
+--------------------+
only showing top 20 rows



**#Normalizing using standardScalar**

In [9]:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol='features', outputCol='scaled_features')
df_ratings = scaler.fit(df_ratings).transform(df_ratings)

In [10]:
df_ratings.select("scaled_features").show()

+--------------------+
|     scaled_features|
+--------------------+
|[4.48501061902150...|
|[0.99844443700460...|
|[0.99844443700460...|
|[0.99844443700460...|
|[0.99844443700460...|
|[0.99844443700460...|
|[0.99844443700460...|
|[0.99844443700460...|
|[0.99844443700460...|
|[0.99844443700460...|
|[1.90504871116777...|
|[1.90504871116777...|
|[1.90504871116777...|
|[1.90504871116777...|
|[0.28495339413864...|
|[0.28495339413864...|
|[0.28495339413864...|
|[0.28495339413864...|
|[0.28495339413864...|
|[0.28495339413864...|
+--------------------+
only showing top 20 rows



**#Creating the model**

In [11]:
# Creating the model
from pyspark.ml.clustering import KMeans
kmeans = KMeans(featuresCol='scaled_features', k=4)
kmeans = kmeans.fit(df_ratings)


In [12]:
# Getting the predictions
kmeans = kmeans.transform(df_ratings)

In [13]:
kmeans.show(20)

+--------+--------+------+--------------------+--------------------+----------+
| book_id| User_id|rating|            features|     scaled_features|prediction|
+--------+--------+------+--------------------+--------------------+----------+
|180201.0|167334.0|   1.0|[180201.0,167334....|[4.48501061902150...|         2|
| 40116.0|    64.0|   0.0|  [40116.0,64.0,0.0]|[0.99844443700460...|         1|
| 40116.0|105599.0|   0.0|[40116.0,105599.0...|[0.99844443700460...|         1|
| 40116.0|  4472.0|   1.0|[40116.0,4472.0,1.0]|[0.99844443700460...|         1|
| 40116.0| 31627.0|   1.0|[40116.0,31627.0,...|[0.99844443700460...|         1|
| 40116.0|  3581.0|   1.0|[40116.0,3581.0,1.0]|[0.99844443700460...|         1|
| 40116.0|     0.0|   0.0|   [40116.0,0.0,0.0]|[0.99844443700460...|         1|
| 40116.0|637113.0|   0.0|[40116.0,637113.0...|[0.99844443700460...|         0|
| 40116.0|130558.0|   0.0|[40116.0,130558.0...|[0.99844443700460...|         1|
| 40116.0|837115.0|   1.0|[40116.0,83711

**# Evaluate the Model**

**1: Silhouette Score**

In [14]:
from pyspark.ml.evaluation import ClusteringEvaluator

# Evaluate the KMeans model using Silhouette score
evaluator = ClusteringEvaluator()
silhouette_score = evaluator.evaluate(kmeans)
print("Silhouette Score: ", silhouette_score)


Silhouette Score:  0.35501550777553265


**2: Accuracy**

In [16]:
# Calculate accuracy
accuracy = kmeans.filter(df_ratings["rating"]==kmeans["prediction"]).count() / df_ratings.count()
print("Accuracy: ", accuracy)


Accuracy:  0.34797215776495244


**3: F1-Measure**

In [27]:
# Calculate the F1 measure
# Since KMeans is an unsupervised algorithm, we need to convert it into a supervised problem
# by comparing predicted cluster labels with actual labels (if available) to compute F1 measure
# Assuming you have actual labels in a column called 'label' in your 'data' DataFrame

# Import the required libraries for calculating F1 measure
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Convert cluster labels to integer type
f1 = kmeans.withColumn("prediction", kmeans["prediction"].cast("double"))



# # Instantiate the MulticlassClassificationEvaluator with 'f1' as the metric
f1_evaluator = MulticlassClassificationEvaluator(labelCol="rating", predictionCol="prediction", metricName="f1")


# # Calculate the F1 measure
f1_score = f1_evaluator.evaluate(f1)

# # Print the F1 measure
print("F1 Score: ", f1_score)

F1 Score:  0.3284099367863862


**4: Recall**

In [28]:
# Assuming you have actual labels in a column called 'label' in your 'data' DataFrame
# Calculate the recall
# Recall is defined as the ratio of true positives to the sum of true positives and false negatives

# Count the number of true positives
true_positives = kmeans.where((kmeans.rating == 1) & (kmeans.prediction == 1)).count()

# Count the number of false negatives
false_negatives = kmeans.where((kmeans.rating == 1) & (kmeans.prediction == 0)).count()

# Calculate recall
recall = true_positives / (true_positives + false_negatives)

# Print the recall
print("Recall: ", recall)

Recall:  0.800670969992221
