### Problem

One of the biggest challenges when wanting to read books is finding the right book to read. That is why we made BookForYou. BookForYou is a recommender system that suggests books for the user based on their inputted preferences for author, title, and book category. It uses book reviews from Amazon’s Book database to find the ideal book candidate.

### Identification of required data

The dataset used is [Amazon Book Reviews](https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?select=books_data.csv).

The dataset consists of two entities, one with book details and the second containing book reviews. Each entity has 10 features, for a combined dataset size of 3.04 GB. As shown below, one book can have many reviews, but a review can only belong to a single book. Books are identified by their titles. From the book details, the title, author, year and category will be used. From the reviews entity, the content of the reviews, book rating, and the helpfulness rating of a given review will be used.

![Entities picture](images\entities.png)

For Book_details the following features will be used:
For reviews the following features will be used:
* Id (the id of the book)
* title (Book Title)
* user_id (Id of user who rate the book)
* review/score (rating from 0 to 5 for the book)
* review/summary (the summary of text review)

### Data PreProcessing

The following imports will be used for data preprocesing.

In [9]:
import csv
import os
import sys
# Spark imports
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc
# Dask imports
import dask.bag as db
import dask.dataframe as df  # you can use Dask bags or dataframes
from csv import reader


Spark initialization:

In [10]:
def init_spark():
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    return spark

### reviews
Here we preprocess the reviews entity to the desired features mentioned above.

In [26]:
spark  = init_spark()
df = spark.read.csv("data\\preprocessed\\reviews.csv", header=True)
df = df.select("Id", "Title", "User_id", "review/score", "review/summary")
df = df.filter(df["Id"] != '')
df = df.filter(df["Title"] != '')
df = df.filter(df["User_id"] != '')
df = df.filter(df["review/score"] != '')
df = df.filter(df["review/summary"] != '')
df.show()
print(df.count())


+----------+--------------------+--------------+------------+--------------------+
|        Id|               Title|       User_id|review/score|      review/summary|
+----------+--------------------+--------------+------------+--------------------+
|1882931173|Its Only Art If I...| AVCGYZL8FQQTD|         4.0|Nice collection o...|
|0826414346|Dr. Seuss: Americ...|A30TK6U7DNS82R|         5.0|   Really Enjoyed It|
|0826414346|Dr. Seuss: Americ...|A3UH4UZ4RSVO82|         5.0|Essential for eve...|
|0826414346|Dr. Seuss: Americ...|A2MVUWT453QH61|         4.0|Phlip Nel gives s...|
|0826414346|Dr. Seuss: Americ...|A22X4XUPKF66MR|         4.0|Good academic ove...|
|0826414346|Dr. Seuss: Americ...|A2F6NONFUDB6UK|         4.0|One of America's ...|
|0826414346|Dr. Seuss: Americ...|A14OJS0VWMOSWO|         5.0|A memorably excel...|
|0826414346|Dr. Seuss: Americ...|A2RSSXTDZDUSH4|         5.0|Academia At It's ...|
|0826414346|Dr. Seuss: Americ...|A25MD5I2GUIW6W|         5.0|And to think that...|
|082

#### Identification of required data
In the `book_details.csv` file, we identify which data will be useful to our recommender system. 

In [None]:
#book_details = dd.read_csv('data/preprocessed/book_details.csv')
book_ratings= dd.read_csv('data/preprocessed/reviews.csv', blocksize=1000)
#book_details.head(10)
book_ratings.compute()