### Problem

One of the biggest challenges when wanting to read books is finding the right book to read. That is why we made BookForYou. BookForYou is a recommender system that suggests books for the user based on their inputted preferences for author, title, and book category. It uses book reviews from Amazon’s Book database to find the ideal book candidate.

### Identification of required data

The dataset used is [Amazon Book Reviews](https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?select=books_data.csv).

The dataset consists of two entities, one with book details and the second containing book reviews. Each entity has 10 features, for a combined dataset size of 3.04 GB. As shown below, one book can have many reviews, but a review can only belong to a single book. Books are identified by their titles. From the book details, the title, author, year and category will be used. From the reviews entity, the content of the reviews, book rating, and the helpfulness rating of a given review will be used.

![Entities picture](images\entities.png)

In the `book_details.csv` file, we will only keep some of the features. The following features will be removed
-   image
-   previewLink
-   infoLink

For null or empty values for the features being kept, we will
-   Remove the entry if `title` is NULL
-   Use the `reviews_text` feature if `description` is NULL
-   Take the mean of other values in `publishedDate` if it is NULL
    -   Only years will be kept

### Data PreProcessing

#### Identification of required data
In the `book_details.csv` file, we identify which data will be useful to our recommender system.
For this entity, we will 

In [1]:
import dask.dataframe as dd

# #book_details = dd.read_csv('data/preprocessed/book_details.csv')
# book_ratings= dd.read_csv('data/preprocessed/reviews.csv', blocksize=1000)
# #book_details.head(10)
# book_ratings.compute()

In [2]:
import csv
import os
import sys
# Spark imports
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc
# Dask imports
import dask.bag as db
import dask.dataframe as details  # you can use Dask bags or dataframes
from csv import reader
from datetime import datetime



In [3]:
# Spark initialization
def init_spark():
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    return spark

In [6]:
spark  = init_spark()
details = spark.read.csv("data\\preprocessed\\book_details.csv", header=True)
details = details.select("Title", "description", "authors", "publisher", "publishedDate", "categories","ratingsCount")

details = details.filter(details["Title"] != '')
details = details.fillna({"authors":"Unknown"})
details = details.fillna({"publisher":"Unknown"})
details = details.fillna({"publishedDate":"Unknown"})

# if description is null, use review text
i = 0
for row in details.collect():
  publish_date = row['publishedDate']
  i= i+1
  try:
    publish_date = datetime.strptime(publish_date, '%Y-%m-%d').date()

  except:
    print(f'row {row}')
    print("Bad dates:", publish_date)

  if(i==7):
    break

  # publish_date = datetime.strptime(publish_date, '%Y-%m-%d').date()
  # publish_year = publish_date.year
  # print(type(publish_date))
 
    # print(details[row])
    # break;
# details = details.filter(details["User_id"] != '')

# details = details.filter(details["review/score"] <= 5)
# details = details.filter(details["review/score"] >= 1)
# no need to filter null values from review/summary
# details.dropna()
# details.show()
# print(details.count())

row Row(Title='Its Only Art If Its Well Hung!', description=None, authors="['Julie Strain']", publisher='Unknown', publishedDate='1996', categories="['Comics & Graphic Novels']", ratingsCount=None)
Bad dates: 1996
row Row(Title='Dr. Seuss: American Icon', description='"Philip Nel takes a fascinating look into the key aspects of Seuss\'s career - his poetry, politics, art, marketing, and place in the popular imagination."" ""Nel argues convincingly that Dr. Seuss is one of the most influential poets in America. His nonsense verse', authors=' like that of Lewis Carroll and Edward Lear', publisher=' inspiring artists like filmmaker Tim Burton and illustrator Lane Smith. --from back cover"', publishedDate="['Philip Nel']", categories='http://books.google.nl/books?id=IjvHQsCn_pgC&printsec=frontcover&dq=Dr.+Seuss:+American+Icon&hl=&cd=1&source=gbs_api', ratingsCount='A&C Black')
Bad dates: ['Philip Nel']
row Row(Title='Wonderful Worship in Smaller Churches', description='This resource includ