### Problem

One of the biggest challenges when wanting to read books is finding the right book to read. That is why we made BookForYou. BookForYou is a recommender system that suggests books for the user based on their inputted preferences for author, title, and book category. It uses book reviews from Amazon’s Book database to find the ideal book candidate.

### Identification of required data

The dataset used is [Amazon Book Reviews](https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?select=books_data.csv).

The dataset consists of two entities, one with book details and the second containing book reviews. Each entity has 10 features, for a combined dataset size of 3.04 GB. As shown below, one book can have many reviews, but a review can only belong to a single book. Books are identified by their titles. From the book details, the title, author, year and category will be used. From the reviews entity, the content of the reviews, book rating, and the helpfulness rating of a given review will be used.

![Entities picture](images\entities.png)

In the `book_details.csv` file, we will only keep some of the features. The following features will be removed
-   image
-   previewLink
-   infoLink

For null or empty values for the features being kept, we will
-   Remove the entry if `title` is NULL
-   Use the `reviews_text` feature if `description` is NULL
-   Take the mean of other values in `publishedDate` if it is NULL
    -   Only years will be kept

### Data PreProcessing

#### Identification of required data
In the `book_details.csv` file, we identify which data will be useful to our recommender system.
For this entity, we will 

In [1]:
import dask.dataframe as dd

# #book_details = dd.read_csv('data/preprocessed/book_details.csv')
# book_ratings= dd.read_csv('data/preprocessed/reviews.csv', blocksize=1000)
# #book_details.head(10)
# book_ratings.compute()

In [2]:
import csv
import os
import sys
# Spark imports
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc
# Dask imports
import dask.bag as db
import dask.dataframe as details  # you can use Dask bags or dataframes
from csv import reader
from datetime import datetime



In [3]:
# Spark initialization
def init_spark():
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    return spark

In [7]:

# import pandas as pd

# cols = ["Title", "description", "authors", "publisher", "publishedDate", "categories"]
# df = pd.read_csv("data\\books_details.csv", usecols = cols)

# #df[["ratingsCount"]]=df[["ratingsCount"]].astype(str)

# pandas_df = df.fillna("NULL VALUES")
# pandas_df["Title"] = pandas_df["Title"].astype("str")
# pandas_df["description"] = pandas_df["description"].astype("str")
# pandas_df["authors"] = pandas_df["authors"].astype("str")
# pandas_df["publisher"] = pandas_df["publisher"].astype("str")
# pandas_df["publishedDate"] = pandas_df["publishedDate"].astype("str")
# pandas_df["categories"] = pandas_df["categories"].astype("str")


# # pandas_df.head(40)
# spark.conf.set("spark.sql.execution.arrow.enabled","true")

# # Convert the Pandas DataFrame to a Spark DataFrame
# sdf = spark.createDataFrame(pandas_df)

# # Show the contents of the Spark DataFrame
# sdf.show()


spark  = init_spark()
details = spark.read.csv("data\\output3.csv", header=True)
details = details.select("""Title""", """description""", """authors""", """publisher""", """publishedDate""", """categories""")
#details = details.select( "description")

#details=details.filter(details["description"].contains(","))

details = details.filter(details["Title"] != '')
details = details.fillna({"authors":"Unknown"})
details = details.fillna({"publisher":"Unknown"})
details = details.fillna({"publishedDate":"Unknown"})

details.dropna()
details.show()


# if description is null, use review text
# i = 0
# for row in details.collect():
#   publish_date = row['publishedDate']
#   i= i+1
#   try:
#     publish_date = datetime.strptime(publish_date, '%Y-%m-%d').date()

#   except:
#     print(f'row {row}')
#     print("Bad dates:", publish_date)

#   if(i==7):
#     break

  # publish_date = datetime.strptime(publish_date, '%Y-%m-%d').date()
  # publish_year = publish_date.year
  # print(type(publish_date))
 
    # print(details[row])
    # break;
# details = details.filter(details["User_id"] != '')

# details = details.filter(details["review/score"] <= 5)
# details = details.filter(details["review/score"] >= 1)
# no need to filter null values from review/summary
# details.dropna()
# details.show()
# print(details.count())

+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+
|               Title|         description|             authors|           publisher|publishedDate|          categories|
+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+
|Its Only Art If I...|                null|    ['Julie Strain']|             Unknown|         1996|['Comics & Graphi...|
|Dr. Seuss: Americ...|Philip Nel takes ...|      ['Philip Nel']|           A&C Black|   2005-01-01|['Biography & Aut...|
|Wonderful Worship...|This resource inc...|    ['David R. Ray']|             Unknown|         2000|        ['Religion']|
|Whispers of the W...|Julia Thomas find...| ['Veronica Haddon']|           iUniverse|      2005-02|         ['Fiction']|
|Nation Dance: Rel...|                null|     ['Edward Long']|             Unknown|   2003-03-01|                null|
|The Church of Chr...|In The Chu

In [5]:
import csv

def ignore_commas_and_quotes(input_file, output_file):
    with open(input_file, 'r') as input_csv_file:
        csv_reader = csv.reader(input_csv_file)
        with open(output_file, 'w', newline='') as output_csv_file:
            csv_writer = csv.writer(output_csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
            for row in csv_reader:
                new_row = []
                for field in row:
                    # Remove commas and double quotes from the field
                    new_field = field.replace(',', '').replace('"', '')
                    new_field=new_field.strip()
                    new_row.append(new_field)
                csv_writer.writerow(new_row)


ignore_commas_and_quotes("data\\books_details.csv","data\\output.csv")
print("done")

done
