### Problem

One of the biggest challenges when wanting to read books is finding the right book to read. That is why we made BookForYou. BookForYou is a recommender system that suggests books for the user based on their inputted preferences for author, title, and book category. It uses book reviews from Amazon’s Book database to find the ideal book candidate.

### Identification of required data

For reviews the following features will be used:

* Id (the id of the book)
* title (Book Title)
* user_id (Id of user who rate the book)
* review/score (rating from 0 to 5 for the book)
* review/summary (the summary of text review)

### Data PreProcessing

The following imports will be used for data preprocesing.

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler


Creating SparkSession

In [4]:
def init_spark():
    spark = SparkSession \
        .builder \
        .appName("BookForYou Recommender System") \
        .config("spark.driver.memory", "8g") \
        .config("spark.executor.memory", "8g") \
        .getOrCreate()
    return spark

### Importing the dataset

In [5]:
spark  = init_spark()
df_ratings = spark.read.csv("data/preprocessed/reviews.csv", inferSchema=True, header=True)

Missing Data
There may frequently be gaps in data sources, which leaves you with three main possibilities for completing the gaps

1. Just keep the missing data points.
2. Drop them missing data points (including the entire row)
3. Fill them in with some other value.

In [6]:
df_ratings = df_ratings.select("Id", "Title", "User_id", "review/score", "review/summary")
df_ratings = df_ratings.na.drop(subset=["Id","Title","User_id","review/score","review/summary"])
df_ratings = df_ratings.withColumnRenamed("Id", "book_string")
df_ratings = df_ratings.withColumnRenamed("User_id", "User_string")
df_ratings = df_ratings.filter(df_ratings["review/score"] <= 5)
df_ratings = df_ratings.filter(df_ratings["review/score"] >= 1)


Collaborative filtering => based off ratings of other users

In [7]:
df_ratings.show()

+-----------+--------------------+--------------+------------+--------------------+
|book_string|               Title|   User_string|review/score|      review/summary|
+-----------+--------------------+--------------+------------+--------------------+
| 1882931173|Its Only Art If I...| AVCGYZL8FQQTD|         4.0|Nice collection o...|
| 0826414346|Dr. Seuss: Americ...|A30TK6U7DNS82R|         5.0|   Really Enjoyed It|
| 0826414346|Dr. Seuss: Americ...|A3UH4UZ4RSVO82|         5.0|Essential for eve...|
| 0826414346|Dr. Seuss: Americ...|A2MVUWT453QH61|         4.0|Phlip Nel gives s...|
| 0826414346|Dr. Seuss: Americ...|A22X4XUPKF66MR|         4.0|Good academic ove...|
| 0826414346|Dr. Seuss: Americ...|A2F6NONFUDB6UK|         4.0|One of America's ...|
| 0826414346|Dr. Seuss: Americ...|A14OJS0VWMOSWO|         5.0|A memorably excel...|
| 0826414346|Dr. Seuss: Americ...|A2RSSXTDZDUSH4|         5.0|Academia At It's ...|
| 0826414346|Dr. Seuss: Americ...|A25MD5I2GUIW6W|         5.0|And to think t

In [8]:
df_ratings.describe()

DataFrame[summary: string, book_string: string, Title: string, User_string: string, review/score: string, review/summary: string]

In [9]:
from pyspark.ml.feature import StringIndexer

# Create a StringIndexer for the "User_id" column
indexer = StringIndexer(inputCols=["book_string","User_string","review/score"], outputCols=["book_id","User_id","ratings"])

# Fit the StringIndexer to the DataFrame
df_ratings = indexer.fit(df_ratings).transform(df_ratings).drop("book_string","User_string","review/score")

# Show the result
#df_ratings.show()

In [10]:
df_ratings.orderBy(desc("ratings")).show()

+--------------------+--------------------+-------+--------+-------+
|               Title|      review/summary|book_id| User_id|ratings|
+--------------------+--------------------+-------+--------+-------+
|Dragons of the Dw...|Trying to fix a g...| 4434.0| 82277.0|    4.0|
|Led Zeppelin: Daz...|   It all depends...|29089.0|777099.0|    4.0|
|Dragons of the Dw...|Good Book! BAD E-...| 4434.0| 77266.0|    4.0|
|Led Zeppelin: Daz...|Led Zeppelin Dese...|29089.0|  4796.0|    4.0|
|Dragons of the Dw...|A disappointment;...| 4434.0| 89924.0|    4.0|
|  Becoming Strangers|Do yourself a fav...|42794.0|578970.0|    4.0|
|Dragons of the Dw...|I REALLY WANTED T...| 4434.0| 51150.0|    4.0|
|  Becoming Strangers|This book is a wa...|42794.0|103704.0|    4.0|
|The value of post...|             Alright|46176.0|291819.0|    4.0|
|          RED LEAVES|Just OK-don't bot...| 7232.0|220507.0|    4.0|
|Confessions of an...|It's not the Mess...|  162.0|  1199.0|    4.0|
|Dragons of the Dw...|            

Splitting dataset into test and training sets

In [11]:
ratings_training_set, ratings_test_set = df_ratings.randomSplit([0.8, 0.2], seed=1234)

### Creating the model

In [12]:
from pyspark.ml.recommendation import ALS #alternating least squares algorithm
from pyspark.ml.evaluation import RegressionEvaluator

recommender = ALS(userCol="User_id", itemCol="book_id", ratingCol="ratings", coldStartStrategy="drop")
recommender = recommender.fit(ratings_training_set)

Predicting with the test set

In [21]:
predictions = recommender.transform(ratings_test_set)
#import pandas as pd
predictions.count()

322080

In [14]:
predictions.show()


+---------+--------------------+-------+--------+-------+------------+
|    Title|      review/summary|book_id| User_id|ratings|  prediction|
+---------+--------------------+-------+--------+-------+------------+
|The Giver|Nothing less than...|   26.0|   642.0|    0.0| -0.16424593|
|The Giver|Just As Haunting ...|   26.0|  1025.0|    0.0| -0.13324575|
|The Giver|Unabrid Audio 4 c...|   26.0|  1307.0|    0.0|  0.32836226|
|The Giver|  Recovering Skeptic|   26.0|  1404.0|    0.0| 0.011291621|
|The Giver|What a book!! Les...|   26.0|  1483.0|    0.0|-0.012859484|
|The Giver| Inspirational Read!|   26.0|  1873.0|    0.0|   0.3416095|
|The Giver|A Fable About The...|   26.0|  3691.0|    0.0|   0.9679783|
|The Giver|  A Perfect Society?|   26.0|  5417.0|    0.0|  0.24573886|
|The Giver|    Utopian distopia|   26.0| 12315.0|    1.0|   1.1893966|
|The Giver|       Great Book!!!|   26.0| 26273.0|    0.0|  0.19406965|
|The Giver|A life-changing book|   26.0| 29426.0|    0.0|         0.0|
|The G

# Evalutate the model

Compute the Root-Mean Squared Error using LogisticRegression
RMS = sqrt( sum(1,n) {pred - actual}^2)

In [15]:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="rmse", labelCol="ratings", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 0.8822855895638484


In [22]:
import pandas as pd
# Read the dataset into a pandas dataframe
# Define a function to calculate precision
def precision(data, threshold):
    # Count the true positives and false positives
    true_positives = data[(data["ratings"] >= threshold) & (data["prediction"] >= threshold)].count()
    false_positives = data[(data["ratings"] < threshold) & (data["prediction"] >= threshold)].count()
    # Calculate precision
    precision = true_positives / (true_positives + false_positives)
    return precision

# Calculate precision for a rating threshold of 3 or higher
p = precision(predictions, 3)
print("Precision for threshold 3 or higher:", p)

Precision for threshold 3 or higher: 0.9681633878960805


In [23]:
import pandas as pd

# Define a function to calculate recall
def recall(data, threshold):
    # Count the true positives and false negatives
    true_positives = data[(data["ratings"] >= threshold) & (data["prediction"] >= threshold)].count()
    false_negatives = data[(data["ratings"] >= threshold) & (data["prediction"] < threshold)].count()
    # Calculate recall
    recall = true_positives / (true_positives + false_negatives)
    return recall

# Calculate recall for a rating threshold of 3 or higher
r = recall(predictions, 3)
print("Recall for threshold 3 or higher:", r)

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

Making Recommendations

In [27]:
ratings_test_set.filter(ratings_test_set["User_id"] == 642.0).show()

+--------------------+--------------------+--------+-------+-------+
|               Title|      review/summary| book_id|User_id|ratings|
+--------------------+--------------------+--------+-------+-------+
|Witness to Myself...|Terrific modern n...| 33535.0|  642.0|    0.0|
|         Red Prophet|Slow going, but s...|  8659.0|  642.0|    2.0|
|   Carry On, Jeeves!|Classic Wodehousi...|  7566.0|  642.0|    0.0|
|     Nightmare House|A great audio of ...|  8136.0|  642.0|    0.0|
|            Vendetta|Complex and inter...| 87914.0|  642.0|    0.0|
|           The Giver|Nothing less than...|    26.0|  642.0|    0.0|
|    A Stir of Echoes|One of his best; ...|  6264.0|  642.0|    0.0|
|Donovan's Brain (...|      Landmark Novel|104538.0|  642.0|    1.0|
| Beyond the Outposts|A great book in a...| 91328.0|  642.0|    0.0|
|Claudius the god:...|The Sopranos of A...|  8208.0|  642.0|    0.0|
|    Come Out Tonight|One of Laymon's best|213273.0|  642.0|    0.0|
|Stranger in a Str...|Were my expe

In [28]:
test_user = ratings_test_set.filter(ratings_test_set["User_id"] == 642.0).select("book_id","User_id","Title","review/summary")

In [29]:
test_user.show()

+--------+-------+--------------------+--------------------+
| book_id|User_id|               Title|      review/summary|
+--------+-------+--------------------+--------------------+
| 33535.0|  642.0|Witness to Myself...|Terrific modern n...|
|  8659.0|  642.0|         Red Prophet|Slow going, but s...|
|  7566.0|  642.0|   Carry On, Jeeves!|Classic Wodehousi...|
|  8136.0|  642.0|     Nightmare House|A great audio of ...|
| 87914.0|  642.0|            Vendetta|Complex and inter...|
|    26.0|  642.0|           The Giver|Nothing less than...|
|  6264.0|  642.0|    A Stir of Echoes|One of his best; ...|
|104538.0|  642.0|Donovan's Brain (...|      Landmark Novel|
| 91328.0|  642.0| Beyond the Outposts|A great book in a...|
|  8208.0|  642.0|Claudius the god:...|The Sopranos of A...|
|213273.0|  642.0|    Come Out Tonight|One of Laymon's best|
|   473.0|  642.0|Stranger in a Str...|Were my expectati...|
| 10481.0|  642.0|The Big Rock Cand...|Terrific autobiog...|
| 98585.0|  642.0|Last W

In [30]:
recommendations = recommender.transform(test_user)

In [31]:
recommendations.orderBy(desc("prediction")).show()

+--------+-------+--------------------+--------------------+-----------+
| book_id|User_id|               Title|      review/summary| prediction|
+--------+-------+--------------------+--------------------+-----------+
|   473.0|  642.0|Stranger in a Str...|Were my expectati...|   1.593513|
|   509.0|  642.0|Stanger in a Stra...|Were my expectati...|  1.5815586|
|  8659.0|  642.0|         Red Prophet|Slow going, but s...|  1.3491691|
|  8136.0|  642.0|     Nightmare House|A great audio of ...| 0.89534837|
| 33535.0|  642.0|Witness to Myself...|Terrific modern n...|  0.8458453|
|  6264.0|  642.0|    A Stir of Echoes|One of his best; ...|  0.8253751|
|  5672.0|  642.0|The New Shorter O...|Best dictionary f...|  0.7862632|
|104538.0|  642.0|Donovan's Brain (...|      Landmark Novel| 0.40847343|
|  8208.0|  642.0|Claudius the god:...|The Sopranos of A...| 0.24503751|
|  7566.0|  642.0|   Carry On, Jeeves!|Classic Wodehousi...| 0.22320217|
|    26.0|  642.0|           The Giver|Nothing less