## Building a recommendation engine to recommend books in Spark

Using collaborative filtering to predict ratings of unread books on 'Goodreads'

Warren Buffett was once asked about the key to success, he pointed to a stack of nearby books and said, "Read 500 pages like this every day. That's how knowledge works. It builds up, like compound interest. All of you can do it, but I guarantee not many of you will do it."
Books are the best resources for most of us to develop and gain perspectives. 

I myself love reading books. Once I like a book, I have a habit of going to good books or asking someone who has a similar taste to look for recommendations for the next series of books I might like. 
Artificial Intelligence has made our world so easy by recommending us books, movies, and products all based on the past data that saves our time and energy on analyzing different options. In fact, sometimes machines can recommend us better than what we think because they don't suffer from emotional biases.

**The intuition behind Alternating Least Square**

For someone who loves reading academic papers, here's the link to the paper: https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf.

The most important kind of recommender system is collaborative filtering based approach. Let's say you know a friend who has the same taste as you because you both love psychology, then you might like reading other books that your friend has read but you haven't. This is the sole concept behind collaborative filtering. Now let's talk specifically about Alternating least square method.


Collaborative filtering can be easily achieved by matrix factorization techniques like Singular Value decomposition where a user-rating matrix is decomposed into the user-concept matrix, concept-weights matrix, and rating-concept matrix. Concepts are basically latent or hidden factors that the matrix decomposition implicitly generates like in the case of the books, the different concepts can be psychology, data science, philosophy, etc.

Most of the matrix factorization techniques like Singular Value decomposition don't know how to deal with an incomplete/sparse matrix which means having empty values in the user-rating matrix(which is common as not every user would have read all the books). Traditionally, engineers have been imputing those values with the mean or median before performing matrix factorization to do collaborative filtering. This leads to overfitting since the books that have never been rated are being imputed by the mean or median which can skew the results towards them.

Recent methods like Alternating Least square don't suffer from these fallbacks. They suggest modeling directly the observed ratings while avoiding overfitting through a regularized mode

In [2]:
#import to_read data

file_location = "/FileStore/tables/to_read.csv"
file_type = "csv"

# CSV options
infer_schema = "True"
first_row_is_header = "True"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
to_read = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(to_read)

# Create Temporary Tables
to_read.createOrReplaceTempView("to_read")

user_id,book_id
1,112
1,235
1,533
1,1198
1,1874
1,2058
1,3334
2,4
2,11
2,13


In [3]:
#import ratings data
file_location = "/FileStore/tables/ratings.csv"
file_type = "csv"

# CSV options
infer_schema = "True"
first_row_is_header = "True"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
ratings = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(ratings)

# Create Temporary Tables
ratings.createOrReplaceTempView("ratings")

book_id,user_id,rating
1,314,5
1,439,3
1,588,5
1,1169,4
1,1185,4
1,2077,4
1,2487,4
1,2900,5
1,3662,4
1,3922,5


Let's see the distribution of **average ratings**

In [5]:
%sql

select * from ratings

book_id,user_id,rating
1,314,5
1,439,3
1,588,5
1,1169,4
1,1185,4
1,2077,4
1,2487,4
1,2900,5
1,3662,4
1,3922,5


In [6]:
#import books data
file_location = "/FileStore/tables/books.csv"
file_type = "csv"

# CSV options
infer_schema = "True"
first_row_is_header = "True"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
books = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sto_readep", delimiter) \
  .load(file_location)

display(books)

# Create Temporary Tables
books.createOrReplaceTempView("books")

id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,title,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
1,2767052,2767052,2792775,272,439023483,9780439023480.0,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,4.34,4780653,4942365,155254,66715.0,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m/2767052.jpg,https://images.gr-assets.com/books/1447303603s/2767052.jpg
2,3,3,4640799,491,439554934,9780439554930.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,"Harry Potter and the Sorcerer's Stone (Harry Potter, #1)",eng,4.44,4602479,4800065,75867,75504.0,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m/3.jpg,https://images.gr-assets.com/books/1474154022s/3.jpg
3,41865,41865,3212258,226,316015849,9780316015840.0,Stephenie Meyer,2005.0,Twilight,"Twilight (Twilight, #1)",en-US,3.57,3866839,3916824,95009,456191.0,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m/41865.jpg,https://images.gr-assets.com/books/1361039443s/41865.jpg
4,2657,2657,3275794,487,61120081,9780061120080.0,Harper Lee,1960.0,To Kill a Mockingbird,To Kill a Mockingbird,eng,4.25,3198671,3340896,72586,60427.0,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m/2657.jpg,https://images.gr-assets.com/books/1361975680s/2657.jpg
5,4671,4671,245494,1356,743273567,9780743273560.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,The Great Gatsby,eng,3.89,2683664,2773745,51992,86236.0,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m/4671.jpg,https://images.gr-assets.com/books/1490528560s/4671.jpg
6,11870085,11870085,16827462,226,525478817,9780525478810.0,John Green,2012.0,The Fault in Our Stars,The Fault in Our Stars,eng,4.26,2346404,2478609,140739,47994.0,92723,327550,698471,1311871,https://images.gr-assets.com/books/1360206420m/11870085.jpg,https://images.gr-assets.com/books/1360206420s/11870085.jpg
7,5907,5907,1540236,969,618260307,9780618260300.0,J.R.R. Tolkien,1937.0,The Hobbit or There and Back Again,The Hobbit,en-US,4.25,2071616,2196809,37653,46023.0,76784,288649,665635,1119718,https://images.gr-assets.com/books/1372847500m/5907.jpg,https://images.gr-assets.com/books/1372847500s/5907.jpg
8,5107,5107,3036731,360,316769177,9780316769170.0,J.D. Salinger,1951.0,The Catcher in the Rye,The Catcher in the Rye,eng,3.79,2044241,2120637,44920,109383.0,185520,455042,661516,709176,https://images.gr-assets.com/books/1398034300m/5107.jpg,https://images.gr-assets.com/books/1398034300s/5107.jpg
9,960,960,3338963,311,1416524797,9781416524790.0,Dan Brown,2000.0,Angels & Demons,"Angels & Demons (Robert Langdon, #1)",en-CA,3.85,2001311,2078754,25112,77841.0,145740,458429,716569,680175,https://images.gr-assets.com/books/1303390735m/960.jpg,https://images.gr-assets.com/books/1303390735s/960.jpg
10,1885,1885,3060926,3455,679783261,9780679783270.0,Jane Austen,1813.0,Pride and Prejudice,Pride and Prejudice,eng,4.24,2035490,2191465,49152,54700.0,86485,284852,609755,1155673,https://images.gr-assets.com/books/1320399351m/1885.jpg,https://images.gr-assets.com/books/1320399351s/1885.jpg


In [7]:
#converting books data into pandas dataframe

books_df = books.toPandas()
books_df.head()

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,title,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,4.34,4780653,4942365,155254,66715.0,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,Harry Potter and the Sorcerer's Stone (Harry P...,eng,4.44,4602479,4800065,75867,75504.0,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,"Twilight (Twilight, #1)",en-US,3.57,3866839,3916824,95009,456191.0,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,To Kill a Mockingbird,eng,4.25,3198671,3340896,72586,60427.0,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,The Great Gatsby,eng,3.89,2683664,2773745,51992,86236.0,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


There is a book from 1750BC

In [9]:
%sql

select * from books where original_publication_year = -1750

id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,title,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
2076,19351,19351,3802528,266,141026286,9780141026280.0,"Anonymous, N.K. Sandars",-1750.0,Shūtur eli sharrī,The Epic of Gilgamesh,eng,3.63,44345,55856,2247,1551.0,5850,17627,17485,13343,https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png,https://s.gr-assets.com/assets/nophoto/book/50x75-a91bf249278a81aabab721ef782c4a74.png


**Distribution of numberr of books published across years**

In [11]:
%sql

select original_publication_year, count(*) as count from books where original_publication_year > 1950 group by original_publication_year 

original_publication_year,count
1988.0,89
1951.0,13
1976.0,39
1979.0,48
1953.0,26
1987.0,83
1959.0,24
1978.0,48
1968.0,35
2010.0,473


In [12]:
import pandas as pd
import numpy as np

In [13]:
books_df.shape

In [14]:
3sanity check
books_df = books_df[books_df.ratings_count.str.isdigit() == True]

In [15]:
books_df.ratings_count = books_df.ratings_count.astype('int')

In [16]:
#Top books with most number of ratings on goodbooks

books_df.sort_values(by = 'ratings_count', ascending = False)[['original_title','ratings_count', 'average_rating' ]][0:10]

Unnamed: 0,original_title,ratings_count,average_rating
0,The Hunger Games,4780653,4.34
1,Harry Potter and the Philosopher's Stone,4602479,4.44
2,Twilight,3866839,3.57
3,To Kill a Mockingbird,3198671,4.25
4,The Great Gatsby,2683664,3.89
5,The Fault in Our Stars,2346404,4.26
6,The Hobbit or There and Back Again,2071616,4.25
7,The Catcher in the Rye,2044241,3.79
9,Pride and Prejudice,2035490,4.24
8,Angels & Demons,2001311,3.85


In [17]:
most_ratings = books_df.sort_values(by = 'ratings_count', ascending = False)[['original_title','ratings_count', 'average_rating', 'image_url' ]][0:10]

**Printing top books with most number of ratings**

In [19]:
import pandas as pd
from IPython.display import Image, HTML
most_ratings['img_html'] = most_ratings['image_url']\
    .str.replace(
        '(.*)', 
        '<img src="\\1" style="max-height:124px;"></img>'
    )
with pd.option_context('display.max_colwidth', 10000):
  
  display(HTML(most_ratings[['original_title', 'img_html', 'ratings_count', 'average_rating' ]].to_html(escape=False)))

Unnamed: 0,original_title,img_html,ratings_count,average_rating
0,The Hunger Games,,4780653,4.34
1,Harry Potter and the Philosopher's Stone,,4602479,4.44
2,Twilight,,3866839,3.57
3,To Kill a Mockingbird,,3198671,4.25
4,The Great Gatsby,,2683664,3.89
5,The Fault in Our Stars,,2346404,4.26
6,The Hobbit or There and Back Again,,2071616,4.25
7,The Catcher in the Rye,,2044241,3.79
9,Pride and Prejudice,,2035490,4.24
8,Angels & Demons,,2001311,3.85


In [20]:
books_df.head()

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,title,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,4.34,4780653,4942365,155254,66715.0,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,Harry Potter and the Sorcerer's Stone (Harry P...,eng,4.44,4602479,4800065,75867,75504.0,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,"Twilight (Twilight, #1)",en-US,3.57,3866839,3916824,95009,456191.0,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,To Kill a Mockingbird,eng,4.25,3198671,3340896,72586,60427.0,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,The Great Gatsby,eng,3.89,2683664,2773745,51992,86236.0,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In [21]:
books_df.average_rating = books_df.average_rating.astype('float')

In [22]:
#Top books with top average ratings on goodbooks

books_df.sort_values(by = 'average_rating', ascending = False)[['original_title','ratings_count', 'average_rating' ]][0:10]

Unnamed: 0,original_title,ratings_count,average_rating
3627,The Complete Calvin and Hobbes,28900,4.82
861,Words of Radiance,73572,4.77
3274,,33220,4.77
8853,Mark of the Lion Trilogy,9081,4.76
7946,,8953,4.76
4482,It's a Magical World: A Calvin and Hobbes Coll...,22351,4.75
6360,There's Treasure Everywhere: A Calvin and Hobb...,16766,4.74
421,Complete Harry Potter Boxed Set,190050,4.74
6589,The Authoritative Calvin and Hobbes,16087,4.73
6919,The Indispensable Calvin and Hobbes: A Calvin ...,14597,4.73


In [23]:
high_rating_books = books_df.sort_values(by = 'average_rating', ascending = False)[['original_title','ratings_count','image_url', 'average_rating' ]][0:10]

**Printing top books with highest average ratings**

In [25]:
high_rating_books['img_html'] = high_rating_books['image_url']\
    .str.replace(
        '(.*)', 
        '<img src="\\1" style="max-height:124px;"></img>'
    )
with pd.option_context('display.max_colwidth', 10000):
  
  display(HTML(high_rating_books[['original_title', 'img_html','ratings_count', 'average_rating' ]].to_html(escape=False)))

Unnamed: 0,original_title,img_html,ratings_count,average_rating
3627,The Complete Calvin and Hobbes,,28900,4.82
861,Words of Radiance,,73572,4.77
3274,,,33220,4.77
8853,Mark of the Lion Trilogy,,9081,4.76
7946,,,8953,4.76
4482,It's a Magical World: A Calvin and Hobbes Collection,,22351,4.75
6360,There's Treasure Everywhere: A Calvin and Hobbes Collection,,16766,4.74
421,Complete Harry Potter Boxed Set,,190050,4.74
6589,The Authoritative Calvin and Hobbes,,16087,4.73
6919,The Indispensable Calvin and Hobbes: A Calvin and Hobbes Treasury,,14597,4.73


In [26]:
authors_with_most_books = pd.DataFrame(books_df.authors.value_counts()[0:10]).reset_index()
authors_with_most_books.columns = ['author', 'number_of_books']

In [27]:
 authors_with_most_books

Unnamed: 0,author,number_of_books
0,Stephen King,60
1,Nora Roberts,59
2,Dean Koontz,47
3,Terry Pratchett,42
4,Agatha Christie,39
5,Meg Cabot,37
6,James Patterson,36
7,David Baldacci,34
8,John Grisham,33
9,J.D. Robb,33


In [28]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
plt.title('Distribution of Average Ratings')
books_df['average_rating'].hist()
display()

In [29]:
ratings.show(5)

In [30]:
#1 million ratings

ratings.describe().show()

In [31]:
#Each book has 100 ratings in the ratingss dataframe

ratings.groupby('book_id').count().show()

Here's the few books with less than **100 ratings**

In [33]:
%sql

SELECT b.original_title, r.book_id,count(*)  FROM ratings r inner join books b on b.book_id = r.book_id group by r.book_id, b.original_title having count(*) <100 

original_title,book_id,count(1)
I Can Read with My Eyes Shut!,7785,98
Fried Green Tomatoes at the Whistle Stop Cafe,9375,77
Bluebeard,9601,87
Papillon,6882,95
A Case of Need,7663,91
The Complete Short Stories of Ernest Hemingway,4625,96
"Trials of Death (Cirque du Freak, #5)",8967,89
"The Vampire Prince (Cirque Du Freak, #6)",8968,96
Comfort Me with Apples: More Adventures at the Table,8725,99
El club Dumas,7194,95


In [34]:
ratings = ratings.select(ratings.user_id,
                         ratings.book_id,
                         ratings.rating.cast("double"))

In [35]:
ratings.show(5)

In [36]:
# Count the total number of ratings in the dataset
numerator = ratings.select("rating").count()

# Count the number of distinct Id's
num_users = ratings.select("user_id").distinct().count()
num_items = ratings.select("book_id").distinct().count()

# Set the denominator equal to the number of users multiplied by the number of items
denominator = num_users * num_items

# Divide the numerator by the denominator
sparsity = (1.0 - (numerator * 1.0)/ denominator) * 100
print("The ratings dataframe is ", "%.2f" % sparsity + "% empty.")

In [37]:
# Min num ratings 
print("Item with the fewest ratings: ")
ratings.groupBy("book_id").count().sort('count').show(10)

In [38]:
# Group data by user_id, count ratings
(ratings.groupBy("user_id")
    .count()
    .filter("`count` >= 5")
    .orderBy('count', ascending=False)
    .show(n = 10))

In [39]:
(ratings.groupBy("book_id")
    .count()
    .filter("`count` > 1")
    .orderBy('count', ascending=False)
    .show(n = 10))

**Now, let's split the data into training and test set to use collaborative filtering usingn Alternate Least Square method**

In [41]:
(training, test) = ratings.randomSplit([0.8, 0.2])

In [42]:
test.show(5)

**Let's import ALS and regression evaluator to find RMSE.**

In [44]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row

In [45]:
als = ALS( userCol="user_id", itemCol="book_id", ratingCol="rating",
          coldStartStrategy="drop", nonnegative = True, implicitPrefs = False)

**'implicitPrefs ' argument is used when we are not using explicit data like the rating data here. Sometimes companies don't have explicit data like ratings and still want to build a recommendation engine using other proxies like views, clicks, wishlists, etc. **

**In that case, implicit preference is used but is out of the scope of our good books project. 'coldStartStrategy ' is used when we don't have any data for a user which might lead to null prediction if the user on the test set has no rating in the training set. We have dropped the cold start strategy because we want to avoid such situations for our problem in hand.**


**Now let's build our hyperparameter list and then fit the algorithm on the training data.**

In [47]:
type(als)

In [48]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator 

param_grid = ParamGridBuilder() \
            .addGrid(als.rank, [10, 50, 75, 100]) \
            .addGrid(als.maxIter, [5, 50, 75, 100]) \
            .addGrid(als.regParam, [.01, .05, .1, .15]) \
            .build()

**Num models to be tested using param_grid: 64
There is a total of 64 models that will be tested and tuned before we receive the final model. The power of parallelization and spark will make it pretty fast though.**

In [50]:
# Define evaluator as RMSE
evaluator = RegressionEvaluator(metricName = "rmse", 
                                labelCol = "rating", 
                                predictionCol = "prediction")
# Print length of evaluator
print ("Num models to be tested using param_grid: ", len(param_grid))

In [51]:
# Build cross validation using CrossValidator
cv = CrossValidator(estimator = als, 
                    estimatorParamMaps = param_grid, 
                    evaluator = evaluator, 
                    numFolds = 5)

In [52]:
print(cv)

In [53]:
model = als.fit(training)

In [54]:
predictions = model.transform(test)

In [55]:
predictions.show(n = 10)

**Well, the prediction on the test set shows that it is very close to the original rating. Like for example the rating for user_id 14372 was originally 3 and our algorithm predicted it to be 3.13 which is pretty close.**

In [57]:
predictions.createOrReplaceTempView("predictions")

In [58]:
%sql
select * from predictions

user_id,book_id,rating,prediction
3922,148,3.0,3.7151182
32055,148,3.0,3.1291625
27834,148,3.0,3.832256
13407,148,4.0,3.9250367
29703,148,4.0,3.7403727
40167,148,5.0,3.2674868
46139,148,5.0,3.8008392
17228,148,5.0,3.9937172
14372,148,3.0,3.188208
30681,148,2.0,3.1382155


In [59]:
%sql
select predictions.user_id, predictions.book_id, predictions.rating, predictions.prediction, books.title from 
predictions inner join books 
ON predictions.book_id = books.id

user_id,book_id,rating,prediction,title
3922,148,3.0,3.7151182,Girl with a Pearl Earring
32055,148,3.0,3.1291625,Girl with a Pearl Earring
27834,148,3.0,3.832256,Girl with a Pearl Earring
13407,148,4.0,3.9250367,Girl with a Pearl Earring
29703,148,4.0,3.7403727,Girl with a Pearl Earring
40167,148,5.0,3.2674868,Girl with a Pearl Earring
46139,148,5.0,3.8008392,Girl with a Pearl Earring
17228,148,5.0,3.9937172,Girl with a Pearl Earring
14372,148,3.0,3.188208,Girl with a Pearl Earring
30681,148,2.0,3.1382155,Girl with a Pearl Earring


In [60]:
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

**Root-mean-square error = 0.913, On average the mean error is .9 that is the difference between the original rating and the predicted rating. Now, let's predict 10 books and ratings for each user.**

In [62]:
# Generate n recommendations for all users
ALS_recommendations = model.recommendForAllUsers(numItems = 10) # n - 10

In [63]:
ALS_recommendations.show(n = 10)

In [64]:
# Temporary table
ALS_recommendations.registerTempTable("ALS_recs_temp")

In [65]:
clean_recs = spark.sql("""SELECT user_id,
                            movieIds_and_ratings.book_id AS book_id,
                            movieIds_and_ratings.rating AS prediction
                        FROM ALS_recs_temp
                        LATERAL VIEW explode(recommendations) exploded_table
                            AS movieIds_and_ratings""")
clean_recs.show()

In [66]:
# Recommendations for unread books
(clean_recs.join(ratings, ["user_id", "book_id"], "left")
    .filter(ratings.rating.isNull()).show())

In [67]:
new_books = (clean_recs.join(ratings, ["user_id", "book_id"], "left")
    .filter(ratings.rating.isNull()))

In [68]:
print(new_books.count())

In [69]:
new_books.show(5)

In [70]:
to_read.show(5)

In [71]:
# Create Temporary Tables
new_books.createOrReplaceTempView("new_books")

In [72]:
print(new_books.count())

In [73]:
print(to_read.count())

In [74]:
# Create Temporary Tables
to_read.createOrReplaceTempView("to_read")

In [75]:
recommendations = new_books.join(to_read,
                              on = ["user_id", "book_id"], 
                              how = "inner")
print(recommendations.show())

In [76]:
print(recommendations.count())

In [77]:
(recommendations
     .withColumn('pred_trunc', recommendations.prediction.substr(1,1))
     .groupby('pred_trunc')
     .count()
     .sort('pred_trunc')
    .show())

In [78]:
# Create Temporary Tables
recommendations.createOrReplaceTempView("recommendations")

In [79]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
plt.title('Most of the predicted ratings are above 3.8', fontsize = 12)
plt.suptitle('Distribution of predictedd ratings for the to_do lists', fontsize = 18)
rec = recommendations.toPandas()
rec['prediction'].hist()
display()

**Most of the books in the wishlist are predicted as 4.5 or more which is expected.**