In [1]:
#Import the pyspark sql libraries.  These are used to execute sql queries.
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.functions import *
from pyspark.sql.functions import col, avg, min, monotonically_increasing_id
from pyspark.sql.functions import isnan, when, count, col

#Import the pyspark machine learning libraries
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator


from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.mllib.recommendation import MatrixFactorizationModel, Rating
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.evaluation import BinaryClassificationMetrics





sqlContext = SQLContext(sc)

import numpy as np
import pandas as pd
import random as rand
from IPython.display import Image, HTML
from IPython.core.display import HTML


## Introduction

For the final project in this class, my goal was to build an ALS recomendation model and use cross validation tune its hyper-parameters in an effort to find the best model using the Books-Crossing dataset. Unfortunately, after numerous attempts over several hours/days and after research online, the processing speed of the Databricks community edition does allow for the running of a cross validation model without timing out of the cluster.  Additionally, attempts to install pyspark locally on my desktop also failed to get the program to run.  Even more, when I tried to tune the hyper-parameters manually, Databricks timed out again without finding the best set of hyper-parameters. To complete the project, I hard coded the hyper-parameters and even with that, I had to limit the rank setting to 16.  Anything over 16, and the processing times out. 

With all of that said, <b>I was able to complete the project. </b>  I built an ALS recommendation model to make recomendations to users in the dataset.  The project covers the following sections:<br/>

<li>A description of the Books Crossing Datasets</li>
<li>Import and Explore the Books Data datset</li>
<li>Import and Explore the Books Rating datset</li>
<li>A Look at the Ratings and their Frequency</li>
<li>A Look at the Users and their Frequency</li>
<li>The Most Rated Books</li>
<li>Data Preparation</li>
<li>Build the Model</li>
<li>Make Recommendations to Frequent Users</li>
<li>Code that Did Not Run</li>
<li>Summary</li>

## The Books Crossing Dataset

The Book-Crossing dataset was collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. The dataset consists of three different datasets: <br />

BX-Books, which contains meta information about the books in the dataset, Author, Year Published, ISBN numbers, and links to images of the cover of the books. <br />
BX-Book Ratings, which contains user ratings for the books <br />
BX-Users, which contains demographic information about the users.  For this project, I did not import this data

##Import BX-Books dataset and Explore <br/>

In this section, I imported the BX-Books dataset into a spark dataframe called books_data.

In [5]:
booksdata = sc.textFile("dbfs:/FileStore/tables/BX_Books.csv")
books_data = spark.read.csv(booksdata, header=True, inferSchema=True)
books_data.cache()

The Books dataset's set schema is listed below and consists of 271,369 records

In [7]:
books_data.printSchema()

In [8]:
books_data.select("ISBN").count()

The dataset consists of the following columns:

'ISBN': The International Standard Book Number is a numeric commercial book identifier which is intended to be unique. <br />
'Book-Title' : The title of the book. <br />
'Book-Author': The author of the book. <br />
'Year-Of-Publication': The year that the book was published. <br />
'Publisher': Publishing company for the book. <br />
'Image-URL-S', 'Image-URL-M', 'Image-URL-L' : hyperlinks to the cover image of the books in small, medium, and large.

In [10]:
print(books_data.columns)

In [11]:
books_data.show(5)

In [12]:
books_data = books_data.withColumn("ISBN", trim(col("ISBN")))
books_data = books_data.withColumn("length_of_ISBN", F.length("ISBN"))
books_data = books_data.filter(books_data.length_of_ISBN <11)
books_data = books_data.withColumn('ISBN', lpad(books_data.ISBN,10, '0'))
books_data.show(3)

We see that 4,619 books in the data set have no year published.

In [14]:
display(books_data.groupby("Year-Of-Publication").count().sort(col("Year-Of-Publication").desc()))

Year-Of-Publication,count
2050,2
2038,1
2037,1
2030,7
2026,1
2024,1
2021,1
2020,3
2012,1
2011,2


Clean the Year of Publication

In [16]:
books_data = books_data.withColumn("Year-Of-Publication", \
              when(books_data["Year-Of-Publication"] == 0, 1900).otherwise(books_data["Year-Of-Publication"]))

In [17]:
books_data = books_data.withColumn("Year-Of-Publication", \
              when(books_data["Year-Of-Publication"] > 2020, 2020).otherwise(books_data["Year-Of-Publication"]))

In [18]:
display(books_data.groupby("Year-Of-Publication").count().sort(col("Year-Of-Publication").desc()))

Year-Of-Publication,count
2020,17
2012,1
2011,2
2010,2
2008,1
2006,3
2005,46
2004,5839
2003,14359
2002,17627


In [19]:
books_data.show(5)

Check for null values

In [21]:
books_data.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in books_data.columns]).show()

Finally, I need to cast the "Year-Of_Publication" as an integer type, and create a new dataframe with better naming conventions

In [23]:
books_data["Year-Of-Publication"].cast(IntegerType())

books_info = books_data.select(
                col("ISBN"),
                col("Book-Title").alias("Title"),
                col("Book-Author").alias("Author"),
                col("Year-Of-Publication").alias("Year_Published"),
                col("Publisher"),
                col("Image-URL-S").alias("Image_URL_S"),
                col("Image-URL-M").alias("Image_URL_M"),
                col("Image-URL-L").alias("Image_URL_L"))


display(books_info)

ISBN,Title,Author,Year_Published,Publisher,Image_URL_S,Image_URL_M,Image_URL_L
0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg
0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg
0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0060973129.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0060973129.01.LZZZZZZZ.jpg
0374157065,Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0374157065.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0374157065.01.LZZZZZZZ.jpg
0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,"W. W. Norton &, Company",http://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg
0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0399135782.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0399135782.01.LZZZZZZZ.jpg
0425176428,What If?: The World's Foremost Military Historians Imagine What Might Have Been,Robert Cowley,2000,Berkley Publishing Group,http://images.amazon.com/images/P/0425176428.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0425176428.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0425176428.01.LZZZZZZZ.jpg
0671870432,PLEADING GUILTY,Scott Turow,1993,Audioworks,http://images.amazon.com/images/P/0671870432.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0671870432.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0671870432.01.LZZZZZZZ.jpg
0679425608,Under the Black Flag: The Romance and the Reality of Life Among the Pirates,David Cordingly,1996,Random House,http://images.amazon.com/images/P/0679425608.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0679425608.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0679425608.01.LZZZZZZZ.jpg
074322678X,Where You'll Find Me: And Other Stories,Ann Beattie,2002,Scribner,http://images.amazon.com/images/P/074322678X.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/074322678X.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/074322678X.01.LZZZZZZZ.jpg


In [24]:
books_info.printSchema()

#### Exploration of Authors
In the next code blocks, I explored the top 5 Authors in the dataset.  The author with the most ISBN numbers is Agatha Christie followed by William Shakespeare, Stephen King, Ann N. Martin, and Carolyn Keene.  This does not mean that thes authors wrote these many books.  It's  that their works have that many ISBN numbers.

In [26]:
books_info.createOrReplaceTempView("books_info")

In [27]:
%sql

select Author, count(distinct ISBN) AS Count
from books_info 
group by Author
order by Count desc
Limit 5


Author,Count
Agatha Christie,632
William Shakespeare,567
Stephen King,524
Ann M. Martin,423
Francine Pascal,373


In this section, I take a look at Stephen King's books in the datase in order to get an idea of its metadata and to see the covers

In [29]:
king_df = sqlContext.sql("select Title, Author, Year_Published, Image_URL_L \
                          from books_info \
                          where Author = 'Stephen King' \
                          Limit 5")
king_df.show()

In [30]:
from IPython.display import Image 
from IPython.core.display import HTML 

kingDF = king_df.toPandas()
kingDF['Image_URL_L'] = kingDF['Image_URL_L'].str.replace('(.*)', '<img src="\\1" style="max-height:124px;"></img>')

with pd.option_context('display.max_colwidth', 10000):
  display(HTML(kingDF[["Title", "Author",'Year_Published',"Image_URL_L" ]].to_html(escape=False)))





Unnamed: 0,Title,Author,Year_Published,Image_URL_L
0,The Girl Who Loved Tom Gordon,Stephen King,2000,
1,Pet Sematary,Stephen King,1994,
2,Mientras Escribo,Stephen King,2002,
3,The Shining,Stephen King,2001,
4,Dreamcatcher,Stephen King,2001,


##Import the Books Rating data and Explore

The second dataset contains user ratings for the books. You can see from its schema that this dataset consists of three features:

User-ID : a unique identification number for each user providing ratings <br />
ISBN: a unique identification number for each book in the dataset<br />
Book-Rating: A rating scale of 1 to 10. This includes a 0 score which implies that the user has not read that particular book

In [32]:
rawdata = sc.textFile("dbfs:/FileStore/tables/BX_Book_Ratings.csv")
book_Ratings = spark.read.csv(rawdata, header=True, inferSchema=True)
book_Ratings.printSchema()


In this code block, I ensured that the ISBN identification numbers in the books data set matches the ISBN numbers in the ratings dataset.

In [34]:
book_Ratings = book_Ratings.withColumn("ISBN", trim(col("ISBN")))
book_Ratings = book_Ratings.withColumn("length_of_ISBN", F.length("ISBN"))
book_Ratings = book_Ratings.join(books_data, on=["ISBN"], how='inner') 
book_Ratings = book_Ratings.select(
                col("User-ID").alias("userId"),
                col("ISBN").alias("ISBN"),
                col("Book-Rating").alias("rating"))


In [35]:
book_Ratings.show(5)

#### A Look at the Ratings
After ensuring that the ISBN numbers in the ratings dataset are also in the books dataset, there are 1,031,174 user ratings in this dataset.

In [37]:
book_Ratings.select("userId").count()

##### Frequency of the Ratings

The ratings range from 0 to 10.  I am assuming that 0 represents a no rating available, not that the work is rated 0. 
Of the ratings, the most frequent is "0" at 647K which means that the dataset is sparse.

In [39]:
# Count the total number of ratings in the dataset
ratings_frequency = book_Ratings.select("rating").groupby("rating").agg(count("*").alias("Frequency")).sort("rating")
display(ratings_frequency)               

rating,Frequency
0,647323
1,1481
2,2375
3,5119
4,7617
5,45355
6,31689
7,66404
8,91806
9,60779


### A Look at the Users

There are 92,106 distinct users in this dataset

In [41]:
book_Ratings.select("userId").distinct().count()

#### Limit Users in the Dataset

Given the processing time issues as discussed in the introduction, I had to limit the ratings dataset to the top 1000 users with the most ratings contributions which narrows the users dataset to 508,585 user ratings.

In [43]:
ratings = book_Ratings.select("userId").groupby("userId").agg(count("userId").alias("Frequency")).sort("Frequency", ascending=False).limit(1000)

bookRatings = ratings.join(book_Ratings, ["userId"], how="inner")

bookRatings.select("userId").count()



### Frequency of Users
In the plot below, we see that user 11676 reviewed the most books at over 11,000 books.  According to the Users dataset (which was not imported), user 11676 stands for N/A or null. In other words, no demographic information was collected for this user.

In [45]:
users_frequency = bookRatings.select("userId").groupby("userId").agg(count("*").alias("Frequency")).sort("Frequency", ascending=False)
display(users_frequency) 

userId,Frequency
11676,11144
198711,6456
153662,5814
98391,5779
35859,5646
212898,4290
278418,3996
76352,3329
110973,2971
235105,2943


### Most Rated Books

In this section I looked at the top five books that have the most ratings

In [47]:
bookRatings.createOrReplaceTempView("ratings_info")
ISBN_df = sqlContext.sql("select ISBN, count(ISBN) as Frequency \
                          from ratings_info \
                          group by ISBN \
                          order by Frequency desc \
                          limit 6")
                  

most_rated = ISBN_df.join(books_data, on=["ISBN"], how='inner')  
display(most_rated.sort(col("Frequency"), ascending=True))

ISBN,Frequency,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,length_of_ISBN
044021145X,219,The Firm,John Grisham,1992,Bantam Dell Publishing Group,http://images.amazon.com/images/P/044021145X.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/044021145X.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/044021145X.01.LZZZZZZZ.jpg,10
0440214041,230,The Pelican Brief,John Grisham,1993,Dell,http://images.amazon.com/images/P/0440214041.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0440214041.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0440214041.01.LZZZZZZZ.jpg,9
0060928336,237,Divine Secrets of the Ya-Ya Sisterhood: A Novel,Rebecca Wells,1997,Perennial,http://images.amazon.com/images/P/0060928336.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0060928336.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0060928336.01.LZZZZZZZ.jpg,8
0385504209,239,The Da Vinci Code,Dan Brown,2003,Doubleday,http://images.amazon.com/images/P/0385504209.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0385504209.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0385504209.01.LZZZZZZZ.jpg,9
0316666343,296,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown",http://images.amazon.com/images/P/0316666343.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0316666343.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0316666343.01.LZZZZZZZ.jpg,9
0971880107,383,Wild Animus,Rich Shapero,2004,Too Far,http://images.amazon.com/images/P/0971880107.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0971880107.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0971880107.01.LZZZZZZZ.jpg,9


I found that the most rated book at over 2500 reviews was "Wild Animus" by Richard Shapero.  I had never heard of this book before, and after some research, the history of this book is a little sketchy in that it seems more like a marketing project to sell a book than an actual book.  For this analysis, I left it in. 

https://litreactor.com/columns/what-the-hell-is-wild-animus

In [49]:
most_rated_df = most_rated.toPandas()
most_rated_df['Image-URL-L'] = most_rated_df['Image-URL-L'].str.replace('(.*)', '<img src="\\1" style="max-height:124px;"></img>')

with pd.option_context('display.max_colwidth', 10000):
  display(HTML(most_rated_df[["Book-Title", "Book-Author","Year-Of-Publication","Image-URL-L" ]].to_html(escape=False)))




Unnamed: 0,Book-Title,Book-Author,Year-Of-Publication,Image-URL-L
0,Wild Animus,Rich Shapero,2004,
1,The Lovely Bones: A Novel,Alice Sebold,2002,
2,The Firm,John Grisham,1992,
3,The Da Vinci Code,Dan Brown,2003,
4,Divine Secrets of the Ya-Ya Sisterhood: A Novel,Rebecca Wells,1997,
5,The Pelican Brief,John Grisham,1993,


## Data Preparation

In order for the ALS algorithm to work, it needs the Item column, in this case ISBN, to be an integer.  To do that, I took the following steps: <br />

Extract unique ISBN ids <br/>
Assign Unique integers to each id<br/>
Rejoin unique integer ids back to the ratings data<br/>

In [51]:
ISBN_df = bookRatings.select('ISBN').distinct()
ISBN_df.show(5)

In the block below, I created a new field, ISBN_Id which sequentially numbers the ISBN column.

In [53]:
ISBN_df = ISBN_df.coalesce(1)
ISBN_df = ISBN_df.withColumn("isbnId", monotonically_increasing_id()).persist()
ISBN_df.show(5)

In [54]:
ISBN_df.printSchema()

I then join the ISBN_df dataframe to the bookRatings datafram so that the ISBN numbers are cross-referenced

In [56]:
book_ratings_01 = bookRatings.join(ISBN_df, ["ISBN"], "left")
book_ratings_01.printSchema()

In [57]:
book_ratings_01.select("rating").count()

In the final step, I created the final dataset for analysis, ratings_data

In [59]:
ratings_data = book_ratings_01.select(col("userId"), col("isbnId"),  col("rating"))

ratings_data.sort("userId").show(5)

Below, I perform a quick check for nulls.

In [61]:
ratings_data.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in ratings_data.columns]).show()

### Build the Model

As discussed in the introduction, my plan was to build the ALS model using cross validation to find the best model, but after several attempts and using a wide array of parameters, the model took too long to run on Databricks. Failing that, I created a custom function that tries and tests different hyper-parameters, but that too timed out.

As a last resort, I hard coded the hyper-parameters, and found that using a rank=16 was the only way to get the model to run.  Any rank over 16, and the process timed out.

In [63]:
(training, test)= ratings_data.randomSplit([0.8,0.2],seed=123)
training.cache()


In [64]:
als = ALS(userCol="userId", itemCol="isbnId", ratingCol="rating",
          rank=16,
          regParam = .975,
          maxIter=20,
          coldStartStrategy="drop",
          nonnegative=True,
          implicitPrefs=False)

In [65]:
model_01 = als.fit(training)


In [66]:
predictions_01 = model_01.transform(test)

In [67]:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

In [68]:
rmse = evaluator.evaluate(predictions_01)
print(rmse)

#### RMSE

Given the hyper-parameters, rank=16, regParam = .975, maxIter=20, the best RMSE is 3.4 which is not very good, but given the processing limitations, this is the best model that I could build.

### Recommendations

In the sections below, I create two functions, user_history and user_suggestions that look at a user's history and makes recommendations to the user using the model that I built.

##### Recommendations for user 11676

The first user that I chose was the user with the most recommendations, user 11676.

Below are the top 5 rated books for user 11676

In [73]:
def user_history(userId):
    userHistory_df = training.filter(training['userId']==userId).sort(col("rating"), ascending=False).limit(5)
    userHistory_df = userHistory_df.join(ISBN_df, "isbnId", "inner")
    userHistory_df = userHistory_df.join(books_info, "ISBN", "inner")
    userHistory_df = userHistory_df.toPandas()
    userHistory_df['Image_URL_L'] = userHistory_df['Image_URL_L'].str.replace('(.*)', '<img src="\\1" style="max-height:124px;"></img>')
    return(userHistory_df)

userHistory = user_history(11676)

with pd.option_context('display.max_colwidth', 10000):
    display(HTML(userHistory[["Title", "Author","Year_Published", "rating", "Image_URL_L" ]].to_html(escape=False)))





Unnamed: 0,Title,Author,Year_Published,rating,Image_URL_L
0,I Am Winnie the Pooh (Golden Story Book),Betty Birney,1994,10,
1,The Stolen,Alex Shearer,2002,10,
2,Remembrance,Danielle Steel,1983,10,
3,House of Secrets,Lowell Cauffiel,1998,10,
4,The Second Time Around : A Novel,Mary Higgins Clark,2003,10,


User Recommendations for user 11676

Below are the top recommendations for the user.

In [75]:
def user_suggestions(userId):
  user_suggest = test.filter(training['userId']==userId).select(['isbnId', 'userId'])
  user_offer = model_01.transform(user_suggest)
  user_offer = user_offer.join(ISBN_df, "isbnId", "inner")
  user_offer = user_offer.join(books_info, "ISBN", "inner")
  user_offer = user_offer.orderBy('prediction', ascending=False).limit(5)
  user_offer = user_offer.toPandas()
  user_offer['Image_URL_L'] = user_offer['Image_URL_L'].str.replace('(.*)', '<img src="\\1" style="max-height:124px;"></img>')  
  return(user_offer)

userSuggestions = user_suggestions(11676)

with pd.option_context('display.max_colwidth', 10000):
    display(HTML(userSuggestions[["Title", "Author","prediction", "Year_Published",  "Image_URL_L" ]].to_html(escape=False)))




Unnamed: 0,Title,Author,prediction,Year_Published,Image_URL_L
0,Frankenstein (Watermill Classic),Mary Wollstonecraft Shelley,10.516162,1993,
1,Birds of Prey: A Novel of Suspense,J.A. Jance,8.620851,2001,
2,"We Are Experiencing Parental Difficulties...Please Stand By : Baby Blues Scrapbook No.5 (Baby Blues Scrapbook, No 5)",Rick Kirkman,8.554343,1995,
3,The Other,Thomas Tryon,7.695937,1987,
4,Leadership is an Art,MAX DEPREE,7.467162,1990,


User Histories and Recommendations are made for users, 200674, 28204, and 119575

In [77]:
userHistory = user_history(200674)

with pd.option_context('display.max_colwidth', 10000):
    display(HTML(userHistory[["Title", "Author","Year_Published", "rating", "Image_URL_L" ]].to_html(escape=False)))


Unnamed: 0,Title,Author,Year_Published,rating,Image_URL_L
0,When the Wind Blows,James Patterson,1999,10,
1,She Who Remembers,Linda Lay Shuler,1989,10,
2,The Perfect Husband,LISA GARDNER,1997,10,
3,In the Name of Love : Ann Rule's Crime Files Volume 4 (Ann Rule's Crime Files),Ann Rule,1998,10,
4,Random Acts of Kindness,Dawna Markova,2002,10,


In [78]:
userSuggestions = user_suggestions(200674)

with pd.option_context('display.max_colwidth', 10000):
    display(HTML(userSuggestions[["Title", "Author","prediction", "Year_Published",  "Image_URL_L" ]].to_html(escape=False)))

Unnamed: 0,Title,Author,prediction,Year_Published,Image_URL_L
0,TV Troubl-Trol: Man St,Brown,0.405629,1997,
1,Fertile Ground,Charles Wilson,0.292099,1996,
2,Tears of Rage,John Walsh,0.287797,1998,
3,Shiver: A Novel,Brian Harper,0.277616,1992,
4,I Hope You Dance,Mark D. Sanders,0.269828,2000,


In [79]:
userHistory = user_history(28204)

with pd.option_context('display.max_colwidth', 10000):
    display(HTML(userHistory[["Title", "Author","Year_Published", "rating", "Image_URL_L" ]].to_html(escape=False)))

Unnamed: 0,Title,Author,Year_Published,rating,Image_URL_L
0,Deja Dead,Kathy Reichs,1998,9,
1,"Angels &, Demons",Dan Brown,2001,10,
2,Cats and Their Women,Louise Taylor,1992,10,
3,The Pillars of the Earth,Ken Follett,1996,10,
4,Toujours Provence (Vintage Departures),Peter Mayle,1992,10,


In [80]:
userSuggestions = user_suggestions(28204)

with pd.option_context('display.max_colwidth', 10000):
    display(HTML(userSuggestions[["Title", "Author","prediction", "Year_Published",  "Image_URL_L" ]].to_html(escape=False)))

Unnamed: 0,Title,Author,prediction,Year_Published,Image_URL_L
0,Ethan Frome: Authoritative Text Backgrounds and Contexts Criticism (Norton Critical Editions),Edith Wharton,1.838138,1995,
1,The Complete Guide to Writing Fiction,Barnaby Conrad,1.470906,1990,
2,L'Amant,Marguerite Duras,1.293505,1984,
3,Countdown,David Hagberg,1.249985,1991,
4,"IMZADI: STAR TREK, NEXT GENERATION (Star Trek the Next Generation)",Peter David,1.249911,1992,


In [81]:
userHistory = user_history(119575)

with pd.option_context('display.max_colwidth', 10000):
    display(HTML(userHistory[["Title", "Author","Year_Published", "rating", "Image_URL_L" ]].to_html(escape=False)))

Unnamed: 0,Title,Author,Year_Published,rating,Image_URL_L
0,"The Bungalow Mystery (Nancy Drew Mystery Stories, No 3)",Carolyn Keene,1991,10,
1,Warlock,Wilbur A. Smith,2001,10,
2,Here Comes Garfield,Jim Davis,1982,10,
3,"The Secret of Shadow Ranch (Nancy Drew Mystery Stories, No 5)",Carolyn Keene,1965,10,
4,City of Joy,Dominique Lapierre,1986,10,


In [82]:
userSuggestions = user_suggestions(119575)

with pd.option_context('display.max_colwidth', 10000):
    display(HTML(userSuggestions[["Title", "Author","prediction", "Year_Published",  "Image_URL_L" ]].to_html(escape=False)))

Unnamed: 0,Title,Author,prediction,Year_Published,Image_URL_L
0,"White Gold Wielder (The Second Chronicles of Thomas Covenant, Bk. 3)",Stephen R. Donaldson,3.386677,1983,
1,"A Clash of Kings (A Song of Fire and Ice, Book 2)",George R. R. Martin,2.67503,2000,
2,Spanky,Christopher Fowler,2.20523,2000,
3,Salem's Lot,Stephen King,2.130235,1999,
4,"A Game of Thrones (A Song of Ice and Fire, Book 1)",George R.R. Martin,1.859701,1997,


## Code that Did Not Run

In the next five blocks, I present commented code that timed out the cluster before completing. The first three blocks are for building the cross validation model. The second function, InterateALS, was my attempt to build a custom cross validation model.  After multiple attempts tweaking each approach and doing research, I surmise that the reason why this code times out is due to the Community edition of DataBricks.  Maybe if I upgraded, these blocks would run.

### Cross Validation

In [85]:
# param_grid = ParamGridBuilder() \
#               .addGrid(als.rank, [1,5,10]) \
#              .addGrid(als.maxIter, [20]) \
#              .addGrid(als.regParam, [0.75, 0.1, 0.125]) \
#              .build()
# print(len(param_grid))

In [86]:
# cv = CrossValidator(estimator = als,
#                      estimatorParamMaps = param_grid, 
#                      evaluator = evaluator,
#                      numFolds = 10)

# model = cv.fit(training) 
                    

### Iterate Model

In [88]:
# def InterateALS(data, k=3, userCol='userId', itemCol='isbnId', ratingCol='ratings', metricName='rmse'):
#         models = []
             
       
#         for i in range(1, k+1): 
#             (trainingSet, testingSet) = data.randomSplit([.8,.2])
#             trainingSet.cache()
#             testingSet.cache()
#             rank=60          
#             maxIter=200
#             als = ALS(userCol=userCol, itemCol=itemCol, ratingCol=ratingCol,
#                       rank=rank,
#                       regParam = .1,
#                       maxIter = maxIter,
#                       nonnegative = True,
#                       coldStartStrategy="drop",
#                       implicitPrefs = False)
#             model = als.fit(trainingSet)
#             predictions = model.transform(testingSet)
#             evaluator = RegressionEvaluator(metricName=metricName, labelCol='ratings', predictionCol='prediction')
#             evaluation = evaluator.evaluate(predictions)
            
#             print('Iteration {}: {} = {}, rank={}, regParam ={}, maxIter={}'.format(i , metricName, evaluation,rank, regParam, maxIter))
#             models.append(model)
           
#         return models


### Summary

The performance accuracy of the model is not very high given the high RMSE of 3.4 which is due to the limitations on adjusting the hyper-parameters of the model to find the best one. However, the approach to the project is sound, and it does render results. Personally, I learned a lot about pyspark during this project and I am eager to learn more.  If I had more time, maybe I could improve the model. It's definitely worth studying.

### References

Apache Spark https://spark.apache.org/<br />
Cai-Nicolas Ziegler, "Books-Crossing Dataset", http://www2.informatik.uni-freiburg.de/~cziegler/BX/<br />
Jamen Long (DataCamp), "Building Recommendation Engines with PySpark",  https://learn.datacamp.com/courses/recommendation-engines-in-pyspark<br />
Saket Garodia (Towards Data Science Blog), "Building a recommendation engine to recommend books in Spark", https://towardsdatascience.com/building-a-recommendation-engine-to-recommend-books-in-spark-f09334d47d67<br />
M Hendra Herviawan, "Movie Recommendation based on Alternating Least Squares (ALS) with Apache Spark", https://hendra-herviawan.github.io/build-movie-recommendation-with-apache-spark.html