## Data Processing Pipeline - Spark EMR Cluster, with Livy as the REST interface to interact with spark clusters 

We will look at [book ratings dataset](https://www.kaggle.com/zygmunt/goodbooks-10k) containing 6 million ratings, 10,000 books and 53,424 users. The goal of this notebook is to create a subset of ratings, where it contains users who have rated more than 1% of books and books that have been rated by at least 2% of the users. The resulting dataset will then have rich history of user preferences, along with popularity of books. The intent is to use the generated dataset to recommend books to users.

The two source datasets can be placed in a designated s3 bucket:

-  ratings.csv
-  books.csv

In [1]:
%%info

In [38]:
s3_bucket = 's3://ai-in-aws/'
prefix = 'Chapter7/'
output_ds_loc = prefix + 'object2vec/'

### Read the Book Ratings Dataset

In [3]:
from pyspark.sql.window import Window 
from pyspark.sql import functions as F

In [5]:
#ratings = spark.read.option("header","true").option("quote", "\"").option("delimiter", ";").csv("s3://ai-in-aws/awsglue-datasets/BX-Book-Ratings.csv")

# Read ratings dataset
ratings = spark.read.option("header","true").option("delimiter", ",").csv(s3_bucket + prefix + "ratings.csv")
#Read books csv to load book title
books_csv = spark.read.option("header","true").option("delimiter", ",").csv(s3_bucket + prefix + "books.csv")


In [10]:
ratings.count()

5976479

Explore the first few records of ratings dataframe

In [5]:
ratings.show(20)

+-------+-------+------+
|user_id|book_id|rating|
+-------+-------+------+
|      1|    258|     5|
|      2|   4081|     4|
|      2|    260|     5|
|      2|   9296|     5|
|      2|   2318|     3|
|      2|     26|     4|
|      2|    315|     3|
|      2|     33|     4|
|      2|    301|     5|
|      2|   2686|     5|
|      2|   3753|     5|
|      2|   8519|     5|
|      4|     70|     4|
|      4|    264|     3|
|      4|    388|     4|
|      4|     18|     5|
|      4|     27|     5|
|      4|     21|     5|
|      4|      2|     5|
|      4|     23|     5|
+-------+-------+------+
only showing top 20 rows

Explore the first few records of the books_csv dataframe

In [13]:
books_csv.show(5)

+-------+-----------------+------------+-------+-----------+---------+-----------------+--------------------+-------------------------+--------------------+--------------------+-------------+--------------+-------------+------------------+-----------------------+---------+---------+---------+---------+---------+--------------------+--------------------+
|book_id|goodreads_book_id|best_book_id|work_id|books_count|     isbn|           isbn13|             authors|original_publication_year|      original_title|               title|language_code|average_rating|ratings_count|work_ratings_count|work_text_reviews_count|ratings_1|ratings_2|ratings_3|ratings_4|ratings_5|           image_url|     small_image_url|
+-------+-----------------+------------+-------+-----------+---------+-----------------+--------------------+-------------------------+--------------------+--------------------+-------------+--------------+-------------+------------------+-----------------------+---------+---------+-----

Obtain book title and average rating across all users

In [17]:
books_csv.select("title", "average_rating").show(5)

+--------------------+--------------+
|               title|average_rating|
+--------------------+--------------+
|The Hunger Games ...|          4.34|
|Harry Potter and ...|          4.44|
|Twilight (Twiligh...|          3.57|
|To Kill a Mocking...|          4.25|
|    The Great Gatsby|          3.89|
+--------------------+--------------+
only showing top 5 rows

### Understand characteristics of the Data
Let us analyze number of ratings by user and book. We will a function value_counts() to group ratings by user and book

In [6]:
def value_counts(df, colName):
    return (df.groupby(colName).count()
              .orderBy('count', ascending=False))

We will begin by seeing the distribution of users by number of books rated by them (bottom 1% of users; bottom 2% of users; .....top 1% of users). 

In [7]:
# Number of ratings per user
users = value_counts(ratings, 'user_id')
users.approxQuantile('count', [0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.96, 0.97, 0.98, 0.99, 1.0], 0.01)
#approxQuantile(col, probabilities, relativeError)

[19.0, 19.0, 61.0, 70.0, 71.0, 71.0, 82.0, 96.0, 112.0, 128.0, 148.0, 162.0, 164.0, 164.0, 171.0, 200.0, 200.0]

approxQuantile() function of PySpark dataframe takes list of quantile probabilities, along with relative error. The output of this function is approximate quantiles at given probabilities. 
As can be see above, top 1% of the users rated 200 books, users who are middle of the distribution rated 111 books, while users in the bottom 25% rated 96 books.

In [17]:
users.show()

+-------+-----+
|user_id|count|
+-------+-----+
|  30944|  200|
|  12874|  200|
|  52036|  199|
|  28158|  199|
|  12381|  199|
|   6630|  197|
|  45554|  197|
|  15604|  196|
|   9806|  196|
|  19729|  196|
|  37834|  196|
|   9668|  196|
|  14372|  196|
|  24143|  196|
|   7563|  196|
|   9731|  195|
|  38798|  195|
|  10509|  195|
|  33065|  195|
|  25840|  195|
+-------+-----+
only showing top 20 rows

Now let us see the distribution of books by number of users who rated them (bottom 1% of books; bottom 2% of books; .....top 1% of books).

In [8]:
# Number of ratings per book
books = value_counts(ratings, 'book_id')
books.approxQuantile('count', [0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.96, 0.97, 0.98, 0.99, 1.0], 0.01)

[8.0, 8.0, 96.0, 101.0, 101.0, 102.0, 119.0, 156.0, 254.0, 547.0, 1274.0, 2587.0, 3029.0, 3948.0, 3948.0, 22806.0, 22806.0]

As can be see above, top 1% of the books are rated by 22,806 users, books that are in the middle of the distribution are rated by 249 users, while books in the bottom 25% are rated by 156 users.

### Prepare Analytics Ready Dataset

We will filter ratings dataset to include only books that have been rated by at at least 1200 users (2.2% of the entire user population) and only users who have rated at least 130 books (1.3% of the entire book population). 

In [19]:
# Filter ratings by selecting books that have been rated by at least 1200 users and users who have rated at least 130 books
fil_users = users.filter(F.col("count") >= 130)
fil_books = books.filter(F.col("count") >= 1200)

In [10]:
#Number of books meeting the threshold
fil_books.count()
#Number of users meeting the threshold
fil_users.count()

37084

Get the title and average rating for each of the books shortlisted

In [20]:
fil_books = fil_books.join(books_csv, on=['book_id'], how='inner')\
                    .select(F.col("book_id"),
                       F.col("count"),
                       F.col("title"),
                       F.col("average_rating")     
                       )

Filter the ratings dataset to only include selective books and users 

In [21]:
# Create filtered ratings containing users and books meeting thresholds 
fil_ratings = ratings.join(fil_users, on=['user_id'], how='inner').join(fil_books, on=['book_id'], how='inner')\
                .select(F.col("book_id"),
                       F.col("user_id"),
                       F.col("rating"),
                       F.col("title"),
                       F.col("average_rating")
                     )

In [25]:
# Final count of ratings, users and books

fil_ratings.count() # ~1 million - 1,051,299
fil_ratings.select('user_id').distinct().count() #12,347 users
fil_ratings.select('book_id').distinct().count()  #985 books

985

#### Create integer indexes for users and books to develop a recommender system

In [29]:
#Create indexes for books and users

#Determine unique users and books from ratings
uniq_users = value_counts(fil_ratings, 'user_id')
uniq_books = value_counts(fil_ratings, 'book_id')

#object2vec algorithm takes user_ind and book_ind starting from zero
w1 = Window.orderBy("user_id") 
uniq_users = uniq_users.withColumn("user_ind", F.row_number().over(w1)-1)

w2 = Window.orderBy("book_id") 
uniq_books = uniq_books.withColumn("book_ind", F.row_number().over(w2)-1)

In [30]:
# Create filtered ratings containing user and book indexes, along with rating
upd_fil_ratings = fil_ratings.join(uniq_users, on=['user_id'], how='inner').join(uniq_books, on=['book_id'], how='inner')\
                .select(F.col("book_id"),
                       F.col("user_id"),
                       F.col("rating"),
                       F.col("title"), 
                       F.col("book_ind"),
                       F.col("user_ind"))

Analyze if we have sequential listing of user ids and book ids and that rating values are greater than zero

In [35]:
fil_ratings.select('rating').distinct().show()
print(upd_fil_ratings.agg({"user_ind": "max"}).collect()[0])
print(upd_fil_ratings.agg({"book_ind": "max"}).collect()[0])

Row(max(book_ind)=984)

Save the prepared dataset in parquet format

In [39]:
upd_fil_ratings.write.parquet(s3_bucket + output_ds_loc + "/bookratings.parquet", mode='overwrite')