# Book Recommendations

Every user has some preferences in relation to items offered by the service. It is possible to infer these preferences based on a variety of attributes (examples with books): book page view time, book clicks, adding to favorites, rating, writing reviews, and so on. 

All interactions between users and items can be converted to numbers and stored in so-called user-item interactions matrix. Here is an example from [this article](https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada):

<img src="pics/interactions.png" width=900 height=450 align="center">

The purpose of recommendation systems is to predict ratings which a user would give to items that he has not rated yet. In a simple way, the recommender algorithms can be classified as follows (the picture from the same [article](https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada)):

<img src="pics/recommender_classification.png" width=900 height=450 align="center">

A typical difficulty in building recommender systems is the small number of ratings. Thus, to mitigate this problem, we will combine information from [Book–Crossing](../book_crossing/data_prep) and [Goodreads](../goodreads/data_prep) datasets. 

In [1]:
import os
import sys

# Append the sys.path with the project root path
sys.path.append(os.path.dirname(os.path.abspath('')))

In [2]:
import pickle
from typing import Dict, Tuple

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Set random seeds and suppress warnings
RANDOM_SEED = 54
np.random.seed(RANDOM_SEED)

# Data Preparation

## Book Crossing

Load preprocessed Book–Crossing data:

In [3]:
path_bc = os.path.join('..', 'book_crossing', 'data_prep')
books_bc = pd.read_csv(os.path.join(path_bc, 'books.csv'),
                       usecols=['isbn13', 'book_title'],
                       index_col=['isbn13'],
                       dtype={'isbn13': 'category',
                              'book_title': 'str'})
ratings_bc = pd.read_csv(os.path.join(path_bc, 'ratings.csv'),
                         dtype={'user_id': 'category',
                                'rating': 'uint8',
                                'isbn13': 'category'})

In [4]:
books_bc.head(3)

Unnamed: 0_level_0,book_title
isbn13,Unnamed: 1_level_1
9780195153446,Classical Mythology
9780002005012,Clara Callan
9780060973124,Decision in Normandy


In [5]:
# Drop all implicit ratings
ratings_bc = ratings_bc[ratings_bc['rating'] > 0]
ratings_bc.head(3)

Unnamed: 0,user_id,rating,isbn13
1,276726,5,9780155061224
3,276729,3,9780521656153
4,276729,6,9780521795029


In [6]:
print(f'Number of books: {len(books_bc)}')
print(f'Number of ratings: {len(ratings_bc)}')

Number of books: 270947
Number of ratings: 384127


In [7]:
books_bc.rename(columns={'book_title': 'title'},
                inplace=True)
books_bc.head(2)

Unnamed: 0_level_0,title
isbn13,Unnamed: 1_level_1
9780195153446,Classical Mythology
9780002005012,Clara Callan


## Goodreads

Load preprocessed Goodreads data:

In [8]:
path_gr = os.path.join('..', 'goodreads', 'data_prep')
books_gr = pd.read_csv(os.path.join(path_gr, 'books.csv'),
                       usecols=['isbn13', 'title', 'work_id', 'book_id'],
                       index_col=['book_id'],
                       dtype={'isbn13': 'category', 'work_id': 'category',
                              'title': 'str', 'book_id': 'category'})
ratings_gr = pd.read_csv(os.path.join(path_gr, 'ratings.csv'),
                         usecols=['user_id', 'book_id', 'rating'],
                         dtype={'user_id': 'category',
                                'rating': 'float16',
                                'book_id': 'category'})

In [9]:
books_gr.head(3)

Unnamed: 0_level_0,isbn13,work_id,title
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5333265,9780312853129,5400751,W.C. Fields: A Life on Film
1333909,9780743509985,1323437,Good Harbor
6066819,9780743294294,6243154,Best Friends Forever


In [10]:
# Drop all implicit ratings
ratings_gr = ratings_gr[~ratings_gr['rating'].isna()]
ratings_gr['rating'] = ratings_gr['rating'].astype('uint8')
ratings_gr.head(3)

Unnamed: 0,rating,book_id,user_id
0,5,12,8842281e1d1347389f2ab93d60773d4d
1,5,21,8842281e1d1347389f2ab93d60773d4d
2,5,30,8842281e1d1347389f2ab93d60773d4d


In [11]:
print(f'Number of books: {len(books_gr)}')
print(f'Number of ratings: {len(ratings_gr)}')

Number of books: 1599130
Number of ratings: 89625534


As we remember from data preprocessing stage, there are some duplicated ISBNs:

In [12]:
books_gr_duplicates = books_gr[books_gr.duplicated(['isbn13'], keep=False)]\
    .sort_values('isbn13')
books_gr_duplicates.head(5)

Unnamed: 0_level_0,isbn13,work_id,title
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
26812295,9780007255764,19269242,The Year of Reading Dangerously: How Fifty Gre...
25412569,9780007255764,19269242,The Year of Reading Dangerously: How Fifty Gre...
25401812,9780007282586,2288775,"A Murder Is Announced (Miss Marple, #5)"
25386782,9780007282586,2288775,A Murder Is Announced
13146527,9780007395200,2677305,Number the Stars


We have to delete the duplicates and made the corresponding changes in the ratings dataset:

In [13]:
# Group them and get indexes
books_gr_duplicates_idx = books_gr_duplicates\
    .groupby(['isbn13'], observed=True)\
    .apply(lambda x: list(x.index)).tolist()

# Iterate over each group and keep only one book id
to_replace = {}
for book_group in books_gr_duplicates_idx:
    to_leave = book_group.pop()
    for index in book_group:
        to_replace[index] = to_leave

# Drop duplicates from the book dataset
books_gr.drop(index=to_replace.keys(), inplace=True)

# Replace in the ratings
ratings_gr['book_id'] = ratings_gr['book_id']\
    .map(lambda x: to_replace.get(x, x))

This transformation causes duplicated rows in ratings, so we will drop them:

In [14]:
ratings_gr.drop_duplicates(['user_id', 'book_id'],
                           keep='first', inplace=True)

Change `book_id` to `isbn13`:

In [15]:
ratings_gr = ratings_gr.merge(books_gr[['isbn13']], left_on='book_id',
                              right_index=True, how='left')
ratings_gr.drop(columns=['book_id'], inplace=True)
ratings_gr.head(2)

Unnamed: 0,rating,user_id,isbn13
0,5,8842281e1d1347389f2ab93d60773d4d,9780517226957
1,5,8842281e1d1347389f2ab93d60773d4d,9780767908184


## Join

Combine book info:

In [16]:
books = books_gr.merge(books_bc, left_on='isbn13',
                       right_index=True, how='outer')

# Merge titles
books['title'] = books['title_x']
missing_titles = books['title'].isna()
books.loc[missing_titles, 'title'] = books.loc[missing_titles, 'title_y']

# Drop unused columns
books.drop(columns=['title_x', 'title_y'], inplace=True)

# Set ISBN as index
books.set_index('isbn13', inplace=True)

Since books from Book-Crossing dataset have no work ids, we need to create unique ones:

In [17]:
books['work_id'] = books['work_id'].astype('float')
missing_work_ids = books['work_id'].isna()
existing_work_ids = set(books.loc[~missing_work_ids, 'work_id'])

new_work_ids = []
counter = 0
nans_count = missing_work_ids.sum()
while len(new_work_ids) < nans_count:
    if counter not in existing_work_ids:
        new_work_ids.append(counter)
    counter += 1

books.loc[missing_work_ids, 'work_id'] = np.array(new_work_ids)
books['work_id'] = books['work_id'].astype('int').astype('category')
books.head(2)

Unnamed: 0_level_0,work_id,title
isbn13,Unnamed: 1_level_1,Unnamed: 2_level_1
9780312853129,5400751,W.C. Fields: A Life on Film
9780743509985,1323437,Good Harbor


Before merging the rating data, we need to scale ratings to the common range:

In [18]:
# Scale goodreads ratings to the range from 1 to 10
ratings_gr['rating'] *= 2

We assume that users from Goodreads and Book-Crossing communities are completely different. Check this:

In [19]:
ratings_gr['user_id'].isin(ratings_bc['user_id']).any()

False

Append Book-Crossing ratings to Goodreads ones:

In [20]:
ratings = ratings_gr.append(ratings_bc)

# Add work_ids
ratings = ratings.merge(books[['work_id']], how='left',
                        left_on='isbn13', right_index=True)
ratings.head(2)

Unnamed: 0,rating,user_id,isbn13,work_id
0,10,8842281e1d1347389f2ab93d60773d4d,9780517226957,135328
1,10,8842281e1d1347389f2ab93d60773d4d,9780767908184,2305997


Since people most often rate the book content rather than a particular edition of the book, we will build recommendations using `work_id` instead of `isbn13`.

Some users may have rated different editions of the same book. Let's average their ratings so that there is only one rating per book from each user:

In [21]:
ratings_per_work = ratings[['work_id', 'user_id', 'rating']]\
    .groupby(['work_id', 'user_id'], observed=True).mean()

# Drop duplicated ratings
work_ratings = ratings.drop_duplicates(['user_id', 'work_id'], keep='first')
work_ratings = work_ratings[['user_id', 'work_id']]\
    .merge(ratings_per_work, left_on=['user_id', 'work_id'],
           right_on=['user_id', 'work_id'], how='left')
work_ratings.head(4)

Unnamed: 0,user_id,work_id,rating
0,8842281e1d1347389f2ab93d60773d4d,135328,10.0
1,8842281e1d1347389f2ab93d60773d4d,2305997,10.0
2,8842281e1d1347389f2ab93d60773d4d,89369,10.0
3,8842281e1d1347389f2ab93d60773d4d,1699340,10.0


In [22]:
# Group data to get statistics
for parameter in ['work_id', 'user_id']:
    ratings_by_param = work_ratings[[parameter, 'rating']]\
        .groupby(parameter, observed=True).count()
    rated_count = len(ratings_by_param)
    five_times_rated_count = (ratings_by_param['rating'] >= 5).sum()
    ten_times_rated_count = (ratings_by_param['rating'] >= 10).sum()
    print(f'Number of {parameter} which have ratings: {rated_count}')
    print(f'Number of {parameter} which have at least 5 ratings: '
          f'{five_times_rated_count}')
    print(f'Number of {parameter} which have at least 10 ratings: '
          f'{ten_times_rated_count}\n')
    print(f'Total number of ratings: {len(work_ratings)}')

Number of work_id which have ratings: 1141892
Number of work_id which have at least 5 ratings: 568805
Number of work_id which have at least 10 ratings: 394932

Total number of ratings: 89686460
Number of user_id which have ratings: 876176
Number of user_id which have at least 5 ratings: 750186
Number of user_id which have at least 10 ratings: 692320

Total number of ratings: 89686460


## Preprocess

To test the recommender performance, we need to split data into train, validation, and test datasets. We will use information about users who left at least 4 ratings — 2 for training, 1 for validation, 1 for testing. If the user has more ratings, we will use 60% for training, 20% for testing and validations. Since the dataset does not contain rating timestamps, we will split the data in a random way. 

In [23]:
ratings_per_user = work_ratings.groupby('user_id', observed=True)
ratings_per_user_count = ratings_per_user['rating'].count()
min_four_users = ratings_per_user_count[ratings_per_user_count >= 4].index
print(f'Number of users who left at least four ratings: '
      f'{len(min_four_users)} '
      f'({round(len(min_four_users) * 100 / len(ratings_per_user), 2)}%)')

Number of users who left at least four ratings: 764983 (87.31%)


In [24]:
# Drop users with a little number of ratings
work_ratings_min_four = ratings_per_user.filter(lambda x: len(x) >= 4)

In [25]:
# Split into train, test, and val datasets
work_ratings_train, work_ratings_val = train_test_split(
    work_ratings_min_four, test_size=0.4, random_state=RANDOM_SEED,
    stratify=work_ratings_min_four['user_id'])
work_ratings_test, work_ratings_val = train_test_split(
    work_ratings_val, test_size=0.5, random_state=RANDOM_SEED,
    stratify=work_ratings_val['user_id'])

# Show the shape
work_ratings_train.shape, work_ratings_val.shape, work_ratings_test.shape

((53708019, 3), (17902673, 3), (17902673, 3))

To compare multiple users' scores, we need to normalize them. Because one user may always give high ratings (8 to 10), while another may give all ratings from 6 to 8. However, the tastes of these users may be exactly the same. Thus, we will use StandardScaler:

In [26]:
def fit_user_scalers(ratings: pd.DataFrame) -> Dict[str, Tuple[float, float]]:
    """Get statistics per suer to normalize their ratings.

    :param ratings: all ratings.
    :return: average and standard deviation of users' ratings.
    """
    user_ratings_scalers = {}
    scaler = StandardScaler()
    for user_id, user_data in ratings.groupby('user_id'):
        scaler.fit(user_data[['rating']])
        user_ratings_scalers[user_id] = scaler.scale_, scaler.mean_
    return user_ratings_scalers


def ratings_normalize(user_ratings: pd.Series, user_scale: float,
                      user_mean: float) -> pd.Series:
    """Transform user's ratings.

    :param user_ratings: user's ratings.
    :param user_scale: standard deviation of user's ratings.
    :param user_mean: average of user's ratings.
    :return: transformed values.
    """
    scaler = StandardScaler()
    scaler.scale_ = user_scale
    scaler.mean_ = user_mean
    return scaler.transform(user_ratings.values.reshape(-1, 1)).ravel()

In [27]:
# This should be done only for training dataset
user_ratings_scalers = fit_user_scalers(work_ratings_train)

In [28]:
# Transform all datasets
for dataset in [work_ratings_val, work_ratings_test, work_ratings_train]:
    dataset['rating_scaled'] = dataset.groupby('user_id')['rating']\
        .transform(lambda x: ratings_normalize(x, *user_ratings_scalers[x.name]))

## Save Data

In [29]:
# Save scaler info
with open(os.path.join('data_interm', 'user_ratings_scalers.pkl'), 'wb') as file:
    pickle.dump(user_ratings_scalers, file)

# Save datasets
work_ratings_train.to_csv(os.path.join('data_interm', 'work_ratings_train.csv'),
                          index=False)
work_ratings_val.to_csv(os.path.join('data_interm', 'work_ratings_val.csv'),
                        index=False)
work_ratings_test.to_csv(os.path.join('data_interm', 'work_ratings_test.csv'),
                         index=False)