# Data loading and preprocessing

In this notebook we retrieve our dataset and preprocess it into a format that is ready to use for training our matrix factorization based recommender system.

First, we import the necessary Python packages.

In [1]:
import gdown
import numpy as np
import pandas as pd
import scipy.sparse
from tqdm.auto import tqdm
import itertools
import json
import gzip
import math
tqdm.pandas() # for progress_apply etc.

## Loading

Next, we download the files for our dataset (Goodreads). We use the gdown package to retrieve them from the Google Drive they're originally hosted on. 

> Since we will be implementing a collaborative filtering algorithm, we only need the interactions part of the dataset. The code for reading in the other parts of the dataset were left as comments for potential future reference.

We define the URLs for each file...

In [2]:
URLS = {
    # "BOOKS": "https://drive.google.com/uc?id=1ICk5x0HXvXDp5Zt54CKPh5qz1HyUIn9m",
    # "AUTHORS": "https://drive.google.com/uc?id=19cdwyXwfXx_HDIgxXaHzH0mrx8nMyLvC",
    # "REVIEWS": "https://drive.google.com/u/0/uc?id=1V4MLeoEiPQdocCbUHjR_7L9ZmxTufPFe",
    "INTERACTIONS": "https://drive.google.com/uc?id=1CCj-cQw_mJLMdvF_YYfQ7ibKA-dC_GA2"
}

and download each file. (if they haven't been downloaded in a previous run of the notebook)

In [3]:
for name, url in URLS.items():
    gdown.cached_download(url, f"./data/{name}.json.gz", quiet=False)

File exists: ./data/INTERACTIONS.json.gz


We now define a function to read the dataset into a Pandas dataframe. This implementation is faster and more memory efficient than the read_json function provided by Pandas.

In [4]:
def read_json_fast(filename, nrows=None):
  """
  Loads line delimited JSON files faster than the 
  read_json function provided by Pandas.
  
  Iterates over file line per line, so shouldn't
  cause out-of-memory issues, except if resulting
  DataFrame is too big.
  
  Args:
    filename: path of the JSON file
    nrows: total number of rows to read from the file
  
  Returns:
    Pandas DataFrame containing (part of) the data. 
  """
  with gzip.open(filename) as f:
        print(f"Processing {filename.split('/')[-1]}:")
        if nrows is not None:
            pbar = tqdm(itertools.islice(f, nrows), unit="lines")
        else:
            pbar = tqdm(f, unit="lines")
        return pd.DataFrame(json.loads(l) for l in pbar)

We define the dataset file locations...

In [5]:
# books_file = './data/BOOKS.json.gz' # book metadata
interactions_file = './data/INTERACTIONS.json.gz' # user-book interactions (ratings)
# reviews_file = './data/REVIEWS.json.gz' # user-book interactions (reviews)
# authors_file = './data/AUTHORS.json.gz' # author metadata

and load the necessary files.

In [6]:
# df_books = read_json_fast(books_file)
df_interactions = read_json_fast(interactions_file)
# df_authors =  read_json_fast(authors_file)

Processing INTERACTIONS.json.gz:


0lines [00:00, ?lines/s]

Now we look at the contents of the loaded dataset.

In [7]:
display(df_interactions.head(1))

Unnamed: 0,user_id,book_id,review_id,is_read,rating,review_text_incomplete,date_added,date_updated,read_at,started_at
0,8842281e1d1347389f2ab93d60773d4d,836610,6b4db26aafeaf0da77c7de6214331e1e,False,0,,Mon Aug 21 12:11:00 -0700 2017,Mon Aug 21 12:11:00 -0700 2017,,


We can see that the dataset contains quite a few columns that are of no use to us. To make everything a little less cluttered we remove the columns that we don't use from the dataframe.

In [8]:
df_interactions = df_interactions[['user_id', 'book_id', 'rating', 'date_updated']]

In [9]:
display(df_interactions.head(1))

Unnamed: 0,user_id,book_id,rating,date_updated
0,8842281e1d1347389f2ab93d60773d4d,836610,0,Mon Aug 21 12:11:00 -0700 2017


## Pre-processing

The first pre-processing step we apply is converting all dates into a more standardized format.

In [10]:
format_str = '%a %b %d %H:%M:%S %z %Y' #see https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
def convert_date(date_string):
  return pd.to_datetime(date_string, utc=True, format=format_str)

_df_interactions = df_interactions.copy()
_df_interactions['date_updated'] =  _df_interactions['date_updated'].progress_apply(convert_date)
_df_interactions['date_updated'] = _df_interactions['date_updated'].dt.tz_localize(None)  # drops utc timezone

  0%|          | 0/7347630 [00:00<?, ?it/s]

Now we define a pre-processing function that:

1. Drops ratings below 3, as we consider these to be non-relevant items fo rthe user.
2. Removes duplicate (user, item) pairs.
3. Removes users that occur in less than minsup interactions.

In [36]:
def preprocess(df, minsup=10):
    """
    Goal: - Remove reconsumption items
          - Remove users that have less than minsup interactions 
          - Drop ratings == 0, i.e. "not provided"
               
    :input df: Dataframe containing user_id, item_id and time
    """
    # drop 0 ratings
    before = df.shape[0]
    df = df[(df["rating"] > 0)]
    print("After dropping  0-ratings: {} -> {}".format(before,df.shape[0]))
    # drop reconsumption items
    before = df.shape[0]
    df = df.drop_duplicates(subset=["user_id","book_id"])
    print("After drop_duplicates (reconsumption items): {} -> {}".format(before,df.shape[0]))
    # drop users with less then minsup items in history
    g = df.groupby('user_id', as_index=False)['book_id'].size()
    g = g.rename({'size': 'user_sup'}, axis='columns')
    df = pd.merge(df, g, how='left', on=['user_id'])
    before = df.shape[0]
    df = df[df['user_sup'] >= minsup]
    print("After dropping users with less than {} interactions: {} -> {}".format(minsup, before,df.shape[0]))
    return df

Then we apply the pre-processing function to the dataframe and log the change in number of samples, number of unique users and number of unique items.

In [37]:
#print number of users and items
print(f"number of unique users: {_df_interactions['user_id'].nunique()}")
print(f"number of unique items: {_df_interactions['book_id'].nunique()}")
processed_df_interactions = preprocess(_df_interactions.copy())
# display(processed_df_interactions.head(5))
print(f"number of unique users: {processed_df_interactions['user_id'].nunique()}")
print(f"number of unique items: {processed_df_interactions['book_id'].nunique()}")
# create sequential ids
processed_df_interactions['user_id_seq'] = processed_df_interactions['user_id'].astype('category').cat.codes
processed_df_interactions['book_id_seq'] = processed_df_interactions['book_id'].astype('category').cat.codes
# merge book id and rating for easier 
display(processed_df_interactions.head(5))

number of unique users: 342415
number of unique items: 89411
After dropping  0-ratings: 7347630 -> 4514094
After drop_duplicates (reconsumption items): 4514094 -> 4514094
After dropping users with less than 10 interactions: 4514094 -> 4041762
number of unique users: 68401
number of unique items: 88276


Unnamed: 0,user_id,book_id,rating,date_updated,user_sup,user_id_seq,book_id_seq
20,06316bec7a49286f1f98d5acce24f923,575753,4,2012-06-05 16:35:39,35,1721,73285
21,06316bec7a49286f1f98d5acce24f923,47694,4,2012-06-05 16:35:15,35,1721,71428
22,06316bec7a49286f1f98d5acce24f923,47700,3,2012-06-05 16:34:57,35,1721,71432
23,06316bec7a49286f1f98d5acce24f923,47720,4,2012-06-05 16:34:37,35,1721,71440
24,06316bec7a49286f1f98d5acce24f923,25104,5,2012-06-05 16:34:26,35,1721,42032


We store the mapping from user/book id's to their sequential id in an external file. This might come in handy in other notebooks.

In [38]:
processed_df_interactions[['user_id', 'user_id_seq']].drop_duplicates().to_pickle("./data/user_id_map.pkl")
processed_df_interactions[['book_id', 'book_id_seq']].drop_duplicates().to_pickle("./data/book_id_map.pkl")

We sort the interactions by their date and group them by user. This allows us to perform a session-based train-test split.

In [39]:
# Sort on date and group per user
sessions_df = processed_df_interactions.sort_values(['date_updated'],ascending=True).groupby(by='user_id_seq', as_index=False)[['book_id_seq','date_updated', 'rating']].agg(list)

Perform session-based split.

In [40]:
# Function to perform split
def split(row, col, percentage_train):
    items = row[col]
    no_train_items = math.floor(len(items) * percentage_train)
    return items[0:no_train_items], items[no_train_items:]

# Split dataset into 0.7 training and 0.3 test samples, split in the temporal dimension.
percentage_train = 0.7
# train_items, test_items = split(items, percentage_train)
sessions_df[['history', 'future']] = sessions_df.progress_apply(lambda row: split(row, 'book_id_seq', percentage_train), axis=1, result_type='expand')
sessions_df[['history_ratings', 'future_ratings']] = sessions_df.progress_apply(lambda row: split(row, 'rating', percentage_train), axis=1, result_type='expand')
display(sessions_df.head(5))

  0%|          | 0/68401 [00:00<?, ?it/s]

  0%|          | 0/68401 [00:00<?, ?it/s]

Unnamed: 0,user_id_seq,book_id_seq,date_updated,rating,history,future,history_ratings,future_ratings
0,0,"[49283, 53514, 53515, 53516, 53522, 53519, 535...","[2013-06-21 17:23:44, 2013-06-21 17:24:05, 201...","[4, 4, 4, 4, 3, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, ...","[49283, 53514, 53515, 53516, 53522, 53519, 535...","[53512, 53508, 53520, 53509, 53513, 72017, 720...","[4, 4, 4, 4, 3, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 4]","[4, 4, 4, 4, 4, 5, 5, 5]"
1,1,"[70584, 81684, 14070, 10809, 69705, 29863, 341...","[2012-10-21 16:23:19, 2012-10-21 16:27:04, 201...","[5, 5, 4, 4, 5, 4, 4, 5, 4, 4, 4, 4, 5]","[70584, 81684, 14070, 10809, 69705, 29863, 341...","[10030, 13982, 8144, 19061]","[5, 5, 4, 4, 5, 4, 4, 5, 4]","[4, 4, 4, 5]"
2,2,"[86771, 86776, 86912, 86890, 1087, 69, 85992, ...","[2008-02-22 22:33:45, 2008-02-23 03:14:49, 200...","[5, 4, 3, 4, 3, 5, 3, 4, 4, 5, 4, 4, 4, 4, 4, ...","[86771, 86776, 86912, 86890, 1087, 69, 85992, ...","[83894, 72445, 16120, 58134, 3327, 86827, 63486]","[5, 4, 3, 4, 3, 5, 3, 4, 4, 5, 4, 4, 4, 4]","[4, 3, 4, 4, 4, 3, 4]"
3,3,"[71311, 73341, 56070, 70016, 82251, 86852, 859...","[2013-11-10 22:46:12, 2013-11-10 22:47:06, 201...","[5, 5, 5, 5, 5, 5, 5, 5, 3, 3, 4, 5, 5, 3, 5, ...","[71311, 73341, 56070, 70016, 82251, 86852, 859...","[73097, 27581, 11308, 74054, 22902, 65494, 785...","[5, 5, 5, 5, 5, 5, 5, 5, 3, 3, 4, 5, 5, 3, 5, ...","[5, 5, 4, 5, 5, 5, 5, 5, 5, 3, 2, 5, 5, 5, 4, ..."
4,4,"[71311, 73815, 17559, 59228, 70489, 81010, 879...","[2012-10-19 15:54:06, 2012-10-19 16:03:30, 201...","[5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]","[71311, 73815, 17559, 59228, 70489, 81010, 879...","[71654, 71656, 87721, 71653]","[5, 4, 4, 4, 4, 4, 4, 4]","[4, 4, 4, 4]"


Finally, we create a sparse representation of the user-item interaction matrix for our train and test set.

In [42]:
def create_sparse_repr(df, column, shape):
    user_ids = []
    item_ids = []
    values = []
    for idx, row in tqdm(df.iterrows()):
        items = row[column]
        item_ids.extend(items)
        user = row['user_id_seq']
        user_ids.extend([user] * len(items))
        ratings = row[column + "_ratings"]
        values.extend(ratings)
    matrix = scipy.sparse.coo_matrix((values, (user_ids, item_ids)), shape=shape, dtype=np.int32)
    return matrix
    

shape = (processed_df_interactions['user_id_seq'].max() + 1,  processed_df_interactions['book_id_seq'].max() + 1)
train = create_sparse_repr(sessions_df, column='history', shape=shape)
test = create_sparse_repr(sessions_df, column='future', shape=shape)

0it [00:00, ?it/s]

0it [00:00, ?it/s]

We store the train and test set externally to be used in the training and evaluating notebook.

In [46]:
scipy.sparse.save_npz('./data/train.npz', train)
scipy.sparse.save_npz('./data/test.npz', test)