# Data loading and preprocessing

In this notebook we retrieve our dataset and preprocess it into a format that is ready to use for training our matrix factorization based recommender system.

First, we import the necessary Python packages.

In [74]:
import gdown
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import itertools
import json
import gzip
import math
import pickle
tqdm.pandas() # for progress_apply etc.

## Loading

Next, we download the files for our dataset (Goodreads). We use the gdown package to retrieve them from the Google Drive they're originally hosted on. 

> Since we will be implementing a collaborative filtering algorithm, we only need the interactions part of the dataset. The code for reading in the other parts of the dataset were left as comments for potential future reference.

We define the URLs for each file...

In [2]:
URLS = {
    # "BOOKS": "https://drive.google.com/uc?id=1ICk5x0HXvXDp5Zt54CKPh5qz1HyUIn9m",
    # "AUTHORS": "https://drive.google.com/uc?id=19cdwyXwfXx_HDIgxXaHzH0mrx8nMyLvC",
    # "REVIEWS": "https://drive.google.com/u/0/uc?id=1V4MLeoEiPQdocCbUHjR_7L9ZmxTufPFe",
    "INTERACTIONS": "https://drive.google.com/uc?id=1CCj-cQw_mJLMdvF_YYfQ7ibKA-dC_GA2"
}

and download each file. (if they haven't been downloaded in a previous run of the notebook)

In [3]:
for name, url in URLS.items():
    gdown.cached_download(url, f"./data/{name}.json.gz", quiet=False)

File exists: ./data/INTERACTIONS.json.gz


We now define a function to read the dataset into a Pandas dataframe. This implementation is faster and more memory efficient than the read_json function provided by Pandas.

In [4]:
def read_json_fast(filename, nrows=None):
  """
  Loads line delimited JSON files faster than the 
  read_json function provided by Pandas.
  
  Iterates over file line per line, so shouldn't
  cause out-of-memory issues, except if resulting
  DataFrame is too big.
  
  Args:
    filename: path of the JSON file
    nrows: total number of rows to read from the file
  
  Returns:
    Pandas DataFrame containing (part of) the data. 
  """
  with gzip.open(filename) as f:
        print(f"Processing {filename.split('/')[-1]}:")
        if nrows is not None:
            pbar = tqdm(itertools.islice(f, nrows), unit="lines")
        else:
            pbar = tqdm(f, unit="lines")
        return pd.DataFrame(json.loads(l) for l in pbar)

We define the dataset file locations...

In [5]:
# books_file = './data/BOOKS.json.gz' # book metadata
interactions_file = './data/INTERACTIONS.json.gz' # user-book interactions (ratings)
# reviews_file = './data/REVIEWS.json.gz' # user-book interactions (reviews)
# authors_file = './data/AUTHORS.json.gz' # author metadata

and load the necessary files.

In [19]:
# df_books = read_json_fast(books_file)
df_interactions = read_json_fast(interactions_file)
# df_authors =  read_json_fast(authors_file)

Processing INTERACTIONS.json.gz:


0lines [00:00, ?lines/s]

Now we look at the contents of the loaded dataset.

In [20]:
display(df_interactions.head(1))

Unnamed: 0,user_id,book_id,review_id,is_read,rating,review_text_incomplete,date_added,date_updated,read_at,started_at
0,8842281e1d1347389f2ab93d60773d4d,836610,6b4db26aafeaf0da77c7de6214331e1e,False,0,,Mon Aug 21 12:11:00 -0700 2017,Mon Aug 21 12:11:00 -0700 2017,,


We can see that the dataset contains quite a few columns that are of no use to us. To make everything a little less cluttered we remove the columns that we don't use from the dataframe.

In [36]:
df_interactions = df_interactions[['user_id', 'book_id', 'rating', 'date_updated']]

In [37]:
display(df_interactions.head(1))

Unnamed: 0,user_id,book_id,rating,date_updated
0,8842281e1d1347389f2ab93d60773d4d,836610,0,Mon Aug 21 12:11:00 -0700 2017


## Pre-processing

The first pre-processing step we apply is converting all dates into a more standardized format.

In [38]:
format_str = '%a %b %d %H:%M:%S %z %Y' #see https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
def convert_date(date_string):
  return pd.to_datetime(date_string, utc=True, format=format_str)

_df_interactions = df_interactions.copy()
_df_interactions['date_updated'] =  _df_interactions['date_updated'].progress_apply(convert_date)
_df_interactions['date_updated'] = _df_interactions['date_updated'].dt.tz_localize(None)  # drops utc timezone

  0%|          | 0/7347630 [00:00<?, ?it/s]

Now we define a pre-processing function that:

1. Drops 0-ratings since they are equal to "Not provided".
2. Removes duplicate (user, item) pairs.
3. Removes users that occur in less than minsup interactions.

In [69]:
def preprocess(df, minsup=4):
    """
    Goal: - Remove reconsumption items
          - Remove users that have less than minsup interactions 
          - 0 Ratings == "Not Provided"
               
    :input df: Dataframe containing user_id, item_id and time
    """
    # drop 0 ratings
    before = df.shape[0]
    df = df[(df["rating"] != 0)]
    print("After dropping  ratings == 0 (not provided ratings): {} -> {}".format(before,df.shape[0]))
    # drop reconsumption items
    before = df.shape[0]
    df = df.drop_duplicates(subset=["user_id","book_id"])
    print("After drop_duplicates (reconsumption items): {} -> {}".format(before,df.shape[0]))
    # drop users with less then minsup items in history
    g = df.groupby('user_id', as_index=False)['book_id'].size()
    g = g.rename({'size': 'user_sup'}, axis='columns')
    df = pd.merge(df, g, how='left', on=['user_id'])
    before = df.shape[0]
    df = df[df['user_sup'] >= minsup]
    print("After dropping users with less than {} interactions: {} -> {}".format(minsup, before,df.shape[0]))
    return df

Then we apply the pre-processing function to the dataframe and log the change in number of samples, number of unique users and number of unique items.

In [71]:
#print number of users and items
print(f"number of unique users: {_df_interactions['user_id'].nunique()}")
print(f"number of unique items: {_df_interactions['book_id'].nunique()}")
processed_df_interactions = preprocess(_df_interactions.copy())
# display(processed_df_interactions.head(5))
print(f"number of unique users: {processed_df_interactions['user_id'].nunique()}")
print(f"number of unique items: {processed_df_interactions['book_id'].nunique()}")
# create sequential ids
processed_df_interactions['user_id_seq'] = processed_df_interactions['user_id'].astype('category').cat.codes
processed_df_interactions['book_id_seq'] = processed_df_interactions['book_id'].astype('category').cat.codes
# merge book id and rating for easier 
display(processed_df_interactions.head(5))

number of unique users: 342415
number of unique items: 89411
After dropping  ratings == 0 (not provided ratings): 7347630 -> 4514094
After drop_duplicates (reconsumption items): 4514094 -> 4514094
After dropping users with less than 4 interactions: 4514094 -> 4314063
number of unique users: 114648
number of unique items: 88848


Unnamed: 0,user_id,book_id,rating,date_updated,user_sup,user_id_seq,book_id_seq
0,8842281e1d1347389f2ab93d60773d4d,24815,5,2008-04-18 06:42:49,4,61194,41327
1,8842281e1d1347389f2ab93d60773d4d,24818,5,2008-04-18 06:42:42,4,61194,41336
2,8842281e1d1347389f2ab93d60773d4d,59715,5,2008-04-18 06:42:40,4,61194,74153
3,8842281e1d1347389f2ab93d60773d4d,24816,5,2008-04-18 06:42:38,4,61194,41329
6,f8a89075dc6de14857561522e729f82c,18465601,4,2014-12-02 02:49:29,5,111384,24172


We store the mapping from user/book id's to their sequential id in an external file. This might come in handy in other notebooks.

In [81]:
processed_df_interactions[['user_id', 'user_id_seq']].drop_duplicates().to_pickle("./data/user_id_map.pkl")
processed_df_interactions[['book_id', 'book_id_seq']].drop_duplicates().to_pickle("./data/book_id_map.pkl")

We sort the interactions by their date and group them by user. This allows us to perform a session-based split.

In [73]:
# Sort on date and group per user
sessions_df = processed_df_interactions.sort_values(['date_updated'],ascending=True).groupby(by='user_id_seq', as_index=False)[['book_id_seq','date_updated']].agg(list)

# Function to perform split
def split(items, percentage_train):
  no_train_items = math.floor(len(items) * percentage_train)
  return items[0:no_train_items], items[no_train_items:]

# Split dataset into 0.75 training and 0.25 test samples, split in the temporal dimension.
percentage_train = 0.75
sessions_df['history'] = sessions_df['book_id_seq'].apply(lambda items: split(items, percentage_train)[0])
sessions_df['future'] = sessions_df['book_id_seq'].apply(lambda items: split(items, percentage_train)[1])
display(sessions_df.head(5))

Unnamed: 0,user_id_seq,book_id_seq,date_updated,history,future
0,0,"[49600, 53860, 53861, 53862, 53868, 53865, 538...","[2013-06-21 17:23:44, 2013-06-21 17:24:05, 201...","[49600, 53860, 53861, 53862, 53868, 53865, 538...","[53866, 53855, 53859, 72519, 72520, 72518]"
1,1,"[71813, 42294, 82121, 65876, 71257, 58000, 218...","[2014-06-30 02:30:37, 2014-06-30 02:30:44, 201...","[71813, 42294, 82121, 65876, 71257, 58000]","[21835, 74568, 80549]"
2,2,"[71085, 82221, 14162, 10873, 70198, 30053, 344...","[2012-10-21 16:23:19, 2012-10-21 16:27:04, 201...","[71085, 82221, 14162, 10873, 70198, 30053, 344...","[10087, 14074, 8195, 19176]"
3,3,"[87333, 87338, 87475, 87452, 1098, 69, 86550, ...","[2008-02-22 22:33:45, 2008-02-23 03:14:49, 200...","[87333, 87338, 87475, 87452, 1098, 69, 86550, ...","[72948, 16224, 58520, 3346, 87389, 63919]"
4,4,"[71813, 73846, 56442, 70513, 82790, 87414, 865...","[2013-11-10 22:46:12, 2013-11-10 22:47:06, 201...","[71813, 73846, 56442, 70513, 82790, 87414, 865...","[74562, 23039, 65951, 79067, 25893, 71275, 312..."


Finally, we create a sparse representation of the binary user-item interaction matrix for our train and test set. This sparse representation is a list of (user, item) tuples that have value 1 in the matrix.

In [84]:
def create_sparse_repr(sessions_df, column):
    user_ids = []
    item_ids = []
    for idx, row in tqdm(sessions_df.iterrows()):
        items = row[column]
        item_ids.extend(items)
        user = row['user_id_seq']
        user_ids.extend([user] * len(items))
    return list(zip(user_ids, item_ids))
    

train = create_sparse_repr(sessions_df, column='history')
test = create_sparse_repr(sessions_df, column='future')

0it [00:00, ?it/s]

0it [00:00, ?it/s]

We store the train and test set externally to be used in the training and evaluating notebook.

In [85]:
with open("./data/train.pkl", "wb")as f:
    pickle.dump(train, f)
with open("./data/test.pkl", "wb")as f:
    pickle.dump(test, f)