# Data loading and preprocessing

In this notebook we retrieve our dataset and preprocess it into a format that is ready to use for training our matrix factorization based recommender system.

First, we import the necessary Python packages.

In [1]:
import gdown
import numpy as np
import pandas as pd
import scipy.sparse
from tqdm.auto import tqdm
import itertools
import json
import gzip
import math
from utils import read_json_fast
tqdm.pandas() # for progress_apply etc.

## Loading

Next, we download the files for our dataset (Goodreads). We use the gdown package to retrieve them from the Google Drive they're originally hosted on. 

> Since we will be implementing a collaborative filtering algorithm, we only need the interactions part of the dataset. The code for reading in the other parts of the dataset were left as comments for potential future reference.

We define the URLs for each file...

In [2]:
URLS = {
    # "BOOKS": "https://drive.google.com/uc?id=1ICk5x0HXvXDp5Zt54CKPh5qz1HyUIn9m",
    # "AUTHORS": "https://drive.google.com/uc?id=19cdwyXwfXx_HDIgxXaHzH0mrx8nMyLvC",
    # "REVIEWS": "https://drive.google.com/u/0/uc?id=1V4MLeoEiPQdocCbUHjR_7L9ZmxTufPFe",
    "INTERACTIONS": "https://drive.google.com/uc?id=1CCj-cQw_mJLMdvF_YYfQ7ibKA-dC_GA2"
}

and download each file. (if they haven't been downloaded in a previous run of the notebook)

In [3]:
for name, url in URLS.items():
    gdown.cached_download(url, f"./data/{name}.json.gz", quiet=False)

File exists: ./data/INTERACTIONS.json.gz


We define the dataset file locations...

In [4]:
# books_file = './data/BOOKS.json.gz' # book metadata
interactions_file = './data/INTERACTIONS.json.gz' # user-book interactions (ratings)
# reviews_file = './data/REVIEWS.json.gz' # user-book interactions (reviews)
# authors_file = './data/AUTHORS.json.gz' # author metadata

and load the necessary files.

In [5]:
%%time
# df_books = read_json_fast(books_file)
df_interactions = read_json_fast(interactions_file)
# df_authors =  read_json_fast(authors_file)

Processing INTERACTIONS.json.gz:


0lines [00:00, ?lines/s]

Wall time: 55.1 s


Now we look at the contents of the loaded dataset.

In [6]:
display(df_interactions.head(1))

Unnamed: 0,user_id,book_id,review_id,is_read,rating,review_text_incomplete,date_added,date_updated,read_at,started_at
0,8842281e1d1347389f2ab93d60773d4d,836610,6b4db26aafeaf0da77c7de6214331e1e,False,0,,Mon Aug 21 12:11:00 -0700 2017,Mon Aug 21 12:11:00 -0700 2017,,


We can see that the dataset contains quite a few columns that are of no use to us. To make everything a little less cluttered we remove the columns that we don't use from the dataframe.

In [7]:
df_interactions = df_interactions[['user_id', 'book_id', 'rating']]

In [8]:
display(df_interactions.head(1))

Unnamed: 0,user_id,book_id,rating
0,8842281e1d1347389f2ab93d60773d4d,836610,0


## Pre-processing

Now we define a pre-processing function that removes users that occur in less than minsup interactions.

In [9]:
def preprocess(df, minsup=5):
    """
    Goal: - Remove users that have less than minsup interactions
               
    :input df: Dataframe containing user_id and item_id 
    """
    # drop users with less then minsup items in history
    before = df.shape[0]
    g = df.groupby('user_id', as_index=False)['book_id'].size()
    g = g.rename({'size': 'user_sup'}, axis='columns')
    g = g[g.user_sup >= minsup]
    df = pd.merge(df, g, how='inner', on=['user_id'])
    print(f"After dropping users with less than {minsup} interactions: {before} -> {df.shape[0]}")
    return df

Then we apply the pre-processing function to the dataframe and log the change in number of samples, number of unique users and number of unique items.

In [12]:
#print number of users and items
print(f"number of unique users: {df_interactions['user_id'].nunique()}")
print(f"number of unique items: {df_interactions['book_id'].nunique()}")
processed_df_interactions = preprocess(df_interactions.copy())
# display(processed_df_interactions.head(5))
print(f"number of unique users: {processed_df_interactions['user_id'].nunique()}")
print(f"number of unique items: {processed_df_interactions['book_id'].nunique()}")
# create sequential ids
processed_df_interactions['user_id_seq'] = processed_df_interactions['user_id'].astype('category').cat.codes
processed_df_interactions['book_id_seq'] = processed_df_interactions['book_id'].astype('category').cat.codes
# merge book id and rating for easier 
display(processed_df_interactions.head(5))

number of unique users: 342415
number of unique items: 89411
After dropping users with less than 5 interactions: 7347630 -> 6995891
number of unique users: 148438
number of unique items: 89276


Unnamed: 0,user_id,book_id,rating,user_sup,user_id_seq,book_id_seq
0,8842281e1d1347389f2ab93d60773d4d,836610,0,14,79093,84684
1,8842281e1d1347389f2ab93d60773d4d,7648967,0,14,79093,82142
2,8842281e1d1347389f2ab93d60773d4d,15704307,0,14,79093,13284
3,8842281e1d1347389f2ab93d60773d4d,6902644,0,14,79093,79463
4,8842281e1d1347389f2ab93d60773d4d,9844623,0,14,79093,88989


We store the mapping from user/book id's to their sequential id in an external file. This might come in handy in other notebooks.

In [13]:
processed_df_interactions[['user_id', 'user_id_seq']].drop_duplicates().to_pickle("./data/user_id_map_optim.pkl")
processed_df_interactions[['book_id', 'book_id_seq']].drop_duplicates().to_pickle("./data/book_id_map_optim.pkl")

We group the interactions by user.

In [23]:
# Group per user
grouped_df = processed_df_interactions.groupby(by='user_id_seq')[['book_id_seq', 'rating']].agg(list)

In [24]:
grouped_df.head()

Unnamed: 0_level_0,book_id_seq,rating
user_id_seq,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"[72891, 72893, 72892, 54114, 54119, 54116, 541...","[5, 5, 5, 4, 4, 4, 5, 4, 4, 4, 4, 4, 3, 4, 4, ..."
1,"[80948, 74947, 21936, 58280, 71625, 66209, 825...","[3, 3, 3, 3, 3, 4, 4, 0, 0, 0, 0, 0, 4, 4]"
2,"[46254, 14134, 10124, 10037, 8229, 19262, 5738...","[0, 4, 4, 4, 4, 5, 5, 0, 4, 4, 5, 0, 0, 4, 4, ..."
3,"[3360, 58805, 16295, 73323, 84853, 72690, 1291...","[4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 5, 4, 4, 3, 5, ..."
4,"[35464, 13870, 38229, 19984, 10867, 29282]","[0, 2, 0, 0, 4, 2]"


We zip the book_id_seq and rating columns for easier random splitting.

In [27]:
grouped_df['interactions'] = grouped_df.progress_apply(lambda row: list(zip(row['book_id_seq'], row['rating'])), axis=1)
grouped_df.head()

  0%|          | 0/148438 [00:00<?, ?it/s]

Unnamed: 0_level_0,book_id_seq,rating,interactions
user_id_seq,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"[72891, 72893, 72892, 54114, 54119, 54116, 541...","[5, 5, 5, 4, 4, 4, 5, 4, 4, 4, 4, 4, 3, 4, 4, ...","[(72891, 5), (72893, 5), (72892, 5), (54114, 4..."
1,"[80948, 74947, 21936, 58280, 71625, 66209, 825...","[3, 3, 3, 3, 3, 4, 4, 0, 0, 0, 0, 0, 4, 4]","[(80948, 3), (74947, 3), (21936, 3), (58280, 3..."
2,"[46254, 14134, 10124, 10037, 8229, 19262, 5738...","[0, 4, 4, 4, 4, 5, 5, 0, 4, 4, 5, 0, 0, 4, 4, ...","[(46254, 0), (14134, 4), (10124, 4), (10037, 4..."
3,"[3360, 58805, 16295, 73323, 84853, 72690, 1291...","[4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 5, 4, 4, 3, 5, ...","[(3360, 4), (58805, 4), (16295, 4), (73323, 3)..."
4,"[35464, 13870, 38229, 19984, 10867, 29282]","[0, 2, 0, 0, 4, 2]","[(35464, 0), (13870, 2), (38229, 0), (19984, 0..."


We create 5 random 80/20 train/test splits for validation.

In [29]:
from sklearn.model_selection import train_test_split as split

In [32]:
# Split dataset into 0.8 training and 0.2 test samples, split randomly into 5 folds
percentage_train = 0.8

for i in range(1,6):
    grouped_df[[f'train{i}', f'test{i}']] = grouped_df.progress_apply(
        lambda row: split(row['interactions'], train_size=percentage_train, random_state=i), 
        axis=1, 
        result_type='expand'
    )

  0%|          | 0/148438 [00:00<?, ?it/s]

  0%|          | 0/148438 [00:00<?, ?it/s]

  0%|          | 0/148438 [00:00<?, ?it/s]

  0%|          | 0/148438 [00:00<?, ?it/s]

  0%|          | 0/148438 [00:00<?, ?it/s]

In [31]:
grouped_df.head()

Unnamed: 0_level_0,book_id_seq,rating,interactions,train1,test1,train2,test2,train3,test3,train4,test4,train5,test5
user_id_seq,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,"[72891, 72893, 72892, 54114, 54119, 54116, 541...","[5, 5, 5, 4, 4, 4, 5, 4, 4, 4, 4, 4, 3, 4, 4, ...","[(72891, 5), (72893, 5), (72892, 5), (54114, 4...","[(54117, 4), (54128, 4), (54119, 4), (72892, 5...","[(54113, 4), (54129, 4), (54114, 4), (54125, 4...","[(54121, 4), (54114, 4), (54131, 3), (54126, 4...","[(54122, 4), (72891, 5), (54123, 4), (54111, 5...","[(54121, 4), (72893, 5), (72892, 5), (54122, 4...","[(54123, 4), (54131, 3), (54117, 4), (54120, 4...","[(54128, 4), (54111, 5), (54130, 4), (72892, 5...","[(54123, 4), (54122, 4), (72891, 5), (54114, 4...","[(49843, 4), (54113, 4), (54128, 4), (54116, 4...","[(54124, 4), (72892, 5), (54117, 4), (54129, 4..."
1,"[80948, 74947, 21936, 58280, 71625, 66209, 825...","[3, 3, 3, 3, 3, 4, 4, 0, 0, 0, 0, 0, 4, 4]","[(80948, 3), (74947, 3), (21936, 3), (58280, 3...","[(21936, 3), (74222, 0), (71625, 3), (74947, 3...","[(58280, 3), (6866, 0), (82526, 4)]","[(80948, 3), (42502, 0), (58280, 3), (74947, 3...","[(31470, 0), (71625, 3), (66209, 4)]","[(21936, 3), (72186, 4), (82526, 4), (66209, 4...","[(6866, 0), (71625, 3), (74947, 3)]","[(42502, 0), (82526, 4), (72186, 4), (21936, 3...","[(71625, 3), (58280, 3), (31470, 0)]","[(21936, 3), (74222, 0), (72186, 4), (31470, 0...","[(66209, 4), (74947, 3), (6866, 0)]"
2,"[46254, 14134, 10124, 10037, 8229, 19262, 5738...","[0, 4, 4, 4, 4, 5, 5, 0, 4, 4, 5, 0, 0, 4, 4, ...","[(46254, 0), (14134, 4), (10124, 4), (10037, 4...","[(10915, 4), (42454, 0), (8905, 0), (8229, 4),...","[(57385, 5), (10037, 4), (14227, 4), (10124, 4)]","[(70563, 5), (19262, 5), (10037, 4), (14134, 4...","[(30219, 4), (8229, 4), (82626, 5), (46254, 0)]","[(8229, 4), (14227, 4), (57385, 5), (42454, 0)...","[(76666, 0), (10124, 4), (14134, 4), (10915, 4)]","[(8229, 4), (46254, 0), (8905, 0), (4462, 0), ...","[(57385, 5), (10037, 4), (82626, 5), (76666, 0)]","[(76666, 0), (71452, 5), (4462, 0), (42454, 0)...","[(19262, 5), (14134, 4), (70563, 5), (10124, 4)]"
3,"[3360, 58805, 16295, 73323, 84853, 72690, 1291...","[4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 5, 4, 4, 3, 5, ...","[(3360, 4), (58805, 4), (16295, 4), (73323, 3)...","[(70, 5), (87812, 3), (84853, 4), (16295, 4), ...","[(1102, 3), (42878, 5), (73323, 3), (87898, 3)...","[(76791, 4), (72074, 4), (87898, 3), (84853, 4...","[(42878, 5), (12912, 4), (3360, 4), (70, 5), (...","[(1102, 3), (72074, 4), (84853, 4), (87875, 4)...","[(87761, 4), (86969, 3), (58805, 4), (16295, 4...","[(87754, 5), (1102, 3), (84853, 4), (87812, 3)...","[(87761, 4), (87898, 3), (73323, 3), (3360, 4)...","[(72690, 4), (58805, 4), (86969, 3), (72074, 4...","[(16295, 4), (87875, 4), (87761, 4), (87754, 5..."
4,"[35464, 13870, 38229, 19984, 10867, 29282]","[0, 2, 0, 0, 4, 2]","[(35464, 0), (13870, 2), (38229, 0), (19984, 0...","[(10867, 4), (35464, 0), (19984, 0), (29282, 2)]","[(38229, 0), (13870, 2)]","[(19984, 0), (38229, 0), (29282, 2), (35464, 0)]","[(10867, 4), (13870, 2)]","[(10867, 4), (13870, 2), (35464, 0), (38229, 0)]","[(19984, 0), (29282, 2)]","[(19984, 0), (35464, 0), (13870, 2), (38229, 0)]","[(10867, 4), (29282, 2)]","[(10867, 4), (13870, 2), (35464, 0), (19984, 0)]","[(29282, 2), (38229, 0)]"


Finally, we create a sparse representation of the user-item interaction matrix for our train and test set.

In [39]:
def create_sparse_repr(df, column, shape):
    user_ids = []
    item_ids = []
    values = []
    for idx, row in tqdm(df.iterrows()):
        interactions = row[column]
        items, ratings = list(zip(*interactions))
        item_ids.extend(items)
        values.extend(ratings)
        user_ids.extend([idx] * len(items))
    matrix = scipy.sparse.coo_matrix((values, (user_ids, item_ids)), shape=shape, dtype=np.int32)
    return matrix
    

shape = (processed_df_interactions['user_id_seq'].max() + 1,  processed_df_interactions['book_id_seq'].max() + 1)

for i in range(1,6):
    train = create_sparse_repr(grouped_df, column=f'train{i}', shape=shape)
    test = create_sparse_repr(grouped_df, column=f'test{i}', shape=shape)
    # We store the train and test set externally to be used in the training and evaluating notebook.
    scipy.sparse.save_npz(f'./data/train{i}.npz', train)
    scipy.sparse.save_npz(f'./data/test{i}.npz', test)

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]