# Data loading and preprocessing

In this notebook we retrieve our dataset and preprocess it into a format that is ready to use for training our matrix factorization based recommender system.

First, we import the necessary Python packages.

In [1]:
import gdown
import numpy as np
import pandas as pd
import scipy.sparse
from tqdm.auto import tqdm
import itertools
import json
import gzip
import math
from utils import read_json_fast
tqdm.pandas() # for progress_apply etc.

## Loading

Next, we download the files for our dataset (Goodreads). We use the gdown package to retrieve them from the Google Drive they're originally hosted on. 

> Since we will be implementing a collaborative filtering algorithm, we only need the interactions part of the dataset. The code for reading in the other parts of the dataset were left as comments for potential future reference.

We define the URLs for each file...

In [2]:
URLS = {
    # "BOOKS": "https://drive.google.com/uc?id=1ICk5x0HXvXDp5Zt54CKPh5qz1HyUIn9m",
    # "AUTHORS": "https://drive.google.com/uc?id=19cdwyXwfXx_HDIgxXaHzH0mrx8nMyLvC",
    # "REVIEWS": "https://drive.google.com/u/0/uc?id=1V4MLeoEiPQdocCbUHjR_7L9ZmxTufPFe",
    "INTERACTIONS": "https://drive.google.com/uc?id=1CCj-cQw_mJLMdvF_YYfQ7ibKA-dC_GA2"
}

and download each file. (if they haven't been downloaded in a previous run of the notebook)

In [3]:
for name, url in URLS.items():
    gdown.cached_download(url, f"./data/{name}.json.gz", quiet=False)

File exists: ./data/INTERACTIONS.json.gz


We define the dataset file locations...

In [4]:
# books_file = './data/BOOKS.json.gz' # book metadata
interactions_file = './data/INTERACTIONS.json.gz' # user-book interactions (ratings)
# reviews_file = './data/REVIEWS.json.gz' # user-book interactions (reviews)
# authors_file = './data/AUTHORS.json.gz' # author metadata

and load the necessary files.

In [5]:
%%time
# df_books = read_json_fast(books_file)
df_interactions = read_json_fast(interactions_file)
# df_authors =  read_json_fast(authors_file)

Processing INTERACTIONS.json.gz:


0lines [00:00, ?lines/s]

Wall time: 2min 59s


Now we look at the contents of the loaded dataset.

In [6]:
display(df_interactions.head(1))

Unnamed: 0,user_id,book_id,review_id,is_read,rating,review_text_incomplete,date_added,date_updated,read_at,started_at
0,8842281e1d1347389f2ab93d60773d4d,836610,6b4db26aafeaf0da77c7de6214331e1e,False,0,,Mon Aug 21 12:11:00 -0700 2017,Mon Aug 21 12:11:00 -0700 2017,,


We can see that the dataset contains quite a few columns that are of no use to us. To make everything a little less cluttered we remove the columns that we don't use from the dataframe.

In [7]:
df_interactions = df_interactions[['user_id', 'book_id', 'rating', 'date_updated']]

In [8]:
display(df_interactions.head(1))

Unnamed: 0,user_id,book_id,rating,date_updated
0,8842281e1d1347389f2ab93d60773d4d,836610,0,Mon Aug 21 12:11:00 -0700 2017


## Pre-processing

### Agreed pre-processing, training and evaluation for the goodreads team (after discussion with Len)

#### preprocessing

- min. sup. users = 5
- min sup items = 1 (should be the case in dataset already, check to be sure)
- keep all the ratings and scores

#### training

- For validation we consider doing 5 random 80%/20% train-test splits.
- For each train-test pair, we first perform hyperparameter optimisation on the train part (via cross-validation) and then evaluate on the test part.
- This gives us 5x(recall@10, ndcg@10) for which we compute the mean and stdev.
- --> for the 5 random splits we should all use the same seeds

#### evaluation

- recall k = 5, 10
- NDCG = 5, 10

The first pre-processing step we apply is converting all dates into a more standardized format.

In [9]:
%%time

from datetime import datetime

format_str = '%a %b %d %H:%M:%S %z %Y' #see https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
def convert_date(date_string):
  return pd.to_datetime(date_string, utc=True, format=format_str)

_df_interactions = df_interactions.copy()
# _df_interactions['date_updated'] =  _df_interactions['date_updated'].progress_apply(convert_date)
_df_interactions['date_updated'] = _df_interactions['date_updated'].progress_apply(lambda s: np.datetime64(datetime.strptime(s,format_str)))
_df_interactions['date_updated'] = _df_interactions['date_updated'].dt.tz_localize(None)  # drops utc timezone

  0%|          | 0/7347630 [00:00<?, ?it/s]



Wall time: 5min 2s


Now we define a pre-processing function that:

1. Keeps all ratings, including the zero-ratings
2. Removes duplicate (user, item) pairs.
3. Removes users that occur in less than minsup interactions.
4. Removes items that occur in less than 1 interaction.

In [10]:
# For optimization phase:
# Default minsup for users changed from 10 to 5
# Default minsup for items is added and defaults to 1
# All scores are taken into account because implicit feedback is used
def preprocess(df, user_minsup=5, item_minsup=1, min_score=None):
    """
    Goal: - Remove reconsumption items
          - Remove users that have less than user_minsup interactions 
          - Remove items that have less than item_minsup interactions
               
    :input df: Dataframe containing user_id, item_id and time
    """
    # drop 0 ratings
    if min_score is not None:
        before = df.shape[0]
        df = df[(df["rating"] >= min_score)]
        print(f"After dropping ratings below {min_score}: {before} -> {df.shape[0]}")
    # drop reconsumption items
    before = df.shape[0]
    df = df.drop_duplicates(subset=["user_id","book_id"])
    print(f"After drop_duplicates (reconsumption items): {before} -> {df.shape[0]}")
    # drop users with less than user_minsup items in history
    if user_minsup is not None:
        before = df.shape[0]
        g = df.groupby('user_id', as_index=False)['book_id'].size()
        g = g.rename({'size': 'user_sup'}, axis='columns')
        g = g[g.user_sup >= user_minsup]
        df = pd.merge(df, g, how='inner', on=['user_id'])
        print(f"After dropping users with less than {user_minsup} interactions: {before} -> {df.shape[0]}")
    # drop items with less than item_minsup items in history
    if item_minsup is not None:
        before = df.shape[0]
        g = df.groupby('book_id', as_index=False)['book_id'].size()
        g = g.rename({'size': 'item_sup'}, axis='columns')
        g = g[g.item_sup >= item_minsup]
        df = pd.merge(df, g, how='inner', on=['book_id'])
        print(f"After dropping items with less than {item_minsup} interactions: {before} -> {df.shape[0]}")
    return df

In [11]:
#print number of users and items
print(f"number of unique users: {_df_interactions['user_id'].nunique()}")
print(f"number of unique items: {_df_interactions['book_id'].nunique()}")
processed_df_interactions = preprocess(_df_interactions.copy())
# display(processed_df_interactions.head(5))
print(f"number of unique users: {processed_df_interactions['user_id'].nunique()}")
print(f"number of unique items: {processed_df_interactions['book_id'].nunique()}")
# create sequential ids
processed_df_interactions['user_id_seq'] = processed_df_interactions['user_id'].astype('category').cat.codes
processed_df_interactions['book_id_seq'] = processed_df_interactions['book_id'].astype('category').cat.codes
# merge book id and rating for easier 
display(processed_df_interactions.head(5))

number of unique users: 342415
number of unique items: 89411
After drop_duplicates (reconsumption items): 7347630 -> 7347630
After dropping users with less than 5 interactions: 7347630 -> 6995891
After dropping items with less than 1 interactions: 6995891 -> 6995891
number of unique users: 148438
number of unique items: 89276


Unnamed: 0,user_id,book_id,rating,date_updated,user_sup,item_sup,user_id_seq,book_id_seq
0,8842281e1d1347389f2ab93d60773d4d,836610,0,2017-08-21 19:11:00,14,2222,79093,84684
1,37e4d1438f5918fd1400c12b49b80f61,836610,5,2012-06-06 04:19:55,261,2222,32344,84684
2,a3d0b73b0f3580b60075b04ea76dd4cd,836610,5,2016-02-16 03:36:35,93,2222,95123,84684
3,5752e83b5c1ccdd8014aaf358b80c199,836610,4,2009-08-11 18:11:43,46,2222,50661,84684
4,eadf8c06370d9c6fd74121723c8d20e3,836610,5,2014-04-13 03:23:31,736,2222,136236,84684


Then we apply the pre-processing function to the dataframe and log the change in number of samples, number of unique users and number of unique items.

We store the mapping from user/book id's to their sequential id in an external file. This might come in handy in other notebooks.

In [12]:
processed_df_interactions[['user_id', 'user_id_seq']].drop_duplicates().to_pickle("./data/user_id_map.pkl")
processed_df_interactions[['book_id', 'book_id_seq']].drop_duplicates().to_pickle("./data/book_id_map.pkl")

We sort the interactions by their date and group them by user. This allows us to perform a session-based train-test split.

Perform session-based split.

In [22]:
def train_test_split_session(df, perc_train=0.7):
    # Sort on date and group per user
    sessions_df = df.sort_values(['date_updated'],ascending=True).groupby(by='user_id_seq', as_index=False)[['book_id_seq', 'rating', 'date_updated']].agg(list)

    # Function to perform split
    def split(row, col, percentage_train):
        items = row[col]
        no_train_items = math.floor(len(items) * percentage_train)
        return items[0:no_train_items], items[no_train_items:]

    # Default: split dataset into 0.7 training and 0.3 test samples, split in the temporal dimension.
    percentage_train = perc_train
    # train_items, test_items = split(items, percentage_train)
    sessions_df[['history', 'future']] = sessions_df.progress_apply(lambda row: split(row, 'book_id_seq', percentage_train), axis=1, result_type='expand')
    sessions_df[['history_ratings', 'future_ratings']] = sessions_df.progress_apply(lambda row: split(row, 'rating', percentage_train), axis=1, result_type='expand')
    return sessions_df

Finally, we create a sparse representation of the user-item interaction matrix for our train and test set.

In [23]:
def create_sparse_repr_session(df, column, shape):
    user_ids = []
    item_ids = []
    values = []
    for idx, row in tqdm(df.iterrows()):
        items = row[column]
        item_ids.extend(items)
        user = row['user_id_seq']
        user_ids.extend([user] * len(items))
        ratings = row[column + "_ratings"]
        values.extend(ratings)
    matrix = scipy.sparse.coo_matrix((values, (user_ids, item_ids)), shape=shape, dtype=np.int32)
    return matrix

def train_test_split_coo(df):
    shape = (df['user_id_seq'].max() + 1,  df['book_id_seq'].max() + 1)
    session_df = train_test_split_session(df.copy())
    # display(session_df.head(5))
    train_coo = create_sparse_repr_session(session_df, column='history', shape=shape)
    test_coo = create_sparse_repr_session(session_df, column='future', shape=shape)
    return train_coo, test_coo

We store the train and test set externally to be used in the training and evaluating notebook.

In [24]:
train, test = train_test_split_coo(processed_df_interactions)
scipy.sparse.save_npz('./data/train.npz', train)
scipy.sparse.save_npz('./data/test.npz', test)

  0%|          | 0/148438 [00:00<?, ?it/s]

  0%|          | 0/148438 [00:00<?, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

## Implicit feedback processing

The user's raging score for a book ranges from "1" to "5'", "0" indicates "not provided"
We convert the explicit feedback into implicit feedback using following rules:
P: preference, C: confidence, alpha: rate of increase (default 40)

| R         | P             | C            |
| ---------:|--------------:|-------------:|
| NA       | 0            | 1 |
| 0        | 0            | 2 |
| < 2.5    | 0            | 1 + (alpha x (5 - R)) |
| > 2.5    | 1            | 1 + (alpha x R) |
    

In [17]:
def create_sparce_cui_matrix(df, alpha=40):
    """
    Creates a user-item confidence sparse matrix 
    
    The user-item matrix does double duty here. It defines which items are liked by which
    users (P_iu in the original paper), as well as how much confidence we have that the user
    liked the item (C_iu).
    The positive items (rating > 2.5) are defined with a positive C_ui, 
    the negative items (rating < 2.5 ) are defined with a negative C_ui.
    Unseen items (rating 0) are defined with negative C_ui = -2.
    items without a rating are implicitly defined with C_ui = -1 but are not stored in the matrix.
    
    Args:
        df (pandas.Dataframe): user-item-scores data frame
        alpha (int): rate of increase
        
    Returns:
        cui_matrix (scp.sparse.coo): user-item confidence sparse matrix
    """
    df_cui = df.copy()[['user_id_seq', 'book_id_seq', 'rating']]
    # df_cui.loc[:, 'cui'] = -2 # default cui for non-seen items
    df_cui = df_cui[df_cui.rating > 2]
    df_cui.loc[:, 'cui'] = 1 + alpha * (df_cui.rating - 2)
    # df_cui.loc[mask, 'cui'] = 1 + alpha * (df.rating - 2)
    # mask = ((df.rating < 2.5) & (df.rating > 0))
    # df_cui.loc[mask, 'cui'] = -(1 + alpha * (3 - df.rating))
    df_cui = df_cui.drop(columns='rating')
    shape = (df['user_id_seq'].max() + 1,  df['book_id_seq'].max() + 1)
    matrix = scipy.sparse.coo_matrix((df_cui.cui, (df_cui.user_id_seq, df_cui.book_id_seq)), shape=shape, dtype=np.float64)
    return matrix

In [18]:
%%time

df_rating = processed_df_interactions[['user_id_seq', 'book_id_seq', 'rating', 'date_updated']]
cui = create_sparce_cui_matrix(df_rating)
scipy.sparse.save_npz('./data/cui2.npz', cui)

Wall time: 3.46 s


In [77]:
cui.dtype

dtype('float64')