In [None]:
# for Collab notebooks, we will start by installing the ``collie_recs`` library
!pip install collie_recs --quiet

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

%env DATA_PATH data/

env: DATA_PATH=data/


# Data Preparation 

In [2]:
import os

import joblib

from collie_recs.cross_validation import stratified_split
from collie_recs.interactions import Interactions
from collie_recs.movielens import read_movielens_df
from collie_recs.utils import convert_to_implicit, remove_users_with_fewer_than_n_interactions

### Intuition 

Before we can start building a deep learning recommendation algorithm, we will need data! 

The core of any Collie model is interactions data, which is a matrix where the **rows** are the users and the **columns** are the items. Since models in Collie are *implicit* recommendation models, the **values** of the matrix will be ``0`` if a user has not interacted with an item and ``1`` if they have. 

In most recommendation scenarios, there are large numbers of users and items, but most users will have interacted with only a small percentage of these items. In an effort to make optimal use of the available memory, we will use a **sparse** matrix to represent this data. 

Collie contains a variety of recommendation models based around collaborative filtering, a technique behind recommendations of the "users who viewed this item also viewed these others" variety. The tutorial series will walk you through building a simple movie recommender before exploring different improvements Collie offers. Now, on to films! 

### MovieLens 100K 

MovieLens 100K<sup>1</sup> is a benchmark dataset for recommendation algorithms showing a subset of users' viewing preferences for a subset of films. There are **942 users** and **1682 items** in the dataset, with **100,000** interactions total. Though this dataset can easily fit in memory without having to be represented with a sparse matrix, we will still go ahead and treat it like we would a larger dataset. MovieLens 100K is a great choice for quick interactive training on a CPU. If you have a GPU available, give MovieLens 1M or MovieLens 10M a try - Collie works seamlessly on either! 

Another important caveat is that MovieLens 100K data contains explicit ratings, meaning instead of a ``0``/``1`` system for the values of our interaction matrix, all users are given the opportunity to rate a movie on a 5-point scale. So, to start this toy-example, we'll treat each 4 or 5 rating as a ``1`` and everything else as a ``0``. 

<font size="1"><sup>1</sup> <span id="fn1">Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic
Framework for Performing Collaborative Filtering. Proceedings of the
1999 Conference on Research and Development in Information
Retrieval. Aug. 1999.</span></font>

In [3]:
# Collie offers some helper functions to pull in + parse MovieLens 100K data.
# We'll shift our user and item IDs to start at 0 rather than 1 when we read the data here
df = read_movielens_df(decrement_ids=True)


df.head()

Making data path at ``data``...
Downloading MovieLens 100K data...


Unnamed: 0,user_id,item_id,rating,timestamp
0,195,241,3,881250949
1,185,301,3,891717742
2,21,376,1,878887116
3,243,50,2,880606923
4,165,345,1,886397596


In [4]:
# convert the explicit data to implicit by only keeping interactions with a rating ``>= 4``
implicit_df = convert_to_implicit(df, min_rating_to_keep=4)

# we'll also go ahead and remove users with fewer than 3 interactions so all remaining users
# have viewed enough films to provide good signal that a recommendations model can learn from
implicit_df = remove_users_with_fewer_than_n_interactions(implicit_df, min_num_of_interactions=3)

In [5]:
# we now have a small, clean subset of data ready to train!
implicit_df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,297,473,1,884182806
1,252,464,1,891628467
2,285,1013,1,879781125
3,199,221,1,876042340
4,121,386,1,879270459


## ``Interactions`` 
While we have chosen to represent the data as a ``pandas.DataFrame`` for easy viewing now, Collie uses a custom ``torch.utils.data.Dataset`` called ``Interactions``. This class stores a sparse representation of the data and offers some handy benefits, including: 

* The ability to index the data with a ``__getitem__`` method 
* The ability to sample many negative items (we will get to this later!) 
* Nice quality checks to ensure data is free of errors before model training 

Instantiating the object is simple! 

In [6]:
# since some users only rated movies below a 4, they will show up with no ratings in
# the ``Interactions`` dataset, which will throw a ``ValueError``. Since we understand why
# this is happening and know it will not interfere with model results, we can go ahead
# and set ``allow_missing_ids`` to ``True``.
# Alternatively, we can renumber the user and items column with something like
# ``implicit_df[col].astype('category').cat.codes``
interactions = Interactions(
    users=implicit_df['user_id'],
    items=implicit_df['item_id'],
    ratings=implicit_df['rating'],
    allow_missing_ids=True,
)


interactions

Checking ``num_negative_samples`` is valid...
Maximum number of items a user has interacted with: 378
Generating positive items set...


Interactions object with 55375 interactions between 943 users and 1674 items, returning 10 negative samples per interaction.

## Data Splits 
With an ``Interactions`` dataset, Collie supports two types of data splits. 

1. **Random split**: This code randomly assigns an interaction to a ``train``, ``validation``, or ``test`` dataset. While this is significantly faster to perform than a stratified split, it does not guarantee any balance, meaning a scenario where a user will have no interactions in the ``train`` dataset and all in the ``test`` dataset is possible. 
2. **Stratified split**: While this code runs slower than a random split, this guarantees that each user will be represented in the ``train``, ``validation``, and ``test`` dataset. This is by far the most fair way to train and evaluate a recommendation model. 

Since this is a small dataset and we have time, we will go ahead and use ``stratified_split``. If you're short on time, a ``random_split`` can easily be swapped in, since both functions share the same API! 

In [7]:
train_interactions, val_interactions = stratified_split(interactions, test_p=0.1, seed=42)


train_interactions, val_interactions

Generating positive items set...
Generating positive items set...


(Interactions object with 49426 interactions between 943 users and 1674 items, returning 10 negative samples per interaction.,
 Interactions object with 5949 interactions between 943 users and 1674 items, returning 10 negative samples per interaction.)

In [8]:
# now we're going to dump these to files and move on to the next notebook!
joblib.dump(train_interactions, os.path.join(os.environ.get('DATA_PATH', 'data/'), 'train_interactions.pkl'))
joblib.dump(val_interactions, os.path.join(os.environ.get('DATA_PATH', 'data/'), 'val_interactions.pkl'))

['data/val_interactions.pkl']

Our data is ready! In the next notebook, we will train our first Collie model! 

----- 