Libraries:
----------
- demonstration of library functionality with a dataset: https://github.com/cheungdaven/DeepRec/blob/master/test/test_item_ranking.py
- https://github.com/lyst/lightfm
- https://github.com/Microsoft/Recommenders
- https://maciejkula.github.io/spotlight

Reading list: https://github.com/DeepGraphLearning/RecommenderSystems/blob/master/readingList.md

Datasets
----------
http://cseweb.ucsd.edu/~jmcauley/datasets.html#goodreads
* Items:	1,561,465
* Users:	808,749
* Interactions:	225,394,930

```json
{
  "user_id": "8842281e1d1347389f2ab93d60773d4d",
  "book_id": "130580",
  "review_id": "330f9c153c8d3347eb914c06b89c94da",
  "isRead": true,
  "rating": 4,
  "date_added": "Mon Aug 01 13:41:57 -0700 2011",
  "date_updated": "Mon Aug 01 13:42:41 -0700 2011",
  "read_at": "Fri Jan 01 00:00:00 -0800 1988",
  "started_at": ""
}
```

https://snap.stanford.edu/data/amazon-meta.html




In [1]:
pip install git+https://github.com/maciejkula/spotlight.git

Collecting git+https://github.com/maciejkula/spotlight.git
  Cloning https://github.com/maciejkula/spotlight.git to /tmp/pip-req-build-ktb6jy_n
  Running command git clone -q https://github.com/maciejkula/spotlight.git /tmp/pip-req-build-ktb6jy_n
Building wheels for collected packages: spotlight
  Building wheel for spotlight (setup.py) ... [?25l[?25hdone
  Created wheel for spotlight: filename=spotlight-0.1.6-cp36-none-any.whl size=33920 sha256=74e452ea107e62378a9064698daa3a12b469875135a32c4482e53739d43fe4bc
  Stored in directory: /tmp/pip-ephem-wheel-cache-lcv8z7zq/wheels/0a/33/c8/e8510ea648aaacf6031e128dfa92bcd3750f02db2aaf0922fe
Successfully built spotlight
Installing collected packages: spotlight
Successfully installed spotlight-0.1.6


In [0]:
from spotlight.datasets.goodbooks import get_goodbooks_dataset, _get_dataset
from spotlight.interactions import Interactions


In [3]:
!wget https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv

--2020-02-11 12:13:22--  https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3286659 (3.1M) [text/plain]
Saving to: ‘books.csv’


2020-02-11 12:13:22 (45.8 MB/s) - ‘books.csv’ saved [3286659/3286659]



In [0]:
import pandas as pd
books = pd.read_csv('books.csv', index_col=0)

In [5]:
books.head()

Unnamed: 0_level_0,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,title,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,4.34,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,Harry Potter and the Sorcerer's Stone (Harry P...,eng,4.44,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,"Twilight (Twilight, #1)",en-US,3.57,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,To Kill a Mockingbird,eng,4.25,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,The Great Gatsby,eng,3.89,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In [0]:
def get_book_titles(book_ids):
  '''Get book titles by book ids
  Example:
  --------
  >> get_book_titles(1)
  ['The Hunger Games (The Hunger Games, #1)']
  '''
  if isinstance(book_ids, int):
    book_ids = [book_ids]
  titles = []
  for book_id in book_ids:
    titles.append(books.loc[book_id, 'title'])
  return titles

In [0]:
data = _get_dataset()
interactions = Interactions(*data)

In [8]:
data

(array([    1,     2,     2, ..., 49925, 49925, 49925], dtype=int32),
 array([ 258, 4081,  260, ...,  722,  949, 1023], dtype=int32),
 array([5., 4., 5., ..., 4., 5., 4.], dtype=float32),
 array([      0,       1,       2, ..., 5976476, 5976477, 5976478],
       dtype=int32))

In [9]:
print(interactions)

<Interactions dataset (53425 users x 10001 items x 5976479 interactions)>


In [0]:
import torch

from spotlight.factorization.explicit import ExplicitFactorizationModel

model = ExplicitFactorizationModel(loss='regression',
                                   embedding_dim=128,  # latent dimensionality
                                   n_iter=10,  # number of epochs of training
                                   batch_size=1024,  # minibatch size
                                   l2=1e-9,  # strength of L2 regularization
                                   learning_rate=1e-3,
                                   use_cuda=torch.cuda.is_available())


In [0]:
from spotlight.cross_validation import random_train_test_split
import numpy as np

train, test = random_train_test_split(interactions, random_state=np.random.RandomState(42))


In [12]:
print('Split into \n {} and \n {}.'.format(train, test))

Split into 
 <Interactions dataset (53425 users x 10001 items x 4781183 interactions)> and 
 <Interactions dataset (53425 users x 10001 items x 1195296 interactions)>.


In [30]:
model.fit(train, verbose=True)
from spotlight.evaluation import rmse_score, precision_recall_score

train_rmse = rmse_score(model, train)
test_rmse = rmse_score(model, test)``
train_precision, train_recall = precision_recall_score(model, train, k=5)
test_precision, test_recall = precision_recall_score(model, test, k=5)

print('Train RMSE {:.3f}, test RMSE {:.3f}'.format(train_rmse, test_rmse))
print(
    'mean train precision at 5: {:.3f}'.format(
        train_precision.mean()
))
print(
    'mean test precision at 5: {:.3f}'.format(
        test_precision.mean()
))

Epoch 0: loss 0.08782204981789599
Epoch 1: loss 0.07508557955445264
Epoch 2: loss 0.06620433954661613
Epoch 3: loss 0.05987947982370981
Epoch 4: loss 0.05520968570890467
Epoch 5: loss 0.05163776899749768
Epoch 6: loss 0.04881051659296871
Epoch 7: loss 0.046574856045733685
Epoch 8: loss 0.04472420661205653
Epoch 9: loss 0.043194792465939
Train RMSE 0.199, test RMSE 1.052
mean train precision at 5: 0.008
mean test precision at 5: 0.013


In [0]:
# explaining predictions. Based on https://github.com/lyst/lightfm/blob/master/examples/quickstart/quickstart.ipynb

def sample_recommendation(model, user_ids, train, item_labels):
    '''Give recommendations for users given a model and explain recommendations.
    '''
    n_users, n_items = train.shape

    for user_id in user_ids:
        known_positives = item_labels[train[user_id].indices]
        
        scores = model.predict(user_id, np.arange(n_items))
        top_items = item_labels[np.argsort(-scores)]
        
        print("User %s" % user_id)
        print("     Known positives:")
        
        for x in known_positives[:3]:
            print("        %s" % x)

        print("     Recommended:")
        
        for x in top_items[:3]:
            print("        %s" % x)

In [15]:
book_labels = get_book_titles(list(train.item_ids))
book_labels[:10]

["Ahab's Wife, or The Star-Gazer",
 'City of Glass (The Mortal Instruments, #3)',
 "Enchanters' End Game (The Belgariad, #5)",
 'Frankenstein',
 'The Atlantis Complex (Artemis Fowl, #7)',
 'The Life and Times of the Thunderbolt Kid',
 'A Game of Thrones (A Song of Ice and Fire, #1)',
 'Disgrace',
 'Beautiful Creatures (Caster Chronicles, #1)',
 'The Alchemist']

In [16]:
sample_recommendation(model, [3, 9999, 15000], train.tocsr(), np.array(book_labels))

User 3
     Known positives:
        The Atlantis Complex (Artemis Fowl, #7)
        Sentinel (Covenant, #5)
        The Devil Wears Prada (The Devil Wears Prada, #1)
     Recommended:
        Beautiful Creatures (Caster Chronicles, #1)
        The Silver Chair (Chronicles of Narnia, #4)
        The Husband's Secret
User 9999
     Known positives:
        City of Glass (The Mortal Instruments, #3)
        The Magicians' Guild (Black Magician Trilogy, #1)
        Bridge to Terabithia
     Recommended:
        Fearless Fourteen (Stephanie Plum, #14)
        Futures and Frosting (Chocolate Lovers, #2)
        The Selfish Gene
User 15000
     Known positives:
        Enchanters' End Game (The Belgariad, #5)
        The Life and Times of the Thunderbolt Kid
        Beautiful Creatures (Caster Chronicles, #1)
     Recommended:
        Matilda
        Othello
        Founding Brothers: The Revolutionary Generation


In [17]:
from spotlight.evaluation import precision_recall_score

train_prs = precision_recall_score(model, train, k=5)
test_prs = precision_recall_score(model, test, k=5)

print('Train PRS {:.3f}, test PRS {:.3f}'.format(train_rmse, test_rmse))


Train PRS 0.266, test PRS 0.964


In [0]:
precision_train, _ = train_prs
precision_test, _ = test_prs

In [29]:
print(
    'mean train precision at 5: {:.3f}'.format(
        precision_train.mean()
))
print(
    'mean test precision at 5: {:.3f}'.format(
        precision_test.mean()
))

mean train precision at 5: 0.028
mean test precision at 5: 0.020


In [20]:
train_prs[0].mean()

0.02811096136567835

In [24]:
test_prs[0].mean()

0.01989068174160458

Epoch 0: loss 2.750798608710475
Epoch 1: loss 0.749443441718753
Epoch 2: loss 0.6722490007157499
Epoch 3: loss 0.5743380024997274
Epoch 4: loss 0.45117901287318807
Epoch 5: loss 0.3322077094168428
Epoch 6: loss 0.24039550715901867
Epoch 7: loss 0.176841794404279
Epoch 8: loss 0.13474581096425045
Epoch 9: loss 0.10675936884683511

In [0]:
import os
import shutil
from sklearn.model_selection import ParameterSampler

CUDA = (os.environ.get('CUDA') is not None or
        shutil.which('nvidia-smi') is not None)

NUM_SAMPLES = 100

LEARNING_RATES = [1e-3, 1e-2, 5 * 1e-2, 1e-1]
LOSSES = ['bpr', 'hinge', 'adaptive_hinge', 'pointwise']
BATCH_SIZE = [8, 16, 32, 256]
EMBEDDING_DIM = [8, 16, 32, 64, 128, 256]
N_ITER = list(range(5, 20))
L2 = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 0.0]



def sample_cnn_hyperparameters(random_state, num):

    space = {
        'n_iter': N_ITER,
        'batch_size': BATCH_SIZE,
        'l2': L2,
        'learning_rate': LEARNING_RATES,
        'loss': LOSSES,
        'embedding_dim': EMBEDDING_DIM,
        'kernel_width': [3, 5, 7],
        'num_layers': list(range(1, 10)),
        'dilation_multiplier': [1, 2],
        'nonlinearity': ['tanh', 'relu'],
        'residual': [True, False]
    }

    sampler = ParameterSampler(space,
                               n_iter=num,
                               random_state=random_state)

    for params in sampler:
        params['dilation'] = list(params['dilation_multiplier'] ** (i % 8)
                                  for i in range(params['num_layers']))

        yield params


In [0]:
seed = 41
random_state = np.random.RandomState(seed)
hyperparameters = next(sample_cnn_hyperparameters(random_state, 1))

In [0]:
hyperparameters

{'batch_size': 32,
 'dilation': [1],
 'dilation_multiplier': 1,
 'embedding_dim': 16,
 'kernel_width': 5,
 'l2': 0.001,
 'learning_rate': 0.001,
 'loss': 'adaptive_hinge',
 'n_iter': 7,
 'nonlinearity': 'tanh',
 'num_layers': 1,
 'residual': True}

In [0]:
train.to_sequence()

<Sequence interactions dataset (502153 sequences x 10 sequence length)>

In [0]:
# https://github.com/maciejkula/spotlight/tree/master/examples/movielens_sequence
import torch

from spotlight.sequence.implicit import ImplicitSequenceModel
from spotlight.sequence.representations import CNNNet
from spotlight.evaluation import sequence_mrr_score


net = CNNNet(train.num_items,
             embedding_dim=hyperparameters['embedding_dim'],
             kernel_width=hyperparameters['kernel_width'],
             dilation=hyperparameters['dilation'],
             num_layers=hyperparameters['num_layers'],
             nonlinearity=hyperparameters['nonlinearity'],
             residual_connections=hyperparameters['residual'])

model = ImplicitSequenceModel(loss=hyperparameters['loss'],
                              representation=net,
                              batch_size=hyperparameters['batch_size'],
                              learning_rate=hyperparameters['learning_rate'],
                              l2=hyperparameters['l2'],
                              n_iter=hyperparameters['n_iter'],
                              use_cuda=torch.cuda.is_available(),
                              random_state=random_state)

model.fit(train.to_sequence())

test_mrr = sequence_mrr_score(model, test.to_sequence())
#val_mrr = sequence_mrr_score(model, validation.to_sequence())




In [0]:
# Mean reciprocal rank
train_mrr = sequence_mrr_score(model, train.to_sequence())
test_mrr = sequence_mrr_score(model, test.to_sequence())



In [0]:
print('Train overall MRR {:.3f}, test overall MRR {:.3f}'.format(train_mrr.mean(), test_mrr.mean()))

Train overall MRR 0.017, test overall MRR 0.013


In [0]:
train_rmse = rmse_score(model, train)
test_rmse = rmse_score(model, test)
print('Train RMSE {:.3f}, test RMSE {:.3f}'.format(train_rmse, test_rmse))

ValueError: ignored

In [0]:
sample_recommendation(model, [3, 9999, 15000], train.tocsr(), np.array(book_labels))



RuntimeError: ignored

In [0]:
pip install lightfm

Collecting lightfm
[?25l  Downloading https://files.pythonhosted.org/packages/e9/8e/5485ac5a8616abe1c673d1e033e2f232b4319ab95424b42499fabff2257f/lightfm-1.15.tar.gz (302kB)
[K     |█                               | 10kB 18.7MB/s eta 0:00:01[K     |██▏                             | 20kB 3.3MB/s eta 0:00:01[K     |███▎                            | 30kB 4.8MB/s eta 0:00:01[K     |████▍                           | 40kB 3.1MB/s eta 0:00:01[K     |█████▍                          | 51kB 3.8MB/s eta 0:00:01[K     |██████▌                         | 61kB 4.6MB/s eta 0:00:01[K     |███████▋                        | 71kB 5.2MB/s eta 0:00:01[K     |████████▊                       | 81kB 5.9MB/s eta 0:00:01[K     |█████████▊                      | 92kB 6.6MB/s eta 0:00:01[K     |██████████▉                     | 102kB 5.1MB/s eta 0:00:01[K     |████████████                    | 112kB 5.1MB/s eta 0:00:01[K     |█████████████                   | 122kB 5.1MB/s eta 0:00:01[K  

In [0]:
# from tutorial at https://github.com/lyst/lightfm
from lightfm import LightFM
from lightfm.evaluation import precision_at_k

# Load the MovieLens 100k dataset. Only five
# star ratings are treated as positive.
#data = fetch_movielens(min_rating=5.0)

# Instantiate and train the model
model = LightFM(loss='warp')
model.fit(train.tocoo(), epochs=30, num_threads=2)

<lightfm.lightfm.LightFM at 0x7f05f786c5f8>

In [0]:
# Evaluate the trained model
test_precision = precision_at_k(model, test.tocoo(), k=5)

In [0]:
test_precision  # .mean()

array([0.6, 0. , 0. , ..., 0.2, 0.2, 0.4], dtype=float32)

In [0]:
test_precision.mean()

0.11494516

In [0]:
sample_recommendation(model, [3, 9999, 15000], train.tocsr(), np.array(book_labels))

User 3
     Known positives:
        The Atlantis Complex (Artemis Fowl, #7)
        Sentinel (Covenant, #5)
        The Devil Wears Prada (The Devil Wears Prada, #1)
     Recommended:
        The Life and Times of the Thunderbolt Kid
        The Atlantis Complex (Artemis Fowl, #7)
        Beautiful Creatures (Caster Chronicles, #1)
User 9999
     Known positives:
        City of Glass (The Mortal Instruments, #3)
        The Magicians' Guild (Black Magician Trilogy, #1)
        Bridge to Terabithia
     Recommended:
        Enchanters' End Game (The Belgariad, #5)
        City of Glass (The Mortal Instruments, #3)
        Frankenstein
User 15000
     Known positives:
        Enchanters' End Game (The Belgariad, #5)
        The Life and Times of the Thunderbolt Kid
        Beautiful Creatures (Caster Chronicles, #1)
     Recommended:
        Where the Heart Is
        Mockingjay (The Hunger Games, #3)
        A Map of the World


In [0]:
https://github.com/cheungdaven/DeepRec/blob/master/test/test_item_ranking.py

In [0]:
https://github.com/microsoft/recommenders/blob/master/benchmarks/movielens.ipynb