- Explore a different recommendation dataset
 - Develop and evaluate baseline recommender systems
 - Implement hybrid recommender models
 - Explore diversification issues in recommender systems

# Part-Pre. Preparation

## Pre 1. Setup Block

This exercise will use the [Goodreads]() dataset for books. These blocks setup the data files, Python etc.

In [None]:
!rm -rf ratings* books* to_read* test*

!curl -o ratings.csv "https://www.dcs.gla.ac.uk/~craigm/recsysH/coursework/final-ratings.csv"
!curl -o books.csv "https://www.dcs.gla.ac.uk/~craigm/recsysH/coursework/final-books.csv"
!curl -o to_read.csv "https://www.dcs.gla.ac.uk/~craigm/recsysH/coursework/final-to_read.csv"
!curl -o test.csv "https://www.dcs.gla.ac.uk/~craigm/recsysH/coursework/final-test.csv"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 7631k  100 7631k    0     0  3738k      0  0:00:02  0:00:02 --:--:-- 3740k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2366k  100 2366k    0     0   632k      0  0:00:03  0:00:03 --:--:--  632k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 7581k  100 7581k    0     0   963k      0  0:00:07  0:00:07 --:--:-- 1221k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1895k  100 1895k    0     0   727k      0  0:00:02  0:00:02 --:--:--  727k


In [None]:
#Standard setup
import pandas as pd
import numpy as np
import torch
!pip install git+https://github.com/cmacdonald/spotlight.git@seed#egg=spotlight
from spotlight.interactions import Interactions
SEED=20
BPRMF=None

Collecting spotlight
  Cloning https://github.com/cmacdonald/spotlight.git (to revision seed) to /tmp/pip-install-n52id6uu/spotlight_9838385206bc43be828a892f7633b1bb
  Running command git clone --filter=blob:none --quiet https://github.com/cmacdonald/spotlight.git /tmp/pip-install-n52id6uu/spotlight_9838385206bc43be828a892f7633b1bb
  Running command git checkout -b seed --track origin/seed
  Switched to a new branch 'seed'
  Branch 'seed' set up to track remote branch 'seed' from 'origin'.
  Resolved https://github.com/cmacdonald/spotlight.git to commit 5ae5c189a964b657e913b075ff18f38d8d567c65
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=0.4.0->spotlight)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch

## Pre 2. Data Preparation

Let's load the `goodbooks` dataset into dataframes.
- `ratings.csv`: It contains ratings sorted by time. Ratings go from one to five.
- `to_read.csv`: It provides IDs of the books marked "to read" by each user, as <user_id, book_id> pairs.
- `books.csv`: It has metadata for each book (goodreads IDs, authors, title, average rating, etc.).

In [None]:
#load in the csv files
ratings_df = pd.read_csv("ratings.csv")
books_df = pd.read_csv("books.csv")
to_read_df = pd.read_csv("to_read.csv")
test = pd.read_csv("test.csv")

In [None]:
## Test
to_read_df.head()

Unnamed: 0.1,Unnamed: 0,user_id,book_id
0,560054,2278,4232
1,277900,2118,298
2,87083,2769,3934
3,77727,2540,293
4,240676,3142,147


In [None]:
#cut down the number of items and users
counts=ratings_df[ratings_df["book_id"] < 2000].groupby(["book_id"]).count().reset_index()
valid_books=counts[counts["user_id"] >= 10][["book_id"]]
print(valid_books.head)

books_df = books_df.merge(valid_books, on="book_id")
ratings_df = ratings_df[ratings_df["user_id"] < 2000].merge(valid_books, on="book_id")
to_read_df = to_read_df[to_read_df["user_id"] < 2000].merge(valid_books, on="book_id")
test = test[test["user_id"] < 2000].merge(valid_books, on="book_id")


#stringify the id columns
def str_col(df):
  if "user_id" in df.columns:
    df["user_id"] = "u" + df.user_id.astype(str)
  if "book_id" in df.columns:
    df["book_id"] = "b" + df.book_id.astype(str)

str_col(books_df)
str_col(ratings_df)
str_col(to_read_df)
str_col(test)

<bound method NDFrame.head of       book_id
0           1
1           2
2           3
3           4
4           5
...       ...
1873     1990
1874     1991
1876     1993
1877     1997
1879     1999

[1826 rows x 1 columns]>


Here we construct the Interactions objects from `ratings.csv`, `to_read.csv` and `test.csv`. We manually specify the num_users and num_items parameters to all Interactions objects, in case the test set differs from your training sets.

In [None]:
from collections import defaultdict
from itertools import count

from spotlight.cross_validation import random_train_test_split

iid_map = defaultdict(count().__next__)


rating_iids = np.array([iid_map[iid] for iid in ratings_df["book_id"].values], dtype = np.int32)
test_iids = np.array([iid_map[iid] for iid in test["book_id"].values], dtype = np.int32)
toread_iids = np.array([iid_map[iid] for iid in to_read_df["book_id"].values], dtype = np.int32)


uid_map = defaultdict(count().__next__)
test_uids = np.array([uid_map[uid] for uid in test["user_id"].values], dtype = np.int32)
rating_uids = np.array([uid_map[uid] for uid in ratings_df["user_id"].values], dtype = np.int32)
toread_uids = np.array([uid_map[iid] for iid in to_read_df["user_id"].values], dtype = np.int32)


uid_rev_map = {v: k for k, v in uid_map.items()}
iid_rev_map = {v: k for k, v in iid_map.items()}


rating_dataset = Interactions(user_ids=rating_uids,
                               item_ids=rating_iids,
                               ratings=ratings_df["rating"].values,
                               num_users=len(uid_rev_map),
                               num_items=len(iid_rev_map))

toread_dataset = Interactions(user_ids=toread_uids,
                               item_ids=toread_iids,
                               num_users=len(uid_rev_map),
                               num_items=len(iid_rev_map))

test_dataset = Interactions(user_ids=test_uids,
                               item_ids=test_iids,
                               num_users=len(uid_rev_map),
                               num_items=len(iid_rev_map))

print(rating_dataset)
print(toread_dataset)
print(test_dataset)

#here we define the validation set
toread_dataset_train, validation = random_train_test_split(toread_dataset, random_state=np.random.RandomState(SEED))

num_items = test_dataset.num_items
num_users = test_dataset.num_users

<Interactions dataset (1999 users x 1826 items x 124762 interactions)>
<Interactions dataset (1999 users x 1826 items x 135615 interactions)>
<Interactions dataset (1999 users x 1826 items x 33917 interactions)>


Finally, this is some utility code that we will use in the exercise.

In [None]:
def getAuthorTitle(iid):
  bookid = iid_rev_map[iid]
  row = books_df[books_df.book_id == bookid]
  return row.iloc[0]["authors"] + " / " + row.iloc[0]["title"]

print("iid 0: " + getAuthorTitle(0) )

iid 0: Carlos Ruiz Zafón, Lucia Graves / The Shadow of the Wind (The Cemetery of Forgotten Books,  #1)


## Pre 3. Example Code


Here is an example recommender object that returns 0 for each item, regardless of user.

In [None]:
from spotlight.evaluation import mrr_score, precision_recall_score

class dummymodel:

  def __init__(self, numitems):
    self.predictions=np.zeros(numitems)

  #uid is the user we are requesting recommendations for;
  #returns an array of scores, one for each item
  def predict(self, uid):
    #this model returns all zeros, regardless of userid
    return( self.predictions )

#lets evaluate how the effeciveness of dummymodel

print(mrr_score(dummymodel(num_items), test_dataset, train=rating_dataset, k=100).mean())
#as expected, a recommendation model that gives 0 scores for all items obtains a MRR score of 0

0.0


In [None]:
#note that mrr_score() displays a progress bar if you set verbose=True
print(mrr_score(dummymodel(num_items), test_dataset, train=rating_dataset, k=100, verbose=True).mean())

1999it [00:00, 2488.00it/s]

0.0





# Part-A. Combination of Recommendation Models

## Explicit & Implicit Matrix Factorisation Models

Create and train three matrix factorisation systems:

(NOTE: Different models will be trained using DIFFERENT datasets)
 - "EMF": explicit MF, trained on the **ratings** Interactions object (`rating_dataset`)
 - "IMF": implicit MF, trained on the **toread** Interactions object (`toread_dataset_train`)
 - "BPRMF": implicit MF with the BPR loss function (`loss='bpr'`), trained on the **toread** Interactions object (`toread_dataset_train`)

Normally, the hyper-parameters (e.g. `embedding_dim`) will be tuned using the `validation` set based on different models, but here, to simplify the excercie, we use a fixed setting of those hyper-parameters, and keep a fixed random seed.
  
In all cases, use the standard initialisation arguments, i.e.
`n_iter=10, embedding_dim=32, use_cuda=False, random_state=np.random.RandomState(SEED)`.

Evaluate each of these models in terms of Mean Reciprocal Rank on the test set. MRR can be obtained using:
```python
mrr_score(X, test_dataset, train=rating_dataset, k=100, verbose=True).mean())
```
where X is an instance of a Spotlight model.

### Implement the explicit MF model

In [None]:
from spotlight.factorization.explicit import ExplicitFactorizationModel
from spotlight.factorization.implicit import ImplicitFactorizationModel
from spotlight.losses import bpr_loss

EMF = ExplicitFactorizationModel(n_iter=10,
                                 embedding_dim=32,
                                 use_cuda=False,
                                 random_state=np.random.RandomState(SEED))
EMF.fit(rating_dataset, verbose=True)

x  =  mrr_score(EMF, test_dataset, train=rating_dataset, k=100, verbose=True).mean()
x = round(x, 4)
x

Epoch 0: loss 3.8710269004595084
Epoch 1: loss 0.7940810405817188
Epoch 2: loss 0.638251251617416
Epoch 3: loss 0.5217335396980654
Epoch 4: loss 0.44844858281192235
Epoch 5: loss 0.4054335441257133
Epoch 6: loss 0.38238635689753003
Epoch 7: loss 0.3633662538572413
Epoch 8: loss 0.3513794243641076
Epoch 9: loss 0.3396693015257355


1999it [00:02, 892.38it/s]


0.059

### Implement the implicit MF model

In [None]:
# Add your solution here
IMF = ImplicitFactorizationModel(n_iter=10,
                                 embedding_dim=32,
                                 use_cuda=False,
                                 random_state=np.random.RandomState(SEED))
IMF.fit(toread_dataset_train, verbose=True)
IMF_score = mrr_score(IMF, test_dataset, train=toread_dataset_train, k=100, verbose=True).mean()
x = round(IMF_score, 4)
x

Epoch 0: loss 0.7677980376020918
Epoch 1: loss 0.5387786055370322
Epoch 2: loss 0.4701719809112684
Epoch 3: loss 0.42832199997215903
Epoch 4: loss 0.3983901780590696
Epoch 5: loss 0.36827549528119696
Epoch 6: loss 0.34734796809981455
Epoch 7: loss 0.32980164122890754
Epoch 8: loss 0.3187009900597469
Epoch 9: loss 0.3048194520316034


1999it [00:02, 873.49it/s]


0.3035

### Implement the BPRMF model

In [None]:
BPRMF = ImplicitFactorizationModel(loss='bpr', n_iter=10,
                                 embedding_dim=32,
                                 use_cuda=False,
                                 random_state=np.random.RandomState(SEED))
BPRMF.fit(toread_dataset_train, verbose=True)

BPRMF_score = mrr_score(BPRMF, test_dataset, train=rating_dataset, k=100, verbose=True).mean()
x = round(BPRMF_score, 4)
x

BPRMF_scores = mrr_score(BPRMF, test_dataset, train=rating_dataset, k=100, verbose=True)
len(BPRMF_scores)

Epoch 0: loss 0.3389544660612097
Epoch 1: loss 0.1964499857276678
Epoch 2: loss 0.15870639708174286
Epoch 3: loss 0.14147727969893306
Epoch 4: loss 0.1328272745891843
Epoch 5: loss 0.12213623430579901
Epoch 6: loss 0.11668405982331848
Epoch 7: loss 0.11047121703202994
Epoch 8: loss 0.1088867612745402
Epoch 9: loss 0.1040047079693737


1999it [00:02, 892.52it/s]
1999it [00:02, 879.23it/s]


1999

Now you can answer quiz question 3

## Hybrid Model

(a) Linearly combine the *scores* from IMF and BPRMF.  
(b) Apply a pipelining recommender, where the top 100 items are obtained from IMF and re-ranked using the scores of BPRMF. Items not returned by IMF get a score of 0.


In [None]:
def test_Hybrid_a(combsumObj):
  for i, u in enumerate([5, 20]):
    print("Hybrid a test case %d" % i)
    print(np.count_nonzero(combsumObj.predict(u) > 1))

def test_Hybrid_b(pipeObj):
  for i, iid in enumerate([3, 0]):
    print("Hybrid b test case %d" % i)
    print(pipeObj.predict(0)[iid])



In [None]:
#from sklearn.preprocessing import minmax_scale

#Implement combsum hybrid model
class CombSumModel:
    def __init__(self, IMF, BPRMF):
        self.IMF = IMF
        self.BPRMF = BPRMF

    def predict(self, uid):
        # Normalise obtained predicted scores
        scores_imf_norm = minmax_scale(self.IMF.predict(uid))
        scores_bprmf_norm = minmax_scale(self.BPRMF.predict(uid))

        # Combine scores by summing them
        combined_scores = scores_imf_norm + scores_bprmf_norm

        return combined_scores

linearModel = CombSumModel(IMF, BPRMF)

combsum_scores = mrr_score(linearModel, test_dataset, train=rating_dataset, k=100, verbose=True)
combsum_score = combsum_scores.mean()

improved_scores_count = sum(a > b for a, b in zip(combsum_scores, BPRMF_scores))
degraded_scores_count = sum(a < b for a, b in zip(combsum_scores, BPRMF_scores))
improved_scores_count

1999it [00:06, 307.76it/s]


736

In [None]:
#Implement pipeline hybrid model
class PipelineModel:
    def __init__(self, IMF, BPRMF):
        self.IMF = IMF
        self.BPRMF = BPRMF

    def predict(self, uid):
        # Obtain top 100 scores predicted from IMF
        top100_imf = self.IMF.predict(uid).argsort()[:100]

# Normalise scores predicted from BPRMF and get corresponding to top 100 IMF scores
        bprmf_scores = minmax_scale(self.BPRMF.predict(uid))
        bprmf_scores = bprmf_scores[top100_imf]
        # Items not returned by IMF get a score of 0
        reranked_scores = np.array([bprmf_scores[i] if i < len(bprmf_scores) else 0 for i in range(self.IMF._num_items)])
        return reranked_scores

pipeModel = PipelineModel(IMF, BPRMF)

pipeline_scores = mrr_score(pipeModel, test_dataset, train=rating_dataset, k=100, verbose=True)
pipeline_score = pipeline_scores.mean()
improved_scores_count = sum(a > b for a, b in zip(pipeline_scores, BPRMF_scores))
degraded_scores_count = sum(a < b for a, b in zip(pipeline_scores, BPRMF_scores))
improved_scores_count, degraded_scores_count

1999it [00:05, 365.84it/s]


(272, 1653)

In [None]:
#Now test hybrid approaches

test_Hybrid_a(linearModel)
test_Hybrid_b(pipeModel)


Hybrid a test case 0
445
Hybrid a test case 1
407
Hybrid b test case 0
0.12721455097198486
Hybrid b test case 1
0.1450352966785431


# Part-B. Analysing Recommendation Models

## Utility methods

In [None]:
from typing import Sequence, Tuple

def get_top_K(model, uid : int, k : int) -> Tuple[ Sequence[int], Sequence[float],  np.ndarray ] :
  #returns iids, their (normalised) scores in descending order, and item emebddings for the top k predictions of the given uid.

  from sklearn.preprocessing import minmax_scale

  from scipy.stats import rankdata
  # get scores from model
  scores = model.predict(uid)

  # map scores into rank 0..1 over the entire item space
  scores = minmax_scale(scores)

  #compute their ranks
  ranks = rankdata(-scores)
  print(ranks)

  # get and filter iids, scores and embeddings
  rtr_scores = scores[ranks <= k]
  rtr_iids = np.argwhere(ranks <= k).flatten()
  if hasattr(model, '_net'):
    embs = model._net.item_embeddings.weight[rtr_iids].detach()
  else:
    # not a model that has any embeddings
    embs = np.zeros([k,1])

  # identify correct ordering using numpy.argsort()
  ordering = (-1*rtr_scores).argsort()

  #return iids, scores and their embeddings in descending order of score
  return rtr_iids[ordering], rtr_scores[ordering], embs[ordering]

if BPRMF is not None: # BPRMF is the model name defined in Task 1
  iids, scores, embs = get_top_K(BPRMF, 0, 10)
  print("Returned iids: %s" % str(iids))
  print("Returned scores: %s" % str(scores))
  print("Returned embeddings: %s" % str(embs))
else:
  print("You need to define BPRMF in Task 1")

[ 204.  596.  922. ... 1753. 1142. 1742.]
Returned iids: [ 23 108  21  33   9  81  52 254  16   3]
Returned scores: [0.99999994 0.9895164  0.98483366 0.9225092  0.9070964  0.9065484
 0.9005373  0.8931043  0.8837875  0.88369954]
Returned embeddings: tensor([[-0.0454,  1.3716, -0.8307, -1.2616,  1.6699,  1.0161,  1.1168,  2.3530,
         -1.2027,  0.8522, -1.0941, -0.6864, -0.5725, -2.0335, -1.2591,  0.6154,
         -0.1374, -1.6868, -1.8616, -0.7514,  1.9909, -0.3909,  1.9239,  1.3293,
         -1.2834, -0.4520,  1.1338,  0.3468,  2.5168, -2.1586,  1.2310,  1.1670],
        [ 0.1239,  1.1003,  0.0531, -1.1045,  1.9932,  1.5049,  1.0011,  1.9734,
         -1.6322, -0.8913, -0.6372,  0.7721, -1.1422, -2.2424, -1.1936, -0.5770,
          0.0762, -1.0283, -1.2806, -2.0889,  2.8154, -0.9600, -0.1419,  0.8408,
         -1.6067, -1.2905,  1.9168,  1.3988,  1.8646, -2.2029,  0.5365,  0.2022],
        [ 0.3844,  0.8189, -0.1892, -1.1793,  2.1731,  0.6669,  1.1271,  1.4538,
         -1.2173, -0

## Qualiatively Examining Recommendations

From now on, we will consider the `BPRMF` model.

Write a function, which given a uid (int), prints the *title and authors* of:
 - (a) the books that the user has previously shelved (c.f. `toread_dataset_train`)
 - (b) the books that the user will read in the future (c.f. `test_dataset`)
 - (c) the top 10 books that the user were recommended by `BPRMF` - you can make use of `get_top_K()`.

Then, we will examine two specific users, namely uid 1805 (u336) and uid 179 (user u1331), to analyse if their recommendations make sense.

In [None]:
def get_title_author(uid : int):
  user_indices_toread = np.where(toread_dataset_train.user_ids == uid)[0]
  user_indices_test = np.where(test_dataset.user_ids == uid)[0]

  book_ids_toread = toread_dataset_train.item_ids[user_indices_toread]
  book_ids_test = test_dataset.item_ids[user_indices_test]
  iids, scores, emb = get_top_K(BPRMF, uid, 10)

  def get_author_title_list(book_ids):
    res = [getAuthorTitle(id) for id in book_ids]
    return res

  res_a = get_author_title_list(book_ids_toread)
  res_b = get_author_title_list(book_ids_test)
  res_c = get_author_title_list(iids)

  return res_a, res_b, res_c

res_a_1805, res_b_1805, res_c_1805 = get_title_author(1805)

score_1805 = BPRMF_scores[1805]

prev_shelved_1805 = len([b for b in res_c_1805 if b in res_b_1805])
print(res_c_1805)
print(res_b_1805)
print(prev_shelved_1805)

[ 446.  260.  575. ... 1006. 1095. 1404.]
['Suzanne Collins / The Hunger Games (The Hunger Games, #1)', 'Dan Brown / The Da Vinci Code (Robert Langdon, #2)', 'Dan Brown / The Lost Symbol (Robert Langdon, #3)', 'Michael Crichton / Disclosure', 'George R.R. Martin / A Clash of Kings  (A Song of Ice and Fire, #2)', 'Dan Brown / Angels & Demons  (Robert Langdon, #1)', 'John Grisham / The Broker', 'Khaled Hosseini / The Kite Runner', 'George R.R. Martin / A Game of Thrones (A Song of Ice and Fire, #1)', 'Suzanne Collins / Mockingjay (The Hunger Games, #3)']
['John Grisham / The Pelican Brief', 'Stieg Larsson, Reg Keeland / The Girl Who Played with Fire (Millennium, #2)', 'Gillian Flynn / Gone Girl', 'Tom Clancy / The Hunt for Red October (Jack Ryan Universe, #4)', 'Chuck Palahniuk / Fight Club', 'Umberto Eco, William Weaver, Seán Barrett / The Name of the Rose', 'John Grisham / The Runaway Jury', 'Thomas Harris / Hannibal (Hannibal Lecter, #3)', 'Lee Child / The Affair (Jack Reacher, #16)',

# Part-C. Diversity of Recommendations

## Measuring Intra-List Diversity


For the BPR implicit factorisation model, implement the Intra-list diversity measure of the top 5 scored items based on their item embeddings in the `BPRMF` model.

Implement your ILD as a function with the specification:
```python
def measure_ild(top_books : Sequence[int], K : int=5) -> float
```
where:
 - `top_books` is a list or a Numpy array of iids that have been returned for a particular user. For instance, it can be obtained from `get_top_K()`.
 - `K` is the number of top-ranked items to consider from `top_books`.
 - Your implementation should use the item embeddings stored in the `BPRMF` model.

Calculate the ILD (with k=5), identify the books previously shelved and recommended for the specific users requested in the quiz, and use these to analyse the recommendations.


In [None]:
import torch.nn.functional as f

def measure_ild(top_books : Sequence[int], K : int=5) -> float:
  # retrieve K embeddings
  top_embeddings  = BPRMF._net.item_embeddings.weight[top_books[:K]]
  # calculate cosine similarities for top embeddings like matrix entries i, j
  ild_sum = 0
  for i in range(K):
      for j in range(i + 1, K):
          similarity = 1 - f.cosine_similarity(top_embeddings[i], top_embeddings[j], axis=0)
          ild_sum += similarity.item()
  # calculate ILD using formula
  ild = (2 / (K * (K - 1))) * ild_sum
  return ild

top_books_1805 , _, _ = get_top_K(BPRMF, 1805, 5)
ild_1805 = measure_ild(top_books_1805, 5)
authors_1805 = [getAuthorTitle(iid) for iid in top_books_1805]

top_books_179 , _, _ = get_top_K(BPRMF, 179, 10)
ild_179 = measure_ild(top_books_179, 5)
authors_179 =[getAuthorTitle(iid) for iid in top_books_179]
authors_179


[ 446.  260.  575. ... 1006. 1095. 1404.]
[ 646.  976.  490. ... 1043. 1525. 1523.]


['John Grisham / The Partner',
 'John Grisham / The Pelican Brief',
 'John Grisham / The Client',
 'John Grisham / The Brethren',
 'John Grisham / The Street Lawyer',
 'John Grisham / The Broker',
 'John Grisham / The Rainmaker',
 'John Grisham / The King of Torts',
 "J.K. Rowling, Mary GrandPré / Harry Potter and the Sorcerer's Stone (Harry Potter, #1)",
 'John Grisham / The Runaway Jury']

## Task 5. Implement MMR Diversification

Develop an Maximal Marginal Relevance (MMR) diversification technique, to re-rank the top-ranked recommendations for a given user.


In [None]:
from typing import Sequence
from itertools import combinations
def mmr(iids : Sequence[int], scores : Sequence[float], embs : np.ndarray, alpha : float) -> Sequence[int]:

  assert len(iids) == len(scores)
  assert len(iids) == embs.shape[0]
  assert len(embs.size()) == 2

  relevance = scores
  diversity = [1 - f.cosine_similarity(embs[i], embs[j], dim=0) for i, j in combinations(range(len(iids)), 2)]

    # Calculate the MMR score for each item
  mmr_scores = [alpha * rel - (1 - alpha) * div for rel, div in zip(relevance, diversity)]

    # Sort items in descending order of MMR score
  sorted_indices = np.argsort(mmr_scores)[::-1]

    # Reorder iids based on the sorting
  rtr_iids = [iids[idx] for idx in sorted_indices]
  return rtr_iids

iids = mmr( *get_top_K(BPRMF, 0, 10), 0.5)
iids

[ 204.  596.  922. ... 1753. 1142. 1742.]


[108, 21, 23, 81, 33, 3, 254, 9, 16, 52]

In [None]:
def run_MMR_testcases(mmrfn):
  example_embeddings1 = torch.tensor([[1.0,1.0],[1.0,1.0],[0,1.0],[0.1, 1.0]])
  example_embeddings2 = torch.tensor([[1.0,1.0],[1.0,1.0],[0.02,1.0],[0.01,1.0]])
  print("Testcase 0 : %s" % mmrfn([1,2,3,4], [0.5, 0.5, 0.5, 0.5],  example_embeddings1, 0.5)[0] )
  print("Testcase 1 : %s" % mmrfn([1,2,3,4], [0.5, 0.5, 0.5, 0.5],  example_embeddings1, 0.5)[1] )
  print("Testcase 2 : %s" % mmrfn([1,2,3,4], [4, 3, 2, 1],  example_embeddings1, 1)[1] )
  print("Testcase 3 : %s" % mmrfn([1,2,3,4], [0.99, 0.98, 0.97, 0.001],  example_embeddings2, 0.001)[1] )
  print("Testcase 4 : %s" % mmrfn([1,2,3,4], [0.99, 0.98, 0.97, 0.001],  example_embeddings2, 0.5)[1] )

run_MMR_testcases(mmr)

Testcase 0 : 1
Testcase 1 : 3
Testcase 2 : 2
Testcase 3 : 2
Testcase 4 : 2


Now we can analyse the impact of our MMR implementation. Let's consider again uid 179 (user u1331).

Apply MMR on the top 10 results obtained from the BPRMF model using `get_top_K()`, with an alpha value of 0.5. The following code should help:
```python
mmr( *get_top_K(BPRMF, 179, 10), 0.5)
```

Finally, anayse the returned books. Calculate the ILD (with `k=5`), and examine the authors and titles (using `getAuthorTitle()`).

In [None]:
iids_179 = mmr( *get_top_K(BPRMF, 179, 10), 0.5)
items_recc = [getAuthorTitle(iid) for iid in iids_179]
print(items_recc)
ild_179 = measure_ild(iids_179, 5)
ild_179

[ 646.  976.  490. ... 1043. 1525. 1523.]
['John Grisham / The Partner', 'John Grisham / The Client', 'John Grisham / The Runaway Jury', 'John Grisham / The Pelican Brief', 'John Grisham / The Street Lawyer', 'John Grisham / The Rainmaker', 'John Grisham / The Broker', 'John Grisham / The Brethren', "J.K. Rowling, Mary GrandPré / Harry Potter and the Sorcerer's Stone (Harry Potter, #1)", 'John Grisham / The King of Torts']


0.2928944885730744