# Recommendation - Model 🍿

---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically.

First, load again the dataframe `movies` and `ratings`

In [2]:
import os
import pandas as pd
import scipy
import pickle
import lightfm
import numpy as np

from lightfm import LightFM
from lightfm import cross_validation
from lightfm.evaluation import precision_at_k

In [3]:
### Load the movies and ratings datasets
path = os.path.join("/","home","guillaume","code","GGIML","vivadata-student","data","ml-latest-small")
movies = pd.read_csv(path+os.path.sep+"movies.csv")
ratings = pd.read_csv(path+os.path.sep+"ratings.csv")

In [4]:
dst_dir = os.path.join("/","home","guillaume","code","GGIML","vivadata-student","data","netflix")

In [5]:
with open(dst_dir+os.path.sep+'ratings_matrix.pkl', 'rb') as f:
    ratings_matrix = pickle.load(f)

In [6]:
with open(dst_dir+os.path.sep+'mappers.pkl', 'rb') as f:
    mappers = pickle.load(f)

In [7]:
idx_to_mid, mid_to_idx, uid_to_idx, idx_to_uid = mappers[0], mappers[1], mappers[2], mappers[3]

Let's split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

In [8]:
train_ratings, test_ratings = cross_validation.random_train_test_split(ratings_matrix, test_percentage=.2, random_state=np.random.RandomState(0))

Let's train a LightFM model for 10 epochs.

In [9]:
ratings_matrix

<610x9724 sparse matrix of type '<class 'numpy.float64'>'
	with 100836 stored elements in Compressed Sparse Row format>

We use the **WARP** (Weighted Approximate-Rank Pairwise) loss which maximises the rank of positive examples by repeatedly sampling negative examples until rank violating one is found. It is useful when only positive interactions are present and optimising the top of the recommendation list is desired.

In [304]:
lfm = LightFM(no_components=30, loss='warp')

In [305]:
lfm.fit(train_ratings, epochs=20)

<lightfm.lightfm.LightFM at 0x7ff2a3fa0cd0>

Let's evaluate your model on your test set.

In [306]:
precision_at_k(lfm, test_ratings).mean()

0.08174342

In [307]:
lfm.item_embeddings[0]

array([ 0.42676395, -0.3410659 ,  0.15136187, -0.42883462, -0.00804605,
        0.13135365,  0.3427576 , -0.4339824 , -0.1129528 ,  0.21282552,
        0.6817652 ,  0.33451477, -0.02379951, -0.12735833,  0.37334958,
       -0.22396557,  0.3733976 , -0.7290944 ,  0.3329436 ,  0.04068927,
        0.13342984,  0.29168043,  0.24499665, -0.5547059 , -0.18547276,
        0.52311647, -0.38751566,  0.18248416,  0.41903153, -0.423558  ],
      dtype=float32)

The item_embeddings are features that the model automatically calculated for our movies. It can be used for content-based filtering.

We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

In [308]:
len(lfm.item_embeddings)

9724

In [309]:
from sklearn.metrics.pairwise import cosine_similarity

In [310]:
#similarity_scores = cosine_similarity(lfm.item_embeddings)
similarity_scores = np.corrcoef(lfm.item_embeddings)

In [311]:
type(similarity_scores)

numpy.ndarray

Let's see, for movie of idx 20, the 10 most similar movies?

In [312]:
most_similar_movies_20 = np.argsort(-similarity_scores[20])

Let's now test the engine! Suppose we have an user that likes **Toy Story** (movie_id = 1). Which movies would you recommend to that user?

In [313]:
most_similar_movies_toy_story = np.argsort(-similarity_scores[mid_to_idx[1]])

In [314]:
toy_story_similar_titles = []
for index in most_similar_movies_toy_story:
    toy_story_similar_titles.append(movies[movies['movieId']==idx_to_mid[index]]['title'].values)

In [315]:
toy_story_similar_titles[1:6]

[array(['Jurassic Park (1993)'], dtype=object),
 array(['Mask, The (1994)'], dtype=object),
 array(['Back to the Future (1985)'], dtype=object),
 array(['Star Wars: Episode IV - A New Hope (1977)'], dtype=object),
 array(['Forrest Gump (1994)'], dtype=object)]

In [102]:
import pickle

In [103]:
with open(dst_dir+os.path.sep+'similarity_scores.pkl', 'wb') as pklfile:
    pickle.dump(similarity_scores, pklfile)

In [104]:
movies.to_pickle(dst_dir+os.path.sep+'movies.pkl')

In [25]:
list_mid = [1,44,592]

sims = []
for mid in list_mid:
    sims.append(similarity_scores[mid_to_idx[mid]])

In [26]:
len(sims)

3

In [43]:
movies.iloc[481]

movieId                                                549
title      Thirty-Two Short Films About Glenn Gould (1993)
genres                                       Drama|Musical
Name: 481, dtype: object

In [28]:
movies.title[0]

'Toy Story (1995)'

In [85]:
def get_sim_scores(list_mid):
    if isinstance(list_mid, int):
        list_mid = [list_mid]
    sims = []
    for mid in list_mid:
        sims.append(similarity_scores[mid_to_idx[mid]])
    return sims

In [92]:
def get_ranked_recos(sims):
    recos = []
    sum_sims = np.sum(sims, axis = 0)
    most_similar_movies = np.argsort(-sum_sims)
    for index in most_similar_movies:
        recos.append((idx_to_mid[index], sum_sims[index], str(''.join(movies[movies['movieId']==idx_to_mid[index]].title.values))))
    return recos

In [93]:
get_ranked_recos(get_sim_scores(2))

[(2, 1.0, 'Jumanji (1995)'),
 (596, 0.97948056, 'Pinocchio (1940)'),
 (367, 0.94778603, 'Mask, The (1994)'),
 (10, 0.94637084, 'GoldenEye (1995)'),
 (653, 0.94063276, 'Dragonheart (1996)'),
 (780, 0.93556273, 'Independence Day (a.k.a. ID4) (1996)'),
 (784, 0.9345796, 'Cable Guy, The (1996)'),
 (588, 0.9333846, 'Aladdin (1992)'),
 (303, 0.9331948, 'Quick and the Dead, The (1995)'),
 (788, 0.92988724, 'Nutty Professor, The (1996)'),
 (48, 0.92257595, 'Pocahontas (1995)'),
 (586, 0.92141026, 'Home Alone (1990)'),
 (364, 0.91876686, 'Lion King, The (1994)'),
 (165, 0.9153266, 'Die Hard: With a Vengeance (1995)'),
 (6593, 0.907245, 'Freaky Friday (2003)'),
 (19, 0.9068981, 'Ace Ventura: When Nature Calls (1995)'),
 (377, 0.90664357, 'Speed (1994)'),
 (1073, 0.90595484, 'Willy Wonka & the Chocolate Factory (1971)'),
 (428, 0.90422696, 'Bronx Tale, A (1993)'),
 (595, 0.89932984, 'Beauty and the Beast (1991)'),
 (500, 0.896887, 'Mrs. Doubtfire (1993)'),
 (344, 0.8936827, 'Ace Ventura: Pet Dete

In [94]:
def get_sim_scores(list_mid):
    if isinstance(list_mid, int):
        list_mid = [list_mid]
    sims = []
    for mid in list_mid:
        sims.append(similarity_scores[mid_to_idx[mid]])
    return sims

def get_ranked_recos(sims):
    recos = []
    sum_sims = np.sum(sims, axis = 0)
    most_similar_movies = np.argsort(-sum_sims)
    for index in most_similar_movies:
        recos.append((idx_to_mid[index], sum_sims[index], str(''.join(movies[movies['movieId']==idx_to_mid[index]].title.values))))
    return recos

def get_reco(list_mids, N=5, exclude_selection=False):
    if exclude_selection:
        return get_ranked_recos(get_sim_scores(list_mids))[len(list_mids):N+len(list_mids)]
    else:
        return get_ranked_recos(get_sim_scores(list_mids))[:N]

In [95]:
get_reco(1)

[(1, 1.0000001, 'Toy Story (1995)'),
 (1035, 0.9784269, 'Sound of Music, The (1965)'),
 (318, 0.9766954, 'Shawshank Redemption, The (1994)'),
 (47, 0.9737339, 'Seven (a.k.a. Se7en) (1995)'),
 (527, 0.9677736, "Schindler's List (1993)")]

In [317]:
def get_similarity_scores():
    print("Loading model saved as similarity scores")
    with open(os.path.join(dst_dir, 'similarity_scores.pkl'), 'rb') as input:
        similarity_scores = pickle.load(input)#.astype(float)
    return similarity_scores

In [318]:
sim_score2 = get_similarity_scores()

Loading model saved as similarity scores


In [323]:
sim_score2[mid_to_idx[3]]

array([ 0.7026446 ,  1.        ,  0.64096839, ..., -0.86666573,
       -0.88497698, -0.78055759])

In [324]:
sim_score2

array([[ 1.        ,  0.7026446 ,  0.85925494, ..., -0.70526429,
        -0.83986228, -0.79124129],
       [ 0.7026446 ,  1.        ,  0.64096839, ..., -0.86666573,
        -0.88497698, -0.78055759],
       [ 0.85925494,  0.64096839,  1.        , ..., -0.65735629,
        -0.7766206 , -0.70661544],
       ...,
       [-0.70526429, -0.86666573, -0.65735629, ...,  1.        ,
         0.94400759,  0.95151998],
       [-0.83986228, -0.88497698, -0.7766206 , ...,  0.94400759,
         1.        ,  0.96592074],
       [-0.79124129, -0.78055759, -0.70661544, ...,  0.95151998,
         0.96592074,  1.        ]])

In [320]:
mid_to_idx[3]

1

In [321]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [327]:
def get_movies():
    print("Loading movies dataset as pickle and then transforming it to dict")
    with open(os.path.join(dst_dir, 'movies.pkl'), 'rb') as input:
        df_movies = pickle.load(input)
    movies = df_movies[["movieId", "title"]].set_index("movieId")
    movies_dict = movies.to_dict(orient="index")
    return movies_dict

In [328]:
movies_dict = get_movies()

Loading movies dataset as pickle and then transforming it to dict


In [340]:
idx_to_mid

{0: 1,
 1: 3,
 2: 6,
 3: 47,
 4: 50,
 5: 70,
 6: 101,
 7: 110,
 8: 151,
 9: 157,
 10: 163,
 11: 216,
 12: 223,
 13: 231,
 14: 235,
 15: 260,
 16: 296,
 17: 316,
 18: 333,
 19: 349,
 20: 356,
 21: 362,
 22: 367,
 23: 423,
 24: 441,
 25: 457,
 26: 480,
 27: 500,
 28: 527,
 29: 543,
 30: 552,
 31: 553,
 32: 590,
 33: 592,
 34: 593,
 35: 596,
 36: 608,
 37: 648,
 38: 661,
 39: 673,
 40: 733,
 41: 736,
 42: 780,
 43: 804,
 44: 919,
 45: 923,
 46: 940,
 47: 943,
 48: 954,
 49: 1009,
 50: 1023,
 51: 1024,
 52: 1025,
 53: 1029,
 54: 1030,
 55: 1031,
 56: 1032,
 57: 1042,
 58: 1049,
 59: 1060,
 60: 1073,
 61: 1080,
 62: 1089,
 63: 1090,
 64: 1092,
 65: 1097,
 66: 1127,
 67: 1136,
 68: 1196,
 69: 1197,
 70: 1198,
 71: 1206,
 72: 1208,
 73: 1210,
 74: 1213,
 75: 1214,
 76: 1219,
 77: 1220,
 78: 1222,
 79: 1224,
 80: 1226,
 81: 1240,
 82: 1256,
 83: 1258,
 84: 1265,
 85: 1270,
 86: 1275,
 87: 1278,
 88: 1282,
 89: 1291,
 90: 1298,
 91: 1348,
 92: 1377,
 93: 1396,
 94: 1408,
 95: 1445,
 96: 1473,
 

In [342]:
movies_dict[idx_to_mid[3]]['title']

'Seven (a.k.a. Se7en) (1995)'