# Recommendation - Model 🍿

In [2]:
pip install lightfm

Collecting lightfm
  Downloading lightfm-1.17.tar.gz (316 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/316.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m225.3/316.4 kB[0m [31m6.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.4/316.4 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: lightfm
  Building wheel for lightfm (setup.py) ... [?25l[?25hdone
  Created wheel for lightfm: filename=lightfm-1.17-cp310-cp310-linux_x86_64.whl size=808332 sha256=962c55ced1c9bf508262fcb39456f8fe43a6adfe2389498050703978c58314f8
  Stored in directory: /root/.cache/pip/wheels/4f/9b/7e/0b256f2168511d8fa4dae4fae0200fdbd729eb424a912ad636
Successfully built lightfm
Installing collected packages: lightfm
Successfully installed lightfm-1.17


---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

Now, time for the exciting part! We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically!

First, load again the dataframe `movies` and `ratings`

In [3]:
### TODO: load the movies and ratings datasets

import pandas as pd
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

print(movies.head())
print(ratings.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


**Q1**. Start by loading all the pickle you saved during last challenge: `ratings_matrix`, `idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`

In [5]:
import pickle
ratings_matrix = pickle.load(open('ratings_matrix.pkl', 'rb'))
idx_to_mid = pickle.load(open('idx_to_mid.pkl', 'rb'))
mid_to_idx = pickle.load(open('mid_to_idx.pkl', 'rb'))
uid_to_idx = pickle.load(open('uid_to_idx.pkl', 'rb'))
idx_to_uid = pickle.load(open('idx_to_uid.pkl', 'rb'))

**Q2**. Because the dataset is slightly different from what we have been used to (X as features, y as target), the usual `train_test_split` method from scikit-learn does not apply.

Hopefully, `lightfm` comes with a `random_train_test_split` located into `cross_validation` dedicated to this usecase 🙂

Split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

In [7]:
import numpy as np

from lightfm.cross_validation import random_train_test_split

train, test = random_train_test_split(ratings_matrix, test_percentage=0.2, random_state=np.random.RandomState(0))

**Q3**. Train a LightFM model for 10 epochs. You can use the parameter `loss="warp"`.

In [8]:
from lightfm import LightFM

model = LightFM(no_components=100, loss="warp", random_state=0)

model.fit(train, epochs=10, verbose=True)

Epoch: 100%|██████████| 10/10 [00:03<00:00,  3.09it/s]


<lightfm.lightfm.LightFM at 0x7983913665f0>

**Q4**. Evaluate your model on your test set. You can use the `precision_at_k` metric implemented in the LightFM library.

In [10]:
from lightfm.evaluation import precision_at_k

k = 5
precision_k = precision_at_k(model, test, train, k=k).mean()

print("Precision at k:", k, "is", precision_k)

Precision at k: 5 is 0.29261085


**Q5**. What does the attribute `item_embeddings` of `model` contains?  This will be the heart of your recommendation engine! 💟 So make sure you understand fully what it contains.

In [11]:
print(model.item_embeddings.shape)

(3650, 100)


**Q6**. We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

> 🔦 **Hint**: For the similarity distance we can either use `cosine_similarity` function or `pearson_similarity`:
> - **Cosine similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> from sklearn.metrics.pairwise import cosine_similarity
> cosine_similarity(X, Y)
> ```
> - **Pearson similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> import numpy as np
> np.corrcoef(X, Y)
> ```

Compute the `similarity_scores` of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.

In [12]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_scores = cosine_similarity(model.item_embeddings)
similarity_scores

array([[ 0.9999996 ,  0.13317415,  0.24826698, ..., -0.42391142,
         0.04463575, -0.40535694],
       [ 0.13317415,  0.9999999 ,  0.1236582 , ..., -0.156093  ,
        -0.31249687,  0.03511212],
       [ 0.24826698,  0.1236582 ,  1.0000002 , ..., -0.28943077,
        -0.33148974,  0.12372942],
       ...,
       [-0.42391142, -0.156093  , -0.28943077, ...,  0.99999994,
         0.56635493,  0.13436124],
       [ 0.04463575, -0.31249687, -0.33148974, ...,  0.56635493,
         0.99999994, -0.2925488 ],
       [-0.40535694,  0.03511212,  0.12372942, ...,  0.13436124,
        -0.2925488 ,  0.99999994]], dtype=float32)

**Q7**. For movie of idx 20, what are the idx of the 10 most similar movies?

In [13]:
idx = 20
similarity_idx = similarity_scores[idx]
ranked_idx = np.argsort(-similarity_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]
for mid in ranked_mid[:10]:
  print(movies[movies.movieId == mid]["title"])

314    Forrest Gump (1994)
Name: title, dtype: object
510    Silence of the Lambs, The (1991)
Name: title, dtype: object
0    Toy Story (1995)
Name: title, dtype: object
277    Shawshank Redemption, The (1994)
Name: title, dtype: object
1284    Good Will Hunting (1997)
Name: title, dtype: object
506    Aladdin (1992)
Name: title, dtype: object
257    Pulp Fiction (1994)
Name: title, dtype: object
1503    Saving Private Ryan (1998)
Name: title, dtype: object
505    Ghost (1990)
Name: title, dtype: object
990    Indiana Jones and the Last Crusade (1989)
Name: title, dtype: object


**Q8**. Let's now test our engine! Suppose we have an user that likes **Toy Story** 🧸 (movie_id = 1). Which movies would you recommend to that user? In other words, which movies are the most similar to the movie Toy Story

> ⚠️ **Warning**: Remember that your `similarity_scores` works with `idx` and you have the `movie_id` associated to your movie.

Retrieve the **top 5 recommendations**.

In [15]:
idx = mid_to_idx[1]
similarity_idx = similarity_scores[idx]
ranked_idx = np.argsort(-similarity_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]
for mid in ranked_mid[:5]:
  print(movies[movies.movieId == mid]["title"])

0    Toy Story (1995)
Name: title, dtype: object
314    Forrest Gump (1994)
Name: title, dtype: object
224    Star Wars: Episode IV - A New Hope (1977)
Name: title, dtype: object
506    Aladdin (1992)
Name: title, dtype: object
1757    Bug's Life, A (1998)
Name: title, dtype: object


As the next step is to **deploy your model**, you need now to:

**Q9**. Save your `similarity_scores` into pickle format. Save also `movies` DataFrame into pickle format. Save them at the `data/netflix` directory at the root of the repository.

In [19]:
directory = "./data"


In [20]:
pickle.dump(similarity_scores, open(directory + "/similarity_scores.pkl", "wb"))
pickle.dump(movies, open(directory + "/movies.pkl", "wb"))

FileNotFoundError: [Errno 2] No such file or directory: './data/similarity_scores.pkl'

**Q10**. Encapsulate the previous code into functions, especially you will need:
- `get_sim_scores(mid)` function that returns the vector of the similarity scores `sims` between a movie `mid` and all the other movies
- `get_ranked_recos(sims)` that returns for a vector of similarity scores `sims` the list of all ranked recommendations (n_movies) (from most recommended to least recommended) - in the format list of (mid, score, name) tuple.

In [25]:
def get_movie_name(mid, movies):
  try:
    name = movies.loc[movies.movieId == mid].title.values[0]
  except:
    name = "Unknown"
  return name


def get_sim_scores(mid):
  idx = mid_to_idx[mid]
  sims = similarity_scores[idx]
  return sims

def get_ranked_recos(sims):
  recos = []
  for idx in np.argsort(-sims):
    mid = idx_to_mid[idx]
    name = get_movie_name(mid, movies)
    score = sims[idx]
    recos.append((mid, score, name))
  return recos

def get_recommendations(mid, movies, k):
  sim_scores = get_sim_scores(mid)
  return get_ranked_recos(sim_scores)[:k]


In [26]:
get_recommendations(3, movies, 10)

[(3, 0.9999999, 'Grumpier Old Men (1995)'),
 (719, 0.63637155, 'Multiplicity (1996)'),
 (852, 0.61769545, 'Tin Cup (1996)'),
 (432, 0.60410196, "City Slickers II: The Legend of Curly's Gold (1994)"),
 (252, 0.5790136, 'I.Q. (1994)'),
 (383, 0.5700361, 'Wyatt Earp (1994)'),
 (333, 0.5599552, 'Tommy Boy (1995)'),
 (809, 0.5561594, 'Fled (1996)'),
 (1409, 0.55322385, 'Michael (1996)'),
 (802, 0.5472849, 'Phenomenon (1996)')]

If you have extra time, feel free now to improve your recommendation engine!