# Recommendation - Model 🍿

In [32]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [33]:
%cd /content/drive/MyDrive/uni/Lab

/content/drive/MyDrive/uni/Lab


---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

Now, time for the exciting part! We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically! 

First, load again the dataframe `movies` and `ratings`

In [34]:
### TODO: load the movies and ratings datasets
import pandas as pd

movies = pd.read_csv("ml-latest-small/movies.csv")
ratings = pd.read_csv("ml-latest-small/ratings.csv")

print(movies.head())
print(ratings.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


**Q1**. Start by loading all the pickle you saved during last challenge: `ratings_matrix`, `idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`

In [35]:
pkl_data = "./data/netflix"
import pickle

ratings_matrix = pickle.load(open(pkl_data + "/ratings_matrix.pkl", "rb"))
idx_to_mid = pickle.load(open(pkl_data + "/idx_to_mid.pkl", "rb"))
mid_to_idx = pickle.load(open(pkl_data + "/mid_to_idx.pkl", "rb"))
uid_to_idx = pickle.load(open(pkl_data + "/uid_to_idx.pkl", "rb"))
idx_to_uid = pickle.load(open(pkl_data + "/idx_to_uid.pkl", "rb"))

**Q2**. Because the dataset is slightly different from what we have been used to (X as features, y as target), the usual `train_test_split` method from scikit-learn does not apply.

Hopefully, `lightfm` comes with a `random_train_test_split` located into `cross_validation` dedicated to this usecase 🙂

Split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

In [36]:
!pip install lightfm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [37]:
import numpy as np
from lightfm.cross_validation import random_train_test_split

train, test = random_train_test_split(ratings_matrix,test_percentage=0.2,
                                      random_state=np.random.RandomState(0))



**Q3**. Train a LightFM model for 10 epochs. You can use the parameter `loss="warp"`.

In [38]:
from lightfm import LightFM

model = LightFM(no_components=10, loss='warp')
model.fit(train, epochs=10,verbose=True)

Epoch: 100%|██████████| 10/10 [00:01<00:00,  7.47it/s]


<lightfm.lightfm.LightFM at 0x7f98aafb26d0>

**Q4**. Evaluate your model on your test set. You can use the `precision_at_k` metric implemented in the LightFM library.

In [39]:
from lightfm.evaluation import precision_at_k

k = 10
precision_measurement = precision_at_k(model, test, train,k=10)
print("Precision at k is:", precision_measurement.mean())

Precision at k is: 0.21299343


**Q5**. What does the attribute `item_embeddings` of `model` contains?  This will be the heart of your recommendation engine! 💟 So make sure you understand fully what it contains.

In [40]:
print(model.user_embeddings.shape)
print(model.item_embeddings.shape)

(610, 10)
(9724, 10)


**Q6**. We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

> 🔦 **Hint**: For the similarity distance we can either use `cosine_similarity` function or `pearson_similarity`:
> - **Cosine similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> from sklearn.metrics.pairwise import cosine_similarity
> cosine_similarity(X, Y)
> ```
> - **Pearson similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> import numpy as np
> np.corrcoef(X, Y)
> ```

Compute the `similarity_scores` of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.

In [41]:
from sklearn.metrics.pairwise import cosine_similarity
similarity_scores = cosine_similarity(model.item_embeddings)
similarity_scores

array([[ 0.9999999 ,  0.80755246,  0.81798285, ..., -0.56051445,
        -0.60890186, -0.69671315],
       [ 0.80755246,  0.99999994,  0.73034537, ..., -0.6483241 ,
        -0.66268545, -0.72428036],
       [ 0.81798285,  0.73034537,  1.0000001 , ..., -0.33328223,
        -0.26356238, -0.42354044],
       ...,
       [-0.56051445, -0.6483241 , -0.33328223, ...,  1.        ,
         0.9102558 ,  0.9445128 ],
       [-0.60890186, -0.66268545, -0.26356238, ...,  0.9102558 ,
         0.9999999 ,  0.9556289 ],
       [-0.69671315, -0.72428036, -0.42354044, ...,  0.9445128 ,
         0.9556289 ,  1.0000001 ]], dtype=float32)

**Q7**. For movie of idx 20, what are the idx of the 10 most similar movies?

In [53]:
idx = 20
sim_row = similarity_scores[idx]
ranked_idx = np.argsort(-sim_row)
print("Idx for ten Most Similar:", ranked_idx[:10])
ranked_mid = [idx_to_mid[x] for x in ranked_idx]
print("mId for ten Most Similar:", ranked_mid[:10])
ranked_titles = [movies[movies.movieId == mid]["title"] for mid in ranked_mid[:10]] # top 10
print(ranked_titles)

Idx for ten Most Similar: [  20   16   26    3  118  433    4   15 1779  217]
mId for ten Most Similar: [356, 296, 480, 47, 2005, 4014, 50, 260, 2087, 3489]
[314    Forrest Gump (1994)
Name: title, dtype: object, 257    Pulp Fiction (1994)
Name: title, dtype: object, 418    Jurassic Park (1993)
Name: title, dtype: object, 43    Seven (a.k.a. Se7en) (1995)
Name: title, dtype: object, 1480    Goonies, The (1985)
Name: title, dtype: object, 2998    Chocolat (2000)
Name: title, dtype: object, 46    Usual Suspects, The (1995)
Name: title, dtype: object, 224    Star Wars: Episode IV - A New Hope (1977)
Name: title, dtype: object, 1550    Peter Pan (1953)
Name: title, dtype: object, 2608    Hook (1991)
Name: title, dtype: object]


**Q8**. Let's now test our engine! Suppose we have an user that likes **Toy Story** 🧸 (movie_id = 1). Which movies would you recommend to that user? In other words, which movies are the most similar to the movie Toy Story 

> ⚠️ **Warning**: Remember that your `similarity_scores` works with `idx` and you have the `movie_id` associated to your movie.

Retrieve the **top 5 recommendations**.

In [43]:
idx = mid_to_idx[1]
similarities_idx = similarity_scores[idx]
ranked_idx = np.argsort(-similarities_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]
print([movies[movies.movieId == mid]["title"] for mid in ranked_mid[:5]])

[0    Toy Story (1995)
Name: title, dtype: object, 1545    Little Mermaid, The (1989)
Name: title, dtype: object, 512    Beauty and the Beast (1991)
Name: title, dtype: object, 2029    Wild Wild West (1999)
Name: title, dtype: object, 510    Silence of the Lambs, The (1991)
Name: title, dtype: object]


As the next step is to **deploy your model**, you need now to: 

**Q9**. Save your `similarity_scores` into pickle format. Save also `movies` DataFrame into pickle format. Save them at the `data/netflix` directory at the root of the repository.

In [44]:
dst_dir = "./data/netflix"
pickle.dump(similarity_scores, open(dst_dir + "/similarity_scores.pkl","wb"))
pickle.dump(movies, open(dst_dir + "/movies.pkl","wb"))

**Q10**. Encapsulate the previous code into functions, especially you will need:
- `get_sim_scores(mid)` function that returns the vector of the similarity scores `sims` between a movie `mid` and all the other movies
- `get_ranked_recos(sims)` that returns for a vector of similarity scores `sims` the list of all ranked recommendations (n_movies) (from most recommended to least recommended) - in the format list of (mid, score, name) tuple.

In [45]:
def get_sim_scores(mid):
    idx = mid_to_idx[mid]
    sims = similarity_scores[idx]
    return sims

In [52]:
def get_ranked_recos(sims):
    ranked_idx = np.argsort(~sims)
    ranked_mid = [idx_to_mid[x] for x in ranked_idx]
    return [(mid,movies[movies.movieId == mid]["title"]) for mid in ranked_mid]

If you have extra time, feel free now to improve your recommendation engine!