# Recommendation - Model 🍿

---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

Now, time for the exciting part! We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically! 

First, load again the dataframe `movies` and `ratings`

In [28]:
### TODO: load the movies and ratings datasets
import pandas as pd

movies = pd.read_csv("ml-latest-small/movies.csv")
ratings = pd.read_csv("ml-latest-small/ratings.csv")

print(movies.head())
ratings.head()

   movieId                               title   
0        1                    Toy Story (1995)  \
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


**Q1**. Start by loading all the pickle you saved during last challenge: `ratings_matrix`, `idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`

In [59]:
import pickle

ratings_matrix = pickle.load(open('./data/ratings_matrix.pkl', 'rb'))
idx_to_mid = pickle.load(open('./data/idx_to_mid.pkl', 'rb'))
mid_to_idx = pickle.load(open('./data/mid_to_idx.pkl', 'rb'))
uid_to_idx = pickle.load(open('./data/uid_to_idx.pkl', 'rb'))
idx_to_uid = pickle.load(open('./data/idx_to_uid.pkl', 'rb'))

FileNotFoundError: [Errno 2] No such file or directory: './data/idx_to_mid.pkl'

**Q2**. Because the dataset is slightly different from what we have been used to (X as features, y as target), the usual `train_test_split` method from scikit-learn does not apply.

Hopefully, `lightfm` comes with a `random_train_test_split` located into `cross_validation` dedicated to this usecase 🙂

Split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

In [3]:
from PIL.ImagePalette import random
import numpy as np
from lightfm.cross_validation import random_train_test_split

train, test = random_train_test_split(ratings_matrix, 0.2)

**Q3**. Train a LightFM model for 10 epochs. You can use the parameter `loss="warp"`.

In [7]:
from lightfm import LightFM

# loss='warp' failed to run
model = LightFM(loss='logistic', no_components=500, random_state = 0)
model.fit(train, epochs = 20, verbose = True)

Epoch: 100%|██████████| 20/20 [01:25<00:00,  4.26s/it]


<lightfm.lightfm.LightFM at 0x261bb275c10>

**Q4**. Evaluate your model on your test set. You can use the `precision_at_k` metric implemented in the LightFM library.

In [8]:
from lightfm.evaluation import precision_at_k

k = 5
precisionK = precision_at_k(model, test, train, k=k).mean()

precisionK

0.19046053

**Q5**. What does the attribute `item_embeddings` of `model` contains?  This will be the heart of your recommendation engine! 💟 So make sure you understand fully what it contains.

In [9]:
print(model.item_embeddings.shape)

# The Item Embeddings is a matrix that is of shape [movies, no_components]. Each row represents a position in no_components dimensional space, where closer items are more similar.
# So each movie has its own position.

(3650, 500)


**Q6**. We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

> 🔦 **Hint**: For the similarity distance we can either use `cosine_similarity` function or `pearson_similarity`:
> - **Cosine similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> from sklearn.metrics.pairwise import cosine_similarity
> cosine_similarity(X, Y)
> ```
> - **Pearson similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> import numpy as np
> np.corrcoef(X, Y)
> ```

Compute the `similarity_scores` of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

similarityScores = np.zeros((3650,3650), dtype=np.float32)
similarityScores = cosine_similarity(model.item_embeddings, model.item_embeddings)

print(similarityScores)

[[0.9999999  0.92144024 0.9526029  ... 0.32358593 0.6287932  0.28094938]
 [0.92144024 0.9999997  0.9202921  ... 0.27235782 0.580153   0.2605363 ]
 [0.9526029  0.9202921  1.         ... 0.29539362 0.5879576  0.26487043]
 ...
 [0.32358593 0.27235782 0.29539362 ... 1.0000002  0.21019645 0.14478612]
 [0.6287932  0.580153   0.5879576  ... 0.21019645 1.0000001  0.15994556]
 [0.28094938 0.2605363  0.26487043 ... 0.14478612 0.15994556 1.0000001 ]]


**Q7**. For movie of idx 20, what are the idx of the 10 most similar movies?

In [34]:
idx = 20
highestScores = np.zeros(11, dtype = np.int64)

highestScores = np.argsort(similarityScores[idx])[-1:-12:-1]
movieIDs = [idx_to_mid[x] for x in highestScores]

for mid in movieIDs:
    print(movies[movies.movieId == mid]['title'])

314    Forrest Gump (1994)
Name: title, dtype: object
277    Shawshank Redemption, The (1994)
Name: title, dtype: object
257    Pulp Fiction (1994)
Name: title, dtype: object
510    Silence of the Lambs, The (1991)
Name: title, dtype: object
461    Schindler's List (1993)
Name: title, dtype: object
507    Terminator 2: Judgment Day (1991)
Name: title, dtype: object
43    Seven (a.k.a. Se7en) (1995)
Name: title, dtype: object
97    Braveheart (1995)
Name: title, dtype: object
418    Jurassic Park (1993)
Name: title, dtype: object
0    Toy Story (1995)
Name: title, dtype: object
46    Usual Suspects, The (1995)
Name: title, dtype: object


**Q8**. Let's now test our engine! Suppose we have an user that likes **Toy Story** 🧸 (movie_id = 1). Which movies would you recommend to that user? In other words, which movies are the most similar to the movie Toy Story 

> ⚠️ **Warning**: Remember that your `similarity_scores` works with `idx` and you have the `movie_id` associated to your movie.

Retrieve the **top 5 recommendations**.

In [38]:
idx = mid_to_idx[1]
recommendations = np.argsort(similarityScores[idx])[-2:-7:-1]
movieIDs = [idx_to_mid[x] for x in recommendations]
for mid in movieIDs:
    print(movies[movies.movieId == mid]['title'])



257    Pulp Fiction (1994)
Name: title, dtype: object
314    Forrest Gump (1994)
Name: title, dtype: object
418    Jurassic Park (1993)
Name: title, dtype: object
615    Independence Day (a.k.a. ID4) (1996)
Name: title, dtype: object
510    Silence of the Lambs, The (1991)
Name: title, dtype: object


As the next step is to **deploy your model**, you need now to: 

**Q9**. Save your `similarity_scores` into pickle format. Save also `movies` DataFrame into pickle format. Save them at the `data/netflix` directory at the root of the repository.

In [58]:
import pickle
import os

path = "./data/"
os.makedirs(path, exist_ok=True)
pickle.dump(similarityScores, open(path + 'similarity_scores.pkl', 'wb'))
pickle.dump(movies, open(path + 'movies.pkl', 'wb'))

**Q10**. Encapsulate the previous code into functions, especially you will need:
- `get_sim_scores(mid)` function that returns the vector of the similarity scores `sims` between a movie `mid` and all the other movies
- `get_ranked_recos(sims)` that returns for a vector of similarity scores `sims` the list of all ranked recommendations (n_movies) (from most recommended to least recommended) - in the format list of (mid, score, name) tuple.

In [53]:
def getSimScores(mid, similarityScores):
    return similarityScores[mid_to_idx[mid]]

def getRankedRecos(sims):
    nMovies = []
    recommendations = np.argsort(sims)[-1::-1]
    movieIDs = [idx_to_mid[x] for x in recommendations]
    for mid in movieIDs:
        sim = sims[mid_to_idx[mid]]
        name = movies[movies.movieId == mid]['title']
        nMovies.append((mid, sim, name))
    
    return nMovies

def getRecommendations(mid, similarityScores):
    sims = getSimScores(mid, similarityScores)
    return getRankedRecos(sims)

In [56]:
nMovies = getRecommendations(2, similarityScores)

nMovies[:5]

[(2,
  1.0,
  1    Jumanji (1995)
  Name: title, dtype: object),
 (364,
  0.9678494,
  322    Lion King, The (1994)
  Name: title, dtype: object),
 (480,
  0.96732163,
  418    Jurassic Park (1993)
  Name: title, dtype: object),
 (500,
  0.965642,
  436    Mrs. Doubtfire (1993)
  Name: title, dtype: object),
 (296,
  0.96554583,
  257    Pulp Fiction (1994)
  Name: title, dtype: object)]

If you have extra time, feel free now to improve your recommendation engine!