# Recommendation - Model 🍿

---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

Now, time for the exciting part! We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically! 

In [19]:
import pickle
import pandas as pd
import numpy as np
from lightfm import LightFM
from lightfm.cross_validation import random_train_test_split
from lightfm.evaluation import precision_at_k
from sklearn.metrics.pairwise import cosine_similarity

First, load again the dataframe `movies` and `ratings`

In [None]:
### TODO: load the movies and ratings datasets

**Q1**. Start by loading all the pickle you saved during last challenge: `ratings_matrix`, `idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`

In [6]:
directory = './data'
ratings_matrix = pickle.load(open(directory + '/ratings_matrix.pkl', 'rb'))
moviesDf = pickle.load(open(directory + '/moviesDf.pkl', 'rb'))
idx_to_mid = pickle.load(open(directory + '/idx_to_mid.pkl', 'rb'))
mid_to_idx = pickle.load(open(directory + '/mid_to_idx.pkl', 'rb'))
uid_to_idx = pickle.load(open(directory + '/uid_to_idx.pkl', 'rb'))
idx_to_uid = pickle.load(open(directory + '/idx_to_uid.pkl', 'rb'))

**Q2**. Because the dataset is slightly different from what we have been used to (X as features, y as target), the usual `train_test_split` method from scikit-learn does not apply.

Hopefully, `lightfm` comes with a `random_train_test_split` located into `cross_validation` dedicated to this usecase 🙂

Split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

In [8]:
train, test = random_train_test_split(ratings_matrix, test_percentage=0.2, random_state=np.random.RandomState(0))

**Q3**. Train a LightFM model for 10 epochs. You can use the parameter `loss="warp"`.

In [12]:
model = LightFM(no_components=100, loss='warp', random_state=0)
model.fit(train, epochs=10, verbose=True)

Epoch: 100%|████████████████████████████████████| 10/10 [00:01<00:00,  5.22it/s]


<lightfm.lightfm.LightFM at 0x135441d10>

**Q4**. Evaluate your model on your test set. You can use the `precision_at_k` metric implemented in the LightFM library.

In [17]:
k = 5
precision_k = precision_at_k(model, test, train, k=k)
print(f'Precision at k: {precision_k.mean()}')

Precision at k: 0.2896551787853241


**Q5**. What does the attribute `item_embeddings` of `model` contains?  This will be the heart of your recommendation engine! 💟 So make sure you understand fully what it contains.

In [18]:
print(model.item_embeddings.shape)

(3650, 100)


First dimension is the unique movie Id's, Second is the number of components from the model (100 dimensions for each movie).

**Q6**. We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

> 🔦 **Hint**: For the similarity distance we can either use `cosine_similarity` function or `pearson_similarity`:
> - **Cosine similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> from sklearn.metrics.pairwise import cosine_similarity
> cosine_similarity(X, Y)
> ```
> - **Pearson similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> import numpy as np
> np.corrcoef(X, Y)
> ```

Compute the `similarity_scores` of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.

In [20]:
similarity_scores = cosine_similarity(model.item_embeddings)
similarity_scores.shape

(3650, 3650)

**Q7**. For movie of idx 20, what are the idx of the 10 most similar movies?

In [24]:
idx = 20
similarity_idx = similarity_scores[idx]
ranked_idx = np.argsort(-similarity_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]
for mid in ranked_mid[:10]:
    count = 0
    print(moviesDf[moviesDf.movieId == mid]['title'])

314    Forrest Gump (1994)
Name: title, dtype: object
510    Silence of the Lambs, The (1991)
Name: title, dtype: object
0    Toy Story (1995)
Name: title, dtype: object
506    Aladdin (1992)
Name: title, dtype: object
1284    Good Will Hunting (1997)
Name: title, dtype: object
1503    Saving Private Ryan (1998)
Name: title, dtype: object
277    Shawshank Redemption, The (1994)
Name: title, dtype: object
3568    Monsters, Inc. (2001)
Name: title, dtype: object
1757    Bug's Life, A (1998)
Name: title, dtype: object
1438    Rain Man (1988)
Name: title, dtype: object


**Q8**. Let's now test our engine! Suppose we have an user that likes **Toy Story** 🧸 (movie_id = 1). Which movies would you recommend to that user? In other words, which movies are the most similar to the movie Toy Story 

> ⚠️ **Warning**: Remember that your `similarity_scores` works with `idx` and you have the `movie_id` associated to your movie.

Retrieve the **top 5 recommendations**.

In [27]:
idx = mid_to_idx[1]
similarity_idx = similarity_scores[idx]
ranked_idx = np.argsort(-similarity_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]
for mid in ranked_mid[:5]:
    count = 0
    print(moviesDf[moviesDf.movieId == mid]['title'])

0    Toy Story (1995)
Name: title, dtype: object
224    Star Wars: Episode IV - A New Hope (1977)
Name: title, dtype: object
314    Forrest Gump (1994)
Name: title, dtype: object
418    Jurassic Park (1993)
Name: title, dtype: object
1757    Bug's Life, A (1998)
Name: title, dtype: object


As the next step is to **deploy your model**, you need now to: 

**Q9**. Save your `similarity_scores` into pickle format. Save also `movies` DataFrame into pickle format. Save them at the `data/netflix` directory at the root of the repository.

In [28]:
pickle.dump(similarity_scores, open(directory + '/similarity_scores.pkl', 'wb'))
pickle.dump(moviesDf, open(directory + '/moviesDf.pkl', 'wb'))

**Q10**. Encapsulate the previous code into functions, especially you will need:
- `get_sim_scores(mid)` function that returns the vector of the similarity scores `sims` between a movie `mid` and all the other movies
- `get_ranked_recos(sims)` that returns for a vector of similarity scores `sims` the list of all ranked recommendations (n_movies) (from most recommended to least recommended) - in the format list of (mid, score, name) tuple.

In [32]:
def get_movie_name(mid, movies):
    try: 
        name = moviesDf.loc[moviesDf.movieId == mid].title.values[0]
    except:
        name = 'Unknown'
    return name
        
def get_sim_scores(mid):
    idx = mid_to_idx[mid]
    similarity_idx = similarity_scores[idx]
    return similarity_idx

def get_ranked_recos(sims, movies):
    recos = []
    for idx in np.argsort(-sims):
        mid = idx_to_mid[idx]
        name = get_movie_name(mid, movies)
        score = sims[idx]
        recos.append((mid, score, name))
    return recos
    
def get_reccomendations(mid, movies, k):
    sim_scores = get_sim_scores(mid)
    return get_ranked_recos(sim_scores, movies)[:k]

In [33]:
get_reccomendations(3, moviesDf, 10)

[(3, 1.0, 'Grumpier Old Men (1995)'),
 (432, 0.6149578, "City Slickers II: The Legend of Curly's Gold (1994)"),
 (719, 0.6083785, 'Multiplicity (1996)'),
 (852, 0.60509646, 'Tin Cup (1996)'),
 (383, 0.5771905, 'Wyatt Earp (1994)'),
 (802, 0.57461876, 'Phenomenon (1996)'),
 (252, 0.5653284, 'I.Q. (1994)'),
 (809, 0.5639537, 'Fled (1996)'),
 (736, 0.55087614, 'Twister (1996)'),
 (65, 0.5355233, 'Bio-Dome (1996)')]

If you have extra time, feel free now to improve your recommendation engine!