# Recommender Systems

Recommender systems are useful for recommending users items based on their past preferences. Recommender systems are generally classified into the following categories:

- **Content-based** based on similarity of item attributes (e.g. keywords, categories) and users preferences: For instance, if a user liked Item A, the recommender system would recommend items with similar characteristics. If we would talk about movies, those characteristics would include producer, genre, etc... Typically, this can be done with a binary classification supervised algorithm.


- **Collaborative filtering**, which calculates similarity from interactions (e.g. ratings, number of purchases, likes, etc.). This process finds customers who have similar preferences, and then gives recommendations to one customer from a set of items that are new to that particular customer and preferred by other customers with a similar profile. The system makes an assumption that users with similar movie watching habits have similar preferences in general. It finds users who watch videos similar to those the user watches, then identifies videos those similar users have watched, but which the first user has not. There are 3 approaches to this; user-user collaborative filtering, item-item collaborative filtering and matrix factorization.


- New approaches and usually more powerful are **Hybrid systems**, which combine the two aforementioned methods.

<img src="data/image.png" width=600>

The above diagram categorises in detail the different methodologies for building a recommender system.

In context to the `movieLens` that will be used later as a sample dataset:
- We will suggest different movies based on the content similarity, e.g. genre, cast, etc. applying an item-content filtering
- We could also compare the user metadata, such as age and gender, and suggest items to the user that similar users have liked. In that case, we would use a user-content filtering. The `movielens` dataset doesn't contain any user content data, so we will build only an item-item collaborative filtering.

**Memory-based** content filtering

In memory-based methods there is no model that learns from the data to predict. A pre-computed matrix of similarities is constructed that can be used for movie predic

## Data collection and cleaning

__Load the datasets in three different pandas dataframes `movies`, `tags` and `ratings`.__

In [1]:
import pandas as pd
import numpy as np

In [2]:
movies = pd.read_csv("data/movies.csv")
tags = pd.read_csv("data/tags.csv")
ratings = pd.read_csv("data/ratings.csv")

__Remove the '|' that separates the different genres of movies and replace with an empty space.__

In [3]:
movies['genres']=[k.replace("|"," ") for k in movies.genres]

In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


__Filter the `movies` dataframe, to have only movies that have received a rating.__

In [5]:
len(tags)

3683

In [6]:
movies = movies[movies['movieId'].isin(ratings['movieId'])]

### 🧪 Test your code

In [7]:
from nbresult import ChallengeResult

result = ChallengeResult('datasets', 
                         movies_shape=movies.shape, 
                         tags_shape=tags.shape, 
                         ratings_shape=ratings.shape, 
                         genres_cleaned=movies['genres'].str.find('|') >= 0)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/06-Unsupervised-Learning/03-Movie-Recommendation
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 4 items

tests/test_datasets.py::TestDatasets::test_genres_are_cleaned [32mPASSED[0m[32m     [ 25%][0m
tests/test_datasets.py::TestDatasets::test_movies_shape [32mPASSED[0m[32m           [ 50%][0m
tests/test_datasets.py::TestDatasets::test_ratings_shape [32mPASSED[0m[32m          [ 75%][0m
tests/test_datasets.py::TestDatasets::test_tags_shape [32mPASSED[0m[32m             [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/datasets.pickle

[32mgit[39m commit -m [33m'Completed datasets step'[39m

[32mgit[39m push origin master


## Feature engineering

We're gonna create a new feature called `metadata`, that merges **all the text data information** we have about a movie : the genre and the tags.

__Merge the `movies` and `tags` dataframes.__

In [8]:
new_df = movies.merge(tags, on='movieId', how='left')
new_df.head(), len(movies)

(   movieId             title                                       genres  \
 0        1  Toy Story (1995)  Adventure Animation Children Comedy Fantasy   
 1        1  Toy Story (1995)  Adventure Animation Children Comedy Fantasy   
 2        1  Toy Story (1995)  Adventure Animation Children Comedy Fantasy   
 3        2    Jumanji (1995)                   Adventure Children Fantasy   
 4        2    Jumanji (1995)                   Adventure Children Fantasy   
 
    userId               tag     timestamp  
 0   336.0             pixar  1.139046e+09  
 1   474.0             pixar  1.137207e+09  
 2   567.0               fun  1.525286e+09  
 3    62.0           fantasy  1.528844e+09  
 4    62.0  magic board game  1.528844e+09  ,
 9724)

__Create a new dataframe `merged_df` with the `metadata` column, containing a string of both tags AND genres merged together.__

So you have to merge **tags first** together per movie. And **then** merge the combined tags with the **genres** column.

For example the `metadata` column of the movie Toy Story will be `pixar pixar fun Adventure Animation Children Comedy Fantasy`.

👉 `merged_df` must contain at least the following columns: `movieId`, `title` and `metadata`.

In [22]:
print(new_df['tag'].unique())

['pixar' 'fun' 'fantasy' ... 'star wars' 'gintama' 'remaster']


In [26]:
def mafonc(s):
    return " ".join([str(k) for k in s])

In [27]:
merged_df = new_df[['title','movieId','genres','tag']]. \
groupby(['movieId','title','genres'], as_index=False)['tag'].agg(mafonc)

merged_df

Unnamed: 0,movieId,title,genres,tag
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,pixar pixar fun
1,2,Jumanji (1995),Adventure Children Fantasy,fantasy magic board game Robin Williams game
2,3,Grumpier Old Men (1995),Comedy Romance,moldy old
3,4,Waiting to Exhale (1995),Comedy Drama Romance,
4,5,Father of the Bride Part II (1995),Comedy,pregnancy remake
...,...,...,...,...
9719,193581,Black Butler: Book of the Atlantic (2017),Action Animation Comedy Fantasy,
9720,193583,No Game No Life: Zero (2017),Animation Comedy Fantasy,
9721,193585,Flint (2017),Drama,
9722,193587,Bungo Stray Dogs: Dead Apple (2018),Action Animation,


In [28]:
merged_df = pd.DataFrame(merged_df)
merged_df['metadata'] = [f"{str(i)} {str(j)}" for i,j in zip(merged_df['genres'],merged_df['tag'])]
merged_df = merged_df[['movieId','title','metadata']]
merged_df

Unnamed: 0,movieId,title,metadata
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy pi...
1,2,Jumanji (1995),Adventure Children Fantasy fantasy magic board...
2,3,Grumpier Old Men (1995),Comedy Romance moldy old
3,4,Waiting to Exhale (1995),Comedy Drama Romance nan
4,5,Father of the Bride Part II (1995),Comedy pregnancy remake
...,...,...,...
9719,193581,Black Butler: Book of the Atlantic (2017),Action Animation Comedy Fantasy nan
9720,193583,No Game No Life: Zero (2017),Animation Comedy Fantasy nan
9721,193585,Flint (2017),Drama nan
9722,193587,Bungo Stray Dogs: Dead Apple (2018),Action Animation nan


### 🧪 Test your code

In [29]:
from nbresult import ChallengeResult
import numpy as np

result = ChallengeResult('feature_engineering', 
                         unique_movies=np.all(merged_df[['movieId']].value_counts() > 1), 
                         metadata=merged_df[merged_df['title'] == 'Copycat (1995)'], 
                         merged_df_rows=merged_df.shape[0])
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/06-Unsupervised-Learning/03-Movie-Recommendation
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 4 items

tests/test_feature_engineering.py::TestFeatureEngineering::test_merged_df_rows [32mPASSED[0m[32m [ 25%][0m
tests/test_feature_engineering.py::TestFeatureEngineering::test_metadata_has_genres [32mPASSED[0m[32m [ 50%][0m
tests/test_feature_engineering.py::TestFeatureEngineering::test_metadata_has_tags [32mPASSED[0m[32m [ 75%][0m
tests/test_feature_engineering.py::TestFeatureEngineering::test_unicity_of_movies [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/feature_engineering.pickle

[32mgit[39m commit -m [33m'Completed feature_engineering step'[39m

[32mgit[39m push origin master


## Create a content latent matrix from metadata

### Count Vectorizer

Next we need to transform the metadata text to vectors, in order to feed it to our machine learning algorithms. Machine Learning models don't understand text data so we need to encode it.

We're gonna use the [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to encode the metadata column. 

Create a new dataframe `count_df` with the vectors you get from this count transformation. Every line is gonna be the frequency vector for this specific movie.

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

In [36]:
cv = CountVectorizer()
count_df = cv.fit_transform(merged_df['metadata'])
count_df = count_df.toarray()
count_df.shape

(9724, 1747)

### Dimensionality reduction

Each movie metadata was transformed into a vector of length ~1675!

As we saw in previous lessons, we can apply dimensionality reduction methods to describe the data (movies) without much loss of information. Truncated singular value decomposition (SVD) is another advanced tool to reduce dimensions.

Contrary to PCA, this estimator **does not center the data before computing the singular value decomposition**. This means it can work with **sparse matrices efficiently**. In particular, truncated SVD works on term count/frequency matrices. In that context, it is known as latent semantic analysis (LSA).

You can look at [`TruncatedSVD`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) class in scikit learn but it's the same principle as PCA.

__👉 Apply the truncated SVD with a reduced number of features (say 25) to reduce the dimension of your count matrix.__

In [37]:
from sklearn.decomposition import TruncatedSVD

In [39]:
tsvd = TruncatedSVD(n_components=25)

count_trans = tsvd.fit_transform(count_df)
count_trans.shape

(9724, 25)

👉 **Plot the cumulative explained variance ratio as a function of the number of components.**

In [44]:
np.cumsum(tsvd.explained_variance_ratio_)[24]

0.8190229187659267

We can see that with only the first 25 components (from initially about 1700) we can explain **more than 80%** of the variance, which suffices for our study purposes.

👉 **Save the 25 components of this matrix into a new `latent_df` dataframe, indexed by the titles from the movies.**

In [86]:
latent_df = count_trans
latent_df

array([[ 0.64491248,  0.95374363,  0.40536339, ...,  0.05040158,
        -0.028162  , -0.03807131],
       [ 0.30631091,  0.17817898,  0.502314  , ...,  0.07867106,
        -0.03171101, -0.01425967],
       [ 0.50295021,  0.85052777, -0.32881652, ..., -0.01657279,
        -0.0057268 , -0.00229172],
       ...,
       [ 1.15433829, -0.45353723, -0.48395616, ...,  0.00691445,
         0.00865034,  0.00463522],
       [ 0.97790449, -0.05031216,  0.59482375, ...,  0.06013595,
         0.00865129, -0.00409142],
       [ 1.10880462,  0.7808703 ,  0.0023096 , ...,  0.00242983,
         0.01802243, -0.0118926 ]])

### 🧪 Test your code

In [46]:
from nbresult import ChallengeResult

result =  ChallengeResult('metadata', 
                          counter_shape=count_df.shape, 
                          latent_shape=latent_df.shape)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/06-Unsupervised-Learning/03-Movie-Recommendation
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 3 items

tests/test_metadata.py::TestMetadata::test_counter_columns [32mPASSED[0m[32m        [ 33%][0m
tests/test_metadata.py::TestMetadata::test_counter_rows [32mPASSED[0m[32m           [ 66%][0m
tests/test_metadata.py::TestMetadata::test_features_number [32mPASSED[0m[32m        [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/metadata.pickle

[32mgit[39m commit -m [33m'Completed metadata step'[39m

[32mgit[39m push origin master


## Create a latent matrix from user ratings

Except from the metadata we have another valuable source of information, **the user ratings**. 

A recommender system can recommend a movie that is similar on the basis of user ratings (item-item collaborative filtering).

👉 **We prepare the dataset below to have movies as rows, `userId`s as columns.**

In [47]:
# Merge
ratings1 = pd.merge(movies[['movieId']], ratings, on="movieId", how="right")
# Pivot
ratings2 = ratings1.pivot(index = 'movieId', columns ='userId', values = 'rating').fillna(0)
display(ratings2.head())
ratings2.shape

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


(9724, 610)

In [70]:
ratings2

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193583,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193585,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193587,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We just created a dataset with users' ratings as vectors of length 9724.

👉**Once again, we will apply SVD to the `ratings2` dataframe, so as to keep only the first 200 components. Call this dataframe `latent_df_2`**

In [65]:
tsvd2 = TruncatedSVD(n_components=200)

latent_df_2 = tsvd2.fit_transform(ratings2)

**👉 Reindex by the film title.**

In [69]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,2,Jumanji (1995),Adventure Children Fantasy
2,3,Grumpier Old Men (1995),Comedy Romance
3,4,Waiting to Exhale (1995),Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action Animation Comedy Fantasy
9738,193583,No Game No Life: Zero (2017),Animation Comedy Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action Animation


In [72]:
latent_df_2 = pd.DataFrame(latent_df_2, index = ratings2.index)

In [73]:
latent_df_2

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,37.649828,-6.380094,14.994617,0.286947,-1.887147,4.790050,8.807629,-2.198635,1.153305,1.387563,...,1.260928,1.627575,2.918042,0.124539,1.071617,0.750785,3.848728,2.762340,-0.850380,-1.076279
2,20.596193,-0.477880,10.865916,-8.635795,-3.072411,-0.517149,4.359215,-3.821408,-3.394116,0.236472,...,-0.180896,0.421606,1.324547,-2.500562,-0.636296,0.844001,-0.824260,-0.677874,-1.263475,-0.602603
3,8.504182,-5.714924,3.441693,-6.382810,0.024162,-0.783432,-1.423027,-0.542292,3.270654,1.897237,...,0.067131,0.538835,0.775452,0.496078,1.303613,0.380570,-0.690115,-0.914148,1.392469,0.723870
4,0.465720,-0.828101,0.954583,-0.239080,-0.735695,-0.877414,0.214720,-0.527368,0.079467,0.015616,...,0.137722,0.319802,0.060195,-0.110588,-0.380696,-0.132477,-0.318583,-0.143114,0.316005,-0.092256
5,6.071045,-2.673536,4.698195,-4.415807,-3.714221,-1.673627,1.297653,-1.308023,1.325734,1.229592,...,1.038862,0.160762,-0.500823,-0.260540,0.233009,-0.229721,-0.513940,-0.989256,0.677854,0.705489
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.039507,0.157924,-0.019030,0.009546,0.024836,-0.164105,0.182753,0.071142,0.009693,-0.046211,...,-0.283803,-0.296700,0.212703,-0.006597,0.286744,-0.194077,0.107057,0.076491,0.025577,0.118604
193583,0.034568,0.138184,-0.016651,0.008353,0.021732,-0.143591,0.159909,0.062250,0.008481,-0.040434,...,-0.248327,-0.259612,0.186115,-0.005772,0.250901,-0.169817,0.093675,0.066930,0.022380,0.103779
193585,0.034568,0.138184,-0.016651,0.008353,0.021732,-0.143591,0.159909,0.062250,0.008481,-0.040434,...,-0.248327,-0.259612,0.186115,-0.005772,0.250901,-0.169817,0.093675,0.066930,0.022380,0.103779
193587,0.034568,0.138184,-0.016651,0.008353,0.021732,-0.143591,0.159909,0.062250,0.008481,-0.040434,...,-0.248327,-0.259612,0.186115,-0.005772,0.250901,-0.169817,0.093675,0.066930,0.022380,0.103779


In [82]:
latent_df_2 = latent_df_2.merge(movies, left_index=True, right_on='movieId').set_index('title') \
        .drop(columns=['movieId','genres'])

In [83]:
latent_df_2.shape

(9724, 200)

### 🧪 Test your code

In [84]:
from nbresult import ChallengeResult

result = ChallengeResult('ratings', latent_shape=latent_df_2.shape)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/06-Unsupervised-Learning/03-Movie-Recommendation
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 1 item

tests/test_ratings.py::TestRatings::test_latent_shape [32mPASSED[0m[32m             [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/ratings.pickle

[32mgit[39m commit -m [33m'Completed ratings step'[39m

[32mgit[39m push origin master


## Apply cosine similarity in content and colaborative matrices

Next, we will use a similarity measure to find the top $N$ **most similar movies to "Toy Story"** based on the filtering methods we created. Cosine similarity is one of the similarity measures we can use Scikit-learn [`cosine_similarity`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html).

__👉 Compute the cosine similarity of a sample movie ("Toy Story") to both content (metadata) and collaborative (ratings) latent matrices.__

In [85]:
from sklearn.metrics.pairwise import cosine_similarity



In [105]:
latent_df_2

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),37.649828,-6.380094,14.994617,0.286947,-1.887147,4.790050,8.807629,-2.198635,1.153305,1.387563,...,1.260928,1.627575,2.918042,0.124539,1.071617,0.750785,3.848728,2.762340,-0.850380,-1.076279
Jumanji (1995),20.596193,-0.477880,10.865916,-8.635795,-3.072411,-0.517149,4.359215,-3.821408,-3.394116,0.236472,...,-0.180896,0.421606,1.324547,-2.500562,-0.636296,0.844001,-0.824260,-0.677874,-1.263475,-0.602603
Grumpier Old Men (1995),8.504182,-5.714924,3.441693,-6.382810,0.024162,-0.783432,-1.423027,-0.542292,3.270654,1.897237,...,0.067131,0.538835,0.775452,0.496078,1.303613,0.380570,-0.690115,-0.914148,1.392469,0.723870
Waiting to Exhale (1995),0.465720,-0.828101,0.954583,-0.239080,-0.735695,-0.877414,0.214720,-0.527368,0.079467,0.015616,...,0.137722,0.319802,0.060195,-0.110588,-0.380696,-0.132477,-0.318583,-0.143114,0.316005,-0.092256
Father of the Bride Part II (1995),6.071045,-2.673536,4.698195,-4.415807,-3.714221,-1.673627,1.297653,-1.308023,1.325734,1.229592,...,1.038862,0.160762,-0.500823,-0.260540,0.233009,-0.229721,-0.513940,-0.989256,0.677854,0.705489
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Black Butler: Book of the Atlantic (2017),0.039507,0.157924,-0.019030,0.009546,0.024836,-0.164105,0.182753,0.071142,0.009693,-0.046211,...,-0.283803,-0.296700,0.212703,-0.006597,0.286744,-0.194077,0.107057,0.076491,0.025577,0.118604
No Game No Life: Zero (2017),0.034568,0.138184,-0.016651,0.008353,0.021732,-0.143591,0.159909,0.062250,0.008481,-0.040434,...,-0.248327,-0.259612,0.186115,-0.005772,0.250901,-0.169817,0.093675,0.066930,0.022380,0.103779
Flint (2017),0.034568,0.138184,-0.016651,0.008353,0.021732,-0.143591,0.159909,0.062250,0.008481,-0.040434,...,-0.248327,-0.259612,0.186115,-0.005772,0.250901,-0.169817,0.093675,0.066930,0.022380,0.103779
Bungo Stray Dogs: Dead Apple (2018),0.034568,0.138184,-0.016651,0.008353,0.021732,-0.143591,0.159909,0.062250,0.008481,-0.040434,...,-0.248327,-0.259612,0.186115,-0.005772,0.250901,-0.169817,0.093675,0.066930,0.022380,0.103779


In [104]:
pd.DataFrame(cosine_similarity(latent_df_2))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9714,9715,9716,9717,9718,9719,9720,9721,9722,9723
0,1.000000,0.464747,0.358228,0.017563,0.380384,0.409755,0.340778,0.170896,0.295199,0.435607,...,-0.004700,-0.004700,-0.004700,-0.004700,-0.004700,-0.004700,-0.004700,-0.004700,-0.004700,0.017610
1,0.464747,1.000000,0.357179,0.144258,0.343334,0.352884,0.316361,0.214307,0.070825,0.496689,...,-0.011056,-0.011056,-0.011056,-0.011056,-0.011056,-0.011056,-0.011056,-0.011056,-0.011056,0.007182
2,0.358228,0.357179,1.000000,0.184831,0.567682,0.324374,0.543521,0.373024,0.426274,0.309351,...,0.000221,0.000221,0.000221,0.000221,0.000221,0.000221,0.000221,0.000221,0.000221,-0.002756
3,0.017563,0.144258,0.184831,1.000000,0.390742,0.166072,0.385515,0.278864,0.077321,0.141521,...,-0.024790,-0.024790,-0.024790,-0.024790,-0.024790,-0.024790,-0.024790,-0.024790,-0.024790,0.012172
4,0.380384,0.343334,0.567682,0.390742,1.000000,0.345670,0.651622,0.384117,0.470799,0.276769,...,-0.012919,-0.012919,-0.012919,-0.012919,-0.012919,-0.012919,-0.012919,-0.012919,-0.012919,-0.006462
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9719,-0.004700,-0.011056,0.000221,-0.024790,-0.012919,-0.023883,-0.024108,0.003244,-0.002253,0.020714,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,-0.005823
9720,-0.004700,-0.011056,0.000221,-0.024790,-0.012919,-0.023883,-0.024108,0.003244,-0.002253,0.020714,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,-0.005823
9721,-0.004700,-0.011056,0.000221,-0.024790,-0.012919,-0.023883,-0.024108,0.003244,-0.002253,0.020714,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,-0.005823
9722,-0.004700,-0.011056,0.000221,-0.024790,-0.012919,-0.023883,-0.024108,0.003244,-0.002253,0.020714,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,-0.005823


We can also create a hybrid filter, which is an average measure of similarity from both content and collaborative filtering. 

`hybrid_similarity = (content_similarity + collaborative_similarity) / 2`

__Calculate the average measure of both content and collaborative.__

In [111]:
data_sim = (pd.DataFrame(cosine_similarity(latent_df), index = latent_df_2.index) + pd.DataFrame(cosine_similarity(latent_df_2), index = latent_df_2.index)) / 2

In [112]:
data_sim

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,9714,9715,9716,9717,9718,9719,9720,9721,9722,9723
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),1.000000,0.593850,0.335537,0.119186,0.412240,0.204459,0.327187,0.343177,0.147114,0.329118,...,0.235647,0.128436,0.125281,0.157933,-0.002522,0.296908,0.332468,-0.002554,0.128219,0.165121
Jumanji (1995),0.593850,1.000000,0.178359,0.071811,0.171842,0.176288,0.158173,0.341125,0.034850,0.349317,...,-0.006347,-0.007572,-0.005780,-0.008285,-0.005733,0.175691,0.197161,-0.005767,-0.007837,0.003031
Grumpier Old Men (1995),0.335537,0.178359,1.000000,0.445881,0.637712,0.162252,0.771749,0.186975,0.213719,0.154835,...,0.248490,0.000113,0.204029,0.000615,0.000141,0.158049,0.176878,0.000420,0.000336,0.248889
Waiting to Exhale (1995),0.119186,0.071811,0.445881,1.000000,0.445209,0.208162,0.545961,0.284127,0.215881,0.195909,...,0.163308,0.276588,0.420657,0.164915,0.164489,0.211339,0.237838,0.341464,0.132224,0.359853
Father of the Bride Part II (1995),0.412240,0.171842,0.637712,0.445209,1.000000,0.172400,0.680075,0.192265,0.234583,0.138989,...,0.345257,-0.007245,0.281523,-0.007894,-0.006891,0.215676,0.241918,-0.006821,-0.007616,0.349030
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Black Butler: Book of the Atlantic (2017),0.296908,0.175691,0.158049,0.211339,0.215676,0.211990,0.145379,0.131402,0.315860,0.234300,...,0.817595,0.758095,0.758342,0.816534,0.658382,1.000000,0.947194,0.658486,0.887483,0.313722
No Game No Life: Zero (2017),0.332468,0.197161,0.176878,0.237838,0.241918,0.113373,0.164157,0.146633,0.176248,0.135662,...,0.766952,0.788548,0.788886,0.853757,0.677097,0.947194,1.000000,0.677110,0.788717,0.351032
Flint (2017),-0.002554,-0.005767,0.000420,0.341464,-0.006821,0.164931,-0.012105,0.205700,0.248830,0.187165,...,0.500267,0.908591,0.908515,0.750138,0.750101,0.658486,0.677110,1.000000,0.704234,0.247128
Bungo Stray Dogs: Dead Apple (2018),0.128219,-0.007837,0.000336,0.132224,-0.007616,0.277001,-0.012476,0.168572,0.407444,0.299264,...,0.706289,0.833194,0.666974,0.908212,0.704389,0.887483,0.788717,0.704234,1.000000,0.201445


__Create a dataframe with the final similarities to Toy Story.__

In [114]:
ToyStory = data_sim[[0]]
ToyStory

Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
Toy Story (1995),1.000000
Jumanji (1995),0.593850
Grumpier Old Men (1995),0.335537
Waiting to Exhale (1995),0.119186
Father of the Bride Part II (1995),0.412240
...,...
Black Butler: Book of the Atlantic (2017),0.296908
No Game No Life: Zero (2017),0.332468
Flint (2017),-0.002554
Bungo Stray Dogs: Dead Apple (2018),0.128219


__Sort your dataframe by most similar by collaborative similarity.__

In [117]:
ToyStory.sort_values(by=0,ascending=False)

Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
Toy Story (1995),1.000000
Toy Story 2 (1999),0.791531
Shrek (2001),0.748547
"Monsters, Inc. (2001)",0.732445
Finding Nemo (2003),0.713528
...,...
Spiral (2018),-0.027974
Hommage à Zgougou (et salut à Sabine Mamou) (2002),-0.027974
Liquid Truth (2017),-0.028006
John From (2015),-0.028006


You should see Toy Story as the most similar movie of course (similarity of 1 for every column).

__You can sort on content as well as hybrid and see which one gives the best recommendation.__

In [None]:
# YOUR CODE HERE

**❓ What similarity do you think is the better to build a realistic movie recommender system?  Assign it to the `best_similarity` variable**

In [102]:
a = 'content'
b = 'collaborative'
c = 'hybrid'
best_similarity = c # fill in with the right answer

### 🧪 Test your code

In [103]:
from nbresult import ChallengeResult

result = ChallengeResult('recommender', best_similarity=best_similarity)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/06-Unsupervised-Learning/03-Movie-Recommendation
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 1 item

tests/test_recommender.py::TestRecommender::test_best_similarity [32mPASSED[0m[32m  [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/recommender.pickle

[32mgit[39m commit -m [33m'Completed recommender step'[39m

[32mgit[39m push origin master


## 🏁 Good job!

You can commit and push your code on GitHub.