# Recommender Systems

Recommender systems are useful for recommending users items based on their past preferences. Recommender systems are generally classified into the following categories:

- **Content-based** based on similarity of item attributes (e.g. keywords, categories) and users preferences: For instance, if a user liked Item A, the recommender system would recommend items with similar characteristics. If we would talk about movies, those characteristics would include producer, genre, etc... Typically, this can be done with a binary classification supervised algorithm.


- **Collaborative filtering**, which calculates similarity from interactions (e.g. ratings, number of purchases, likes, etc.). This process finds customers who have similar preferences, and then gives recommendations to one customer from a set of items that are new to that particular customer and preferred by other customers with a similar profile. The system makes an assumption that users with similar movie watching habits have similar preferences in general. It finds users who watch videos similar to those the user watches, then identifies videos those similar users have watched, but which the first user has not. There are 3 approaches to this; user-user collaborative filtering, item-item collaborative filtering and matrix factorization.


- New approaches and usually more powerful are **Hybrid systems**, which combine the two aforementioned methods.

<img src="data/image.png" width=600>

The above diagram categorises in detail the different methodologies for building a recommender system.

In context to the `movieLens` that will be used later as a sample dataset:
- We will suggest different movies based on the content similarity, e.g. genre, cast, etc. applying an item-content filtering
- We could also compare the user metadata, such as age and gender, and suggest items to the user that similar users have liked. In that case, we would use a user-content filtering. The `movielens` dataset doesn't contain any user content data, so we will build only an item-item collaborative filtering.

**Memory-based** content filtering

In memory-based methods there is no model that learns from the data to predict. A pre-computed matrix of similarities is constructed that can be used for movie predic

## Data collection and cleaning

__Load the datasets in three different pandas dataframes `movies`, `tags` and `ratings`.__

In [0]:
# YOUR CODE HERE

__Remove the '|' that separates the different genres of movies and replace with an empty space.__

In [0]:
# YOUR CODE HERE

__Filter the `movies` dataframe, to have only movies that have received a rating.__

In [0]:
# YOUR CODE HERE

### 🧪 Test your code

In [0]:
from nbresult import ChallengeResult

result = ChallengeResult('datasets', 
                         movies_shape=movies.shape, 
                         tags_shape=tags.shape, 
                         ratings_shape=ratings.shape, 
                         genres_cleaned=movies['genres'].str.find('|') >= 0)
result.write()
print(result.check())

## Feature engineering

We're gonna create a new feature called `metadata`, that merges **all the text data information** we have about a movie : the genre and the tags.

__Merge the `movies` and `tags` dataframes.__

In [0]:
# YOUR CODE HERE

__Create a new dataframe `merged_df` with the `metadata` column, containing a string of both tags AND genres merged together.__

So you have to merge **tags first** together per movie. And **then** merge the combined tags with the **genres** column.

For example the `metadata` column of the movie Toy Story will be `pixar pixar fun Adventure Animation Children Comedy Fantasy`.

👉 `merged_df` must contain at least the following columns: `movieId`, `title` and `metadata`.

In [0]:
# YOUR CODE HERE

### 🧪 Test your code

In [0]:
from nbresult import ChallengeResult
import numpy as np

result = ChallengeResult('feature_engineering', 
                         unique_movies=np.all(merged_df[['movieId']].value_counts() > 1), 
                         metadata=merged_df[merged_df['title'] == 'Copycat (1995)'], 
                         merged_df_rows=merged_df.shape[0])
result.write()
print(result.check())

## Create a content latent matrix from metadata

### Count Vectorizer

Next we need to transform the metadata text to vectors, in order to feed it to our machine learning algorithms. Machine Learning models don't understand text data so we need to encode it.

We're gonna use the [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to encode the metadata column. 

Create a new dataframe `count_df` with the vectors you get from this count transformation. Every line is gonna be the frequency vector for this specific movie.

In [0]:
# YOUR CODE HERE

### Dimensionality reduction

Each movie metadata was transformed into a vector of length ~1675!

As we saw in previous lessons, we can apply dimensionality reduction methods to describe the data (movies) without much loss of information. Truncated singular value decomposition (SVD) is another advanced tool to reduce dimensions.

Contrary to PCA, this estimator **does not center the data before computing the singular value decomposition**. This means it can work with **sparse matrices efficiently**. In particular, truncated SVD works on term count/frequency matrices. In that context, it is known as latent semantic analysis (LSA).

You can look at [`TruncatedSVD`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) class in scikit learn but it's the same principle as PCA.

__👉 Apply the truncated SVD with a reduced number of features (say 25) to reduce the dimension of your count matrix.__

In [0]:
# YOUR CODE HERE

👉 **Plot the cumulative explained variance ratio as a function of the number of components.**

In [0]:
# YOUR CODE HERE

We can see that with only the first 25 components (from initially about 1700) we can explain **more than 80%** of the variance, which suffices for our study purposes.

👉 **Save the 25 components of this matrix into a new `latent_df` dataframe, indexed by the titles from the movies.**

In [0]:
# YOUR CODE HERE

### 🧪 Test your code

In [0]:
from nbresult import ChallengeResult

result =  ChallengeResult('metadata', 
                          counter_shape=count_df.shape, 
                          latent_shape=latent_df.shape)
result.write()
print(result.check())

## Create a latent matrix from user ratings

Except from the metadata we have another valuable source of information, **the user ratings**. 

A recommender system can recommend a movie that is similar on the basis of user ratings (item-item collaborative filtering).

👉 **We prepare the dataset below to have movies as rows, `userId`s as columns.**

In [0]:
# Merge
ratings1 = pd.merge(movies[['movieId']], ratings, on="movieId", how="right")
# Pivot
ratings2 = ratings1.pivot(index = 'movieId', columns ='userId', values = 'rating').fillna(0)
display(ratings2.head())
ratings2.shape

We just created a dataset with users' ratings as vectors of length 9724.

👉**Once again, we will apply SVD to the `ratings2` dataframe, so as to keep only the first 200 components. Call this dataframe `latent_df_2`**

In [0]:
# YOUR CODE HERE

**👉 Reindex by the film title.**

In [0]:
# YOUR CODE HERE

In [0]:
latent_df_2.shape

### 🧪 Test your code

In [0]:
from nbresult import ChallengeResult

result = ChallengeResult('ratings', latent_shape=latent_df_2.shape)
result.write()
print(result.check())

## Apply cosine similarity in content and colaborative matrices

Next, we will use a similarity measure to find the top $N$ **most similar movies to "Toy Story"** based on the filtering methods we created. Cosine similarity is one of the similarity measures we can use Scikit-learn [`cosine_similarity`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html).

__👉 Compute the cosine similarity of a sample movie ("Toy Story") to both content (metadata) and collaborative (ratings) latent matrices.__

In [0]:
# YOUR CODE HERE

We can also create a hybrid filter, which is an average measure of similarity from both content and collaborative filtering. 

`hybrid_similarity = (content_similarity + collaborative_similarity) / 2`

__Calculate the average measure of both content and collaborative.__

In [0]:
# YOUR CODE HERE

__Create a dataframe with the final similarities to Toy Story.__

In [0]:
# YOUR CODE HERE

__Sort your dataframe by most similar by collaborative similarity.__

In [0]:
# YOUR CODE HERE

You should see Toy Story as the most similar movie of course (similarity of 1 for every column).

__You can sort on content as well as hybrid and see which one gives the best recommendation.__

In [0]:
# YOUR CODE HERE

**❓ What similarity do you think is the better to build a realistic movie recommender system?  Assign it to the `best_similarity` variable**

In [0]:
a = 'content'
b = 'collaborative'
c = 'hybrid'
best_similarity = c # fill in with the right answer

### 🧪 Test your code

In [0]:
from nbresult import ChallengeResult

result = ChallengeResult('recommender', best_similarity=best_similarity)
result.write()
print(result.check())

## 🏁 Good job!

You can commit and push your code on GitHub.