**Have you ever sat in front of your TV and spent hours trying to decide what to watch?**
<img src="https://mlvillage.blob.core.windows.net/unsupervised/confused.jpeg" alt="confused" width="200" align="right" style="display:inline-block;"/>

**Do you ever wonder how companies like Netflix, Amazon etc. know exactly what to recommend to you?**

If yes, then you're in the right place. In this lab, we will introduce some basic recommendation techniques. By the time you finish, you should be able to create your own recommendation system!!


### Authors of this lab
**Super volunteer**: Priyank Mathur (prmathur@adobe.com) <br>
**Awesome volunteers**: Ankit Aggarwal (aagarwa@adobe.com), Megha Rawat (mrawat@adobe.com), Pulkit Gera (pugera@adobe.com), Rahul Mittal (rahmitta@adobe.com), Sudheer Sana (sudheer@adobe.com), Tarun Vashisth (vashisth@adobe.com)


# Welcome to the level 2 lab for unsupervised machine learning. 

**Unsupervised learning** is the branch of machine learning that learns from data that has not been labeled, classified or categorized. While we build our recommendation system, we will take a high level tour of some of the techniques that can be employed when you are working with untagged data.

### We will cover the following topics -
* Importing and basic data manipulation
* Exploratory analysis and visualization
* Distance metrics
* Clustering
* Dimensionality reduction
* Collaborative filtering
* Matrix factorization

We will be using the topics listed above to improve the recommendation system we build and understand the data better. At the end, we will compare all the recommendation methods that we've developed to see how they perform.

We will be using the following special purpose python libraries that are useful for data science -
* [pandas](https://pandas.pydata.org/) - provides performant and easy-to-use data structures & analysis tools
* [numpy](http://www.numpy.org/) - the fundamental package for working with arrays and matrices
* [matplotlib](https://matplotlib.org/) - provides 2D plotting capability
   * [mplot3d](https://matplotlib.org/mpl_toolkits/mplot3d/index.html) - adds 3D plotting capability
   * [seaborn](https://seaborn.pydata.org/) - provides a high-level interface for drawing attractive and informative statistical graphics
* [scikit-learn](https://scikit-learn.org) - tools for machine learning, data mining and data analysis

In [None]:
import sys
sys.path.insert(0, '/home/asruser/unsupervised-learning-lab/')

# import necessary libraries, after executing this cell
# you can click the button below to see
# details on what was imported
from utils.imports import *
from utils import widget_utils

# Set properties
sns.set(style='whitegrid')

# Give option to show what was imported
widget_utils.show_imports()

# Working with data

## Load data
The dataset has been obtained from https://grouplens.org/datasets/movielens/.

<cite>F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (2015).</cite> https://doi.org/10.1145/2827872

The Pandas library provides a very handy function called [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) to load the data into dataframes.

In [None]:
data_root = '/home/asruser/unsupervised-learning-lab/data/ml-latest-small/'

# load data
movies_df = pd.read_csv(os.path.join(data_root, 'movies.csv'))
ratings_df = pd.read_csv(os.path.join(data_root, 'ratings.csv'))

print('Data load complete')

Here's a brief overview of what the data is comprised of - 

* movies.csv - contains movie information where each line of this file after the header row represents one movie, and has the following format: **movieId, title, genres**
* ratings.csv - each line of this file after the header row represents one rating of one movie by one user, and has the following format: **userId, movieId, rating, timestamp**

Lets see a preview of the information in each dataframe using the [head](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) function in pandas.

In [None]:
display(Markdown('#### Movies'))
display(movies_df.head())
display(Markdown('#### Ratings'))
display(ratings_df.head())

## Data preparation and cleanup

Data manipulation is the process of changing data to make it easier to read or work with. Real life datasets are generally unsuitable for direct use in any kind of analysis or learning algorithm and we need to transform it into appropriate format. In fact, for many projects, you might end up spending upto 90% of your time on data cleanup. Many data science teams actually have groups that work just on managing data.

In our case, the dataset is already prepared for analysis for the most part. We will perform the following additional cleanups to make our analysis easier.

* **Expand genres** in the movies dataframe to split genres into separate rows using a combination of these pandas data manipulation operations - 
[stack](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.stack.html),
[reset_index](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html),
[merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html).

* **Convert** unix timestamps to years to see patterns over time.

In [None]:
# For each movie, split the genre column into separate rows,
# then reset the index to movieId so we can merge
movieId_to_genres = pd.DataFrame(movies_df.genres.str.split('|').tolist(), 
                                 index=movies_df.movieId).stack().reset_index()[['movieId', 0]]

# Merge the dataframe above with the movies dataframe using how="left".
# This is essentially equal to a left join in SQL.
# Click on the button below to learn about left join.
movies_df_expanded = movieId_to_genres.merge(movies_df[['movieId', 'title']], how='left', 
                                             left_on='movieId', right_on='movieId')

# Assign column names in the new dataframe
movies_df_expanded.columns = ['movieId', 'genre', 'title']

# Create a lookup table for title->movieId, i.e. getting movie id given the title of the movie,
movies_idx = movies_df[['movieId', 'title']].set_index('title')['movieId']

widget_utils.show_join_description()

display(movies_df_expanded.head(10))

As you would've noticed, each movie row has been split into multiple rows based on how many genres a movie belongs to.

And now, lets extract the year from the timestamp fields of the ratings dataframe using the [Series apply](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.apply.html) function provided by pandas.

***Optional*** Learn more about Python lambda functions [here](https://www.w3schools.com/python/python_lambda.asp).

In [None]:
# Parse the timestamp of each row and extract the year using datetime library
ratings_df.timestamp = ratings_df.timestamp.apply(lambda x: datetime.fromtimestamp(x).year)
ratings_df.head()

## Data exploration

Data exploration is the initial step in data analysis, where users explore a large data set in an unstructured way to uncover initial properties, characteristics, and patterns of interest. These patterns can include size or amount of data, completeness of the data, correctness of the data, possible relationships amongst data elements or files/tables in the data. This allows us to make informed decisions as to what our next steps might be. 
For e.g. 
1. If there are columns in the dataset which have missing values - can we ignore the rows missing the data or should we try to identify a way to fill these in.
2. If there are clear groups in the data, should the groups be modeled differently?

etc.

Lets start by performing exploratory analysis on our dataset.

#### What are the genres across all the movies?
Lets start by looking at what are all the genres that the movies in our dataset belong to.

In [None]:
# Get the unique values in the genre column
all_genres = movies_df_expanded.genre.unique().tolist()

all_genres

#### And how many movies are in each genre?
We can use the [grouping](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) functionality of pandas to create groups and perform aggregate calculations.

In [None]:
# Generate groups and compute the count of each group
genre_count = movies_df_expanded.groupby('genre').count()['movieId']

# At this point we have a dataframe with genre as the index and count as the value.
# Reset the index so genre itself becomes a column
genre_count = genre_count.reset_index()

# Sort by descending count
genre_count = genre_count.sort_values(by=['movieId'], ascending=False)

genre_count

It appears that Drama is a very common genre. To get a better understanding of how the movies are distributed across genres, we will create a barplot of the counts below.

In [None]:
plt.figure(figsize=(12, 5))
sns.barplot(data=genre_count, x='genre', y='movieId', palette=sns.color_palette('Blues'))
plt.xlabel('genre')
plt.ylabel('number of movies')
plt.xticks(rotation=45);

#### Popularity of genres over time
Now that we know which are the most common genres, can we also get some information about which are the most popular ones?

We can combine the movie rating information with the movie details (like year of release) using pandas [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) function, and this should yield a data frame containing all the data we need to identify popular genres over time.

In [None]:
# Perform inner join on the ratings and movies with genres data frame
genre_ratings = ratings_df.merge(movies_df_expanded, left_on='movieId', right_on='movieId', how='inner')

genre_ratings.head()

Before plotting this data, we can standardize the ratings so that they vary from 0 to 1 by using a technique called min-max scaling.

In [None]:
min_rating, max_rating = ratings_df.rating.min(), ratings_df.rating.max()

genre_ratings.rating = (genre_ratings.rating - min_rating) / (max_rating - min_rating)

By grouping the data by year and genre, we can compute the mean rating and deviation for each genre for each year.

In [None]:
# Group by multiple columns to get groups for all combinations of year and genre
rating_groups = genre_ratings.groupby(['timestamp', 'genre'])['rating']

# For each group, compute the mean and standard deviation
ratings_per_genre_per_year_summary = rating_groups.agg([np.mean, np.std]).reset_index()

ratings_per_genre_per_year_summary.head(10)

One of the easiest way to see how the popularity of a genre changes over time is to create a line plot, where the y-axis shows the changes in the rating.

We will utilize IPython widget feature of jupyter notebooks to create an interactive plot that will help us see and compare genre popularity. Chose genre from the list after the cell below and use "Show confidence" to toggle confidence display. Use shift or ctrl (cmd) key to select multiple genres.

In [None]:
widget_utils.show_genre_popularity_interaction(ratings_per_genre_per_year_summary, all_genres)

One interesting observation that can be made from the plots above is that almost all the genres were very highly rated in the year 2013.

**Optional exercise -** after finishing with this lab, can you figure out why this might be?

# Building a recommendation system

A recommendation system can be built in a number of ways - 
* Editorial and hand curated lists can be prepared by experts of the fields. For e.g. a food critic can prepare a list of best restaurants in a city.
* Simple aggregate based system can recommend the most used or the most popular items to the user.
* Personalized recommender systems that are tailored to individual users. These are the kinds of systems that are employed by companies like Amazon, Netflix etc. to recommend items/movies to users.

## Finding similar movies

In this lab, we will try to build a system that, given a query movie, suggests similar movies using the information in the dataset above. 

As we explore and develop better methods, we will take the example of the movie ***Jumanji***.

As a reminder, this is what each row in the movies data frame looks like

In [None]:
jumanji_movie_id = 2
display(movies_df[movies_df.title == 'Jumanji (1995)'])

Below is a general purpose function that takes a movie id as input, and based on the similarity function we want, returns most similar movies. The 'fast' in the name represents the fact that to compute the most similar movies, we [vectorize](https://towardsdatascience.com/data-science-with-python-turn-your-conditional-loops-to-numpy-vectors-9484ff9c622e) our operations to make them faster.

In [None]:
def get_similar_movies_fast(movie_id, data_frame, feature_columns, sim_function, number_of_recommendations=5):
    '''
    Returns movies similar to the movie with id movie_id.
    Vectorized implementation.

    Parameters
    ----------
    movie_id : int
        query movie id
    data_frame : pandas datarame
        dataframe with information about all movies
    feature_columns: column names of features
    sim_function : function that returns similarity between query features 
                and all data points
    number_of_recommendations : int, default 5
        number of recommendations to return

    Returns
    -------
    dataframe : most similar movies
    '''
    # Get query movie features
    query_row = data_frame.loc[data_frame.movieId==movie_id]
    query_feats = query_row[feature_columns].values

    # Get features of all other movies in a m*n array
    # where m is the number of movies and n is the number of features
    data_feats = data_frame[feature_columns].values
   
    # apply the similarity function on the 2 features sets
    similarities = sim_function(data_feats, query_feats)
    
    # Sort by similarity and return n most similar movies
    movie_id_similarity = zip(data_frame.movieId.values, similarities)
    movie_id_similarity = sorted(movie_id_similarity, key=operator.itemgetter(1), reverse=True)
    
    movie_id_similarity = pd.DataFrame(movie_id_similarity[:number_of_recommendations+1], 
                                       columns=['movieId', 'similarity'])
    
    movie_id_similarity = movie_id_similarity[movie_id_similarity.movieId!=movie_id]
    
    movie_id_similarity = movie_id_similarity.merge(data_frame, 
                                                    left_on='movieId', 
                                                    right_on='movieId')[:number_of_recommendations]
        
    return movie_id_similarity[['movieId', 'title', 'genres', 'similarity']]

## Collaborative filtering
The Collaborative Filtering Recommender is entirely based on the past behavior and not on the context. More specifically, it is based on the similarity in preferences, tastes and choices of users. It analyses how similar the tastes of one user is to another and makes recommendations on the basis of that.

<img src="https://mlvillage.blob.core.windows.net/unsupervised/Collaborative_filtering.png" alt="Collaborative_filtering" width="400"/>
In the picture above the third person likes watermelons, and so do person one and two. We can also see that person one and two both like grapes. This makes us think that the third person also might like grapes and we can try and give a recommendation. This way, our recommendation might lead the third person to discover a completely new taste they like! 

This is a version of collaborative filtering called item-item filtering. The basic idea is that each user likes to watch similar kinds of movies. If you like movies like lord of the Rings, Harry Potter, Avengers; it is likely that in the future you'd like to watch movies like The Hobbit, The Dark Knight and other action/adventure movies.

This lets us use each movie's user ratings as its representation.

In [None]:
# Pivot the rating data frame to create an array of user->movie rating
user_movie_rating_df = ratings_df.pivot(index='movieId', columns='userId', values='rating').fillna(0.0)

# Join the above data frame with movies to add columns like title, genres etc.
user_movie_rating_df = movies_df.merge(user_movie_rating_df, right_index=True, left_on='movieId')

user_movie_rating_df.head()

In [None]:
# Extract just the rating columns
rating_feature_columns = list(range(1, 611))
user_movie_rating_data = user_movie_rating_df[rating_feature_columns]

user_movie_rating_data.head()

#### How to *measure* similarity?

For the above representation of a movie, we need to define a way to compute and measure the similarity between a pair of movies. It can be something as simple as "number of common genres" that the 2 movies have.

However, simply measuring the number of common genres is often not sufficient. For e.g. a movie that belongs to many genres will be **very** similar to almost all other movies because they'll share many common genres. We need a similarity metric that somehow normalizes this affect. 

For e.g.:

If $movie_A$ belongs to Action, Adventure;

$movie_B$ belongs to Action, Adventure, Children; and

$movie_C$ belongs to Action, Adventure, Children, Comedy, Fantasy, Sci-Fi

Our metric should return higher similarity for movies $movie_A$ and $movie_B$ than the similarity for movies $movie_A$ and $movie_C$, i.e.

$$ sim(movie_A, movie_B) > sim(movie_A, movie_C) $$

In our case, we have real (continuous) features and not just boolean indicators, so we can implement a method for identifying similar movies using [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity). If A represents the ratings for $movie_A$ and B represents the ratings for $movie_B$, Cosine Similarity returns

$$ sim_{Cosine}(A, B) = \frac{A \cdot B}{\lvert\lvert A \rvert\rvert \times \lvert\lvert B \rvert \rvert} $$

In [None]:
def get_similarity_cosine(query_feats, data_feats):
    '''
    Returns similarity based using cosine similarity

    Parameters
    ----------
    query_feats : int
        features of the query
    data_feats : pandas dataframe
        features of all data points

    Returns
    -------
    int, np.array
        cosine similarity of query with each item in the data
    '''
    similarities = cosine_similarity(query_feats, data_feats)
    
    return similarities.flatten()

Lets use this similarity measure to get the most similar movies to Jumanji.

In [None]:
%%time

# compute and show similar movies
display(get_similar_movies_fast(jumanji_movie_id, user_movie_rating_df, rating_feature_columns, get_similarity_cosine))

Although the genres of these predictions do not always match up with the query, they do look pretty similar movies to *Jumaji*.

#### But do we need to look at *all* the movies to find the most similar ones?
In a real world recommendation system, the number of options to select most similar movies from would be huge. As such, we cannot compare the query movie to each and every other movie in the dataset. In such a scenario, a common technique is to identify a subset of the options that are *approximately* very similar to the query and then compute the similarity on this reduced set. One way to accomplish this is **clustering**.

## Clustering

<img src="https://mlvillage.blob.core.windows.net/unsupervised/clustering.png" alt="drawing" width="400"/>

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters).

In this lab, we will use an algorithm called **K-Means**.


#### K-Means
K-Means is one of the most popular clustering algorithms. It stores k centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

K-Means finds the best centroids by alternating between:
1. assigning data points to clusters based on the current centroids 
2. chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.
<img src="https://mlvillage.blob.core.windows.net/unsupervised/kmeans-wiki.gif" alt="drawing" width="400"/>

**Optional note -** Like most Machine Learning algorithms, K-means is also probabilistic in nature and does not necessarily provide the same solution every time.

For clustering any dataset, we first need to try and identify how many clusters do we want to segment the data into. One way to do this is to perform **Elbow analysis** to get an estimate of the number of clusters we can try with. We will try clustering with increasing number of clusters and see how well does the data fit using each configuration. If we plot the percentage of data fit against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the "elbow criterion".

In [None]:
inertias = []
num_clusters = range(2, 20, 3)

with tqdm.tqdm(total=len(num_clusters)) as p_bar:
    for k in num_clusters:
        # Perform k-means for k clusters
        kmeans_u_m = KMeans(n_clusters=k, random_state=0)
        kmeans_u_m.fit(user_movie_rating_data)
        
        # Compute and record how well did we fit to the data
        inertias.append(kmeans_u_m.inertia_)
        p_bar.update(1)
        p_bar.set_description('Completed for num_clusters '.format(k))

# Plot data fit as a function of number of clusters
plt.plot(num_clusters, inertias, marker='o')

We see that around $k=8$, the marginal gain by increasing the number of clusters goes down. Let us try and visualize how our clusters look like. 

One of the most common ways to analyze the data is to plot it to see how is it distributed across the feature space. This will not only help us visualize the clustering output, but it will also provide insights into how the data is structured. This process is trivial when we have <= 3 features since we can visualize them in 2D or 3D, but when we have more than 3 dimensions, this becomes tricky. In these scenarios we employ a technique called **dimensionality reduction**.

Dimensionality reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions ensuring that it conveys similar information concisely. In addition to helping with visualizations, these techniques are typically used while solving machine learning problems to obtain better features.

In this lab, we will use an algorithm called **Pricipal Component Analysis** to first reduce the dimensionality of the data to 2 so we can see them on a 2-D plot. PCA basically tries to capture and retain the directions of greatest variance in the data so that we can represent the same information in fewer dimensions.

**Optional** To learn more about PCA, refer [this](http://setosa.io/ev/principal-component-analysis/) article.

In [None]:
# Perform k-means using optimal number of clusters, which is 8
# Note that we are clustering the points in the high dimensional space
# The 2 dimensional projection is only for visualization in our case
kmeans_u_m = KMeans(n_clusters=8, random_state=0)
kmeans_u_m.fit(user_movie_rating_data)

# Perform PCA on ratings representation
pca_u_m = PCA(n_components=2)
u_m_xy = pca_u_m.fit_transform(user_movie_rating_data)
u_m_x, u_m_y = zip(*u_m_xy)

# Now that we have all the points projected into two dimensions, 
# we can easily visualize them with a scatter plot.
fig = plt.figure(figsize=(10, 5))
plt.scatter(u_m_x, u_m_y, c=kmeans_u_m.labels_.astype(float), cmap=plt.cm.Dark2, alpha=0.3);

Although often very close, the clusters occupy visibly different areas of space.

Now, lets look at a sample of movies in each cluster predicted by using user ratings.

In [None]:
user_movie_rating_df.loc[:, 'cluster'] = kmeans_u_m.labels_

widget_utils.show_cluster_interaction(user_movie_rating_df)

A curious thing to notice here is that in addition to clusters that have movies with similar content, there are clusters that contain movies around the same time period.

Having this information enables us as now to make fast and efficient recommendations in three steps:
1. Look up a movie the user likes
2. find all movies in the same cluster as this movie
3. compute the similarity of the movies in the cluster to the one the user likes

In [None]:
%%time

# find the subset of movies in same cluster as Jumanji
query_cluster = user_movie_rating_df[user_movie_rating_df.movieId==2]['cluster'].values[0]
reduced_user_movie_rating_df = user_movie_rating_df[user_movie_rating_df.cluster == query_cluster]

# compute and show similar movies
display(get_similar_movies_fast(jumanji_movie_id, reduced_user_movie_rating_df, rating_feature_columns, get_similarity_cosine))

You'll notice almost a 6x speed improvement by reducing the number of candidates that we have to compute the similarity with.

### Pros of collaborative filtering
* This method works for any kind of item and we do not need to handcraft profiles.
* Provides more generic recommendations outside users tastes.

### Cons of collaborative filtering
* Suffers from cold start and sparsity problems where we can not provide recommendations if there are not enough users or ratings data available.
* We cannot recommend an item that has not been rated so far.
* It tends to recommend popular items.

## Recommendation using matrix factorization (latent factors)

When we use distance based similarity approaches on raw data, we match using sparse low-level details that we **assume** represent the user’s preference. What we need is some technique to derive the actual tastes and preferences from the raw data, and one such method is low-rank matrix factorization. The goal of matrix factorization is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) than predict the unknown ratings.

When we have a very sparse matrix with a lot of dimensions, by doing matrix factorization, we can restructure the user-item rating matrix into a low-rank structure where each column represents some property. In the example below, we hope that after performing matrix factorization on the rating matrix, we will be able to extract factors like *seriousness* and *target audience*.

<img src="https://mlvillage.blob.core.windows.net/unsupervised/factors.png" alt="factors" width="600"/>

Each user has their own biases and thought process when they rate a movie. Some users might be very liberal and give high ratings to almost all the movies, while others might be very hard to please and rate low. Hence, the first thing we need to do is to normalize the data by subtracting the average rating given by the user from all the ratings that this user has given.

In [None]:
# Extract the ratings
user_mov_ratings = user_movie_rating_df[rating_feature_columns].T

# Subtract mean user ratings for each user to normalize the data
user_rating_means = (user_mov_ratings.values).sum(axis=1) / (user_mov_ratings.values > 0).sum(axis=1)
user_mov_ratings_demean = user_mov_ratings.values - user_rating_means.reshape(-1, 1)

# Print the first row for the de-meaned ratings
user_mov_ratings_demean[0]

Lets see if there is any trend to how the users rate movies on average.

In [None]:
# Plot the distribution of the average user movie rating computed above
sns.kdeplot(user_rating_means, shade=True)

As we can see above, most users give an average rating of around 3.5. When considering the scale is from 1 to 5, we can say that most users are very generous while rating the movies.

In this lab, we will use a method for matrix factorization called singular value decomposition (SVD). At a high level, SVD is an algorithm that decomposes a matrix A into the best lower rank (i.e. smaller/simpler) approximation of the original matrix A. Mathematically, it decomposes a matrix A into two unitary matrices and a diagonal matrix, where A is the input data matrix (ratings), U is the left singular vectors (user “features” matrix), $\Sigma$ is the diagonal matrix of singular values (strengths of each concept), and $V^T$ is the right singular vectors (movie “features” matrix).

<img src="https://mlvillage.blob.core.windows.net/unsupervised/svd.png" alt="svd" width="400"/>

Here, each column in the matrix U and each row in the matrix V represent an underlying property or characteristic of movies.

**Optional note -** You can find more about the relationship between PCA and SVD [here](https://intoli.com/blog/pca-and-svd/).

In [None]:
# Perform SVD to get the latent factors
U, sigma, Vt = svds(user_mov_ratings_demean, k = 50)

print(U.shape, sigma.shape, Vt.shape)

In [None]:
# Create a dataframe for movies with the new representation containing factors
movies_df_with_svd = pd.concat([user_movie_rating_df.reset_index()[['movieId', 'title', 'genres']], 
                                pd.DataFrame(Vt.T)], axis=1)
movie_feature_columns = list(range(Vt.shape[0]))

display(movies_df_with_svd.head())

Finally, we will use these new features that represent the high level features of a movie to identify other similar movies to ***Jumanji***.

In [None]:
%%time
display(get_similar_movies_fast(jumanji_movie_id, movies_df_with_svd, movie_feature_columns, get_similarity_cosine))

# Compare methods

Having built 2 very different types of recommendation systems, we will extract some popular movies and test the different methods from above.

In [None]:
'''
1. Get count of ratings per movie
2. Sort by most rated
3. Get 20 most rated movieIds
'''
popular_movie_ids = ratings_df.merge(movies_df, left_on='movieId', right_on='movieId')\
        .groupby('movieId')\
        .count()\
        .sort_values('rating')\
        .tail(20)\
        .index\
        .tolist()

# Get titles for popular movies
popular_movie_titles = movies_df[movies_df.movieId.isin(popular_movie_ids)]['title'].values.tolist()

popular_movie_titles

From the interactive widget below, select a query movie and a method for producing recommendations.

In [None]:
methods = ['item-item collaborative filtering', 'matrix factorization']

widget_utils.show_compare_interactive(sorted(popular_movie_titles),
                                      movies_idx,
                                      movies_df,
                                      methods,
                                      None,
                                      user_movie_rating_df,
                                      movies_df_with_svd)

## Thank you
for completing level 2 of the unsupervised learning lab. Make sure you also checkout the other labs in the Machine Learning village.

## Congratulations!

You finished this lab. If you are participating in the TS2019 Game, **please contact a volunteer to get your points!**

If you have another 45 seconds, please help us by <a href="https://www.surveymonkey.com/r/SK6XVKG" target="_blank">filling out this survey</a>.

We hope you had fun with this lab! But, wait, there is more! Come and visit the ML village website. It has access to all our lab materials so you can run the notebooks on your own machine, whenever and wherever you want. We also assembled a list of useful resources. To find our homepage you can scan the QR code below, or go to:  [https://git.corp.adobe.com/pages/TechSummit2019MLVillage/](https://git.corp.adobe.com/pages/TechSummit2019MLVillage/)

![ML Village Home Page](https://mlvillage.blob.core.windows.net/imagerecognition/ML_village_QR_code.png "ML Village Home Page")