Boilerplate code below..

In [1]:
import pandas as pd
import numpy as np
from typing import List, Dict
from IPython.display import display, HTML, Markdown

import warnings
warnings.filterwarnings('ignore')


def display_best_and_worse_recommendations(recommendations: pd.DataFrame):
    recommendations.sort_values('Estimated Prediction', ascending=False, inplace=True)

    top_recommendations = recommendations.iloc[:10]
    top_recommendations.columns = ['Prediction (sorted by best)', 'Movie Title']

    worse_recommendations = recommendations.iloc[-10:]
    worse_recommendations.columns = ['Prediction (sorted by worse)', 'Movie Title']

    display(HTML("<h1>Recommendations your user will love</h1>"))
    display(top_recommendations)

    display(HTML("<h1>Recommendations your user will hate</h1>"))
    display(worse_recommendations)
    

def load_movies_dataset() -> pd.DataFrame:
    movie_data_columns = [
    'movie_id', 'title', 'release_date', 'video_release_date', 'url',
    'unknown', 'Action', 'Adventure', 'Animation', "Children's",
    'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir',
    'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller',
    'War', 'Western'
    ]

    movie_data = pd.read_csv(
        'datasets/ml-100k/u.item', 
        sep = '|', 
        encoding = "ISO-8859-1", 
        header = None, 
        names = movie_data_columns,
        index_col = 'movie_id'
    )
    movie_data['release_date'] = pd.to_datetime(movie_data['release_date'])
    return movie_data

def load_ratings() -> pd.DataFrame:
    ratings_data = pd.read_csv(
        'datasets/ml-100k/u.data',
        sep = '\t',
        encoding = "ISO-8859-1",
        header = None,
        names=['user_id', 'movie_id', 'rating', 'timestamp']
    )
    return ratings_data

# A practical guide to Singular Value Decomposition in Python

Recommender systems have become increasingly popular in recent years, and are used by some of the largest websites in the world to predict the likelihood of a user taking an action on an item. In the world of Netflix, this means recommending similar movies to the ones you have seen. In the world of dating, this means suggesting matches similar to people you already showed interest in!

My path to recommenders has been an unusual one: from a Software Engineer to working on matching algorithms at a dating company, with a little background on machine learning. With my knowledge of Python and the use of basic SVD (Singular Value Decomposition) frameworks, I was able to understand SVDs from a practical standpoint of what you can do with them, instead of focusing on the science.

In my talk, you will learn 2 practical ways of generating recommendations using SVDs: matrix factorization and item similarity. We will be learning the high-level components of SVD the "doer way": we will be implementing a simple movie recommendation engine with the help of Jupiter notebooks, the MovieLens database, and the Surprise recommendation package.

## Table of contents

 - Downloading and exploring the MovieLens dataset
 - Training a SVD model using Surprise
 - Using the predict() API inside of Surprise
 - Recommendations via Matrix Factorization: Performing predict() manually
 - recommendations via Product based CF: Finding similarity between vectors

# MovieLens dataset

This dataset contains all the movies and their metadata

`movie_id` 1 is **Toy Story**

<p><img src="https://static1.squarespace.com/static/51cdafc4e4b09eb676a64e68/t/579282fabebafbb6c366252c/1469219594863/" alt="Drawing" style="width: 200px; float: left"/></p>

In [3]:
movie_data = load_movies_dataset()
ratings_data = load_ratings()
ratings_data.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488
7,253,465,5,891628467
8,305,451,3,886324817
9,6,86,3,883603013


# Running our interactions through Surprise SVD

Let's take the **interactions** between the Users and Movies, and generate **latent features**  

In [4]:
from surprise import SVD, NMF, accuracy
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate, train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_data[['user_id', 'movie_id', 'rating']], reader)
trainset, testset = train_test_split(data, test_size=.25)

# Let's train a new SVD with 100 Latent features
model = SVD(n_factors=100, biased=False)
model.fit(trainset)

# In reality, we should perform a train/test split and check RMSE to see if our model is trained
# but today, for simplicity, I'm skipping this step
predictions = model.test(testset)
accuracy.rmse(predictions)

RMSE: 0.9604


0.9604071344125338

# Inspecting our Product Matrix

Surprise SVD stores the product matrix under the `model.qi` attribute. Let's take a look

In [5]:
pd.DataFrame(model.qi).head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-0.061063,0.059416,0.116937,0.200122,-0.047376,-0.187288,-0.061984,-0.094023,0.171242,-0.264416,...,0.082201,0.138748,0.113034,-0.280768,-0.29719,-0.540336,0.323032,-1.005453,-0.20302,-0.237543
1,-0.304669,-0.215347,-0.238126,-0.00946,-0.241827,0.347439,0.157658,-0.13134,0.268331,-0.465403,...,-0.240596,0.176555,-0.120756,-0.159796,-0.539848,-0.398833,0.213463,-0.570909,-0.22503,-0.220835
2,-0.234562,0.16972,0.108425,-0.418974,-0.058458,0.043597,0.12741,0.169643,0.271624,-0.037436,...,0.050831,0.056686,-0.039044,-0.003882,-0.309512,-0.365279,0.123678,-0.446317,-0.123744,-0.048933
3,-0.385466,0.409892,-0.018661,0.228947,-0.128392,0.32658,-0.193503,-0.138809,0.272231,-0.36362,...,0.256737,0.221133,0.349778,-0.01105,-0.359474,-0.534094,-0.076799,-0.510362,0.079436,-0.066026
4,-0.324121,0.072924,-0.196967,0.113852,-0.115416,-0.080928,0.038039,-0.098371,0.031335,-0.083513,...,-0.002417,0.124899,0.317894,-0.155399,-0.540723,0.009428,-0.048847,-0.654256,-0.090699,0.196248
5,-0.507885,0.270702,0.046607,0.03292,-0.297462,0.160062,-0.193173,-0.032687,0.170563,-0.309114,...,0.199833,0.293225,0.298177,0.090809,-0.554952,-0.374542,0.109276,-0.8345,-0.038704,0.034348
6,-0.250259,0.263547,0.216782,0.021987,-0.393329,0.036197,-0.031359,0.019506,0.299731,-0.174963,...,-0.058979,0.159845,-0.001584,-0.018436,-0.328463,-0.197771,-0.032877,-0.705507,-0.320911,-0.204819
7,-0.184504,0.1184,0.077801,-0.017985,-0.153894,0.095814,0.254059,-0.037066,0.269633,-0.233825,...,-0.076749,0.286258,0.04493,-0.208489,-0.340737,-0.407629,-0.077942,-0.749707,-0.227567,0.384376
8,-0.498732,0.323445,-0.061876,0.179025,-0.056949,0.054613,0.024386,-0.117682,0.558597,-0.278122,...,0.136047,-0.094046,-0.086619,0.054432,-0.382159,-0.321067,-0.228332,-0.696384,-0.182559,0.121261
9,-0.220433,-0.090828,0.107689,0.145995,0.27798,-0.457632,0.268899,-0.166795,0.180009,0.009401,...,-0.338301,0.029704,-0.01869,-0.093245,-0.334795,-0.486999,-0.270495,-0.762408,-0.182289,-0.044484


# Exploring the product matrix

The matrix has `n_factors` columns (we chose 10). Every row represents a movie

In [99]:
print(f"The shape of our product matrix is {model.qi.shape}.")
print(f"There are {ratings_data['movie_id'].unique().shape[0]} unique movies movies")

The shape of our product matrix is (1643, 100).
There are 1682 unique movies movies


Around 3% of movies are not present. This is because Surprise removes products (and users) that do not have a minimum number of ratings. 

# Generating predictions with simplicity

Before looking into the latent features of our movies, let's use the API provided by Surprise. More specifically, Surprise provides us 1 API

 - `model.predict` computes the rating prediction for given user and movie
 
Let's look at how we can use this API to generate movies that a given user may like

```python
>>> model.predict('302', '1')
Prediction(uid=302, iid=1, r_ui=None, est=3.5327866666666665, details={'was_impossible': False})
```

NOTE: User ID and Movie ID are **strings**

In [116]:
svd.predict(196, 1)

3.4258653536152694

In [6]:
movie_id_to_title_map: Dict[int, str] = dict(movie_data['title'])
# {1: 'Toy Story (1995)',
#  2: 'GoldenEye (1995)',
#  3: 'Four Rooms (1995)'}

def generate_recommended_movies_for_user(user_id: int) -> pd.DataFrame:
    """Return a DataFrame containing recommendations for the user, and the
    associated score
    """
    results = []
    for movie_id, movie_title in movie_id_to_title_map.items():
        
        # For each movie, calculate score prediction 
        prediction = model.predict(user_id, movie_id)
        results.append((prediction.est, movie_title))
       
    return pd.DataFrame(results, columns=['Estimated Prediction', 'Movie Title'])


# Let's generate some recommendations for a user
recommendations = generate_recommended_movies_for_user(302)
display_best_and_worse_recommendations(recommendations)

Unnamed: 0,Prediction (sorted by best),Movie Title
1681,3.53256,Scream of Stone (Schrei aus Stein) (1991)
1524,3.53256,"Object of My Affection, The (1998)"
1452,3.53256,Angel on My Shoulder (1946)
1630,3.53256,"Slingshot, The (1993)"
1620,3.53256,Butterfly Kiss (1995)
1456,3.53256,Love Is All There Is (1996)
1615,3.53256,Desert Winds (1995)
1459,3.53256,Sleepover (1995)
1606,3.53256,Hurricane Streets (1998)
1603,3.53256,He Walked by Night (1948)


Unnamed: 0,Prediction (sorted by worse),Movie Title
1447,1.0,My Favorite Season (1993)
438,1.0,Amityville: A New Generation (1993)
1429,1.0,Ill Gotten Gains (1997)
1441,1.0,"Scarlet Letter, The (1995)"
1437,1.0,Panther (1995)
1435,1.0,Mr. Jones (1993)
787,1.0,Relative Fear (1994)
1432,1.0,Men of Means (1998)
1430,1.0,Legal Deceit (1997)
1309,1.0,"Walk in the Sun, A (1945)"


# Predict, under the hood

So far we have seen how the `predict()` API works in surface. But how does it **really** work inside of surprise. It's, surprisingly, simple! (get the pun?)

But before we go there, let's go back to our Feature Vectors

![Latent Features](https://cdn-images-1.medium.com/max/1600/0*_gKhyxIC3wup0cCE.jpg)

## Looking at the Movie matrix (vT)

Let's take a look at the latent features for every movie. Product features can be found in the `qi` attribute.
 - create a DataFrame that maps product matrix row index to movie
 - join the newly created dataframe with the movie dataset
 - join the newly created dataframe with the latent features

In [7]:
# Create a DataFrame that maps product matrix row index to movie
movie_to_product_matrix = pd.DataFrame(
    list(model.trainset._raw2inner_id_items.items()
), columns=['movie_id', 'vT_index'], dtype=int).set_index('movie_id', drop=False)

# Join the newly created dataframe with the movie dataset
mapping_matrix_with_title = movie_to_product_matrix.join(movie_data['title'])

# Create a dataframe containing latent features, and join it to the remaining dataset
latent_features = pd.DataFrame(model.qi, columns=[f"Latent Feature {k}" for k in range(1, 101)])
mapping_matrix_with_title_and_features = mapping_matrix_with_title.set_index('vT_index').join(latent_features)

mapping_matrix_with_title_and_features.head(10)

Unnamed: 0_level_0,movie_id,title,Latent Feature 1,Latent Feature 2,Latent Feature 3,Latent Feature 4,Latent Feature 5,Latent Feature 6,Latent Feature 7,Latent Feature 8,...,Latent Feature 91,Latent Feature 92,Latent Feature 93,Latent Feature 94,Latent Feature 95,Latent Feature 96,Latent Feature 97,Latent Feature 98,Latent Feature 99,Latent Feature 100
vT_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,313,Titanic (1997),-0.061063,0.059416,0.116937,0.200122,-0.047376,-0.187288,-0.061984,-0.094023,...,0.082201,0.138748,0.113034,-0.280768,-0.29719,-0.540336,0.323032,-1.005453,-0.20302,-0.237543
1,181,Return of the Jedi (1983),-0.304669,-0.215347,-0.238126,-0.00946,-0.241827,0.347439,0.157658,-0.13134,...,-0.240596,0.176555,-0.120756,-0.159796,-0.539848,-0.398833,0.213463,-0.570909,-0.22503,-0.220835
2,746,Real Genius (1985),-0.234562,0.16972,0.108425,-0.418974,-0.058458,0.043597,0.12741,0.169643,...,0.050831,0.056686,-0.039044,-0.003882,-0.309512,-0.365279,0.123678,-0.446317,-0.123744,-0.048933
3,82,Jurassic Park (1993),-0.385466,0.409892,-0.018661,0.228947,-0.128392,0.32658,-0.193503,-0.138809,...,0.256737,0.221133,0.349778,-0.01105,-0.359474,-0.534094,-0.076799,-0.510362,0.079436,-0.066026
4,258,Contact (1997),-0.324121,0.072924,-0.196967,0.113852,-0.115416,-0.080928,0.038039,-0.098371,...,-0.002417,0.124899,0.317894,-0.155399,-0.540723,0.009428,-0.048847,-0.654256,-0.090699,0.196248
5,195,"Terminator, The (1984)",-0.507885,0.270702,0.046607,0.03292,-0.297462,0.160062,-0.193173,-0.032687,...,0.199833,0.293225,0.298177,0.090809,-0.554952,-0.374542,0.109276,-0.8345,-0.038704,0.034348
6,1,Toy Story (1995),-0.250259,0.263547,0.216782,0.021987,-0.393329,0.036197,-0.031359,0.019506,...,-0.058979,0.159845,-0.001584,-0.018436,-0.328463,-0.197771,-0.032877,-0.705507,-0.320911,-0.204819
7,183,Alien (1979),-0.184504,0.1184,0.077801,-0.017985,-0.153894,0.095814,0.254059,-0.037066,...,-0.076749,0.286258,0.04493,-0.208489,-0.340737,-0.407629,-0.077942,-0.749707,-0.227567,0.384376
8,168,Monty Python and the Holy Grail (1974),-0.498732,0.323445,-0.061876,0.179025,-0.056949,0.054613,0.024386,-0.117682,...,0.136047,-0.094046,-0.086619,0.054432,-0.382159,-0.321067,-0.228332,-0.696384,-0.182559,0.121261
9,269,"Full Monty, The (1997)",-0.220433,-0.090828,0.107689,0.145995,0.27798,-0.457632,0.268899,-0.166795,...,-0.338301,0.029704,-0.01869,-0.093245,-0.334795,-0.486999,-0.270495,-0.762408,-0.182289,-0.044484


These are **learned features**. We cannot attribute them to anything specific, but they usually have some real-world correlation

In [8]:
from scipy.spatial.distance import cosine


def compute_similarity(movie_a: str, movie_b: str) -> float:
    try:
        movie_a_vectors: np.array = mapping_matrix_with_title_and_features[
            mapping_matrix_with_title_and_features['title'] == movie_a
        ].iloc[0, 2:].as_matrix()
        movie_b_vectors: np.array = mapping_matrix_with_title_and_features[
            mapping_matrix_with_title_and_features['title'] == movie_b
        ].iloc[0, 2:].as_matrix()
    except IndexError:
        # SVD may sometimes remove users or products that do not contain
        # a minimum number of ratings to/from them. This helps improve the
        # quality of recommendations
        return -1
    
    return 1 - cosine(movie_a_vectors, movie_b_vectors)


# compute_similarity('Evita (1996)', 'Evita (1996)')
compute_similarity('Toy Story (1995)', 'Evita (1996)')
# compute_similarity('They Made Me a Criminal (1939)', 'Toy Story (1995)')

0.45755752814307704

In [103]:
def generate_similar_movies_for_movie(movie_title: str) -> pd.DataFrame:
    all_movies = movie_data[['title']]
    all_movies['similarity'] = all_movies['title'].map(lambda title: compute_similarity(title, movie_title))
    return all_movies

# Find similar movies using Cosine Similarity

Usually, there isn't a straightforward way to pinpoint what a latent feature may be a strong indicator of. Even though we don't know exactly what these features correlate to, we can still compare vectors together. The latent feature at same index of every vector will relate to the same attribute.

To find how similar 2 movies are, all we need to do is compare their vectors

In [104]:
similarity_table = generate_similar_movies_for_movie('Toy Story (1995)')
similarity_table.sort_values('similarity', ascending=False).head(5)

Unnamed: 0_level_0,title,similarity
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),1.0
651,Glory (1989),0.771793
204,Back to the Future (1985),0.750685
164,"Abyss, The (1989)",0.748735
479,Vertigo (1958),0.747471


In [105]:
similarity_table = generate_similar_movies_for_movie('Star Wars (1977)')
similarity_table.sort_values('similarity', ascending=False).head(5)

Unnamed: 0_level_0,title,similarity
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
50,Star Wars (1977),1.0
181,Return of the Jedi (1983),0.838185
172,"Empire Strikes Back, The (1980)",0.832663
210,Indiana Jones and the Last Crusade (1989),0.768443
174,Raiders of the Lost Ark (1981),0.763315


In [106]:
similarity_table = generate_similar_movies_for_movie('Monty Python and the Holy Grail (1974)')
similarity_table.sort_values('similarity', ascending=False).head(5)

Unnamed: 0_level_0,title,similarity
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
168,Monty Python and the Holy Grail (1974),1.0
648,"Quiet Man, The (1952)",0.768556
1007,Waiting for Guffman (1996),0.745965
408,"Close Shave, A (1995)",0.738414
12,"Usual Suspects, The (1995)",0.737224




![Wallace and Grommit](https://images-na.ssl-images-amazon.com/images/M/MV5BYjkyM2Y1NzQtYmQ0Zi00MmE5LTgwY2QtNjI3MmE4NzhmNTUwXkEyXkFqcGdeQXVyNTAyODkwOQ@@._V1_.jpg)

In [79]:
ratings_data.groupby('movie_id').sum().sort_values('rating', ascending=False).join(movie_data)

Unnamed: 0_level_0,user_id,rating,timestamp,title,release_date,video_release_date,url,unknown,Action,Adventure,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
50,274817,2541,514804231448,Star Wars (1977),1977-01-01,,http://us.imdb.com/M/title-exact?Star%20Wars%2...,0,1,1,...,0,0,0,0,0,1,1,0,1,0
100,231486,2111,448437746740,Fargo (1996),1997-02-14,,http://us.imdb.com/M/title-exact?Fargo%20(1996),0,0,0,...,0,0,0,0,0,0,0,1,0,0
181,240820,2032,447548365780,Return of the Jedi (1983),1997-03-14,,http://us.imdb.com/M/title-exact?Return%20of%2...,0,1,1,...,0,0,0,0,0,1,1,0,1,0
258,235005,1936,449993136137,Contact (1997),1997-07-11,,http://us.imdb.com/Title?Contact+(1997/I),0,0,0,...,0,0,0,0,0,0,1,0,0,0
174,199104,1786,370882569779,Raiders of the Lost Ark (1981),1981-01-01,,http://us.imdb.com/M/title-exact?Raiders%20of%...,0,1,1,...,0,0,0,0,0,0,0,0,0,0
127,196558,1769,364769941246,"Godfather, The (1972)",1972-01-01,,"http://us.imdb.com/M/title-exact?Godfather,%20...",0,1,0,...,0,0,0,0,0,0,0,0,0,0
286,225292,1759,425069350834,"English Patient, The (1996)",1996-11-15,,http://us.imdb.com/M/title-exact?English%20Pat...,0,0,0,...,0,0,0,0,0,1,0,0,1,0
1,215609,1753,399028021059,Toy Story (1995),1995-01-01,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,176180,1673,344378020675,"Silence of the Lambs, The (1991)",1991-01-01,,http://us.imdb.com/M/title-exact?Silence%20of%...,0,0,0,...,0,0,0,0,0,0,0,1,0,0
288,224234,1645,422431340025,Scream (1996),1996-12-20,,http://us.imdb.com/M/title-exact?Scream%20(1996),0,0,0,...,0,0,1,0,0,0,0,1,0,0
