Boilerplate code below..

Setting a default seed for RNG

In [1]:
import random
import numpy as np

my_seed = 1337
random.seed(my_seed)
np.random.seed(my_seed)

In [2]:
import pandas as pd
import numpy as np
from typing import *
from IPython.display import display, HTML, Markdown

import warnings
warnings.filterwarnings('ignore')


def display_best_and_worse_recommendations(recommendations: pd.DataFrame):
    recommendations.sort_values('Estimated Prediction', ascending=False, inplace=True)

    top_recommendations = recommendations.iloc[:10]
    top_recommendations.columns = ['Prediction (sorted by best)', 'Movie Title']

    worse_recommendations = recommendations.iloc[-10:]
    worse_recommendations.columns = ['Prediction (sorted by worse)', 'Movie Title']

    display(HTML("<h1>Recommendations your user will love</h1>"))
    display(top_recommendations)

    display(HTML("<h1>Recommendations your user will hate</h1>"))
    display(worse_recommendations)
    

def load_movies_dataset() -> pd.DataFrame:
    movie_data_columns = [
    'movie_id', 'title', 'release_date', 'video_release_date', 'url',
    'unknown', 'Action', 'Adventure', 'Animation', "Children's",
    'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir',
    'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller',
    'War', 'Western'
    ]

    movie_data = pd.read_csv(
        'datasets/ml-100k/u.item', 
        sep = '|', 
        encoding = "ISO-8859-1", 
        header = None, 
        names = movie_data_columns,
        index_col = 'movie_id'
    )
    movie_data['release_date'] = pd.to_datetime(movie_data['release_date'])
    return movie_data

def load_ratings() -> pd.DataFrame:
    ratings_data = pd.read_csv(
        'datasets/ml-100k/u.data',
        sep = '\t',
        encoding = "ISO-8859-1",
        header = None,
        names=['user_id', 'movie_id', 'rating', 'timestamp']
    )
    return ratings_data[['user_id', 'movie_id', 'rating']]

def load_ratings_with_name() -> pd.DataFrame:
    ratings_data = load_ratings()
    movies_data = load_movies_dataset()
    ratings_data['user_id'] = ratings_data['user_id'].map(lambda k: f"User {k}")
    
    ratings_and_movies = ratings_data \
        .set_index('movie_id') \
        .join(movies_data['title']) \
        .reset_index()
    
    ratings_and_movies['movie_title'] = ratings_and_movies['title']
    return ratings_and_movies[['user_id', 'movie_title', 'rating']].sample(frac=1)
    
    


# A practical guide to Singular Value Decomposition in Python

Recommender systems have become increasingly popular in recent years, and are used by some of the largest websites in the world to predict the likelihood of a user taking an action on an item. In the world of Netflix, this means recommending similar movies to the ones you have seen. In the world of dating, this means suggesting matches similar to people you already showed interest in!

My path to recommenders has been an unusual one: from a Software Engineer to working on matching algorithms at a dating company, with a little background on machine learning. With my knowledge of Python and the use of basic SVD (Singular Value Decomposition) frameworks, I was able to understand SVDs from a practical standpoint of what you can do with them, instead of focusing on the science.

In my talk, you will learn 2 practical ways of generating recommendations using SVDs: matrix factorization and item similarity. We will be learning the high-level components of SVD the "doer way": we will be implementing a simple movie recommendation engine with the help of Jupiter notebooks, the MovieLens database, and the Surprise recommendation package.

 - Recommendations via Matrix Factorization: Performing predict() manually

## Table of contents

 - Downloading and exploring the MovieLens dataset
 - Training a SVD using Surprise in 4 simple steps
 - Using the predict() API inside of Surprise
 - Recommendations via Product based CF: Finding similarity between vectors

# Downloading and exploring the MovieLens dataset

<p><img src="https://static1.squarespace.com/static/51cdafc4e4b09eb676a64e68/t/579282fabebafbb6c366252c/1469219594863/" alt="Drawing" style="width: 400px; float: left"/></p>


# Exploring the Dataset



- Open Source dataset
- 20 million ratings
- 27,000 movies
- 138,000 users

In [3]:
movie_data = load_movies_dataset()

In [4]:
# EXPLORATION

ratings_with_movies = ratings_data.set_index('movie_title').join(movie_data.set_index('title'), how='inner')
ratings_with_genres = ratings_with_movies.reset_index(drop=True)[['user_id', 'Action', 'Adventure', 'Animation', "Children's",
    'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir',
    'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller',
    'War', 'Western']]

user_and_preferences = ratings_with_genres.groupby('user_id').sum()

total_user_rating = user_and_preferences.sum(axis=1)
user_and_preferences.std(axis=1).sort_values(ascending=False)


user_and_preferences.loc['User 450']


user_and_preferences.mean()

NameError: name 'ratings_data' is not defined

In [5]:
ratings_data = load_ratings_with_name()
ratings_data.head(5)

Unnamed: 0,user_id,movie_title,rating
36649,User 742,Jerry Maguire (1996),4
2478,User 908,"Usual Suspects, The (1995)",3
82838,User 758,Real Genius (1985),4
69729,User 393,Things to Do in Denver when You're Dead (1995),3
36560,User 66,Jerry Maguire (1996),4


In [6]:
# Remove movies with few ratings
movie_ratings = ratings_data.groupby('movie_title').size()
valid_movies = movie_ratings[movie_ratings > 15]
movie_ratings = ratings_data.set_index('movie_title', drop=False).join(valid_movies.to_frame(), how='inner').reset_index(drop=True)

del movie_ratings[0]

movie_ratings = movie_ratings.sample(frac=1)
movie_ratings.head(5)

Unnamed: 0,user_id,movie_title,rating
64719,User 730,Picture Perfect (1997),2
10385,User 399,Black Beauty (1994),3
50679,User 534,Liar Liar (1997),5
90771,User 741,Twister (1996),1
55889,User 176,Mighty Aphrodite (1995),4


# Training a SVD using Surprise in 4 simple steps

Let's take the **interactions** between the Users and Movies, and generate **latent features**  

In [7]:
from surprise import SVD, NMF, accuracy
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate, train_test_split

# Step 1: create a Reader.
# A reader tells our SVD what the lower and upper bound of our ratings is.
# MovieLens ratings are from 1 to 5
reader = Reader(rating_scale=(1, 5))

In [8]:
# Step 2: create a new Dataset instance with a DataFrame and the reader
# The DataFrame needs to have 3 columns in this specific order: [user_id, product_id, rating]
data = Dataset.load_from_df(ratings_data, reader)

In [9]:
# Step 3: keep 25% of your trainset for testing
trainset, testset = train_test_split(data, test_size=.25)

In [10]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_data, reader)
trainset, testset = train_test_split(data, test_size=.01)

In [11]:
# Step 4: train a new SVD
model = SVD(n_factors=100, biased=False)
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x111c24f28>

In [12]:
# Optionally, validate the RMSE (root-mean-square error) to ensure model training was effectiv 
predictions = model.test(testset)
accuracy.rmse(predictions)

RMSE: 0.9366


0.9365724030417191

In [13]:
# TEST mormalization
pd.DataFrame(model.qi).iloc[0].pow(2).sum()
model.qi /= np.linalg.norm(model.qi, ord=2, axis=1).reshape(-1, 1)
pd.DataFrame(model.qi).iloc[0].pow(2).sum()

1.0000000000000002

# Generating predictions with simplicity

Coomputes the rating prediction for given user and movie with `model.predict()`

In [14]:
daniel = "User 196"
toy_story = "Toy Story (1995)"
model.predict(daniel, toy_story)

Prediction(uid='User 196', iid='Toy Story (1995)', r_ui=None, est=1.5455571744427679, details={'was_impossible': False})

# Inspecting our Product Matrix

Surprise SVD stores the product matrix under the `model.qi` attribute.

In [15]:
model.qi.shape

(1664, 100)

The matrix has `n_factors` columns (we chose 100). Every row represents a movie

# Mapping every vector back to it's movie

Every row is mapped to a movie. How do we map every row to it's vector?

In [16]:
item_to_row_idx: Dict[Any, int] = model.trainset._raw2inner_id_items

# Unpacking a dict into a DataFrame for readibility
item_to_row_idx_df = pd.DataFrame(
    list(item_to_row_idx.items()),
    columns=['Movie name', 'model.qi row idx'],
)
item_to_row_idx_df.head(5)


Unnamed: 0,Movie name,model.qi row idx
0,Restoration (1995),0
1,Monty Python and the Holy Grail (1974),1
2,Starship Troopers (1997),2
3,Swingers (1996),3
4,As Good As It Gets (1997),4


# Identifying Toy Story

It's very easy to fetch the latent features of a product. Once we have the **row index** of the product, we can fetch the **row** from the `model.qi` matrix

In [17]:
toy_story_row_idx : int = item_to_row_idx['Toy Story (1995)']

In [18]:
model.qi[toy_story_row_idx]

array([-0.00974943,  0.03265917,  0.03659085, -0.06292274, -0.06491639,
        0.09442259, -0.02551203, -0.02027142, -0.02335466, -0.01492428,
       -0.06290319,  0.01435514, -0.00619643,  0.01799988,  0.02820319,
       -0.21907753, -0.05845834,  0.08035567,  0.05981695, -0.09046247,
        0.22661292,  0.03933661, -0.10093707, -0.15839697, -0.01699003,
        0.07315269, -0.11690985, -0.08911202, -0.09714794, -0.0814975 ,
       -0.01107214,  0.12426467, -0.19211571,  0.08327614, -0.07947047,
        0.06593131,  0.0065287 ,  0.01774967,  0.04007586,  0.16557257,
       -0.01542594, -0.19190932, -0.0248609 ,  0.13164055,  0.07578262,
       -0.0357468 , -0.16622227, -0.18182406,  0.06855151, -0.03599035,
        0.07501229,  0.01276342, -0.0289171 , -0.01152304, -0.09080726,
       -0.1464064 , -0.17687792, -0.09365408,  0.0740219 , -0.10258863,
        0.08704034, -0.05372785,  0.09271249,  0.04955725,  0.17063114,
        0.18700405, -0.13139778, -0.07092693,  0.05760666,  0.10

In [19]:
print(f"Every product has {model.qi[toy_story_row_idx].shape[0]} features")

Every product has 100 features


# Recommendations via Product based CF: Finding similarity between vectors

2 products are "similar" when the cosine distance of their latent features is <= 0

In [20]:
from scipy.spatial.distance import cosine


def cosine_distance(vector_a: np.array, vector_b: np.array) -> float:
    return cosine(vector_a, vector_b)

def get_vector_by_movie_title(movie_title: str, trained_model: SVD) -> np.array:
    movie_row_idx = trained_model.trainset._raw2inner_id_items[movie_title]
    return trained_model.qi[movie_row_idx]

In [28]:
# Fetch the vectors of "Toy Story" and "Wizard of Oz"
toy_story_vector: np.array = get_vector_by_movie_title('Toy Story (1995)', model)
wizard_of_oz: np.array = get_vector_by_movie_title('Wizard of Oz, The (1939)', model)

# Calculate the distance between the vectors. The smaller the number,
# the more similar the two movies are
similarity_score = cosine_distance(toy_story_vector, wizard_of_oz)
similarity_score

0.32043956574320265

In [22]:

def get_top_similarities(movie_title: str, model: SVD) -> pd.DataFrame:
    movie_vector: np.array = get_vector_by_movie_title(movie_title, model)
    similarity_table = []
    for other_movie_title in model.trainset._raw2inner_id_items.keys():
        other_movie_vector = get_vector_by_movie_title(other_movie_title, model)
        similarity_score = cosine_distance(other_movie_vector, movie_vector)
        similarity_table.append((similarity_score, other_movie_title))
        
    similarity_table = pd.DataFrame(
        similarity_table,
        columns=['similarity', 'movie title']
    ).sort_values('similarity', ascending=True)
    
    return similarity_table.iloc[:10]
    
    return similarity_table.set_index('movie title').join(movie_data).iloc[:10]

In [23]:
get_top_similarities('Toy Story (1995)', model)

Unnamed: 0,similarity,movie title
209,0.0,Toy Story (1995)
1094,0.235412,"Affair to Remember, An (1957)"
90,0.25822,"Shawshank Redemption, The (1994)"
291,0.271727,"Grand Day Out, A (1992)"
192,0.282496,"Wrong Trousers, The (1993)"
867,0.283445,Mr. Smith Goes to Washington (1939)
36,0.287266,"Abyss, The (1989)"
145,0.290202,E.T. the Extra-Terrestrial (1982)
466,0.291383,Schindler's List (1993)
120,0.292125,Stand by Me (1986)


In [24]:
get_top_similarities('Monty Python\'s Life of Brian (1979)', model)

Unnamed: 0,similarity,movie title
657,0.0,Monty Python's Life of Brian (1979)
1,0.191159,Monty Python and the Holy Grail (1974)
142,0.233888,Ran (1985)
192,0.245717,"Wrong Trousers, The (1993)"
555,0.246895,"Philadelphia Story, The (1940)"
234,0.254793,Up in Smoke (1978)
187,0.256729,Delicatessen (1991)
1116,0.257891,"Body Snatcher, The (1945)"
682,0.260078,Red Rock West (1992)
390,0.260469,Lost Horizon (1937)


# Recommendations via Matrix Reconstruction: Reconstructing the score between vectors

So far we have seen how the `predict()` API works in surface. But how does it **really** work inside of surprise. It's, surprisingly, simple! (get the pun?)

In [25]:
def dot_product(vector_a: np.array, vector_b: np.array) -> float:
    return vector_a.dot(vector_b)


def get_vector_by_user_name(user_name: str, trained_model: SVD) -> np.array:
    movie_row_idx = trained_model.trainset._raw2inner_id_users[user_name]
    return trained_model.pu[movie_row_idx]

In [26]:
user_908_vector = get_vector_by_user_name("User 908", model)
dot_product(user_908_vector, toy_story_vector)

1.4846220818359699