In [65]:
import pandas as pd
import numpy as np
from typing import List, Dict

import warnings
warnings.filterwarnings('ignore')

# A practical guide to Singular Value Decomposition in Python

Recommender systems have become increasingly popular in recent years, and are used by some of the largest websites in the world to predict the likelihood of a user taking an action on an item. In the world of Netflix, this means recommending similar movies to the ones you have seen. In the world of dating, this means suggesting matches similar to people you already showed interest in!

My path to recommenders has been an unusual one: from a Software Engineer to working on matching algorithms at a dating company, with a little background on machine learning. With my knowledge of Python and the use of basic SVD (Singular Value Decomposition) frameworks, I was able to understand SVDs from a practical standpoint of what you can do with them, instead of focusing on the science.

In my talk, you will learn 2 practical ways of generating recommendations using SVDs: matrix factorization and item similarity. We will be learning the high-level components of SVD the "doer way": we will be implementing a simple movie recommendation engine with the help of Jupiter notebooks, the MovieLens database, and the Surprise recommendation package.

## Table of contents

 - Downloading and exploring the MovieLens dataset
 - ROC Curve

In [3]:
from IPython.display import display, HTML, Markdown


def display_best_and_worse_recommendations(recommendations):
    recommendations.sort_values('Estimated Prediction', ascending=False, inplace=True)

    top_recommendations = recommendations.iloc[:10]
    top_recommendations.columns = ['Prediction (sorted by best)', 'Movie Title']

    worse_recommendations = recommendations.iloc[-10:]
    worse_recommendations.columns = ['Prediction (sorted by worse)', 'Movie Title']

    display(HTML("<h1>Recommendations your user will love</h1>"))
    display(top_recommendations)

    display(HTML("<h1>Recommendations your user will hate</h1>"))
    display(worse_recommendations)

In [4]:
movie_data_columns = [
    'movie_id', 'title', 'release_date', 'video_release_date', 'url',
    'unknown', 'Action', 'Adventure', 'Animation', "Children's",
    'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir',
    'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller',
    'War', 'Western'
]

movie_data = pd.read_csv(
    'datasets/ml-100k/u.item', 
    sep = '|', 
    encoding = "ISO-8859-1", 
    header = None, 
    names = movie_data_columns,
    index_col = 'movie_id'
)
movie_data['release_date'] = pd.to_datetime(movie_data['release_date'])

movie_data.loc[1]

title                                                  Toy Story (1995)
release_date                                        1995-01-01 00:00:00
video_release_date                                                  NaN
url                   http://us.imdb.com/M/title-exact?Toy%20Story%2...
unknown                                                               0
Action                                                                0
Adventure                                                             0
Animation                                                             1
Children's                                                            1
Comedy                                                                1
Crime                                                                 0
Documentary                                                           0
Drama                                                                 0
Fantasy                                                         

# Movies dataset

This dataset contains all the movies and their metadata

`movie_id` 1 is **Toy Story**

<p><img src="https://static1.squarespace.com/static/51cdafc4e4b09eb676a64e68/t/579282fabebafbb6c366252c/1469219594863/" alt="Drawing" style="width: 200px; float: left"/></p>

In [5]:
ratings_data = pd.read_csv(
    'datasets/ml-100k/u.data',
    sep = '\t',
    encoding = "ISO-8859-1",
    header = None,
    names=['user_id', 'movie_id', 'rating', 'timestamp']
)
ratings_data.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488
7,253,465,5,891628467
8,305,451,3,886324817
9,6,86,3,883603013


# Ratings dataset

Contains the **interactions** between users and movies

- User **196** rated movie **242** with a score of **3** 
- User **186** rated movie **302** with a score of **3** 
- User **22** rated movie **377** with a score of **3** 

In [6]:
ratings_data[ratings_data['movie_id'] == 1]['rating'].describe()

count    452.000000
mean       3.878319
std        0.927897
min        1.000000
25%        3.000000
50%        4.000000
75%        5.000000
max        5.000000
Name: rating, dtype: float64

On average, people really LOVE toy story! and I don't blame them!

# Running our interactions through Surprise SVD

Let's take the **interactions** between the Users and Movies, and generate **latent features**  

In [7]:
from surprise import SVD, NMF
from surprise import Dataset
from surprise.model_selection import cross_validate, train_test_split


data = Dataset.load_builtin('ml-100k')
trainset, testset = train_test_split(data, test_size=.25)

model = NMF(n_factors=10, biased=False)
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.NMF at 0x10dd654e0>

# Generating predictions with simplicity

Before looking into the latent features of our movies, let's use the API provided by Surprise. More specifically, Surprise provides us 1 API

 - `model.predict` computes the rating prediction for given user and movie
 
Let's look at how we can use this API to generate movies that a given user may like

```python
>>> model.predict('302', '1')
Prediction(uid=302, iid=1, r_ui=None, est=3.5327866666666665, details={'was_impossible': False})
```

NOTE: User ID and Movie ID are **strings**

In [8]:
movie_id_to_title_map: Dict[int, str] = dict(movie_data['title'])
# {1: 'Toy Story (1995)',
#  2: 'GoldenEye (1995)',
#  3: 'Four Rooms (1995)'}

def generate_recommended_movies_for_user(user_id: int) -> pd.DataFrame:
    """Return a DataFrame containing recommendations for the user, and the
    associated score
    """
    results = []
    for movie_id, movie_title in movie_id_to_title_map.items():
        
        # For each movie, calculate score prediction 
        prediction = model.predict(str(user_id), str(movie_id))
        results.append((prediction.est, movie_title))
       
    return pd.DataFrame(results, columns=['Estimated Prediction', 'Movie Title'])


# Let's generate some recommendations for a user
recommendations = generate_recommended_movies_for_user(302)
display_best_and_worse_recommendations(recommendations)

Unnamed: 0,Prediction (sorted by best),Movie Title
1241,4.605348,"Old Lady Who Walked in the Sea, The (Vieille q..."
1462,4.506306,"Boys, Les (1997)"
1448,4.45685,Pather Panchali (1955)
1655,4.25903,Little City (1998)
640,4.185027,Paths of Glory (1957)
511,4.183396,Wings of Desire (1987)
1449,4.151286,Golden Earrings (1947)
866,4.117834,"Whole Wide World, The (1996)"
1466,4.075233,"Saint of Fort Washington, The (1993)"
1367,4.016534,Mina Tannenbaum (1994)


Unnamed: 0,Prediction (sorted by worse),Movie Title
554,1.0,White Man's Burden (1995)
1608,1.0,B*A*P*S (1997)
437,1.0,Amityville 3-D (1983)
438,1.0,Amityville: A New Generation (1993)
975,1.0,Solo (1996)
1145,1.0,Calendar Girl (1993)
1087,1.0,Double Team (1997)
1586,1.0,Terror in a Texas Town (1958)
1585,1.0,Lashou shentan (1992)
456,1.0,Free Willy 3: The Rescue (1997)


# Predict, under the hood

So far we have seen how the `predict()` API works in surface. But how does it **really** work inside of surprise. It's, surprisingly, simple! (get the pun?)

But before we go there, let's go back to our Feature Vectors

![Latent Features](https://cdn-images-1.medium.com/max/1600/0*_gKhyxIC3wup0cCE.jpg)

## Looking at the Movie matrix (vT)

Let's take a look at the latent features for every movie. Product features can be found in the `qi` attribute.
 - create a DataFrame that maps product matrix row index to movie
 - join the newly created dataframe with the movie dataset
 - join the newly created dataframe with the latent features

In [18]:
# Create a DataFrame that maps product matrix row index to movie
movie_to_product_matrix = pd.DataFrame(
    list(model.trainset._raw2inner_id_items.items()
), columns=['movie_id', 'vT_index'], dtype=int).set_index('movie_id', drop=False)

# Join the newly created dataframe with the movie dataset
mapping_matrix_with_title = movie_to_product_matrix.join(movie_data['title'])

# Create a dataframe containing latent features, and join it to the remaining dataset
latent_features = pd.DataFrame(model.qi, columns=[f"Latent Feature {k}" for k in range(1, 11)])
mapping_matrix_with_title_and_features = mapping_matrix_with_title.set_index('vT_index').join(latent_features)

mapping_matrix_with_title_and_features.head(10)

Unnamed: 0_level_0,movie_id,title,Latent Feature 1,Latent Feature 2,Latent Feature 3,Latent Feature 4,Latent Feature 5,Latent Feature 6,Latent Feature 7,Latent Feature 8,Latent Feature 9,Latent Feature 10
vT_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,282,"Time to Kill, A (1996)",0.608714,0.831445,0.654854,0.402548,0.718931,0.469895,0.758823,0.254309,0.664029,0.635457
1,248,Grosse Pointe Blank (1997),0.318664,0.832503,0.834611,0.741847,0.096866,0.925916,0.618941,0.451103,0.568593,0.566105
2,1141,"War Room, The (1993)",0.248595,1.008002,0.454207,0.63309,0.936849,0.049802,0.540847,0.370435,1.016044,0.761252
3,274,Sabrina (1995),0.423493,0.232216,0.39598,0.785715,0.917521,0.391481,0.813809,0.408192,0.637816,0.658849
4,495,Around the World in 80 Days (1956),0.631047,0.739826,0.099724,0.022218,0.82029,0.773851,0.843607,0.400152,0.979798,0.550228
5,304,Fly Away Home (1996),0.841236,0.675221,0.35804,0.835458,0.217843,0.546073,0.244704,0.645574,0.605339,0.720203
6,1418,"Joy Luck Club, The (1993)",1.053016,1.139279,0.67502,0.004888,0.414575,1.371647,0.165999,0.605249,0.206292,0.669139
7,165,Jean de Florette (1986),0.91231,0.348977,1.192438,0.361417,1.231676,0.70036,0.749986,0.557291,0.324394,0.39464
8,19,Antonia's Line (1995),0.659402,0.32399,0.329784,0.229035,1.183885,1.098252,0.626037,0.740899,0.19373,0.889085
9,288,Scream (1996),0.231096,0.453338,0.61542,0.964054,0.699507,0.515848,0.394911,0.8145,0.140486,0.709994


These are **learned features**. We cannot attribute them to anything specific, but they usually have some real-world correlation

# Find similar movies using Cosine Similarity

Usually, there isn't a straightforward way to pinpoint what a latent feature may be a strong indicator of. Even though we don't know exactly what these features correlate to, we can still compare vectors together. The latent feature at same index of every vector will relate to the same attribute.

To find how similar 2 movies are, all we need to do is compare their vectors

In [56]:
from scipy.spatial.distance import cosine


def compute_similarity(movie_a: str, movie_b: str) -> float:
    try:
        movie_a_vectors: np.array = mapping_matrix_with_title_and_features[
            mapping_matrix_with_title_and_features['title'] == movie_a
        ].iloc[0, 2:].as_matrix()
        movie_b_vectors: np.array = mapping_matrix_with_title_and_features[
            mapping_matrix_with_title_and_features['title'] == movie_b
        ].iloc[0, 2:].as_matrix()
    except IndexError:
        # SVD may sometimes remove users or products that do not contain
        # a minimum number of ratings to/from them. This helps improve the
        # quality of recommendations
        return -1
    
    return 1 - cosine(movie_a_vectors, movie_b_vectors)


# compute_similarity('Evita (1996)', 'Evita (1996)')
# compute_similarity('Toy Story (1995)', 'Evita (1996)')
compute_similarity('They Made Me a Criminal (1939)', 'Toy Story (1995)')

-1

In [66]:
def generate_similar_movies_for_movie(movie_title: str) -> pd.DataFrame:
    all_movies = movie_data[['title']]
    all_movies['similarity'] = all_movies['title'].map(lambda title: compute_similarity(title, movie_title))
    return all_movies


similarity_table = generate_similar_movies_for_movie('Postino, Il (1994)')

In [67]:
similarity_table.sort_values('similarity', ascending=False).head(10)

Unnamed: 0_level_0,title,similarity
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
14,"Postino, Il (1994)",1.0
636,Escape from New York (1981),0.981315
657,"Manchurian Candidate, The (1962)",0.977107
1197,"Family Thing, A (1996)",0.973401
1251,A Chef in Love (1996),0.971659
178,12 Angry Men (1957),0.96762
741,"Last Supper, The (1995)",0.967553
132,"Wizard of Oz, The (1939)",0.966423
188,Full Metal Jacket (1987),0.966104
211,M*A*S*H (1970),0.964075
