In [58]:
import pandas as pd
import numpy as np
from typing import List, Dict

# A practical guide to Singular Value Decomposition in Python

Recommender systems have become increasingly popular in recent years, and are used by some of the largest websites in the world to predict the likelihood of a user taking an action on an item. In the world of Netflix, this means recommending similar movies to the ones you have seen. In the world of dating, this means suggesting matches similar to people you already showed interest in!

My path to recommenders has been an unusual one: from a Software Engineer to working on matching algorithms at a dating company, with a little background on machine learning. With my knowledge of Python and the use of basic SVD (Singular Value Decomposition) frameworks, I was able to understand SVDs from a practical standpoint of what you can do with them, instead of focusing on the science.

In my talk, you will learn 2 practical ways of generating recommendations using SVDs: matrix factorization and item similarity. We will be learning the high-level components of SVD the "doer way": we will be implementing a simple movie recommendation engine with the help of Jupiter notebooks, the MovieLens database, and the Surprise recommendation package.

## Table of contents

 - Downloading and exploring the MovieLens dataset
 - ROC Curve

In [145]:
from IPython.display import display, HTML, Markdown


def display_best_and_worse_recommendations(recommendations):
    recommendations.sort_values('Estimated Prediction', ascending=False, inplace=True)

    top_recommendations = recommendations.iloc[:10]
    top_recommendations.columns = ['Prediction (sorted by best)', 'Movie Title']

    worse_recommendations = recommendations.iloc[-10:]
    worse_recommendations.columns = ['Prediction (sorted by worse)', 'Movie Title']

    display(HTML("<h1>Recommendations your user will love</h1>"))
    display(top_recommendations)

    display(HTML("<h1>Recommendations your user will hate</h1>"))
    display(worse_recommendations)

In [121]:
movie_data_columns = [
    'movie_id', 'title', 'release_date', 'video_release_date', 'url',
    'unknown', 'Action', 'Adventure', 'Animation', "Children's",
    'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir',
    'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller',
    'War', 'Western'
]

movie_data = pd.read_csv(
    'datasets/ml-100k/u.item', 
    sep = '|', 
    encoding = "ISO-8859-1", 
    header = None, 
    names = movie_data_columns,
    index_col = 'movie_id'
)
movie_data['release_date'] = pd.to_datetime(movie_data['release_date'])

movie_data.loc[1]

title                                                  Toy Story (1995)
release_date                                        1995-01-01 00:00:00
video_release_date                                                  NaN
url                   http://us.imdb.com/M/title-exact?Toy%20Story%2...
unknown                                                               0
Action                                                                0
Adventure                                                             0
Animation                                                             1
Children's                                                            1
Comedy                                                                1
Crime                                                                 0
Documentary                                                           0
Drama                                                                 0
Fantasy                                                         

# Movies dataset

This dataset contains all the movies and their metadata

`movie_id` 1 is **Toy Story**

<p><img src="https://static1.squarespace.com/static/51cdafc4e4b09eb676a64e68/t/579282fabebafbb6c366252c/1469219594863/" alt="Drawing" style="width: 200px; float: left"/></p>

In [3]:
ratings_data = pd.read_csv(
    'datasets/ml-100k/u.data',
    sep = '\t',
    encoding = "ISO-8859-1",
    header = None,
    names=['user_id', 'movie_id', 'rating', 'timestamp']
)
ratings_data.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488
7,253,465,5,891628467
8,305,451,3,886324817
9,6,86,3,883603013


# Ratings dataset

Contains the **interactions** between users and movies

- User **196** rated movie **242** with a score of **3** 
- User **186** rated movie **302** with a score of **3** 
- User **22** rated movie **377** with a score of **3** 

In [14]:
ratings_data[ratings_data['movie_id'] == 1]['rating'].describe()

count    452.000000
mean       3.878319
std        0.927897
min        1.000000
25%        3.000000
50%        4.000000
75%        5.000000
max        5.000000
Name: rating, dtype: float64

On average, people really LOVE toy story! and I don't blame them!

# Running our interactions through Surprise SVD

Let's take the **interactions** between the Users and Movies, and generate **latent features**  

In [148]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate, train_test_split


data = Dataset.load_builtin('ml-100k')
trainset, testset = train_test_split(data, test_size=.25)

model = SVD(n_factors=10)
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x109619668>

# Generating predictions with simplicity

Before looking into the latent features of our movies, let's use the API provided by Surprise. More specifically, Surprise provides us 1 API

 - `model.predict` computes the rating prediction for given user and movie
 
Let's look at how we can use this API to generate movies that a given user may like

```python
>>> model.predict('302', '1')
Prediction(uid=302, iid=1, r_ui=None, est=3.5327866666666665, details={'was_impossible': False})
```

NOTE: User ID and Movie ID are **strings**

In [149]:
movie_id_to_title_map: Dict[int, str] = dict(movie_data['title'])
# {1: 'Toy Story (1995)',
#  2: 'GoldenEye (1995)',
#  3: 'Four Rooms (1995)'}

def generate_recommended_movies_for_user(user_id: int) -> pd.DataFrame:

    results = []
    for movie_id, movie_title in movie_id_to_title_map.items():
        
        # For each movie, calculate score prediction 
        prediction = model.predict(str(user_id), str(movie_id))
        results.append((prediction.est, movie_title))
       
    return pd.DataFrame(results, columns=['Estimated Prediction', 'Movie Title'])

recommendations = generate_recommended_movies_for_user(302)
display_best_and_worse_recommendations(recommendations)

Unnamed: 0,Prediction (sorted by best),Movie Title
407,4.042405,"Close Shave, A (1995)"
168,4.028172,"Wrong Trousers, The (1993)"
487,4.010458,Sunset Blvd. (1950)
482,3.990727,Casablanca (1942)
656,3.95767,"Manchurian Candidate, The (1962)"
512,3.890518,"Third Man, The (1949)"
602,3.884698,Rear Window (1954)
63,3.870787,"Shawshank Redemption, The (1994)"
11,3.866142,"Usual Suspects, The (1995)"
177,3.859788,12 Angry Men (1957)


Unnamed: 0,Prediction (sorted by worse),Movie Title
1088,1.69341,Speed 2: Cruise Control (1997)
423,1.631941,Children of the Corn: The Gathering (1996)
456,1.626604,Free Willy 3: The Rescue (1997)
686,1.614657,McHale's Navy (1997)
742,1.611026,"Crow: City of Angels, The (1996)"
889,1.595696,Mortal Kombat: Annihilation (1997)
367,1.57135,Bio-Dome (1996)
119,1.564614,Striptease (1996)
687,1.527307,Leave It to Beaver (1997)
1214,1.515884,Barb Wire (1996)


# Predict, under the hood

So far we have seen how the `predict()` API works in surface. But how does it **really** work inside of surprise. It's, surprisingly, simple! (get the pun?)

But before we go there, let's go back to our Feature Vectors

![Latent Features](https://cdn-images-1.medium.com/max/1600/0*_gKhyxIC3wup0cCE.jpg)

## Looking at the Movie matrix (vT)
Let's take a look at the vectors for every movies

In [166]:
# pd.DataFrame(list(model.trainset._raw2inner_id_items.items()), columns=['movie_id', 'features_idx'])
latent_features = pd.DataFrame(model.qi, columns=[f"Latent Feature {k}" for k in range(1, 11)])
latent_features.head(10)

Unnamed: 0,Latent Feature 1,Latent Feature 2,Latent Feature 3,Latent Feature 4,Latent Feature 5,Latent Feature 6,Latent Feature 7,Latent Feature 8,Latent Feature 9,Latent Feature 10
0,0.023777,0.118232,0.123403,-0.267605,-0.059307,0.32651,0.03382,-0.249384,-0.27362,-0.305778
1,-0.128067,-0.309221,0.219295,0.172537,0.031776,-0.132634,0.334796,-0.031491,0.119166,0.083358
2,-0.071747,-0.1046,0.043873,-0.029186,-0.407443,0.120324,0.035192,-0.084826,0.207559,0.181254
3,0.213844,-0.008536,-0.088484,-0.333,-0.237115,0.096295,0.204255,0.111921,0.03192,-0.021037
4,0.067862,-0.28096,-0.098893,0.202384,-0.231357,0.106276,0.116667,-0.332775,0.292389,0.117683
5,0.02417,0.035562,0.057665,-0.037663,0.024285,0.051654,0.078961,0.165113,0.222108,0.104478
6,-0.17504,0.094611,-0.051376,0.011055,-0.138841,-0.020481,-0.012301,-0.001641,-0.083318,0.099007
7,-0.063354,-0.336975,-0.196063,0.047137,0.098619,-0.091096,-0.080262,0.082356,0.13549,-0.081132
8,-0.036663,0.171357,-0.145401,-0.067328,-0.126514,-0.09635,-0.026958,-0.07686,0.049069,0.051903
9,0.153954,0.02962,0.163092,-0.141742,0.070151,0.156195,-0.020623,0.23387,0.096179,-0.146613


# Mapping every index to it's movie

In [191]:
movie_to_vt_matrix = pd.DataFrame(
    list(model.trainset._raw2inner_id_items.items()
), columns=['movie_id', 'vT_index'], dtype=int)

mapping_matrix_with_title = movie_to_vt_matrix.set_index('movie_id', drop=False).join(movie_data['title'])
mapping_matrix_with_title.set_index('vT_index').join(latent_features).head(10)

Unnamed: 0_level_0,movie_id,title,Latent Feature 1,Latent Feature 2,Latent Feature 3,Latent Feature 4,Latent Feature 5,Latent Feature 6,Latent Feature 7,Latent Feature 8,Latent Feature 9,Latent Feature 10
vT_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,88,Sleepless in Seattle (1993),0.023777,0.118232,0.123403,-0.267605,-0.059307,0.32651,0.03382,-0.249384,-0.27362,-0.305778
1,55,"Professional, The (1994)",-0.128067,-0.309221,0.219295,0.172537,0.031776,-0.132634,0.334796,-0.031491,0.119166,0.083358
2,427,To Kill a Mockingbird (1962),-0.071747,-0.1046,0.043873,-0.029186,-0.407443,0.120324,0.035192,-0.084826,0.207559,0.181254
3,174,Raiders of the Lost Ark (1981),0.213844,-0.008536,-0.088484,-0.333,-0.237115,0.096295,0.204255,0.111921,0.03192,-0.021037
4,50,Star Wars (1977),0.067862,-0.28096,-0.098893,0.202384,-0.231357,0.106276,0.116667,-0.332775,0.292389,0.117683
5,1227,"Awfully Big Adventure, An (1995)",0.02417,0.035562,0.057665,-0.037663,0.024285,0.051654,0.078961,0.165113,0.222108,0.104478
6,430,Duck Soup (1933),-0.17504,0.094611,-0.051376,0.011055,-0.138841,-0.020481,-0.012301,-0.001641,-0.083318,0.099007
7,171,Delicatessen (1991),-0.063354,-0.336975,-0.196063,0.047137,0.098619,-0.091096,-0.080262,0.082356,0.13549,-0.081132
8,796,Speechless (1994),-0.036663,0.171357,-0.145401,-0.067328,-0.126514,-0.09635,-0.026958,-0.07686,0.049069,0.051903
9,550,Die Hard: With a Vengeance (1995),0.153954,0.02962,0.163092,-0.141742,0.070151,0.156195,-0.020623,0.23387,0.096179,-0.146613


These are **learned features**. We cannot attribute them to anything specific, but they usually have some real-world correlation