In [1]:
import pandas as pd
import numpy as np

# A practical guide to Singular Value Decomposition in Python

Recommender systems have become increasingly popular in recent years, and are used by some of the largest websites in the world to predict the likelihood of a user taking an action on an item. In the world of Netflix, this means recommending similar movies to the ones you have seen. In the world of dating, this means suggesting matches similar to people you already showed interest in!

My path to recommenders has been an unusual one: from a Software Engineer to working on matching algorithms at a dating company, with a little background on machine learning. With my knowledge of Python and the use of basic SVD (Singular Value Decomposition) frameworks, I was able to understand SVDs from a practical standpoint of what you can do with them, instead of focusing on the science.

In my talk, you will learn 2 practical ways of generating recommendations using SVDs: matrix factorization and item similarity. We will be learning the high-level components of SVD the "doer way": we will be implementing a simple movie recommendation engine with the help of Jupiter notebooks, the MovieLens database, and the Surprise recommendation package.

## Indice

 - Downloading and exploring the MovieLens dataset
 - ROC Curve

In [2]:
movie_data_columns = [
    'movie_id', 'title', 'release_date', 'video_release_date', 'url',
    'unknown', 'Action', 'Adventure', 'Animation', "Children's",
    'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir',
    'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller',
    'War', 'Western'
]

movie_data = pd.read_csv(
    'datasets/ml-100k/u.item', 
    sep = '|', 
    encoding = "ISO-8859-1", 
    header = None, 
    names = movie_data_columns,
    index_col = 'movie_id'
)
movie_data['release_date'] = pd.to_datetime(movie_data['release_date'])

In [3]:
ratings_data = pd.read_csv(
    'datasets/ml-100k/u.data',
    sep = '\t',
    encoding = "ISO-8859-1",
    header = None,
    names=['user_id', 'movie_id', 'rating', 'timestamp']
)
ratings_data.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488
7,253,465,5,891628467
8,305,451,3,886324817
9,6,86,3,883603013


## Explanation of the above table

- User **196** rated movie **242** with a score of **3** 
- User **186** rated movie **302** with a score of **3** 
- User **22** rated movie **377** with a score of **3** 

In [7]:
ratings_data[ratings_data['movie_id'] == 242]['rating'].describe()

count    117.000000
mean       3.991453
std        0.995643
min        1.000000
25%        4.000000
50%        4.000000
75%        5.000000
max        5.000000
Name: rating, dtype: float64

In [14]:

movie_data.loc[242]

title                                                    Kolya (1996)
release_date                                      1997-01-24 00:00:00
video_release_date                                                NaN
url                   http://us.imdb.com/M/title-exact?Kolya%20(1996)
unknown                                                             0
Action                                                              0
Adventure                                                           0
Animation                                                           0
Children's                                                          0
Comedy                                                              1
Crime                                                               0
Documentary                                                         0
Drama                                                               0
Fantasy                                                             0
Film-Noir           

In [41]:
from abc import ABC, abstractmethod
from typing import List


class BaseRecommender(ABC):
    
    @abstractmethod
    def predict_for_user(self, user_id: int, n: int) -> List[str]:
        ...



class RandomRecommender(BaseRecommender):
    
    def __init__(self, ratings: pd.DataFrame, movies: pd.DataFrame) -> None:
        self.ratings = ratings
        self.movies = movies
    
    def predict_for_user(self, user_id: int, n: int) -> List[str]:
        
        # Get all movies seen by user
        movies_seen_by_user = self.ratings[self.ratings['user_id'] == user_id]['movie_id']
        
        # Get all movies, and remove the ones seen by the user
        movies_not_seen_by_user: pd.DataFrame = movie_data[~movie_data.index.isin(movies_seen_by_user)]
            
        # Return a random sample of N films
        return list(movies_not_seen_by_user.sample(n)['title'])
        

        
random_recommender = RandomRecommender(ratings_data, movie_data)
print(random_recommender.predict_for_user(186, 5))

['Big Green, The (1995)', 'Ed Wood (1994)', 'Alphaville (1965)', 'Apple Dumpling Gang, The (1975)', 'Bullets Over Broadway (1994)']


In [61]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate, train_test_split


data = Dataset.load_builtin('ml-100k')
trainset, testset = train_test_split(data, test_size=.25)

model = SVD()
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x110dc02b0>

In [80]:


class MatrixFactorizationRecommender(BaseRecommender):
    
    def __init__(self, ratings: pd.DataFrame, movies: pd.DataFrame, svd: SVD) -> None:
        self.svd = svd
        self.ratings = ratings
        self.movies = movies
    
    def predict_for_user(self, user_id: int, n: int) -> List[str]:
        
        # Get all movies seen by user
        movies_seen_by_user = self.ratings[self.ratings['user_id'] == user_id]['movie_id']
        
        # Get all movies, and remove the ones seen by the user
        movies_not_seen_by_user: pd.DataFrame = movie_data[~movie_data.index.isin(movies_seen_by_user)]
            
            
        def _inner_predict(idx):
            return self.svd.predict(str(user_id), str(idx)).est
        
        movies_not_seen_by_user['score'] = movies_not_seen_by_user.index.map(_inner_predict)
        return movies_not_seen_by_user.sort_values('score', ascending=False).iloc[:n][['title', 'score']]

random_recommender = MatrixFactorizationRecommender(ratings_data, movie_data, model)
random_recommender.predict_for_user(186, 5)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0_level_0,title,score
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
483,Casablanca (1942),4.908095
357,One Flew Over the Cuckoo's Nest (1975),4.664586
272,Good Will Hunting (1997),4.622444
479,Vertigo (1958),4.579681
96,Terminator 2: Judgment Day (1991),4.570107


In [77]:
model.predict(186, 242)

model.trainset._raw2inner_id_users

{'1': 442,
 '10': 23,
 '100': 573,
 '101': 141,
 '102': 287,
 '103': 855,
 '104': 399,
 '105': 474,
 '106': 808,
 '107': 335,
 '108': 598,
 '109': 229,
 '11': 60,
 '110': 314,
 '111': 906,
 '112': 325,
 '113': 515,
 '114': 418,
 '115': 587,
 '116': 116,
 '117': 35,
 '118': 447,
 '119': 382,
 '12': 709,
 '120': 908,
 '121': 788,
 '122': 626,
 '123': 685,
 '124': 811,
 '125': 166,
 '126': 269,
 '127': 311,
 '128': 517,
 '129': 797,
 '13': 159,
 '130': 86,
 '131': 915,
 '132': 752,
 '133': 338,
 '134': 643,
 '135': 620,
 '136': 695,
 '137': 479,
 '138': 624,
 '139': 611,
 '14': 243,
 '140': 640,
 '141': 175,
 '142': 688,
 '143': 368,
 '144': 150,
 '145': 220,
 '146': 936,
 '147': 875,
 '148': 604,
 '149': 546,
 '15': 29,
 '150': 537,
 '151': 323,
 '152': 318,
 '153': 762,
 '154': 776,
 '155': 365,
 '156': 234,
 '157': 634,
 '158': 226,
 '159': 184,
 '16': 485,
 '160': 162,
 '161': 772,
 '162': 769,
 '163': 16,
 '164': 632,
 '165': 764,
 '166': 478,
 '167': 122,
 '168': 791,
 '169': 584,
 