# Building a Recommender with Ratings Data

Using GraphLab Create we can take an SFrame containing user ratings for movies, and quickly create a recommender.

In [7]:
#importing necessary packages
import graphlab as gl
gl.canvas.set_target('ipynb')
gl

<module 'graphlab' from '/home/gagan/anaconda2/envs/dato-env/lib/python2.7/site-packages/graphlab/__init__.pyc'>

# Prepare the Data

We have downladed this data set from https://www.movielens.org. MovieLens 1M Dataset is stable benchmark dataset. 
It contain 1 million ratings from 6000 users on 4000 movies. Released 2/2003. Permalink: http://grouplens.org/datasets/movielens/1m/

In [10]:
#loading datasets
rating_file = "/home/gagan/Desktop/MovieDataset/ratings.csv"
data = gl.SFrame.read_csv(rating_file , header=False)
data.rename({'X1':'user_id','X2':'movie_id','X3':'rating','X4':'timestamp'})

users_file = "/home/gagan/Desktop/MovieDataset/users.csv"
users = gl.SFrame.read_csv(users_file , header=False)
users.rename({'X1':'user_id','X2':'gender','X3':'age','X4':'occupation','X5':'zip-code'})

movies_file = "/home/gagan/Desktop/MovieDataset/Movies.csv"
items = gl.SFrame.read_csv(movies_file , header=False)
items.rename({'X1':'movie_id','X2':'title','X3':'genre'})


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


movie_id,title,genre
1,Toy Story (1995),Animation|Children's|Come dy ...
2,Jumanji (1995),Adventure|Children's|Fant asy ...
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama
5,Father of the Bride Part II (1995) ...,Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children's
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller


# Statistics about data

Showing basic statistics about data

In [11]:
#showing statistics of ratings dataset
data.show()

In [12]:
#showing statistics of users dataset
users.show()

In [13]:
#showing statistics of movies dataset
items.show()

In [14]:
#showing top 10 values in movies dataset
items.head()

movie_id,title,genre
1,Toy Story (1995),Animation|Children's|Come dy ...
2,Jumanji (1995),Adventure|Children's|Fant asy ...
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama
5,Father of the Bride Part II (1995) ...,Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children's
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller


In [15]:
# Joining rating dataset with movies and users dataset
data = data.join(items , on='movie_id')

data = data.join(users , on='user_id')

data

user_id,movie_id,rating,timestamp,title,genre,gender,age,occupation
1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975) ...,Drama,F,1,10
1,661,3,978302109,James and the Giant Peach (1996) ...,Animation|Children's|Musi cal ...,F,1,10
1,914,3,978301968,My Fair Lady (1964),Musical|Romance,F,1,10
1,3408,4,978300275,Erin Brockovich (2000),Drama,F,1,10
1,2355,5,978824291,"Bug's Life, A (1998)",Animation|Children's|Come dy ...,F,1,10
1,1197,3,978302268,"Princess Bride, The (1987) ...",Action|Adventure|Comedy|R omance ...,F,1,10
1,1287,5,978302039,Ben-Hur (1959),Action|Adventure|Drama,F,1,10
1,2804,5,978300719,"Christmas Story, A (1983)",Comedy|Drama,F,1,10
1,594,4,978302268,Snow White and the Seven Dwarfs (1937) ...,Animation|Children's|Musi cal ...,F,1,10
1,919,4,978301368,"Wizard of Oz, The (1939)",Adventure|Children's|Dram a|Musical ...,F,1,10

zip-code
48067
48067
48067
48067
48067
48067
48067
48067
48067
48067


# Data Split

In order to evaluate the performance of our model, we randomly split the observations in our data set into two partitions: we will use train_data when creating our model and test_data for evaluating its performance.

In [21]:
# splitting data for training and testing
training_data, test_data = gl.recommender.util.random_split_by_user(data, 'user_id', 'movie_id')

# Creating baseline recommender model

Now that we have a train and test set, let's come up with a very simple way of predicting ratings. That way when we try more complicated things, we'll have some baseline for comparison.

GraphLab's PopularityRecommender provides this functionality. It just stores the mean rating per item. When asked to predict a user's rating for a particular item pair, it just predicts the mean of all ratings for that item; it pays no attention to user information.

In order to use the PopularityRecommender, all we need to do is pass its create function the data and tell it the pertinent column names.


In [22]:
#Creating baseline model-popularity model
model_popularity = gl.popularity_recommender.create(training_data , 'user_id' , 'movie_id' )

In [23]:
#showing Statistics of the model
model_popularity

Class                           : PopularityRecommender

Schema
------
User ID                         : user_id
Item ID                         : movie_id
Target                          : None
Additional observation features : 0
Number of user side features    : 0
Number of item side features    : 0

Statistics
----------
Number of observations          : 967712
Number of users                 : 6040
Number of items                 : 3697

Training summary
----------------
Training time                   : 0.0155

Model Parameters
----------------
Model class                     : PopularityRecommender

Now that we have a (simple) model, we need a way to measure the accuracy of its predictions. That way we can compare the performance of different models. The Root Mean Squared Error is one of the most common ways to measure the accuracy.

In [24]:
baseline_rmse = gl.evaluation.rmse(test_data['rating'], model_popularity.predict(test_data))
print baseline_rmse

1017.41164026


# Creating ItemSimilarityModels

In [25]:
#creating Item Similarity Recommender Model
model_jaccard = gl.item_similarity_recommender.create(training_data , 'user_id', 'movie_id' )

In [26]:
#Showing Statistics of our model
model_jaccard

Class                           : ItemSimilarityRecommender

Schema
------
User ID                         : user_id
Item ID                         : movie_id
Target                          : None
Additional observation features : 0
Number of user side features    : 0
Number of item side features    : 0

Statistics
----------
Number of observations          : 967712
Number of users                 : 6040
Number of items                 : 3697

Training summary
----------------
Training time                   : 8.0717

Model Parameters
----------------
Model class                     : ItemSimilarityRecommender
only_top_k                      : 100
threshold                       : 0.001
similarity_type                 : jaccard
training_method                 : auto

In [32]:
jaccard_rmse = gl.evaluation.rmse(test_data['rating'], model_jaccard.predict(test_data))
print jaccard_rmse

3.70388758088


# Comparing performance of baseline and item_similarity_model

In [27]:
#comparing performance of these models
performance = gl.compare(test_data , [model_popularity,model_jaccard])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    | 0.243486973948 | 0.00986439814914 |
|   2    | 0.233466933868 | 0.0188487381618  |
|   3    | 0.231462925852 | 0.0286799187444  |
|   4    | 0.223947895792 | 0.0367541980367  |
|   5    | 0.212224448898 | 0.0423273852804  |
|   6    | 0.202237808951 | 0.0481091356844  |
|   7    | 0.193386773547 | 0.0530568714778  |
|   8    | 0.187124248497 | 0.0575237288517  |
|   9    | 0.181808060566 | 0.0633743942538  |
|   10   | 0.178356713427 | 0.0691317782722  |
+--------+----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.3416833667

In [28]:
# showing precision-recall graph of these two models
gl.show_comparison(performance,[model_popularity,model_jaccard])


# Creating Similarity Models with different similarity_methods

ItemSimilarityModel has three popular method for finding out the similarity corelation between the items.
These are :- JACCARD , PEARSON , COSINE

In [29]:
#creating an other model with pearson as similarity method
model_pearson = gl.item_similarity_recommender.create(training_data , 'user_id', 'movie_id' , target='rating',  
                                              similarity_type='pearson' )

In [30]:
#showing statistics of model_pearson
model_pearson

Class                           : ItemSimilarityRecommender

Schema
------
User ID                         : user_id
Item ID                         : movie_id
Target                          : rating
Additional observation features : 0
Number of user side features    : 0
Number of item side features    : 0

Statistics
----------
Number of observations          : 967712
Number of users                 : 6040
Number of items                 : 3697

Training summary
----------------
Training time                   : 8.151

Model Parameters
----------------
Model class                     : ItemSimilarityRecommender
only_top_k                      : 100
threshold                       : 0.001
similarity_type                 : pearson
training_method                 : auto

In [33]:
pearson_rmse = gl.evaluation.rmse(test_data['rating'], model_pearson.predict(test_data))
print pearson_rmse

0.858916122729


In [31]:
#creating third model with similarity method as cosine
model_cosine = gl.item_similarity_recommender.create(training_data , 'user_id', 'movie_id' ,  
                                                     similarity_type='cosine' )

In [19]:
#showing statistics of model_cosine
model_cosine

Class                           : ItemSimilarityRecommender

Schema
------
User ID                         : user_id
Item ID                         : movie_id
Target                          : None
Additional observation features : 0
Number of user side features    : 0
Number of item side features    : 0

Statistics
----------
Number of observations          : 967712
Number of users                 : 6040
Number of items                 : 3697

Training summary
----------------
Training time                   : 10.8131

Model Parameters
----------------
Model class                     : ItemSimilarityRecommender
only_top_k                      : 100
threshold                       : 0.001
similarity_type                 : cosine
training_method                 : auto

In [35]:
cosine_rmse = gl.evaluation.rmse(test_data['rating'], model_cosine.predict(test_data))
print cosine_rmse

3.6757253456


In [36]:
model_matrix = gl.ranking_factorization_recommender.create(training_data, 'user_id', 'movie_id',
                                                          user_data=users, item_data=items,
                                                          max_iterations=20, num_factors=5,
                                                          regularization=0.01)

In [37]:
model_matrix

Class                           : RankingFactorizationRecommender

Schema
------
User ID                         : user_id
Item ID                         : movie_id
Target                          : None
Additional observation features : 8
Number of user side features    : 5
Number of item side features    : 3

Statistics
----------
Number of observations          : 967712
Number of users                 : 6040
Number of items                 : 3883

Training summary
----------------
Training time                   : 339.302

Model Parameters
----------------
Model class                     : RankingFactorizationRecommender
num_factors                     : 5
binary_target                   : 1
side_data_factorization         : 1
solver                          : auto
nmf                             : 0
max_iterations                  : 20

Regularization Settings
-----------------------
regularization                  : 0.01
regularization_type             : normal
linear_regularizat

In [38]:
matrix_rmse = gl.evaluation.rmse(test_data['rating'], model_matrix.predict(test_data))
print matrix_rmse

3.21493632984


# Comparing all these model performance

In [39]:
# comparing performance of all models
compare_all = gl.compare(test_data , [model_popularity,model_jaccard,model_pearson,model_cosine,model_matrix])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    | 0.243486973948 | 0.00986439814914 |
|   2    | 0.233466933868 | 0.0188487381618  |
|   3    | 0.231462925852 | 0.0286799187444  |
|   4    | 0.223947895792 | 0.0367541980367  |
|   5    | 0.212224448898 | 0.0423273852804  |
|   6    | 0.202237808951 | 0.0481091356844  |
|   7    | 0.193386773547 | 0.0530568714778  |
|   8    | 0.187124248497 | 0.0575237288517  |
|   9    | 0.181808060566 | 0.0633743942538  |
|   10   | 0.178356713427 | 0.0691317782722  |
+--------+----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.3416833667

In [40]:
#plotting precision-recall curve for all models
gl.show_comparison(compare_all,[model_popularity,model_jaccard,model_pearson,model_cosine,model_matrix])

# Getting similar item

In [41]:
#getting movie details
items[items['movie_id']==1287]

movie_id,title,genre
1287,Ben-Hur (1959),Action|Adventure|Drama


In [42]:
#getting similar items

model_jaccard.get_similar_items([1287] , k=5).join(items , on={'similar':'movie_id'}) # movie_id is Ben_Hur

movie_id,similar,score,rank,title,genre
1287,1262,0.278937381404,5,"Great Escape, The (1963)",Adventure|War
1287,1954,0.280231716148,3,Rocky (1976),Action|Drama
1287,2366,0.283121597096,2,King Kong (1933),Action|Adventure|Horror
1287,2944,0.279287722587,4,"Dirty Dozen, The (1967)",Action|War
1287,2947,0.288461538462,1,Goldfinger (1964),Action


In [43]:
model_pearson.get_similar_items([1287] , k=5).join(items , on={'similar':'movie_id'})

movie_id,similar,score,rank,title,genre
1287,1276,0.140326410547,4,Cool Hand Luke (1967),Comedy|Drama
1287,1954,0.147160898384,3,Rocky (1976),Action|Drama
1287,2067,0.133204922142,5,Doctor Zhivago (1965),Drama|Romance|War
1287,2728,0.19250143265,1,Spartacus (1960),Drama
1287,2947,0.161271236883,2,Goldfinger (1964),Action


In [44]:
model_cosine.get_similar_items([1287] , k=5).join(items , on={'similar':'movie_id'})

movie_id,similar,score,rank,title,genre
1287,1291,0.440432732899,4,Indiana Jones and the Last Crusade (1989) ...,Action|Adventure
1287,1954,0.450051421739,2,Rocky (1976),Action|Drama
1287,2366,0.441647765351,3,King Kong (1933),Action|Adventure|Horror
1287,2944,0.436635778098,5,"Dirty Dozen, The (1967)",Action|War
1287,2947,0.450952267513,1,Goldfinger (1964),Action


In [46]:
model_matrix.get_similar_items([1287],k=5).join(items , on={'similar':'movie_id'})

movie_id,similar,score,rank,title,genre
1287,2,-1.0,5,Jumanji (1995),Adventure|Children's|Fant asy ...
1287,3,-1.0,4,Grumpier Old Men (1995),Comedy|Romance
1287,4,-1.0,3,Waiting to Exhale (1995),Comedy|Drama
1287,5,-1.0,2,Father of the Bride Part II (1995) ...,Comedy
1287,6,-1.0,1,Heat (1995),Action|Crime|Thriller


In [47]:
model_popularity.get_similar_items([1287],k=5).join(items , on={'similar':'movie_id'})

movie_id,similar,score,rank,title,genre
1287,288,1.0,2,Natural Born Killers (1994) ...,Action|Thriller
1287,300,0.999999856737,5,Quiz Show (1994),Drama
1287,1235,1.0,1,Harold and Maude (1971),Comedy
1287,2161,0.999999856737,4,"NeverEnding Story, The (1984) ...",Adventure|Children's|Fant asy ...
1287,2572,0.999999856737,3,10 Things I Hate About You (1999) ...,Comedy|Romance


# Generating recommendations

In [48]:
recsys = model_cosine.recommend()

In [49]:
recsys

user_id,movie_id,score,rank
1,1196,0.357700357281,1
1,1265,0.312288670805,2
1,1198,0.30882663893,3
1,1210,0.301459388527,4
1,2987,0.29966600486,5
1,2716,0.295848966179,6
1,1580,0.292194464521,7
1,593,0.283710220085,8
1,318,0.281332273949,9
1,1291,0.277629220575,10


# Generating recommendations for a particular user

In [50]:
# what movies does a user with id 4 has seen
data[data['user_id']==4].join(items , on='movie_id')

user_id,movie_id,rating,timestamp,title,genre,gender,age,occupation
4,260,5,978294199,Star Wars: Episode IV - A New Hope (1977) ...,Action|Adventure|Fantasy |Sci-Fi ...,M,45,7
4,480,4,978294008,Jurassic Park (1993),Action|Adventure|Sci-Fi,M,45,7
4,1036,4,978294282,Die Hard (1988),Action|Thriller,M,45,7
4,1097,4,978293964,E.T. the Extra- Terrestrial (1982) ...,Children's|Drama|Fantasy |Sci-Fi ...,M,45,7
4,1196,2,978294199,Star Wars: Episode V - The Empire Strikes Back ...,Action|Adventure|Drama |Sci-Fi|War ...,M,45,7
4,1198,5,978294199,Raiders of the Lost Ark (1981) ...,Action|Adventure,M,45,7
4,1201,5,978294230,"Good, The Bad and The Ugly, The (1966) ...",Action|Western,M,45,7
4,1210,3,978293924,Star Wars: Episode VI - Return of the Jedi (1 ...,Action|Adventure|Romance |Sci-Fi|War ...,M,45,7
4,1214,4,978294260,Alien (1979),Action|Horror|Sci- Fi|Thriller ...,M,45,7
4,1240,5,978294260,"Terminator, The (1984)",Action|Sci-Fi|Thriller,M,45,7

zip-code,title.1,genre.1
2460,Star Wars: Episode IV - A New Hope (1977) ...,Action|Adventure|Fantasy |Sci-Fi ...
2460,Jurassic Park (1993),Action|Adventure|Sci-Fi
2460,Die Hard (1988),Action|Thriller
2460,E.T. the Extra- Terrestrial (1982) ...,Children's|Drama|Fantasy |Sci-Fi ...
2460,Star Wars: Episode V - The Empire Strikes Back ...,Action|Adventure|Drama |Sci-Fi|War ...
2460,Raiders of the Lost Ark (1981) ...,Action|Adventure
2460,"Good, The Bad and The Ugly, The (1966) ...",Action|Western
2460,Star Wars: Episode VI - Return of the Jedi (1 ...,Action|Adventure|Romance |Sci-Fi|War ...
2460,Alien (1979),Action|Horror|Sci- Fi|Thriller ...
2460,"Terminator, The (1984)",Action|Sci-Fi|Thriller


In [51]:
#recommending 20 movies to user with id = 4
model_cosine.recommend(users=[4] , k=20).join(items , on='movie_id')

user_id,movie_id,score,rank,title,genre
4,110,0.431652915777,15,Braveheart (1995),Action|Drama|War
4,457,0.487654445808,6,"Fugitive, The (1993)",Action|Thriller
4,589,0.513913049736,1,Terminator 2: Judgment Day (1991) ...,Action|Sci-Fi|Thriller
4,592,0.466959376331,8,Batman (1989),Action|Adventure|Crime|Dr ama ...
4,858,0.4654320524,9,"Godfather, The (1972)",Action|Crime|Drama
4,1197,0.454312913513,11,"Princess Bride, The (1987) ...",Action|Adventure|Comedy|R omance ...
4,1200,0.497097760443,5,Aliens (1986),Action|Sci- Fi|Thriller|War ...
4,1222,0.423635445763,16,Full Metal Jacket (1987),Action|Drama|War
4,1270,0.469002840768,7,Back to the Future (1985),Comedy|Sci-Fi
4,1291,0.508515858369,2,Indiana Jones and the Last Crusade (1989) ...,Action|Adventure


# Recommendations for new users

In [57]:
#getting movie details
items[items['movie_id']==565]

movie_id,title,genre
565,Cronos (1992),Horror


In [58]:
#recommendations for new users
#adding new user data
new_users = gl.SFrame()
new_users['movie_id'] = [565]
new_users['user_id'] = 99999


In [59]:
#recommending moivies to new user
model_cosine.recommend(users=[99999] , new_observation_data=new_users ).join(items , on='movie_id')

user_id,movie_id,score,rank,title,genre
99999,735,0.276172385369,1,Cemetery Man (Dellamorte Dellamore) (1994) ...,Comedy|Horror
99999,1241,0.235093546661,9,Braindead (1992),Comedy|Horror
99999,2159,0.249186860723,6,Henry: Portrait of a Serial Killer (1990) ...,Crime|Horror
99999,2517,0.239471828661,8,Christine (1983),Horror
99999,2646,0.250162919535,5,House of Dracula (1945),Horror
99999,2647,0.234701397019,10,House of Frankenstein (1944) ...,Horror
99999,3013,0.262436176696,3,Bride of Re-Animator (1990) ...,Comedy|Horror
99999,3018,0.262711864407,2,Re-Animator (1985),Horror
99999,3550,0.250968599301,4,"Hunger, The (1983)",Horror
99999,3935,0.244316380409,7,Kronos (1973),Horror


# Visualization of recommender engine and accuracy

In [62]:
view = model_cosine.views.overview(
        validation_set=test_data,
        item_data=items,
        item_name_column='movie_id')



In [63]:
view.show()

