In [1]:
import graphlab as gl
gl.canvas.set_target('ipynb')

The following code snippet will parse the books data provided at the training.

In [2]:
import os
if os.path.exists('books/ratings'):
    ratings = gl.SFrame('books/ratings')
    items = gl.SFrame('books/items')
    users = gl.SFrame('books/users')
else:
    ratings = gl.SFrame.read_csv('books/book-ratings.csv')
    ratings.save('books/ratings')
    items = gl.SFrame.read_csv('books/book-data.csv')
    items.save('books/items')
    users = gl.SFrame.read_csv('books/user-data.csv')
    users.save('books/users')

[INFO] graphlab.cython.cy_server: GraphLab Create v2.0 started. Logging: /tmp/graphlab_server_1468091566.log
INFO:graphlab.cython.cy_server:GraphLab Create v2.0 started. Logging: /tmp/graphlab_server_1468091566.log


This commercial license of GraphLab Create is assigned to engr@dato.com.


Visually explore the above data using GraphLab Canvas.

In [3]:
ratings.show()

## Recommendation systems

In this section we will make a model that can be used to recommend new tags to users.

### Creating a Model

Use `gl.recommender.create()` to create a model that can be used to recommend tags to each user.

In [4]:
m = gl.recommender.create(ratings, user_id='name', item_id='book')

Print a summary of the model by simply entering the name of the object.

In [5]:
m

Class                            : ItemSimilarityRecommender

Schema
------
User ID                          : name
Item ID                          : book
Target                           : None
Additional observation features  : 0
User side features               : []
Item side features               : []

Statistics
----------
Number of observations           : 98077
Number of users                  : 31856
Number of items                  : 11121

Training summary
----------------
Training time                    : 3.5958

Model Parameters
----------------
Model class                      : ItemSimilarityRecommender
threshold                        : 0.001
similarity_type                  : jaccard
training_method                  : auto

Other Settings
--------------
degree_approximation_threshold   : 4096
sparse_density_estimation_sample_size : 4096
max_data_passes                  : 4096
target_memory_usage              : 8589934592
seed_item_set_size               : 50
nearest_

Get all unique users from the first 10000 observations and save them as a variable called `users`.

In [6]:
users = ratings.head(10000)['name'].unique()

Get 20 recommendations for each user in your list of users. Save these as a new SFrame called `recs`.

In [7]:
recs = m.recommend(users, k=20)

## Inspecting your model

Get an SFrame of the 20 most similar items for each observed item.

In [8]:
sims = m.get_similar_items()

This dataset has multiple rows corresponding to the same book, e.g., in situations where reprintings were done by different publishers in different year.

For each unique value of 'book' in the `items` SFrame, select one of the of the available values for `author`, `publisher`, and `year`. Hint: Try using [`SFrame.groupby`](http://dato.com/products/create/docs/graphlab.data_structures.html#module-graphlab.aggregate) and [`gl.aggregate.SELECT_ONE`](http://dato.com/products/create/docs/graphlab.data_structures.html#graphlab.aggregate.SELECT_ONE).

In [9]:
items = items.groupby('book', {k: gl.aggregate.SELECT_ONE(k) for k in ['author', 'publisher', 'year']})

Computing the number of times each book was rated, and add a column containing these counts to the `items` SFrame using `SFrame.join`.

In [10]:
num_ratings_per_book = ratings.groupby('book', gl.aggregate.COUNT)
items = items.join(num_ratings_per_book, on='book')

Print the first few books, sorted by the number of times they have been rated. Do these values make sense?

In [11]:
items.sort('Count', ascending=False)

book,publisher,year,author,Count
Wild Animus,Too Far,2004,Rich Shapero,581
The Da Vinci Code,Doubleday,2003,Dan Brown,488
The Secret Life of Bees,Penguin Highbridge,2002,Sue Monk Kidd,406
Bridget Jones's Diary,Picador (UK),1996,Helen Fielding,377
Life of Pi,Pub Group West,2004,Yann Martel,336
The Summons,Random House Large Print Publishing ...,2002,John Grisham,309
A Painted House,Random House Audio Publishing Group ...,2001,John Grisham,284
The Girls' Guide to Hunting and Fishing ...,Penguin Books Ltd,2000,Melissa Bank,259
Good in Bed,Atria,2001,Jennifer Weiner,247
The Five People You Meet in Heaven ...,Hyperion,2003,Mitch Albom,244


Now print the most similar items per item, sorted by the most common books. Hint: Join the two SFrames you created above.

In [12]:
sims = sims.join(items[['book', 'Count']], on='book')
sims = sims.sort(['Count', 'book', 'rank'], ascending=False)
sims.print_rows(1000, max_row_width=150)

+-------------------------------+--------------------------------+------------------+------+-------+
|              book             |            similar             |      score       | rank | Count |
+-------------------------------+--------------------------------+------------------+------+-------+
|          Wild Animus          |    A Prayer for Owen Meany     | 0.00925928354263 |  10  |  581  |
|          Wild Animus          |          Empire Falls          | 0.0097222328186  |  9   |  581  |
|          Wild Animus          |      When the Wind Blows       | 0.00980395078659 |  8   |  581  |
|          Wild Animus          |   The Bonesetter's Daughter    | 0.0108991861343  |  7   |  581  |
|          Wild Animus          |           Life of Pi           | 0.0110375285149  |  6   |  581  |
|          Wild Animus          |          The Alienist          | 0.0113154053688  |  5   |  581  |
|          Wild Animus          |        A Painted House         | 0.0116959214211  |  4   

### Experimenting with other models

Create a dataset called `implicit` that contains only ratings data where `rating` was 4 or greater.

In [13]:
implicit = ratings[ratings['rating'] >= 4]

Create a train/test split of the `implicit` data created above. Hint: Use [random_split_by_user](http://graphlab.com/products/create/docs/generated/graphlab.recommender.random_split_by_user.html#graphlab.recommender.random_split_by_user).

In [14]:
train, test = gl.recommender.util.random_split_by_user(implicit, user_id='name', item_id='book')

Print the first 5 rows of the training set.

In [15]:
train.head(5)

name,book,rating
Channon,Dave Barry's Bad Habits a 100% Fact-Free Book ...,5
Boe,It's Not About the Bike: My Journey Back to Life ...,4
Raul,The Hero and the Crown,4
Sarah,One Night of Scandal,4
Brooklynn,Fat Ollie's Book: A Novel of the 87th ...,4


Create a `ranking_factorization_recommender` model using just the training set and 20 factors.

In [16]:
m = gl.ranking_factorization_recommender.create(train, 'name', 'book', target='rating', num_factors=20)

Evaluate how well this model recommends items that were seen in the test set you created above. Hint: Check out `m.evaluate_precision_recall()`.

In [17]:
m.evaluate_precision_recall(test, cutoffs=[50])['precision_recall_overall']

cutoff,precision,recall
50,0.0026582278481,0.102848101266


Create an SFrame containing only one observation, where 'Billy Bob' has rated 'Animal Farm' with score 5.0.

In [18]:
new_observation_data = gl.SFrame({'name': ['Me'], 'book': ['Animal Farm'], 'rating': [5.0]})

Use this data when querying for recommendations.

In [19]:
m.recommend(users=['Me'], new_observation_data=new_observation_data)

name,book,score,rank
Me,The Da Vinci Code,4.72174966921,1
Me,The Secret Life of Bees,4.6367422997,2
Me,A Prayer for Owen Meany,4.60010815253,3
Me,The Five People You Meet in Heaven ...,4.59598705759,4
Me,Me Talk Pretty One Day,4.55587265482,5
Me,Life of Pi,4.54568443288,6
Me,Bridget Jones's Diary,4.54082047035,7
Me,The Handmaid's Tale,4.54068297734,8
Me,Suzanne's Diary for Nicholas ...,4.52010121038,9
Me,The Poisonwood Bible,4.5169126748,10
