# Building a song recommender


# Fire up GraphLab Create
(See [Getting Started with SFrames](../Week%201/Getting%20Started%20with%20SFrames.ipynb) for setup instructions)

In [1]:
import graphlab
import os

In [2]:
# Set product key on this computer. After running this cell, you will not need to re-enter your product key.
# Here the product key is stored in an environment variable. Can just be entered as a string.
graphlab.product_key.set_product_key(os.environ.get('GraphlabCreate'))

# Limit number of worker processes. This preserves system memory, which prevents hosted notebooks from crashing.
graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 4)

# Output active product key.
# graphlab.product_key.get_product_key()

This non-commercial license of GraphLab Create for academic use is assigned to andreas.keller@gmail.com and will expire on December 19, 2017.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1487583782.log


# Load music data

In [3]:
song_data = graphlab.SFrame('data/song_data.gl/')

# Explore data

Music data shows how many times a user listened to a song, as well as the details of the song.

In [4]:
song_data.head()

user_id,song_id,listen_count,title,artist
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOAKIMP12A8C130995,1,The Cove,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Paco De Lucia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBXHDL12A81C204C0,1,Stronger,Kanye West
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBYHAJ12A6701BF1D,1,Constellations,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODACBL12A8C13C273,1,Learn To Fly,Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODDNQT12A6D4F5F7E,5,Apuesta Por El Rock 'N' Roll ...,Héroes del Silencio
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODXRTY12AB0180F3B,1,Paper Gangsta,Lady GaGa
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFGUAY12AB017B0A8,1,Stacked Actors,Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFRQTD12A81C233C0,1,Sehr kosmisch,Harmonia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOHQWYZ12A6D4FA701,1,Heaven's gonna burn your eyes ...,Thievery Corporation feat. Emiliana Torrini ...

song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De Lucia ...
Stronger - Kanye West
Constellations - Jack Johnson ...
Learn To Fly - Foo Fighters ...
Apuesta Por El Rock 'N' Roll - Héroes del ...
Paper Gangsta - Lady GaGa
Stacked Actors - Foo Fighters ...
Sehr kosmisch - Harmonia
Heaven's gonna burn your eyes - Thievery ...


## Showing the most popular songs in the dataset

In [5]:
# graphlab.canvas.set_target('ipynb')

In [6]:
song_data['song'].show()

Canvas is accessible via web browser at the URL: http://localhost:64130/index.html
Opening Canvas in default web browser.


In [7]:
len(song_data)

1116609

## Count number of unique users in the dataset

In [8]:
users = song_data['user_id'].unique()

In [9]:
len(users)

66346

# Create a song recommender

In [10]:
train_data,test_data = song_data.random_split(.8,seed=0)

## Simple popularity-based recommender

In [11]:
popularity_model = graphlab.popularity_recommender.create(train_data,
                                                          user_id='user_id',
                                                          item_id='song')

### Use the popularity model to make some predictions

A popularity model makes the same prediction for all users, so provides no personalization.

In [12]:
popularity_model.recommend(users=[users[0]])

user_id,song,score,rank
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Sehr kosmisch - Harmonia,4754.0,1
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Undo - Björk,4227.0,2
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,You're The One - Dwight Yoakam ...,3781.0,3
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Dog Days Are Over (Radio Edit) - Florence + The ...,3633.0,4
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Revelry - Kings Of Leon,3527.0,5
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Horn Concerto No. 4 in E flat K495: II. Romance ...,3161.0,6
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Secrets - OneRepublic,3148.0,7
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Fireflies - Charttraxx Karaoke ...,2532.0,8
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Tive Sim - Cartola,2521.0,9
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Drop The World - Lil Wayne / Eminem ...,2053.0,10


In [13]:
popularity_model.recommend(users=[users[1]])

user_id,song,score,rank
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Sehr kosmisch - Harmonia,4754.0,1
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Undo - Björk,4227.0,2
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,You're The One - Dwight Yoakam ...,3781.0,3
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Dog Days Are Over (Radio Edit) - Florence + The ...,3633.0,4
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Revelry - Kings Of Leon,3527.0,5
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Horn Concerto No. 4 in E flat K495: II. Romance ...,3161.0,6
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Secrets - OneRepublic,3148.0,7
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Hey_ Soul Sister - Train,2538.0,8
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Fireflies - Charttraxx Karaoke ...,2532.0,9
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Tive Sim - Cartola,2521.0,10


## Build a song recommender with personalization

We now create a model that allows us to make personalized recommendations to each user. 

In [14]:
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                user_id='user_id',
                                                                item_id='song')

### Applying the personalized model to make song recommendations

As you can see, different users get different recommendations now.

In [15]:
personalized_model.recommend(users=[users[0]])

user_id,song,score,rank
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Cuando Pase El Temblor - Soda Stereo ...,0.0194504536115,1
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Fireflies - Charttraxx Karaoke ...,0.0144737317012,2
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Love Is A Losing Game - Amy Winehouse ...,0.0142865960415,3
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Marry Me - Train,0.014133471709,4
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Secrets - OneRepublic,0.013591665488,5
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Sehr kosmisch - Harmonia,0.0133987894425,6
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Te Hacen Falta Vitaminas - Soda Stereo ...,0.0129302831796,7
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,OMG - Usher featuring will.i.am ...,0.0127778282532,8
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,Y solo se me ocurre amarte (Unplugged) - ...,0.0123411279458,9
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...,No Dejes Que... - Caifanes ...,0.0121042499175,10


In [16]:
personalized_model.recommend(users=[users[1]])

user_id,song,score,rank
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Riot In Cell Block Number Nine - Dr Feelgood ...,0.0374999940395,1
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Sei Lá Mangueira - Elizeth Cardoso ...,0.0331632643938,2
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,The Stallion - Ween,0.0322580635548,3
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Rain - Subhumans,0.0314159244299,4
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,West One (Shine On Me) - The Ruts ...,0.0306771993637,5
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Back Against The Wall - Cage The Elephant ...,0.0301204770803,6
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Life Less Frightening - Rise Against ...,0.0284431129694,7
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,A Beggar On A Beach Of Gold - Mike And The ...,0.0230024904013,8
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Audience Of One - Rise Against ...,0.0193938463926,9
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Blame It On The Boogie - The Jacksons ...,0.0189873427153,10


### We can also apply the model to find similar songs to any song in the dataset

In [17]:
personalized_model.get_similar_items(['With Or Without You - U2'])

song,similar,score,rank
With Or Without You - U2,I Still Haven't Found What I'm Looking For ...,0.042857170105,1
With Or Without You - U2,Hold Me_ Thrill Me_ Kiss Me_ Kill Me - U2 ...,0.0337349176407,2
With Or Without You - U2,Window In The Skies - U2,0.0328358411789,3
With Or Without You - U2,Vertigo - U2,0.0300751924515,4
With Or Without You - U2,Sunday Bloody Sunday - U2,0.0271317958832,5
With Or Without You - U2,Bad - U2,0.0251798629761,6
With Or Without You - U2,A Day Without Me - U2,0.0237154364586,7
With Or Without You - U2,Another Time Another Place - U2 ...,0.0203251838684,8
With Or Without You - U2,Walk On - U2,0.0202020406723,9
With Or Without You - U2,Get On Your Boots - U2,0.0196850299835,10


In [18]:
personalized_model.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])

song,similar,score,rank
Chan Chan (Live) - Buena Vista Social Club ...,Murmullo - Buena Vista Social Club ...,0.188118815422,1
Chan Chan (Live) - Buena Vista Social Club ...,La Bayamesa - Buena Vista Social Club ...,0.18719214201,2
Chan Chan (Live) - Buena Vista Social Club ...,Amor de Loca Juventud - Buena Vista Social Club ...,0.184834122658,3
Chan Chan (Live) - Buena Vista Social Club ...,Diferente - Gotan Project,0.0214592218399,4
Chan Chan (Live) - Buena Vista Social Club ...,Mistica - Orishas,0.0205761194229,5
Chan Chan (Live) - Buena Vista Social Club ...,Hotel California - Gipsy Kings ...,0.0193049907684,6
Chan Chan (Live) - Buena Vista Social Club ...,Nací Orishas - Orishas,0.0191571116447,7
Chan Chan (Live) - Buena Vista Social Club ...,Le Moulin - Yann Tiersen,0.018796980381,8
Chan Chan (Live) - Buena Vista Social Club ...,Gitana - Willie Colon,0.018796980381,9
Chan Chan (Live) - Buena Vista Social Club ...,Criminal - Gotan Project,0.0187793374062,10


# Quantitative comparison between the models

We now formally compare the popularity and the personalized models using precision-recall curves. 

In [19]:
if graphlab.version[:3] >= "1.6":
    model_performance = graphlab.compare(test_data, [popularity_model, personalized_model], user_sample=0.05)
    graphlab.show_comparison(model_performance,[popularity_model, personalized_model])
else:
    # %matplotlib inline
    model_performance = graphlab.recommender.util.compare_models(test_data, [popularity_model, personalized_model], user_sample=.05)

compare_models: using 2931 users to estimate model performance
PROGRESS: Evaluate model M0



Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    | 0.0290003411805 | 0.0073292793866 |
|   2    | 0.0290003411805 | 0.0145570377914 |
|   3    | 0.0257022631639 | 0.0191307763467 |
|   4    | 0.0247355851245 |  0.023718143754 |
|   5    |  0.022790856363 | 0.0269115574899 |
|   6    | 0.0216649607642 | 0.0316728790373 |
|   7    |  0.020568309207 |  0.035448474343 |
|   8    | 0.0192340498124 | 0.0380427999159 |
|   9    |  0.01812047462  | 0.0404249499695 |
|   10   | 0.0172978505629 | 0.0430846599475 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1



Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    |  0.190037529853 | 0.0592172228431 |
|   2    |  0.162743091095 | 0.0988264976238 |
|   3    |  0.141134993745 |  0.123567695715 |
|   4    |  0.125127942682 |  0.143307790621 |
|   5    |  0.115046059365 |  0.162020806916 |
|   6    |  0.105879677016 |  0.177764618527 |
|   7    | 0.0983087195984 |  0.191531864114 |
|   8    | 0.0914363698396 |  0.202894743529 |
|   9    | 0.0857500284317 |  0.214055102182 |
|   10   | 0.0816103718867 |  0.225045033911 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]

Model compare metric: precision_recall
Canvas is updated and available in a tab in the default browser.


The curve shows that the personalized model provides much better performance. 

# Assignments

## Assignment 1

Unique users who have listened to songs by Kanye West

In [20]:
users_kanye = song_data[song_data['artist'] == 'Kanye West']['user_id'].unique()
len(users_kanye)

2522

Unique users who have listened to songs by Foo Fighters

In [21]:
users_foo = song_data[song_data['artist'] == 'Foo Fighters']['user_id'].unique()
len(users_foo)

2055

Unique users who have listened to songs by Taylor Swift

In [22]:
users_taylor = song_data[song_data['artist'] == 'Taylor Swift']['user_id'].unique()
len(users_taylor)

3246

Unique users who have listened to songs by Lady GaGa

In [23]:
users_gaga = song_data[song_data['artist'] == 'Lady GaGa']['user_id'].unique()
len(users_gaga)

2928

## Assignment 2

Artist sorted by popularity in terms of listen count

In [24]:
artist_listen = song_data.groupby(key_columns='artist', operations={'total_count': graphlab.aggregate.SUM('listen_count')}).sort('total_count', ascending = False)

Most popular artist.

In [25]:
artist_listen[0]

{'artist': 'Kings Of Leon', 'total_count': 43218}

Least popular artist

In [26]:
artist_listen[-1]

{'artist': 'William Tabbert', 'total_count': 14}

## Assignment 3

Split test and training data

In [27]:
train_data, test_data = song_data.random_split(.8, seed = 0)

Train similarity model

In [28]:
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                 user_id = 'user_id',
                                                                 item_id = 'song')

Find recommendations for first 10000 test users

In [29]:
subset_test_users = test_data['user_id'].unique()[0:10000]
subset_test_users_recom = personalized_model.recommend(subset_test_users, k = 1)

Find most recommended songs

In [30]:
most_recom = subset_test_users_recom.groupby(key_columns = 'song', operations={'count': graphlab.aggregate.COUNT()}).sort('count', ascending = False)

Most recommended song

In [31]:
most_recom[0]

{'count': 421, 'song': 'Undo - Bj\xc3\xb6rk'}

Least recommended

In [32]:
most_recom[-1]

{'count': 1, 'song': 'Dark Matter - Andrew Bird'}