#Building a song recommender


#Fire up GraphLab Create

In [1]:
import graphlab

#Load music data

In [13]:
song_data = graphlab.SFrame('song_data.gl/')

#Explore data

Music data shows how many times a user listened to a song, as well as the details of the song.

In [34]:
song_data['artist']

dtype: str
Rows: 1116609
['Jack Johnson', 'Paco De Lucia', 'Kanye West', 'Jack Johnson', 'Foo Fighters', 'Héroes del Silencio', 'Lady GaGa', 'Foo Fighters', 'Harmonia', 'Thievery Corporation feat. Emiliana Torrini', 'Jack Johnson / Matt Costa / Zach Gill / Dan Lebowitz / Steve Adams', 'Puff Daddy', "The B-52's", 'John Mayer', 'Robert Johnson', 'The Lonely Island', 'Panic At The Disco', 'Kanye West', 'Foo Fighters', 'Fleet Foxes', 'Fleet Foxes', 'Jack Johnson / Paula Fuga', 'Andrew Bird', 'John Mayer', 'Angus & Julia Stone', 'Incubus', 'John Mayer', 'John Mayer', 'Jimmy Eat World', 'Jorge Drexler', 'Thievery Corporation', 'King Curtis', 'Band Of Horses', 'Incubus', 'Incubus', 'Foo Fighters', 'John Mayer', 'Fleet Foxes', 'Kings Of Leon', 'The String Cheese Incident', 'Fleet Foxes', 'John Mayer', 'Sublime', 'Jack Johnson', 'Jack Johnson', 'Sage Francis', 'Local Natives', 'Local Natives', 'Jack Johnson', 'Mogwai', 'Local Natives', 'Phoenix', 'Phoenix', 'Passion Pit', 'Young Jeezy', 'Octopu

##Showing the most popular songs in the dataset

In [None]:
graphlab.canvas.set_target('ipynb')

In [None]:
song_data['song'].show()

In [None]:
len(song_data)

##Count number of unique users in the dataset

In [None]:
users = song_data['user_id'].unique()

In [None]:
len(users)

#Create a song recommender

In [None]:
train_data,test_data = song_data.random_split(.8,seed=0)

##Simple popularity-based recommender

In [None]:
popularity_model = graphlab.popularity_recommender.create(train_data,
                                                         user_id='user_id',
                                                         item_id='song')

###Use the popularity model to make some predictions

A popularity model makes the same prediction for all users, so provides no personalization.

In [None]:
popularity_model.recommend(users=[users[0]])

In [None]:
popularity_model.recommend(users=[users[1]])

##Build a song recommender with personalization

We now create a model that allows us to make personalized recommendations to each user. 

In [None]:
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                user_id='user_id',
                                                                item_id='song')

###Applying the personalized model to make song recommendations

As you can see, different users get different recommendations now.

In [None]:
personalized_model.recommend(users=[users[0]])

In [None]:
personalized_model.recommend(users=[users[1]])

###We can also apply the model to find similar songs to any song in the dataset

In [None]:
personalized_model.get_similar_items(['With Or Without You - U2'])

In [None]:
personalized_model.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])

#Quantitative comparison between the models

We now formally compare the popularity and the personalized models using precision-recall curves. 

In [None]:
if graphlab.version[:3] >= "1.6":
    model_performance = graphlab.compare(test_data, [popularity_model, personalized_model], user_sample=0.05)
    graphlab.show_comparison(model_performance,[popularity_model, personalized_model])
else:
    %matplotlib inline
    model_performance = graphlab.recommender.util.compare_models(test_data, [popularity_model, personalized_model], user_sample=.05)

The curve shows that the personalized model provides much better performance. 

In [35]:
df1 = song_data[song_data['artist']=='Kings Of Leon']

In [36]:
df1['user_id'].unique()

dtype: str
Rows: 7373
['18325842a941bc58449ee71d659a08d1c1bd2383', '046311081a8703bdee3be2e6d9da07ccdc135340', '62c4bf887b7b1e5cf6ab62723481099c7f98377e', '11e266d0d2d7a841bd1a2604b14cc05f4bcecd8e', '7e7a4eede58db53c0b4837b5b302843d845259e8', 'c46038495c5f7fe43b12c3cea012517998fd78a0', '751774a078c559b9abf6aceaec9062c4213a92ac', 'd6dbb0578fe5baf513cad07fbf07801f89fd9313', '64d4b02087e9433c4a82aeba087cd8967453afa1', '58fcc1ee8eb4bff5a9bedff6e7510f0167367c1a', '9232eeaa942af2ff3f3e2b90f996d3ad003e4b31', '24423ce2aa848fd563617d9b8a50c2c9fd1ebaac', '5264bce396b58ba18b13fa8bc692b9e13095b238', '0b8078451a33f49730e6a6b58e780ed5031a4a87', '3930e92959235c4adb586ffc8ed5429b9a304656', '7fa506764e36a8221b8de5f419f297626963e9b3', '9116c65d75343f5ac19f27b2eb8b1b447855e2ba', 'b1ad477696f5817f7cb9eaf1d13e102349236597', 'b919f662b26957782fa1ea3543438edf53489ec7', '2f33a2101e99fbdd7e6af490e2934c940cf9ac44', 'c7115319adfc1442867137d4f72477276abeab7f', 'cebf140abc6ea32caab364308509d810d378a5c9', '974ed1d7

In [37]:
song_data.groupby(key_columns='artist', operations={'total_count': graphlab.aggregate.SUM('listen_count')})

TypeError: sort() takes at least 2 arguments (1 given)