# 02 - Song Embeddings - Skipgram Recommender

In this notebook, we'll use human-made music playlists to learn song embeddings. We'll treat a playlist as if it's a sentence and the songs it contains as words. We feed that to the word2vec algorithm which then learns embeddings for every song we have. These embeddings can then be used to recommend similar songs. This technique is used by Spotify, AirBnB, Alibaba, and others. It accounts for a vast portion of their user activity, user media consumption, and/or sales (in the case of Alibaba).

The [dataset we'll use](https://www.cs.cornell.edu/~shuochen/lme/data_page.html) was collected by Shuo Chen from Cornell University. The dataset contains playlists from hundreds of radio stations from around the US.

## Importing packages and dataset

In [1]:
import numpy as np
import pandas as pd
import gensim 
from gensim.models import Word2Vec
from urllib import request
import warnings
warnings.filterwarnings('ignore')

The playlist dataset is a text file where every line represents a playlist. That playlist is basically a series of song IDs. 

In [2]:
# Get the playlist dataset file
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

# Parse the playlist dataset file. Skip the first two lines as 
# they only contain metadata
lines = data.read().decode("utf-8").split('\n')[2:] 

# Remove playlists with only one song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]


The `playlists` variable now contains a python list. Each item in this list is a playlist containing song ids. We can look at the first two playlists here:

In [3]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

Playlist #1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117',

## Training the Word2Vec Model
Our dataset is now in the shape the the Word2Vec model expects as input. We pass the dataset to the model, and set the following key parameters:
 * **size**: Embedding size for the songs. 
 * **window**: word2vec algorithm parameter -- maximum distance between the current and predicted word (song) within a sentence
 * **negative**: word2vec algorithm parameter -- Number of negative examples to use at each training step that the model needs to identify as noise


In [4]:
model = Word2Vec(playlists, size=32, window=20, negative=50, min_count=1, workers=4)

The model is now trained. Every song has an embedding. We only have song IDs, though, no titles or other info. Let's grab the song information file.

## Song Title and Artist File
Let's load and parse the file containing song titles and artists

In [5]:
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]

Now, `songs` is a list containing the id, title, and artist of every song in our datset. It looks like this:

In [6]:
songs[:3]

[['0 ', 'Gucci Time (w\\/ Swizz Beatz)', 'Gucci Mane'],
 ['1 ', 'Aston Martin Music (w\\/ Drake & Chrisette Michelle)', 'Rick Ross'],
 ['2 ', 'Get Back Up (w\\/ Chris Brown)', 'T.I.']]

To simplify looking up song titles by ID, we'll define a pandas dataframe to hold song information.

In [7]:
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [23]:
songs_df.head(3000)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Gucci Time (w\/ Swizz Beatz),Gucci Mane
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
2,Get Back Up (w\/ Chris Brown),T.I.
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
4,Whip My Hair,Willow
...,...,...
2995,Doo Doo Doo Doo Doo (Heartbreaker),The Rolling Stones
2996,Tearing Us Apart,Eric Clapton & Tina Turner
2997,Cult Of Personality,Living Colour
2998,Runaround,Van Halen


Pandas dataframes give us the ability to easily search through the columns of our dataset. We can look at the songs of a certain artist, for example.

In [24]:
songs_df[songs_df.artist == 'The Rolling Stones'].head()

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1333,Gimme Shelter,The Rolling Stones
1334,Carol,The Rolling Stones
1335,(I Can't Get No) Satisfaction,The Rolling Stones
1336,Jumpin' Jack Flash,The Rolling Stones
1337,Under My Thumb,The Rolling Stones


### Looking up songs by their IDs
Pandas also give us the ability to retrieve the information of multiple songs by passing their ids. Let's for example retrieve the info for songs number 1, 10, and 100.

In [10]:
songs_df.iloc[[1,10,100]]

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
10,Shake It,Elephant Man
100,I'm Yours,Jason Mraz


## Recommending Similar Songs
Let's now pick a song, and see what similar songs the model recommends

In [25]:
songs_df.iloc[1336]

title     Jumpin' Jack Flash
artist    The Rolling Stones
Name: 1336 , dtype: object

In [26]:
song_id = 1336

# Ask the model for songs similar to song #2172
model.wv.most_similar(positive=str(song_id))

[('3049', 0.9976781010627747),
 ('3072', 0.9969066381454468),
 ('2678', 0.9961973428726196),
 ('3065', 0.9957556128501892),
 ('3894', 0.9954989552497864),
 ('3088', 0.9953351616859436),
 ('17692', 0.9952812194824219),
 ('2963', 0.9952243566513062),
 ('2728', 0.994480550289154),
 ('2708', 0.9938569068908691)]

Let's look up the titles and artists of these songs:

In [27]:
similar_songs = np.array(model.wv.most_similar(positive=str(song_id)))[:,0]
songs_df.iloc[similar_songs]

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
3049,"Hello, I Love You",The Doors
3072,You're My Best Friend,Queen
2678,Lookin' Out My Back Door,Creedence Clearwater Revival
3065,Centerfield,John Fogerty
3894,Touch Me,The Doors
3088,Hello Goodbye,The Beatles
17692,You Really Got Me,The Kinks
2963,Born To Be Wild,Steppenwolf
2728,Somebody To Love,Jefferson Airplane
2708,The House Of The Rising Sun,The Animals


Let's define a function that prints out both the song title and the recommendations based on it:


In [28]:
def print_recommendations(song_id):
    print( songs_df.iloc[song_id] )
    similar_songs = np.array(model.wv.most_similar(positive=str(song_id)))[:,0]
    return  songs_df.iloc[similar_songs] 


## More Example Recommendations

### Paranoid Android - Radiohead

In [29]:
print_recommendations(3088)

title     Hello Goodbye
artist      The Beatles
Name: 3088 , dtype: object


Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2654,Fortunate Son,Creedence Clearwater Revival
2728,Somebody To Love,Jefferson Airplane
3049,"Hello, I Love You",The Doors
3894,Touch Me,The Doors
1336,Jumpin' Jack Flash,The Rolling Stones
2972,Paint It Black,The Rolling Stones
2963,Born To Be Wild,Steppenwolf
2855,Down On The Corner,Creedence Clearwater Revival
2975,Feelin' Alright,Joe Cocker
3106,You Ain't Seen Nothing Yet,Bachman-Turner Overdrive


### California Love - 2Pac

In [None]:
print_recommendations(842)

title     California Love (w\/ Dr. Dre & Roger Troutman)
artist                                              2Pac
Name: 842 , dtype: object


Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
329,Stronger,Kanye West
5668,How We Do (w\/ 50 Cent),The Game
18844,Murder She Wrote,Chaka Demus & Pliers
890,Knock You Down (w\/ Ne-Yo & Kanye West),Keri Hilson
5890,Low (w\/ T-Pain),Flo-Rida
5788,Drop It Like It's Hot (w\/ Pharrell),Snoop Dogg
11331,Take You There,Sean Kingston
36741,We Get It On (w\/ Omarion),Red Cafe
5681,Drop It Low (w\/ Chris Brown),Ester Dean
1418,Tick Tock,Kesha


### Billie Jean - Michael Jackson

In [None]:
print_recommendations(3822)

title         Billie Jean
artist    Michael Jackson
Name: 3822 , dtype: object


Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
4187,I Wanna Dance With Somebody (Who Loves Me),Whitney Houston
15660,Let The Music Play,Shannon
4157,P.Y.T. (Pretty Young Thing),Michael Jackson
4181,Kiss,Prince & The Revolution
8542,Never Gonna Give You Up,Rick Astley
3357,Manic Monday,The Bangles
3396,Holiday,Madonna
12749,Wanna Be Startin' Somethin',Michael Jackson
4271,Walking On Sunshine,Katrina & The Waves
1506,The Way You Make Me Feel,Michael Jackson
