### Objective : In this notebook we will play around with the spotify datasets and do the following things
                
    1. Read the pickle file of summarised datasets
    2. Train a word 2 vec model using skip gram with window size as a hyperparametrs
    3. Play around with the vectors received from this excercise 
    4. Try creating two function which return most similar songs to particular songs
    5. Take 3 songs as list and return a playlist of 10 words

In [1]:
### Load the required packages in the required format
import pandas as pd
import os
import warnings
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
plt.style.use('ggplot')



In [2]:
### Load the pickled datasets 
with open('spotify_summary.pickle','rb') as dataset:
    spotify_summary = pickle.load(dataset)
    print (" The dataset is loaded succesfully")
    print (" The shape of the dataset is as follows",spotify_summary.shape)
    print (spotify_summary.head(5))

 The dataset is loaded succesfully
 The shape of the dataset is as follows (229180, 3)
                            user_id      playlistname  \
0  00055176fea33f6e027cd3302289378b              favs   
1  0007f3dd09c91198371454c608d47f22              2014   
2  0007f3dd09c91198371454c608d47f22         Fav songs   
3  0007f3dd09c91198371454c608d47f22         Sad songs   
4  000b0f32b5739f052b9d40fcc5c41079  Agnetha Fältskog   

                                           trackname  
0  [9619, 2591, 46683, 9620, 1138379, 37346, 6335...  
1                [174985, 1541, 878603, 17550, 5303]  
2  [1854415, 174985, 1684382, 955407, 19605, 1482...  
3                                 [1510871, 1448429]  
4                                  [1281658, 487582]  


In [3]:
### Gensim takes input as a list of list. Our tracknames are already a list convert them to list of list
spotify_wrd2vec_input = [ x for x in spotify_summary['trackname']]
print ("Input data is ready for gensim models")
print ("The number of input playlists we have are as follows :",len(spotify_wrd2vec_input))

Input data is ready for gensim models
The number of input playlists we have are as follows : 229180


In [5]:
### Tip : Always read the help before using any function so we will have a look at the word2vec functions
# help(gensim.models.Word2Vec())

In [9]:
### Define traing the word 2 vec model we will use Skip Gram using negative sampling as oftmax can be slow
# seed = 1000, hs = 0,negative = 10,workers=10,iter = 100)
### Skip Gram : Predict Context given the middle word works well with infrequent datasets. Good idea for songs as some songs may ne liked by a few users oly
print ("Model Training has started")
model = gensim.models.Word2Vec(spotify_wrd2vec_input, size = 200 , window = 4 , min_count = 15,
                               seed = 1000, hs = 0,negative = 10,workers=16,iter = 100)
print ("Model Trainin Finished")

Model Training has started
Model Trainin Finished


In [22]:
### Pickle the model datasets and save it to a pickle file 

with open('model_spotify_word2vec.pickle','wb') as model_file:
    pickle.dump(model,model_file)
    print (" Dumping the model succesful ")
    print (" The model is dumped at this location :",os.getcwd())

 Dumping the model succesful 
 The model is dumped at this location : C:\Users\ash\Desktop\NLP-DL


In [8]:
### From the dump load the model dictionary and model pickle files
with open('model_spotify_word2vec.pickle','rb') as model_file:
    model_spotify = pickle.load(model_file)

### Load the pickle files stored for song to numeric 
with open('track_map_dict.pickle','rb') as dict1:
    track_dict= pickle.load( dict1)
print ("Track dict has {} observations".format(len(track_dict)))
#### Load the prcikle file for artist to numeric
with open('track_map_comp_dict.pickle','rb') as dict2:
    track_map_comp_dict = pickle.load(dict2)
print ("Track dict has {} observations".format(len(track_map_comp_dict)))

Track dict has 1866246 observations
Track dict has 1978500 observations


### Define a function which return similar songs to a particular songs

In [10]:
#### Define a function which takes as input songs from list and returns similar songs
def similar_songs(songname,n):
    ''' Gets the songname from user and return the n songs similar'''
    song_id = track_dict[songname]
    print ("Searching for songs similar to :",songname)
    
    similar = model_spotify.most_similar(song_id,topn = n)
    print ("Similar songs are as follow")
    for i in similar[:]:
        print (track_map_comp_dict[i[0]])
        
    

### Define a function which takes list of songs and creates playlist for the users

In [11]:
#### Define a function which takes as input songs from list and returns similar songs
def create_play_list(list_songs,n):
    ''' Gets the songname from user and return the 5 songs similar'''  
    list1 = []
    for i in list_songs:
        list1.append(track_dict[i])      
        
    print ("Searching for songs similar to :",list_songs)
    
    similar = model_spotify.most_similar(positive = list1,topn = n)
    print ("Playlist based on your list is as follows")
    for i in similar[:]:
        print (track_map_comp_dict[i[0]])

### Lets check out results for my favourite songs list 

In [12]:
create_play_list(['wonderwall','paradise','yellow','let her go','fireflies'],10)

  # Remove the CWD from sys.path while we load stuff.
2020-05-13 17:08:33,136 : INFO : precomputing L2-norms of word weight vectors


Searching for songs similar to : ['wonderwall', 'paradise', 'yellow', 'let her go', 'fireflies']
Playlist based on your list is as follows
world spins madly on
you and your heart
wouldn't it be nice - 1999 - remaster
where the streets have no name - unplugged
wonderwall - remastered
yellow - live
wherever you will go - acoustic
won't go home without you
you are the best thing
you and i both


### Lets check the results for a different music taste - Classic Metal | Rock

In [13]:
create_play_list(['enter sandman','fade to black','kashmir'],15)

Searching for songs similar to : ['enter sandman', 'fade to black', 'kashmir']
Playlist based on your list is as follows
fear of the dark - 1998 remastered version
eye of the tiger
for whom the bell tolls
fear of the dark
fairies wear boots
feel good inc
du hast
even flow
fuel
entre dos tierras
ett slag färgat rött
everlong
fade to black - instrumental version
fortunate son
estranged


  # Remove the CWD from sys.path while we load stuff.


### Try a different list of songs

In [14]:
create_play_list(['hey you','time','hypnotised','fix you'],10)

Searching for songs similar to : ['hey you', 'time', 'hypnotised', 'fix you']
Playlist based on your list is as follows
hänsyn
holy dread!
hon har ett sätt - 1998 digital remaster
high speed
highway of endless dreams
i am a man of constant sorrow - o brother, where art thou? soundtrack/with band
holiday
gap
human
i always was your girl


  # Remove the CWD from sys.path while we load stuff.


###  find out similar songs Kashmir by Led Zepplin

In [15]:
similar_songs('kashmir',5)

Searching for songs similar to : kashmir
Similar songs are as follow
kashmir - live: o2 arena, london - december 10, 2007
immigrant song
keep talking - 2011 remastered version
karma police
joker and the thief


  import sys


### Highlighting the Problems with current datasets and next steps for further 

1. Training Data is not clean and has lot of similar songs with different names. We could try to restrict the version of songs to 1 or 2 max based on frequency for example
   SOngs :  kashmir , kashmir - live: o2 arena, london - december 10, 2007
   
2. Songs with similar names can be of different taste based on the artist names. We should create vocab by combining strings of tracknames with the artist names

        