0. Summary

This notebook contains five parts:

0.1.  Points 1 - 3 document the steps that were taken to create the training data of the model. 
      As this data can be found in the repository under "./static/spotify.csv it could be simply loaded from there  if the user wishes to experiment with it.

0.2. Points 4 -8 demonstrate the training of the knn-model and testing it by using the spotipy-api to generate input.

0.3. Point 9 demonstrates the least similar song search using Numpy. 

0.4. Point 10 holds the code to get a 30 second demo of a song via the spotipy api. 

0.5. Point 11 includes a short discussion why we chose knn and did not opt for a neural net to achieve the target. 

1. Load the data

In [None]:
import pandas as pd 

df_2018 = pd.read_csv(".\data\spotify_2018.csv", encoding="latin1")
df_2019 = pd.read_csv(".\data\spotify_2019.csv", encoding="latin1")
df_2020 = pd.read_csv(".\data\spotify_2020.csv", encoding="latin1")

df = pd.concat([df_2018,df_2019,df_2020]).drop_duplicates().reset_index(drop=True)

2. Datacleaning

In [7]:
# remove Nans
df = df.dropna()

# remove tracks that are most likely audiobooks, speeches etc

df = df[(df.speechiness < 0.8)]

# remove tracks with loudness > 0

df = df[(df.loudness<=0)]

In [8]:
df = df.drop_duplicates('track_id', keep='last')

3. Prepare data for training the model

In [9]:
# save track Ids for future lookup
IDS = df.track_id

# provide features for training

features = [
 'danceability',
 'energy',
 'key',
 'loudness',
 'mode',
 'speechiness',
 'acousticness',
 'instrumentalness',
 'liveness',
 'valence',
 'tempo']

X = df[features]

In [10]:
# check that X and IDS are of the same length
len(X.energy)  == len(IDS) 

True

In [11]:
# normalizing the data with MinMax scaler. This is implemented without sklearn to reduce 
# the size of the future upload


X.key = X.key/(11)
X.loudness = X.loudness/58.882
X.tempo = X.tempo/249.983



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [110]:
X.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
36,0.448,0.256,0.454545,-0.172718,1,0.0483,0.875,0.0,0.113,0.174,0.309109
48,0.664,0.0755,0.909091,-0.336877,1,0.0389,0.91,0.0,0.164,0.575,0.320038
51,0.652,0.486,0.636364,-0.145308,0,0.0382,0.22,0.0,0.177,0.378,0.53174
63,0.375,0.43,0.363636,-0.165772,1,0.0362,0.741,2.4e-05,0.093,0.209,0.333459
79,0.471,0.469,1.0,-0.179325,1,0.0379,0.128,2e-05,0.132,0.0798,0.479461


In [12]:
# save final training dataset for the repository
X.to_csv("spotify.csv", encoding='latin1', index=False)

4. Train the model 

In [111]:
from sklearn.neighbors import NearestNeighbors

nn  = NearestNeighbors(n_neighbors=6, algorithm='brute')
nn.fit(X)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=6, p=2,
                 radius=1.0)

5. Get example input with the spotipy api

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

cid = "361bfc4ba3d24781af18c0585594c1ff"

secret = "c0f2fae0100243829516fe45298949c2"

client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

song = sp.search(f'We are young fun', type='track', limit=1)
song_id = song['tracks']['items'][0]['id']
features_song = sp.audio_features([song_id])[0]

In [113]:
# create a pandas dataframe that can be used for knn

song = [[features_song[i] for i in features]]

example = pd.DataFrame(song,

                   columns=features)

In [114]:
# scale the input with minmax

example.key = example.key/(11)
example.loudness = example.loudness/58.882
example.tempo = example.tempo/249.983

In [91]:
example.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,0.378,0.638,0.909091,-0.094698,1,0.075,0.02,7.7e-05,0.0849,0.735,0.736394


6. Make a prediction

In [115]:
distance, ids = nn.kneighbors(example)

In [116]:
# match provided indexes with track_ids
for i in ids[0]:
    print(IDS.iloc[i])

2KyeLD6a0JH1iWykA7uyvj
6ly3yIQCvsoLjFKn1sVN0T
4Pd5vQ9k3nGZtJshIneXSA
3SRQ83CRFQ8uMRX7IsJp4N
4LHvE4PjSvyOE0rInpU9C7
4cOwNYxsajIGLJzoV845ec


In [117]:
# output in a list 

output = [IDS.iloc[i] for i in ids[0]]

In [81]:
output

['72xcjEOojC20QH6LlqcdSi',
 '72xcjEOojC20QH6LlqcdSi',
 '2AaF78iCWISMWYog5RnSi5',
 '2AaF78iCWISMWYog5RnSi5',
 '06qEiiMjJKPCy5bmg47bCn',
 '06qEiiMjJKPCy5bmg47bCn']

7. Create a pickle of the model 

In [118]:
from joblib import dump

dump(nn, 'spotify2.joblib')

['spotify2.joblib']

8. Create csv for the lookup table

In [119]:
IDS.to_csv("IDS2.csv", encoding='latin1', index=False)

In [84]:
# test the csv file

test = pd.read_csv("./IDS2.csv")

In [85]:
test.head()

Unnamed: 0,track_id
0,6lfxq3CG4xtTiEg7opyCyx
1,06JmNnH3iXKENNRKifqu0v
2,7BXW1QCg56yzEBV8pW8pah
3,4MZQ3lHA1TYO6yyedtmBYg
4,4m1lB7qJ78VPYsQy7RoBcU


In [86]:
IDS.head()

0    6lfxq3CG4xtTiEg7opyCyx
1    06JmNnH3iXKENNRKifqu0v
2    7BXW1QCg56yzEBV8pW8pah
3    4MZQ3lHA1TYO6yyedtmBYg
4    4m1lB7qJ78VPYsQy7RoBcU
Name: track_id, dtype: object

9. Implementing least similar song search

Knn does not provide a least similar neighbour functionality. However it is pretty straightforward to calculate the most distant point out of set of points from a given point using numpy.

In [156]:
# transforming the input data and the dataset to numpy arrays
input_song = example.values
data = X.values

# define a function to calculate the distance between the input_song and 
# any other point from data. 

def myfunc(x):
    return np.linalg.norm(input_song-x)

# apply the function to each row of data = every song that was used to train knn

distances = np.apply_along_axis(myfunc, axis=1, arr=data)

# retrieve the index of the maximum distance in distances

index_least_similar_song = np.where(distances == np.amax(distances))[0][0]

# get corresponding track_id

track_id = IDS.iloc[index_least_similar_song]


In [157]:
track_id

'6pZs7ObmFRDgcF1nz83iTx'

10. Get a 30 second demo of a song via spotipy with a track_id

In [2]:
song = sp.track("spotify:track:6pZs7ObmFRDgcF1nz83iTx")

In [3]:
song["preview_url"]

'https://p.scdn.co/mp3-preview/4de084dacae3ed9027e7747305337c9f2bdffd66?cid=361bfc4ba3d24781af18c0585594c1ff'

In [4]:
song_id

# see 5. of this notebook for the origin of song_id

'7a86XRg84qjasly9f6bPSD'

In [5]:
song_opposite = sp.track("spotify:track:7a86XRg84qjasly9f6bPSD")
song_opposite["preview_url"]

'https://p.scdn.co/mp3-preview/af96d9b11588ea7d7754ccf2a8cefa7ead23a7d6?cid=361bfc4ba3d24781af18c0585594c1ff'

11. Reasoning to not use a neural net

- Neural nets do not work better than traditional algorithms on tabular numeric data.

- One could argue that the key- and the mode-column in the dataset holds categorcal values. However given the specific way knn functions there was no reason to treat those values as categorical. 

- as this is a student project and does not have a budget the primary option for deployment is Heroku. However Heroku is known to not function well with Tensorflow and often enough neural nets are simply too big to be implemented on Heroku. 