# HitPredict

Hit predict will predict the popularity of a song based on some of its musical properties. We used the Spotify DB dataset from kaggle, which contains numerous rows of features of over 230.000 tracks. It was assembled using Spotify's API.

As always we started with importing the libraries that we'll be using.

In [1]:
import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
np.random.seed(10)

The database is in .csv format, we used Pandas' read_csv() function to import it to Python. We visualize the data below.

In [2]:
data = pd.read_csv('SpotifyFeatures.csv')
data.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39


Track ID is not giving us any useful information so lets just get rid of it. As well as we need to remove the name of the track and the name of the artist from the dataset because proccessing them would just confuse the network.

In [3]:
del data['track_id']
del data['artist_name']
del data['track_name']

In [4]:
data.describe()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
count,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0
mean,41.127502,0.36856,0.554364,235122.3,0.570958,0.148301,0.215009,-9.569885,0.120765,117.666585,0.454917
std,18.189948,0.354768,0.185608,118935.9,0.263456,0.302768,0.198273,5.998204,0.185518,30.898907,0.260065
min,0.0,0.0,0.0569,15387.0,2e-05,0.0,0.00967,-52.457,0.0222,30.379,0.0
25%,29.0,0.0376,0.435,182857.0,0.385,0.0,0.0974,-11.771,0.0367,92.959,0.237
50%,43.0,0.232,0.571,220427.0,0.605,4.4e-05,0.128,-7.762,0.0501,115.778,0.444
75%,55.0,0.722,0.692,265768.0,0.787,0.0358,0.264,-5.501,0.105,139.054,0.66
max,100.0,0.996,0.989,5552917.0,0.999,0.999,1.0,3.744,0.967,242.903,1.0


First and foremost let's check for any 0 data points that we might need to replace:

In [5]:
print(pd.isnull(data).sum())

genre               0
popularity          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
valence             0
dtype: int64


Fortunately there are none, let's move on with some of the feature engineering that we have done. 

Most values are numerical and need no preprocessing. We do have to convert however some text based rows into numbers which can be fed to the network afterwards.
Such rows are Key, Mode and Time Signature all of which will be replaced with integers 1 through the number of unique types that the given row may contain. 

I was not familiar with most of these terms, below you will find short descriptions that helped me better understand them.
Time signature: (also known as meter signature, metre signature, or measure signature) is a notational convention used in Western musical notation to specify how many beats (pulses) are contained in each measure (bar), and which note value is equivalent to a beat.
Mode: In the theory of Western music, it is a type of musical scale coupled with a set of characteristic melodic behaviors.

In [6]:
categorical_features = ["genre","artist_name","time_signature","key","mode"]
n_items = len(data)
for feat in categorical_features:
    print("Proccessing %s. number of unique fields: %d" % (feat, data[feat].nunique()))
    if data[feat].nunique()<50:
        print(data[feat].unique())
        print("Number of occurance of each unique value:")
        print(data.groupby(feat).count().iloc[:,0])
        for feat_value in data[feat].unique():
            if (len(data[data[feat]==feat_value]) / n_items <= 0.02):
                print("Adding %s category to the 'OTHER' category." % feat_value)
                data[feat] = data[feat].apply(lambda x: "OTHER" if x==feat_value else x, 1)
                
        print("Final number of unique fields:")
        print(data.groupby(feat).count().iloc[:,0]) 
        print("\n")

Proccessing genre. number of unique fields: 27
['Movie' 'R&B' 'A Capella' 'Alternative' 'Country' 'Dance' 'Electronic'
 'Anime' 'Folk' 'Blues' 'Opera' 'Hip-Hop' "Children's Music"
 'Children’s Music' 'Rap' 'Indie' 'Classical' 'Pop' 'Reggae' 'Reggaeton'
 'Jazz' 'Rock' 'Ska' 'Comedy' 'Soul' 'Soundtrack' 'World']
Number of occurance of each unique value:
genre
A Capella            119
Alternative         9263
Anime               8936
Blues               9023
Children's Music    5403
Children’s Music    9353
Classical           9256
Comedy              9681
Country             8664
Dance               8701
Electronic          9377
Folk                9299
Hip-Hop             9295
Indie               9543
Jazz                9441
Movie               7806
Opera               8280
Pop                 9386
R&B                 8992
Rap                 9232
Reggae              8771
Reggaeton           8927
Rock                9272
Ska                 8874
Soul                9089
Soundtrack     

KeyError: 'artist_name'

As there are not too many possible values for these categorical features we can one-hot-encode them for more efficient learning, using pandas built-in function.

In [7]:
data = pd.get_dummies(data, columns=["genre", "time_signature", "key","mode"])
pd.set_option("max_columns",None)
data.sample(10)

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,genre_Alternative,genre_Anime,genre_Blues,genre_Children's Music,genre_Children’s Music,genre_Classical,genre_Comedy,genre_Country,genre_Dance,genre_Electronic,genre_Folk,genre_Hip-Hop,genre_Indie,genre_Jazz,genre_Movie,genre_OTHER,genre_Opera,genre_Pop,genre_R&B,genre_Rap,genre_Reggae,genre_Reggaeton,genre_Rock,genre_Ska,genre_Soul,genre_Soundtrack,genre_World,time_signature_0/4,time_signature_1/4,time_signature_3/4,time_signature_4/4,time_signature_5/4,key_A,key_A#,key_B,key_C,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#,mode_Major,mode_Minor
103869,51,0.0505,0.722,237920,0.828,0.0,0.114,-3.227,0.19,117.693,0.275,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
24247,39,0.00663,0.703,235333,0.872,0.175,0.118,-6.435,0.15,90.048,0.387,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
181919,43,0.797,0.678,182893,0.312,0.0,0.115,-14.695,0.0392,119.889,0.953,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0
152944,69,0.104,0.53,203453,0.707,0.0,0.105,-5.516,0.203,109.827,0.696,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
153187,54,0.000945,0.398,232187,0.967,6e-06,0.18,-3.232,0.0611,111.676,0.611,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
36711,28,0.00292,0.601,260569,0.654,0.754,0.11,-8.077,0.0332,127.997,0.0811,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
124296,0,0.939,0.499,102107,0.146,3e-06,0.495,-24.238,0.13,60.424,0.403,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0
4637,39,0.346,0.476,225952,0.447,2.8e-05,0.0875,-6.434,0.0292,133.088,0.324,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0
104698,22,0.988,0.389,105507,0.196,0.91,0.209,-23.298,0.0492,76.932,0.442,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0
23503,42,0.000586,0.378,191294,0.933,0.0132,0.219,-2.102,0.314,170.653,0.346,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0


Finally we split the data into 60% training 20% validation and 20% test subsets.

In [8]:
train, validate, test = np.split(data.sample(frac=1), [int(.6*len(data)), int(.8*len(data))])

In [9]:
print(len(test), len(validate), len(train))

46545 46545 139635


In [10]:
# We have plenty of rows in the dataset. Before starting a training  session I would like to 
# make sure that the network is functioning properly.
# So let's just play around with a fraction of the dataset, I don't want to lock my computer for hours
demo_train = train[:10000]
demo_validate = validate[:2000]
demo_test = test[:2000]

In [11]:
#remove 'demo_' to run the training on the whole dataset.

# Let's extract the target column from the dataset.
Y_train = demo_train['popularity'].values
Y_validate = demo_validate['popularity'].values
Y_test = demo_test['popularity'].values

# We also create the train, test, and validation input here
X_train = demo_train.drop(columns=['popularity'])
X_validate = demo_validate.drop(columns=['popularity'])
X_test = demo_test.drop(columns=['popularity'])

In [12]:
from sklearn.preprocessing import StandardScaler

# Not all features are standardized so let's do it before we start he training
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_validate = scaler.transform(X_validate)
X_test = scaler.transform(X_test)

In [13]:
# importing necessary keras packages
from keras.models import Sequential
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Dropout, Activation
from keras.models import load_model
from keras.optimizers import SGD

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [14]:
#define callbacks
patience=30
early_stopping=EarlyStopping(patience=patience, verbose=1)
checkpointer=ModelCheckpoint(filepath='weights.hdf5', save_best_only=True, verbose=1)

model = Sequential()
model.add(Dense(output_dim=40, input_dim=X_train.shape[1]))
model.add(Activation('relu'))
model.add(Dense(output_dim=30))
model.add(Activation('relu'))
model.add(Dense(output_dim=15))
model.add(Activation('relu'))
# This a regression problem where we target values in 0-100 range
model.add(Dense(output_dim=1, activation='relu'))
# Let's have a look at the model
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 40)                2280      
_________________________________________________________________
activation_1 (Activation)    (None, 40)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 30)                1230      
_________________________________________________________________
activation_2 (Activation)    (None, 30)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 15)                465       
_________________________________________________________________
activation_3 (Activation)    (None, 15)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                

  import sys
  if __name__ == '__main__':
  # This is added back by InteractiveShellApp.init_path()
  


In [15]:
sgd = SGD(lr=0.00001)
# using mse for regression problem
model.compile(loss='mse', optimizer=sgd)
history=model.fit(X_train,Y_train,epochs=200, 
                  batch_size=16,
                  verbose=2,
                  validation_data=(X_validate, Y_validate),
                  callbacks=[checkpointer, early_stopping])


Train on 10000 samples, validate on 2000 samples
Epoch 1/200
 - 1s - loss: 2019.7855 - val_loss: 2031.4243

Epoch 00001: val_loss improved from inf to 2031.42433, saving model to weights.hdf5
Epoch 2/200
 - 1s - loss: 2019.3284 - val_loss: 2030.0968

Epoch 00002: val_loss improved from 2031.42433 to 2030.09680, saving model to weights.hdf5
Epoch 3/200
 - 1s - loss: 1975.2878 - val_loss: 1803.6781

Epoch 00003: val_loss improved from 2030.09680 to 1803.67813, saving model to weights.hdf5
Epoch 4/200
 - 1s - loss: 736.4616 - val_loss: 188.7085

Epoch 00004: val_loss improved from 1803.67813 to 188.70845, saving model to weights.hdf5
Epoch 5/200
 - 1s - loss: 150.6487 - val_loss: 123.6564

Epoch 00005: val_loss improved from 188.70845 to 123.65637, saving model to weights.hdf5
Epoch 6/200
 - 1s - loss: 120.6651 - val_loss: 108.7864

Epoch 00006: val_loss improved from 123.65637 to 108.78640, saving model to weights.hdf5
Epoch 7/200
 - 1s - loss: 111.8462 - val_loss: 103.2076

Epoch 00007


Epoch 00061: val_loss did not improve from 90.69485
Epoch 62/200
 - 1s - loss: 87.1592 - val_loss: 90.8772

Epoch 00062: val_loss did not improve from 90.69485
Epoch 63/200
 - 1s - loss: 87.0992 - val_loss: 90.9173

Epoch 00063: val_loss did not improve from 90.69485
Epoch 64/200
 - 1s - loss: 86.9761 - val_loss: 90.7908

Epoch 00064: val_loss did not improve from 90.69485
Epoch 65/200
 - 1s - loss: 86.9336 - val_loss: 90.5816

Epoch 00065: val_loss improved from 90.69485 to 90.58160, saving model to weights.hdf5
Epoch 66/200
 - 1s - loss: 86.8683 - val_loss: 90.5939

Epoch 00066: val_loss did not improve from 90.58160
Epoch 67/200
 - 1s - loss: 86.7809 - val_loss: 90.5381

Epoch 00067: val_loss improved from 90.58160 to 90.53808, saving model to weights.hdf5
Epoch 68/200
 - 1s - loss: 86.7028 - val_loss: 90.6596

Epoch 00068: val_loss did not improve from 90.53808
Epoch 69/200
 - 1s - loss: 86.6540 - val_loss: 90.6398

Epoch 00069: val_loss did not improve from 90.53808
Epoch 70/200


In [16]:
# Load the weights that performed best on the validation dataset
from sklearn.metrics import mean_squared_error
model = load_model('weights.hdf5')
# predictions for the test dataset
preds = model.predict(X_test)
test_err = mean_squared_error(Y_test, preds)
print("Test error: ",test_err)

Test error:  100.00394219449768


Training only on a sample dataset gave us 100 error on the test dataset. This means that model's predictions miss the correct value by 10 on average. So far not the best result, but still it shows some correspondence between the features and the popularity. Next step is to start investigating what effect the lyrics have on a track's popularity using NLP.