# HitPredict 1st Milestone
## Data Collection, Visualisation and Feature Engineering

Hit predict will predict the popularity of a song based on some of its musical properties. We used the Spotify DB dataset from kaggle, which contains numerous rows of features of over 230.000 tracks. It was assembled using Spotify's API.

As always we started with importing the libraries that we'll be using.

In [1]:
import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split

The database is in .csv format, we used Pandas' read_csv() function to import it to Python. We visualize the data below.

In [2]:
data = pd.read_csv('SpotifyFeatures.csv')
data.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39


Track ID is not giving us any useful information so lets just get rid of it

In [3]:
del data['track_id']

In [4]:
data.describe()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
count,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0
mean,41.127502,0.36856,0.554364,235122.3,0.570958,0.148301,0.215009,-9.569885,0.120765,117.666585,0.454917
std,18.189948,0.354768,0.185608,118935.9,0.263456,0.302768,0.198273,5.998204,0.185518,30.898907,0.260065
min,0.0,0.0,0.0569,15387.0,2e-05,0.0,0.00967,-52.457,0.0222,30.379,0.0
25%,29.0,0.0376,0.435,182857.0,0.385,0.0,0.0974,-11.771,0.0367,92.959,0.237
50%,43.0,0.232,0.571,220427.0,0.605,4.4e-05,0.128,-7.762,0.0501,115.778,0.444
75%,55.0,0.722,0.692,265768.0,0.787,0.0358,0.264,-5.501,0.105,139.054,0.66
max,100.0,0.996,0.989,5552917.0,0.999,0.999,1.0,3.744,0.967,242.903,1.0


First and foremost let's check for any 0 data points that we might need to replace:

In [5]:
print(pd.isnull(data).sum())

genre               0
artist_name         0
track_name          0
popularity          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
valence             0
dtype: int64


Fortunately there are none, let's move on with some of the feature engineering that we have done. 

Most values are numerical and need no preprocessing. We do have to convert however some text based rows into numbers which can be fed to the network afterwards.
Such rows are Key, Mode and Time Signature all of which will be replaced with integers 1 through the number of unique types that the given row may contain. 

I was not familiar with most of these terms, below you will find short descriptions that helped me better understand them.
Time signature: (also known as meter signature, metre signature, or measure signature) is a notational convention used in Western musical notation to specify how many beats (pulses) are contained in each measure (bar), and which note value is equivalent to a beat.
Mode: In the theory of Western music, it is a type of musical scale coupled with a set of characteristic melodic behaviors.

In [6]:
categorical_features = ["genre","artist_name","time_signature","key","mode"]
n_items = len(data)
for feat in categorical_features:
    print("Proccessing %s. number of unique fields: %d" % (feat, data[feat].nunique()))
    if data[feat].nunique()<50:
        print(data[feat].unique())
        print("Number of occurance of each unique value:")
        print(data.groupby(feat).count().iloc[:,0])
        for feat_value in data[feat].unique():
            if (len(data[data[feat]==feat_value]) / n_items <= 0.02):
                print("Adding %s category to the 'OTHER' category." % feat_value)
                data[feat] = data[feat].apply(lambda x: "OTHER" if x==feat_value else x, 1)
                
        print("Final number of unique fields:")
        print(data.groupby(feat).count().iloc[:,0]) 
        print("\n")

Proccessing genre. number of unique fields: 27
['Movie' 'R&B' 'A Capella' 'Alternative' 'Country' 'Dance' 'Electronic'
 'Anime' 'Folk' 'Blues' 'Opera' 'Hip-Hop' "Children's Music"
 'Children’s Music' 'Rap' 'Indie' 'Classical' 'Pop' 'Reggae' 'Reggaeton'
 'Jazz' 'Rock' 'Ska' 'Comedy' 'Soul' 'Soundtrack' 'World']
Number of occurance of each unique value:
genre
A Capella            119
Alternative         9263
Anime               8936
Blues               9023
Children's Music    5403
Children’s Music    9353
Classical           9256
Comedy              9681
Country             8664
Dance               8701
Electronic          9377
Folk                9299
Hip-Hop             9295
Indie               9543
Jazz                9441
Movie               7806
Opera               8280
Pop                 9386
R&B                 8992
Rap                 9232
Reggae              8771
Reggaeton           8927
Rock                9272
Ska                 8874
Soul                9089
Soundtrack     

As there are not too many possible values for these categorical features we can one-hot-encode them for more efficient learning, using pandas built-in function.

In [7]:
data = pd.get_dummies(data, columns=["genre", "time_signature", "key","mode"])
pd.set_option("max_columns",None)
data.sample(10)

Unnamed: 0,artist_name,track_name,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,genre_Alternative,genre_Anime,genre_Blues,genre_Children's Music,genre_Children’s Music,genre_Classical,genre_Comedy,genre_Country,genre_Dance,genre_Electronic,genre_Folk,genre_Hip-Hop,genre_Indie,genre_Jazz,genre_Movie,genre_OTHER,genre_Opera,genre_Pop,genre_R&B,genre_Rap,genre_Reggae,genre_Reggaeton,genre_Rock,genre_Ska,genre_Soul,genre_Soundtrack,genre_World,time_signature_3/4,time_signature_4/4,time_signature_5/4,time_signature_OTHER,key_A,key_A#,key_B,key_C,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#,mode_Major,mode_Minor
201541,Thomas Newman,Hide and Seek,31,0.864,0.506,111640,0.161,0.843,0.115,-20.399,0.0372,120.01,0.0892,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0
106198,Gaetano Donizetti,Don Pasquale: Act II: Aria: Chercherò lontana ...,4,0.989,0.223,582536,0.0555,0.00248,0.127,-21.266,0.0456,87.894,0.0608,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0
119767,Valee,Awesome (feat. Matt Ox),58,0.37,0.824,178747,0.47,0.0,0.115,-8.767,0.425,69.968,0.602,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0
87080,A Boogie Wit da Hoodie,Drowning (feat. Kodak Black),81,0.501,0.839,209269,0.81,0.0,0.117,-5.274,0.0568,129.014,0.814,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
118732,Pitbull,Hey Baby (Drop It to the Floor),63,0.0435,0.595,234453,0.913,0.0,0.259,-3.428,0.0884,128.021,0.762,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
228545,Anthony Hamilton,Pray For Me,44,0.316,0.78,279893,0.436,0.0,0.105,-5.933,0.0478,127.993,0.21,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
193858,Audra McDonald,How Much Love,9,0.989,0.449,200640,0.129,0.00448,0.0867,-14.612,0.0402,124.034,0.129,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
80832,Adolphe Adam,"Giselle, Act II: Albrecht's Variation",7,0.98,0.323,54693,0.153,0.374,0.109,-17.499,0.0377,183.866,0.599,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0
113411,Lil Yachty,SaintLaurentYSL (feat. Lil Baby),65,0.521,0.886,168439,0.303,0.0,0.169,-11.199,0.792,127.981,0.322,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
184660,Phil Harris,I`ve Got Nothing To Do But Love,10,0.974,0.848,193097,0.375,0.000273,0.102,-10.398,0.225,110.842,0.816,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0


Finally we split the data into 60% training 20% validation and 20% test subsets.

In [8]:
train, validate, test = np.split(data.sample(frac=1), [int(.6*len(data)), int(.8*len(data))])