- Problem Setting
- Problem Definition
- Data Source
- Data Description
- Data Collection
- Data Processing
- Data Exploration
- Model Exploration and Model Selection
- Implementation of Selected Models
- Performance Evaluation and Interpretation
- Project Results
- Impact of the Project Outcomes
The music industry is a multibillion-dollar industry consisting of artists of diverse backgrounds producing songs in various genres that influence people worldwide. Yet, there are only a handful of songs and artists that make it to the top of the billboards. By taking historical data for every decade from 1960 to 2019 on Spotify, we try to predict if the track will be a hit or not.
In this dataset created by fetching features from Spotify’s Web API, we try to classify a track as a ‘Hit’ or not for every decade from 1960 to 2019. We try to use multiple classification models and assess their performance to pick the optimal one.
The dataset for this project is acquired from Kaggle:
https://www.kaggle.com/theoverman/the-spotify-hit-predictor-dataset
This dataset consists of 41,106 songs with 19 attributes from 6 different decades from 1960 to 2019. The target variable is a label, indicating that a song is a hit (1) or flop (0). The track is considered a flop following these conditions:
- The track must not appear in the 'hit' list of that decade.
- The track's artist must not appear in the 'hit' list of that decade.
- The track must belong to a genre that could be considered non-mainstream and/or Avant- grade.
- The track’s genre must not have a song on the ‘hit’ list.
- The track must have ‘US’ as one of its markets.
Other attributes of the dataset are tracks, artist, Uri, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration, etc.
Figure 1 : Data Sample with Descriptions
We downloaded and extracted the 6 CSV files corresponding to the data of each of the decades and then merged them into a single dataframe.
We determined the shape of the data to see how many records and columns are present. We saw that all the 41,106 records and 19 columns have been successfully imported. In the next step, we looked at the data types and checked if we needed to convert any of them. We observed that we didn’t need to make any explicit data type conversion. Next, we checked if there were any missing values or nulls present in our data. We could see that there are no null values or any missing values in the dataset. Lastly, we looked at the statistics of the data set. We gathered insights about the mean, median, and range of the data of all our variables. Further, we observed that amongst the predictor variables, two of the columns can be considered categorical. These would be the ‘Mode’ and ‘Time Signature’ columns. These wouldn’t necessarily need to be converted since they are already numerical. Our Response variable is the target column which holds only binary values of ‘0’ and ‘1’ which indicates if a song is a Hit or not.
In this part, we looked at the statistics of the variables.
danceability | energy | key | loudness | mode | speechiness | acousticness | |
---|---|---|---|---|---|---|---|
Count | 41106.000000 | 41106.000000 | 41106.000000 | 41106.000000 | 41106.000000 | 41106.000000 | 41106.000000 |
Mean | 0.539695 | 0.579545 | 5.213594 | -10.221525 | 0.693354 | 0.072960 | 0.364197 |
Std | 0.177821 | 0.252628 | 3.534977 | 5.311626 | 0.461107 | 0.086112 | 0.338913 |
min | 0.000000 | 0.000251 | 0.000000 | -49.253000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.420000 | 0.396000 | 2.000000 | -12.816000 | 0.000000 | 0.033700 | 0.039400 |
50% | 0.552000 | 0.601000 | 5.000000 | -9.257000 | 1.000000 | 0.043400 | 0.258000 |
75% | 0.669000 | 0.787000 | 8.000000 | -6.374250 | 1.000000 | 0.069800 | 0.676000 |
max | 0.988000 | 1.000000 | 11.000000 | 3.744000 | 1.000000 | 0.960000 | 0.996000 |
From the statistics, we can see that almost all the values for the predictor time_signature are
instrumentalness | liveness | valence | tempo | duration_ms | Chorus_hit | sections | time_signature | |
---|---|---|---|---|---|---|---|---|
count | 41106.000000 | 41106.00 | 41106.00 | 41106.00 | 4.110600e+04 | 41106.000 | 41106.00 | 41106.00 |
mean | 0.154416 | 0.201535 | 0.542440 | 119.338249 | 2.348776e+05 | 40.106041 | 10.475673 | 3.8936 |
std | 0.303530 | 0.172959 | 0.267329 | 29.098845 | 1.189674e+05 | 19.005515 | 4.871850 | 0.4230 |
min | 0.000000 | 0.013000 | 0.000000 | 0.000000 | 1.516800e+04 | 0.000000 | 0.000000 | 0.00 |
25% | 0.000000 | 0.094000 | 0.330000 | 97.397000 | 1.729278e+05 | 27.599792 | 8.000000 | 4.00 |
50% | 0.000120 | 0.132000 | 0.558000 | 117.565000 | 2.179070e+05 | 35.850795 | 10.000000 | 4.00 |
75% | 0.061250 | 0.261000 | 0.768000 | 136.494000 | 2.667730e+05 | 47.625615 | 12.000000 | 4.00 |
max | 1.000000 | 0.999000 | 0.996000 | 241.423000 | 4.170227e+06 | 433.182000 | 169.000000 | 5.00 |
- Hence, we decide to remove this predictor in this step.
In this part first, we create a boxplot to look for outliers and see the distribution of all the predictors.
Figure 2 : Distribution of Scaled Predictors with outliers
There are many outliers in the data in predictors such as instrumentalness and Spechiness. They have 8920, and 5088 outliers respectively, therefore we remove these two predictors completely and after that remove all the observations which have an outlier.
After removing the outliers, we have 33401 songs and 12 predictors for building the model.
Figure 3 : Distribution of the Scaled Predictors after removing the outliers
This boxplot illustrates the distribution of the predictors, however, as our goal is to build a classification model, exploring the distribution of the predictors by the target variable will be much more efficient.
Figure 4 : Side by side boxplots by target variable_
From figure 4, the predictor’s danceability, energy, loudness, acousticness, duration, liveness, sections, and valence have a different distribution amongst hit songs and other songs. Hence, they can be considered good predictors.
The total number of Songs after removing the outliers is 33401. We count the number of hit songs and other songs to find out if we need oversampling.
Figure 5 : Count of hit songs and other songs
From figure 5 we can see that almost half of the songs are hit, and half of the songs are not considered hit, and we do not need oversampling.
The purpose of this project is to predict whether a song is classified as a hit song or is classified as a non-hit song. Hence, the goal of the project is classification, we use different classification models to get the best results. At first, 8 models are selected for this classification task. These models are:
- Logistic Regression
- K-Nearest Neighbors
- Decision Tree
- Support Vector Machine (Linear Kernel)
- Support Vector Machine (RBF Kernel)
- Neural Network
- Random Forest
- Gradient Boosting
Next, we split the dataset into 8 0% training and 2 0% testing datasets. We use 3-Fold cross validation to choose the best models. The Scores of the models are shown in table 1.
Model | Score |
---|---|
Logistic Regression | 68.1% |
K-Nearest Neighbors | 68.4% |
Decision Tree | 64.8% |
Support Vector Machine (Linear Kernel) | 68.2% |
Support Vector Machine (RBF Kernel) | 73.4% |
Neural Network | 73.1% |
Random Forest | 73.6% |
Gradient Boosting | 73.0% |
From table 1, the models are divided into two groups. Models that have a score lower than 69%, and Models with a score higher than 72%. Therefore, we choose 4 models that have a score of 72% and above for the next step. These models are Support Vector Machine (RBF Kernel), Neural Network, Random Forest, and Gradient Boosting.
In this step, we use cross-validation to observe the performance of the selected models. We use 5 - fold cross-validation, where the data is divided into 5 folds and 4 are used for training and the other one is used as the validation set. Each time 1 of the folds are used as the validation and therefore the model is fitted 5 times.
Model | Score |
---|---|
Support Vector Machine (RBF Kernel) | 71.88% |
Neural Network | 71.73% |
Random Forest | 74.00% |
Gradient Boosting | 71.299% |
From the results of the cross-validation, we choose the Random Forest Classifier as our final model.
We first train the model on 80% training data (as we have 34000 rows, 80% training is enough).
The model will give an accuracy of 73.74%. In the next steps, we use different methods to enhance the performance of the model and finally evaluate the model.
The random forest model splits the nodes based on more important features and implicitly performs feature selection. In this step, we will try to remove some features with less importance and see if the model accuracy and running time are enhanced or not. Figure 6 shows the importance of each feature.
It is observed that sections, mode, and key don’t have a lot of importance in the random forest model.
After removing these three features, the performance of the model is decreased by 0. 6 %, and the running time is only decreased by 1 second. Therefore, we only use this section to observe the importance of each feature and we will keep all the features.
In this step, we will use dimension reduction methods and try to improve the performance of the model. First, we build all the principal components to be able to find a subset that will capture more than 90% of the total variance in the data.
From figure 7 , we can conclude that by using 7 principal components, 90% of the total variance is captured. Building the model with the 7 principal components, we get a Cross validation score of 6 8. 5 % for the model.
This is 6% less than the accuracy that we had without dimension reduction, and as a result, we will not use dimension reduction before building our random forest model.
A random forest model has various parameters, and in this part, we try to optimize the model performance by choosing the best parameters for our random forest model. Random Forest parameters are:
Number of Estimators: The number of estimators is equal to the number of trees in the model. Maximum Depth: Maximum depth is the maximum number of splits in the tree. (max level) Minimum Sample Split: Minimum sample split is the minimum number of samples required to be able to split a node. Minimum Sample Leaf: Minimum sample leaf is the minimum number of observations that each leaf should have after all the splits. Bootstrap: To use bootstrap or not.
As there are a few parameters and each random forest model takes 9 seconds to run, we first try to find a set of parameters and perform a random grid search on them, and after that perform a grid search on the best parameters.
The parameters for the random search are shown in the table:
Parameter | values |
---|---|
N_estimators | [100,110,120,…,990,1000] |
Max_depth | [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] |
Min_sample_split | [2,4,6,8,10,…,46,48,50] |
Min_sample_leaf | [2,4,6,8,10,…,46,48,50] |
Bootstrap | True, False |
After running the model with the random parameters and with 3-fold cross-validation for each of them, we obtain the following result:
Figure 8: Random Search Results
For each of the parameters, we choose 2 or 3 of the highest-scoring ones based on figure 8 and perform a grid search on them.
In this part, we built models with 3-fold cross-validation for all the combinations of the following parameters:
Parameter | values |
---|---|
N_estimators | [400,900] |
Max_depth | [12,13] |
Min_sample_split | [2,12,28] |
Min_sample_leaf | [28,34] |
Bootstrap | False |
Max_features | Auto, sqrt |
A model is built for all the combinations and with 3-fold cross-validation, which leads to 148 total models being fitted.
Out of all the models, a model with the parameters in table 6 has the best performance, and therefore we choose the following parameters for our random forest model:
Parameter | value |
---|---|
N_estimators | 400 |
Max_depth | 13 |
Min_sample_split | 12 |
Min_sample_leaf | 28 |
Bootstrap | False |
Max_features | sqrt |
After all the steps we can see that the final accuracy of the model is 73.0% which is 0.82% lower than the accuracy of the baseline model.
This shows that performing PCA and Hyperparameter tuning does not necessarily improve the model's performance and here we will proceed with our baseline model.
For measuring the performance of the random forest classifier, we use accuracy score, confusion matrix, Precision, Recall, and f1_score.
The accuracy of the predictions by the random forest classifier is 73.82% on the test data. This means that the classifier predicted correctly on the test data by 73.82%. This in turn implies that the error rate is 2 6.18%.
The confusion matrix of the random forest classifier is as follows:
Predicted Hit | Predicted Non-Hit | |
---|---|---|
Actual Hit | 2940 | 666 |
Actual Non-Hit | 1083 | 1992 |
The classification report for the test data and the predictions are:
Precision | Recall | f1-score | Support | |
---|---|---|---|---|
Non-Hit | 0.75 | 0.64 | 0.69 | 3075 |
Hit | 0.73 | 0.81 | 0.77 | 3606 |
accuracy | 0.74 | 6681 | ||
macro avg | 0.74 | 0.73 | 0.73 | 6681 |
weighted avg | 0.74 | 0.74 | 0.73 | 6681 |
It is the number of correctly identified members of a class divided by all the times the model predicted that class. In the case of “hits”, the precision score would be the number of correctly identified “hits” divided by the total number of times the classifier predicted “hits,” rightly or wrongly. This value is around 73%, meaning that out of all the songs which are predicted as a hit song by the model, 73% of them are hit songs.
It is the number of members of a class that the classifier identified correctly divided by the total number of members in that class. For “hits”, this would be the number of actual “hits” that the classifier correctly identified as such. This value is 81%, meaning that out of all the hit songs, 81% are correctly classified as hit songs.
The ROC Curve plots the sensitivity based on 1-specificity for 100 different cutoffs. Here the area under the curve is 0. 81 which shows that the model is performing well. The random model has an area of 0. 5 and the best classifier has an area of 1.
In the original dataset consisting of 40k records, there were multiple outliers in the initial 19 features. The features with the most outliers are removed. The remaining dataset is then split into a train test split of 80 – 20. The classification models such as Logistic Regression, K-Nearest Neighbors, Decision Tree, Support Vector Machine (Linear Kernel), Support Vector Machine (RBF Kernel), Neural Network, Random Forest, and Gradient Boosting are considered. The results show that the Random Forest model is the best classification model for our dataset. It was able to classify hit songs and non-hit songs with an accuracy of 73.8 9 %.
After selecting Random Forest, Dimension reduction techniques such as principal component analysis and determining feature importance are performed in the hopes of improving model accuracy. Hyper-parameter tuning techniques including Random, and Grid Search are performed. It is observed that the use of these techniques resulted only in a slight improvement in the performance of the model with a large increase in the computation time. It is determined that these techniques can be ignored since there was no significant impact on the model’s performance.
The precision, recall, and the area under the ROC curve showed that the model is performing well. For hit songs, the Recall was 81% meaning that out of all the hit songs, the model predicted it correctly 81% of the time.
As the global music industry is moving more and more towards online audio streaming services such as Spotify, this project provides insights into what makes a song a hit on the billboard. Through this project, the importance of a variety of features like ‘loudness’ and ‘energy’ is observed in making a song hit or not. This will in turn help artists, producers, and even online streaming platforms make better business decisions, promotions, and what would be in the making of a popular song globally. The goal of the project was to accurately label a hit song as ‘1’ or ‘0’ otherwise. In this regard, the Random Forest classifier outperformed all other classifiers with its high recall, high precision, and high f1 scores for hit songs. From the analysis in this project, The Random Forest classification model could be used to predict hit songs. It accurately determines hit songs and works best for predictors that are highly diverse and varied, which would have been difficult to do so in the past.