This project is for predicting if a song will be a hit song or not depending on the acoustic properties of the song. Examples of the data features used in these predictions are energy, key, loudness, tempo, album, artist, among other properties. The structure and methods of this work closely follow those of Herremans et al.
- Using Spotify's API, we scraped the attributes of approximately 3000 songs.
- We created labels of each song as a hit/non-hit by scraping an official charts website. Hit criteria can be changed, but we label a hit song as one that reaches the top 100 in the official charts.
We used several techniques to handle:
- Missing input
- Deduplication of samples
- Normalization of features
- Feature selection for removal of correlated features
The algorithms we used from built-in libraries are:
- SVM
- Naive Bayes
- Logistic Regression
- Decision Trees
- Random Forests
- Adaboost
We also implemented the following algorithms from scratch:
- Neural Network
- Naive Bayes
- Multilayer, tree based Naive Bayes using error correction
The primary issue faced in this project was that of a class imbalance. Intuitively, there are less hit songs than non-hit songs. To combat this issue, we implemented the SMOTE algorithm to sample the minority class (hit songs) further.
- Using main.py, choose the settings and algorithm to perform hit song prediction. These same instructions are listed in main.py.
- For "algo", select either "RF" for random forest or "DT" for decision tree for the recursive feature selection setting. This will only be used if the boolean for Feature Selection is "True" and "recursive_feature_selector" is chosen.
- For "feature_selector", select among "consistency_subset", "selectkBest", or "recursive_feature_selector" to reduce the dimensionality of the dataset by choosing the most relevant features.
- When setting the data pre-processing, the parameters are "file_path", "data_path", "algo", "feature_selector", and three booleans. The booleans are whether to use normalization, missing value treatment, and if feature selection will be performed, repsectively. Note that when "NN" is chosen, we achieved the best results when no feature selection is performed (i.e., the last boolean is False).
- Choose the cross-validation folds for performance assessment using "nFold".
- Finally, choose the type of model to run using "_learner". The following are all options:
- Naive Bayes: NB
- Decision Tree: DT
- Random Forest: RF
- Support Vector Machine: SVM
- Logistic Regression: LR
- Neural Network: MLP
- Our Implementation of Neural Network: NN
- AdaBoost: ADA
- Tree Based Naive Bayes with depth 2: NBL2
- Tree Based Naive Bayes with depth 3: NBL3
- Our Implementation of Naive Bayes: NBO
Dorien Herremans, David Martens & Kenneth Sörensen (2014) Dance Hit Song Prediction, Journal of New Music Research, 43:3, 291-302.