# Spotify Hit Classifier
Using the spotify API to predict hit songs.

## Big Picture

Spotify is a music application that allows users to listen to any music that they desire at any time. It provides an API where users can access information about songs that describe different numerical aspects of the songs. The question that I seek to answer is wheater it is possible to predict if a song will become a hit or not based on these values. By answering this question it would be possible to create music that had these traits and thus produce record hits. This would be especially useful for [AI-generated music](https://www.theverge.com/2019/4/26/18517803/openai-musenet-artificial-intelligence-ai-music-generation-lady-gaga-harry-potter-mozart)

## Getting the data

The data that I am using was collected using the Spotify API by the Kaggle user [Farooq Ansari](https://www.kaggle.com/theoverman). They also labeled each song as a hit or a flop and sorted them by the decade that they were released. 

## Explore the data

In [1]:
import pandas as pd 
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

df = pd.read_csv('../Data/dataset-of-00s.csv')

# remove non numeric columns
del df['track']
del df['artist']
del df['uri']

df.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,chorus_hit,sections,target
0,0.578,0.471,4,-7.27,1,0.0289,0.368,0.0,0.159,0.532,133.061,196707,4,30.88059,13,1
1,0.704,0.854,10,-5.477,0,0.183,0.0185,0.0,0.148,0.688,92.988,242587,4,41.51106,10,1
2,0.162,0.836,9,-3.009,1,0.0473,0.000111,0.00457,0.174,0.3,86.964,338893,4,65.32887,13,0
3,0.188,0.994,4,-3.745,1,0.166,7e-06,0.0784,0.192,0.333,148.44,255667,4,58.59528,9,0
4,0.63,0.764,2,-4.353,1,0.0275,0.363,0.0,0.125,0.631,112.098,193760,4,22.62384,10,1


## Prepare the data

In [2]:
# Features
X = df.iloc[:,:-1] 
# Labels
y = df.iloc[:,-1] 

## Explore Models

In [3]:
# Train a K-nearest neighbors on the labeled data
knn_classifier = KNeighborsClassifier()
knn_scores = cross_val_score(knn_classifier, X, y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (knn_scores.mean(), knn_scores.std() * 2))

Accuracy: 0.63 (+/- 0.03)


In [4]:
# Train an svm on the labeled data to learn about ham and spam emails
svm_classifier = svm.SVC(gamma=0.001, C=100.)
svm_scores = cross_val_score(svm_classifier, X, y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (svm_scores.mean(), svm_scores.std() * 2))

Accuracy: 0.60 (+/- 0.02)


In [5]:
# Train a Decision tree on the labeled data
dtree_classifier = DecisionTreeClassifier(random_state=0)
dtree_scores = cross_val_score(dtree_classifier, X, y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (dtree_scores.mean(), dtree_scores.std() * 2))

Accuracy: 0.77 (+/- 0.02)


## Fine-tune Model
Since the decision tree has the most accurate base value I will be tuning it to see if I can get an even better result.

In [11]:
for x in range(1,10):
    clf = DecisionTreeClassifier(criterion="entropy", max_depth=x)
    dtree_scores = cross_val_score(clf, X, y, cv=5)
    print("Accuracy: %0.2f (+/- %0.2f), Max Depth: %0.0f" % (dtree_scores.mean(), dtree_scores.std() * 2, x))

Accuracy: 0.75 (+/- 0.02), Max Depth: 1
Accuracy: 0.78 (+/- 0.01), Max Depth: 2
Accuracy: 0.80 (+/- 0.01), Max Depth: 3
Accuracy: 0.80 (+/- 0.02), Max Depth: 4
Accuracy: 0.81 (+/- 0.01), Max Depth: 5
Accuracy: 0.81 (+/- 0.02), Max Depth: 6
Accuracy: 0.82 (+/- 0.02), Max Depth: 7
Accuracy: 0.81 (+/- 0.02), Max Depth: 8
Accuracy: 0.81 (+/- 0.01), Max Depth: 9


## Solution

From the fine-tuning of the model, we can see that a max depth of 7 gives us the best result with an accuracy of 82 +/- 2%. This is a very strong result, meaning that we can very accuratley predict wheater a song will be a hit or not solely based on its attributes.

## Limitations and later work

The biggest limitation to this work is the method by which it was labeled. It is not clear how this dataset was labeled. The only definition they give is thus: "'Flop' here merely implies that it is a track that probably could not be considered popular in the mainstream.)" This makes it seem very subjective. A better leeling system would have been to see how many times a song was played or how much money it made.

## References

Data set retrieved from: https://www.kaggle.com/theoverman/the-spotify-hit-predictor-dataset