# Spotify Song Hit Prediction
### Andrew Smith 12/5/2022

The purpose of this project is to try to predict if a song on Spotify (a popular music streaming service provider) will be a "Hit" or a "Flop" based on metrics derived from the Spotify API.  The metrics rate the song along various properties such as "loudnes", "energy", "key", "temp", "duration", etc.  The explaination and details on how the metrics are derived can be found at https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features

For this project, the data to be analyzed was obtained from Kaggle: https://www.kaggle.com/datasets/theoverman/the-spotify-hit-predictor-dataset

The curator of the data used the Spotify API to obtain the song metrics for several decades of songs and also added a "target" feature which labels each song as being a "Hit" or a "Flop".  The curator made this determination based on other external sources which identified hit songs.

        - The track must not appear in the 'hit' list of that decade.
        - The track's artist must not appear in the 'hit' list of that decade.
        - The track must belong to a genre that could be considered non-mainstream and / or avant-garde. 
        - The track's genre must not have a song in the 'hit' list.
        - The track must have 'US' as one of its markets.`

In [5]:
import numpy as np
import pandas as pd

import os
for dirname, _, filenames in os.walk('data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data\dataset-of-00s.csv
data\dataset-of-10s.csv
data\dataset-of-60s.csv
data\dataset-of-70s.csv
data\dataset-of-80s.csv
data\dataset-of-90s.csv
data\LICENSE
data\README.txt


#### First read in the datasets into a list of decade DataFrames

In [16]:
decade_list = [pd.read_csv(f'data/dataset-of-{decade}s.csv') for decade in ['60', '70', '80', '90', '00', '10' ]]

#### Let's take a look at some of the data from the first decade.

In [17]:
decade_list[0].info()
decade_list[0].head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8642 entries, 0 to 8641
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track             8642 non-null   object 
 1   artist            8642 non-null   object 
 2   uri               8642 non-null   object 
 3   danceability      8642 non-null   float64
 4   energy            8642 non-null   float64
 5   key               8642 non-null   int64  
 6   loudness          8642 non-null   float64
 7   mode              8642 non-null   int64  
 8   speechiness       8642 non-null   float64
 9   acousticness      8642 non-null   float64
 10  instrumentalness  8642 non-null   float64
 11  liveness          8642 non-null   float64
 12  valence           8642 non-null   float64
 13  tempo             8642 non-null   float64
 14  duration_ms       8642 non-null   int64  
 15  time_signature    8642 non-null   int64  
 16  chorus_hit        8642 non-null   float64


Unnamed: 0,track,artist,uri,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,chorus_hit,sections,target
0,Jealous Kind Of Fella,Garland Green,spotify:track:1dtKN6wwlolkM8XZy2y9C1,0.417,0.62,3,-7.727,1,0.0403,0.49,0.0,0.0779,0.845,185.655,173533,3,32.94975,9,1
1,Initials B.B.,Serge Gainsbourg,spotify:track:5hjsmSnUefdUqzsDogisiX,0.498,0.505,3,-12.475,1,0.0337,0.018,0.107,0.176,0.797,101.801,213613,4,48.8251,10,0
2,Melody Twist,Lord Melody,spotify:track:6uk8tI6pwxxdVTNlNOJeJh,0.657,0.649,5,-13.392,1,0.038,0.846,4e-06,0.119,0.908,115.94,223960,4,37.22663,12,0
3,Mi Bomba Sonó,Celia Cruz,spotify:track:7aNjMJ05FvUXACPWZ7yJmv,0.59,0.545,7,-12.058,0,0.104,0.706,0.0246,0.061,0.967,105.592,157907,4,24.75484,8,0
4,Uravu Solla,P. Susheela,spotify:track:1rQ0clvgkzWr001POOPJWx,0.515,0.765,11,-3.515,0,0.124,0.857,0.000872,0.213,0.906,114.617,245600,4,21.79874,14,0


#### Fold in the decade as feature and combine the decade DataFrames into one large DataFrame

In [18]:
for i, decade in enumerate([1960, 1970, 1980, 1990, 2000, 2010]):
    decade_list[i]['decade'] = pd.Series(decade, index=decade_list[i].index)
    
data = pd.concat(decade_list, axis=0).reset_index(drop=True)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41106 entries, 0 to 41105
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track             41106 non-null  object 
 1   artist            41106 non-null  object 
 2   uri               41106 non-null  object 
 3   danceability      41106 non-null  float64
 4   energy            41106 non-null  float64
 5   key               41106 non-null  int64  
 6   loudness          41106 non-null  float64
 7   mode              41106 non-null  int64  
 8   speechiness       41106 non-null  float64
 9   acousticness      41106 non-null  float64
 10  instrumentalness  41106 non-null  float64
 11  liveness          41106 non-null  float64
 12  valence           41106 non-null  float64
 13  tempo             41106 non-null  float64
 14  duration_ms       41106 non-null  int64  
 15  time_signature    41106 non-null  int64  
 16  chorus_hit        41106 non-null  float6

#### Remove the features "track", "artist" and "uri" as unnecessary
While intersting to know the identity of the songs labeled as a Hit or Flop, these data are not needed for training purposes.

In [19]:
data = data.drop(['track', 'artist', 'uri'], axis=1)

#### Split the data into Input matrix X and Target vector Y

In [20]:
X = data.drop('target',axis=1)
Y = data['target']

#### Further split the X and Y into train/test sets with 20% retained for testing

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test  = train_test_split(X, Y, train_size=0.8, random_state=42)

ModuleNotFoundError: No module named 'sklearn'