## Research Question 

### Determine how well the features explain the popularity of songs on Spotify and Predict the popularity of a song based on these features(Using Spotify Data Set)

In [None]:
# Data for colab
!wget https://raw.githubusercontent.com/SandhyaKiran04/Popularity-Of-Song-Prediction/main/data.csv

In [1]:
from   sklearn.compose            import *
from   sklearn.ensemble           import *
from   sklearn.impute             import *
from   sklearn.linear_model       import *
from   sklearn.metrics            import * 
from   sklearn.pipeline           import Pipeline
from   sklearn.model_selection    import RandomizedSearchCV
from   sklearn.preprocessing      import *
from   sklearn.model_selection    import train_test_split
from   sklearn.base               import TransformerMixin,BaseEstimator


import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

## Load Data and Examine Data

In [2]:
spotify = pd.read_csv('data.csv')

In [3]:
spotify.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.991,['Mamie Smith'],0.598,168333,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,12,1920,0.0936,149.976,0.634,1920
1,0.643,"[""Screamin' Jay Hawkins""]",0.852,150200,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.0264,5,0.0809,-7.261,0,I Put A Spell On You,7,1920-01-05,0.0534,86.889,0.95,1920
2,0.993,['Mamie Smith'],0.647,163827,0.186,0,11m7laMUgmOKqI3oYzuhne,1.8e-05,0,0.519,-12.098,1,Golfing Papa,4,1920,0.174,97.6,0.689,1920
3,0.000173,['Oscar Velazquez'],0.73,422087,0.798,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,17,1920-01-01,0.0425,127.997,0.0422,1920
4,0.295,['Mixe'],0.704,165224,0.707,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.402,-6.036,0,Xuniverxe,2,1920-10-01,0.0768,122.076,0.299,1920


In [4]:
spotify.shape

(174389, 19)

In [5]:
cat_columns = ['key','mode','explicit']
con_columns = ['acousticness','danceability','duration_ms','energy',
               'instrumentalness','liveness','loudness','speechiness','tempo','valence','year']

In [None]:
features = spotify.drop('popularity', axis=1)  # features are taken separately 
target = spotify['popularity']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2, random_state = 42)

## Data PreProcessing

In [7]:
# Defining Dummy estimator to fit different models
class DummyEstimator(BaseEstimator):                
    "Pass through class, methods are present but do nothing."
    def fit(self): pass
    def score(self): pass

In [8]:
# preprocessing for categorical features:

# 1) replacing missing values to most_frequent
# 2) one hot encoding for changing strings to vectors
cat_pipe = Pipeline([('imputer', SimpleImputer(missing_values=np.nan,         
                                              strategy='most_frequent')),   
                    ('ohe', OneHotEncoder())])                              

# preprocessing for continuous features:

# 1) imputing missing values to median rather than mean to deal with outliers better
# 2) Standscaler to normalize the features
con_pipe = Pipeline([('imputer', SimpleImputer(strategy='median')),     
                    ('scaler', StandardScaler())])


# combining both the pipes into single preprocessing pipeline and drop the rest of columns

preprocessing = ColumnTransformer([('categorical', cat_pipe, cat_columns),
                                  ('continuous', con_pipe, con_columns)],remainder='drop')

In [9]:
# Main pipeline with dummy estiamtor
pipe = Pipeline([('preprocessing', preprocessing),
                 ('clf', DummyEstimator())])

### Tuning Hyperparameters on three Algorithms - Ridge, ExtraTreesRegressor and RandomForestRegressor
### For Ridge, Tuning 'alpha' because it determines regularisation strength, 'solver' to determine which method to use in the computational routines.
### For ExtraTreesRegressor and RandomForestRegressor, we are tuning 'n_estimators','max_depth','min_samples_leaf' and 'max_features' to reduce overfitting of the data and also whether to bootstrap the data or not  if set to 'False' the whole data is used to build the tree).

In [10]:
search_space = [{'clf':[Ridge()],
                 'clf__alpha' : [0.5,1,1.5,2],
                 'clf__max_iter': [100,500,1000],
                 'clf__solver': ['auto','svd','cholesky','lsqr','sparse_cg','sag','saga']},
    
                {'clf': [ExtraTreesRegressor()],
                 'clf__n_estimators': [20,30,40],
                 'clf__max_depth': [10,20,30],
                 'clf__min_samples_leaf': [3,5,7],
                 'clf__max_features' : ['auto','sqrt','log2'],
                 'clf__bootstrap' : [True,False]},
                
                {'clf': [RandomForestRegressor()],
                 'clf__n_estimators':  [30,50,100],
                 'clf__max_features':['auto', 'sqrt'],
                 'clf__max_depth': [10,20,30],
                 'clf__min_samples_leaf':[3,5,7],
                 'clf__bootstrap': [True, False]} 
                ]
                 

clf_algos_rand = RandomizedSearchCV(estimator=pipe, 
                                    param_distributions=search_space,
                                    scoring = 'r2',
                                    n_iter=5,
                                    cv=3, 
                                    n_jobs=-1,
                                    verbose=1)


best_model = clf_algos_rand.fit(X_train, y_train)    # fitting the training data


best_model.best_estimator_.get_params()['clf']       # best hyperparameters for the model

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   13.7s finished


ExtraTreesRegressor(max_depth=10, min_samples_leaf=7, n_estimators=40)

In [11]:
y_pred   = best_model.predict(X_test)               # predicting the values

### RMSE calculates a risk metric corresponding to the square root of expected value of the squared (quadratic) error or loss
### R-Squared, explained variance score represents the proportion of variance that has been explained by the independent variables in the model. It provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance.
### MAE calculates a risk metric corresponding to the expected value of the absolute error loss 
### Since we are interested in knowing how well the features explain the popularity of the song, we can consider R-Squared value or expalained variance score as metrics

In [12]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
var_score = explained_variance_score(y_test,y_pred)
mae = mean_absolute_error(y_test, y_pred)

In [13]:
print("rmse: ", rmse)
print("r2: ", r2)
print("var_score: ", var_score)
print("mae: ",mae)

rmse:  13.628856802209896
r2:  0.6127029975695873
var_score:  0.6127033367720085
mae:  9.326975440883418


## Final  Best  Model

In [14]:
best_model.best_estimator_

Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('categorical',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('ohe',
                                                                   OneHotEncoder())]),
                                                  ['key', 'mode', 'explicit']),
                                                 ('continuous',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['acousticness

## Conclusion and Further Steps

### R_Squared value came to be around 0.61 and it implies that model expalins 61% of the fitted data and that gives an indication that the features in the spotify data could be used to interpret the popularity of the songs.This  worked because the data set has diverse and predictive features.This is useful in business setting when trying to understand which songs could become popular based on features such as tempo,danceability etc., and could also be used  further to recommend songs to users based on the common features in the songs they listen to.

### Further steps include improving the model fit by using other exhaustive methods like grid search.Also, finding if there is multicollinearity between the features and dropping those features.