# Predictive Modeling
### Kwame V. Taylor

I will set the baseline and create the first ML model to predict song popularity.

## Set up Environment

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from math import sqrt

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error, explained_variance_score, mean_absolute_error
from sklearn.linear_model import LinearRegression, TweedieRegressor, LassoLars
from sklearn.feature_selection import RFE
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import IsolationForest, RandomForestRegressor
from sklearn.svm import SVR
import sklearn.svm
import math
import itertools
import optunity
import optunity.metrics

import warnings
warnings.filterwarnings("ignore")

In [2]:
from prepare import handle_nulls, set_index
from preprocessing import spotify_split, split_df, scale_data, encode_features
from model import get_model_features, OLS_model

## Acquire data

In [3]:
df = pd.read_csv('full-playlist.csv', index_col=0)

In [4]:
df.head()

Unnamed: 0_level_0,artist,album,release_date,track_name,album_popularity,label,danceability,energy,key,loudness,...,disc_number,track_number,album_id,album_type,duration_seconds,duration_minutes,is_featured_artist,release_year,release_month,release_day
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6mecZbKK3JDeMdFRNxsCV5,tay-k,trapman,2020-07-12,trapman,36,Tay-K,0.792,0.594,2,-8.544,...,1,1,2J1hMj78HfdcMrmL2Sk6eR,single,232,3,0,2020,7,12
5PtMwNq8Dp31uYdGGacVJE,lil wyte,doubt me now,2003-03-04,oxy cotton,55,Hypnotize Minds Productions,0.816,0.578,9,-6.912,...,1,11,2lwxcemR1muymEHNMblCpm,album,193,3,0,2003,3,4
6s8EhlBn2PIoESylkXnwYc,kamelen,kingpin slim,2019-11-29,kingpin o.g - remix,46,NMG/G-HUSET,0.649,0.798,0,-6.45,...,1,11,6va2RTYO2ois7t88RN0LhJ,album,254,4,0,2019,11,29
2e9EZ2V5QGGZPMJacO3y0Y,waka flocka flame,flockaveli,2010-10-01,grove st. party (feat. kebo gotti),71,Asylum/Warner Records,0.705,0.702,0,-4.783,...,1,9,6MQtWELG7aRX7CkAzQ6nLM,album,250,4,1,2010,10,1
3ZRd5Z0fiYtASLdEPPb16m,project pat,mista don't play: everythangs workin',2001-02-13,don't save her (feat. crunchy black),55,Hypnotize Minds Productions,0.838,0.793,11,-5.47,...,1,5,4QzaueQPQa0lqrMmQoh4v0,album,261,4,1,2001,2,13


In [5]:
df.shape

(5733, 30)

## Prepare data

In [6]:
# handle null values
df = handle_nulls(df)

In [7]:
# check for nulls
df.isna().sum()

artist                0
album                 0
release_date          0
track_name            0
album_popularity      0
label                 0
danceability          0
energy                0
key                   0
loudness              0
mode                  0
speechiness           0
instrumentalness      0
liveness              0
valence               0
tempo                 0
duration_ms           0
time_signature        0
explicit              0
popularity            0
disc_number           0
track_number          0
album_id              0
album_type            0
duration_seconds      0
duration_minutes      0
is_featured_artist    0
release_year          0
release_month         0
release_day           0
dtype: int64

In [8]:
# check data types
df.dtypes

artist                 object
album                  object
release_date           object
track_name             object
album_popularity        int64
label                  object
danceability          float64
energy                float64
key                     int64
loudness              float64
mode                    int64
speechiness           float64
instrumentalness      float64
liveness              float64
valence               float64
tempo                 float64
duration_ms             int64
time_signature          int64
explicit                int64
popularity              int64
disc_number             int64
track_number            int64
album_id               object
album_type             object
duration_seconds        int64
duration_minutes        int64
is_featured_artist      int64
release_year            int64
release_month           int64
release_day             int64
dtype: object

In [9]:
# set index to track_id
#df = set_index(df)

Note to self: After MVP we need to convert release_data into a Timestamp.

## Preprocess data

In [10]:
# show features
df.columns

Index(['artist', 'album', 'release_date', 'track_name', 'album_popularity',
       'label', 'danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration_ms', 'time_signature', 'explicit', 'popularity',
       'disc_number', 'track_number', 'album_id', 'album_type',
       'duration_seconds', 'duration_minutes', 'is_featured_artist',
       'release_year', 'release_month', 'release_day'],
      dtype='object')

In [11]:
df.head(3)

Unnamed: 0_level_0,artist,album,release_date,track_name,album_popularity,label,danceability,energy,key,loudness,...,disc_number,track_number,album_id,album_type,duration_seconds,duration_minutes,is_featured_artist,release_year,release_month,release_day
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6mecZbKK3JDeMdFRNxsCV5,tay-k,trapman,2020-07-12,trapman,36,Tay-K,0.792,0.594,2,-8.544,...,1,1,2J1hMj78HfdcMrmL2Sk6eR,single,232,3,0,2020,7,12
5PtMwNq8Dp31uYdGGacVJE,lil wyte,doubt me now,2003-03-04,oxy cotton,55,Hypnotize Minds Productions,0.816,0.578,9,-6.912,...,1,11,2lwxcemR1muymEHNMblCpm,album,193,3,0,2003,3,4
6s8EhlBn2PIoESylkXnwYc,kamelen,kingpin slim,2019-11-29,kingpin o.g - remix,46,NMG/G-HUSET,0.649,0.798,0,-6.45,...,1,11,6va2RTYO2ois7t88RN0LhJ,album,254,4,0,2019,11,29


In [12]:
# encode features
df = encode_features(df)
df.head(3)

Unnamed: 0_level_0,artist,album,release_date,track_name,album_popularity,label,danceability,energy,key,loudness,...,track_number,album_id,album_type,duration_seconds,duration_minutes,is_featured_artist,release_year,release_month,release_day,is_explicit
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6mecZbKK3JDeMdFRNxsCV5,tay-k,trapman,2020-07-12,trapman,36,Tay-K,0.792,0.594,2,-8.544,...,1,2J1hMj78HfdcMrmL2Sk6eR,single,232,3,0,2020,7,12,1
5PtMwNq8Dp31uYdGGacVJE,lil wyte,doubt me now,2003-03-04,oxy cotton,55,Hypnotize Minds Productions,0.816,0.578,9,-6.912,...,11,2lwxcemR1muymEHNMblCpm,album,193,3,0,2003,3,4,1
6s8EhlBn2PIoESylkXnwYc,kamelen,kingpin slim,2019-11-29,kingpin o.g - remix,46,NMG/G-HUSET,0.649,0.798,0,-6.45,...,11,6va2RTYO2ois7t88RN0LhJ,album,254,4,0,2019,11,29,1


In [13]:
# chose features for MVP modeling
df = get_model_features(df)
df.head()

Unnamed: 0_level_0,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,time_signature,popularity,disc_number,track_number,duration_seconds,is_featured_artist,is_explicit
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
6mecZbKK3JDeMdFRNxsCV5,0.792,0.594,2,-8.544,1,0.3,0.0,0.244,0.351,82.512,4,43,1,1,232,0,1
5PtMwNq8Dp31uYdGGacVJE,0.816,0.578,9,-6.912,1,0.233,0.0,0.114,0.265,148.077,4,61,1,11,193,0,1
6s8EhlBn2PIoESylkXnwYc,0.649,0.798,0,-6.45,0,0.145,0.0,0.409,0.717,160.011,4,23,1,11,254,0,1
2e9EZ2V5QGGZPMJacO3y0Y,0.705,0.702,0,-4.783,0,0.108,0.0,0.364,0.771,140.059,4,62,1,9,250,1,1
3ZRd5Z0fiYtASLdEPPb16m,0.838,0.793,11,-5.47,0,0.0773,1e-06,0.106,0.8,160.003,4,45,1,5,261,1,1


In [14]:
# split the data
X_train, y_train, X_validate, y_validate, X_test, y_test, train, validate, test = spotify_split(df, 'popularity')
train.head(3)

Shape of train: (4012, 16) | Shape of validate: (861, 16) | Shape of test: (860, 16)
Percent train: 70.0        | Percent validate: 15.0       | Percent test: 15.0


Unnamed: 0_level_0,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,time_signature,popularity,disc_number,track_number,duration_seconds,is_featured_artist,is_explicit
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
30bqVoKjX479ab90a8Pafp,0.585,0.471,4,-9.934,0,0.0616,0.0184,0.115,0.323,93.099,4,87,1,1,142,0,1
0HO8pCseEpgozNi3z0R4bc,0.833,0.518,10,-10.126,0,0.349,0.0,0.635,0.773,180.008,4,24,1,11,120,0,1
643K3eEgRvdJiXjSzlz7dg,0.471,0.671,1,-6.05,1,0.341,0.0,0.308,0.85,176.863,4,30,1,2,252,0,1


In [15]:
# scale the data
X_train_scaled, X_validate_scaled, X_test_scaled = scale_data(train, validate, test, 'popularity', 'MinMax')
X_train_scaled.head(3)

Unnamed: 0_level_0,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,time_signature,disc_number,track_number,duration_seconds,is_featured_artist,is_explicit
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
30bqVoKjX479ab90a8Pafp,0.593306,0.439493,0.363636,0.580823,0.0,0.064033,0.019127,0.101302,0.328921,0.422695,0.8,0.0,0.0,0.235832,0.0,1.0
0HO8pCseEpgozNi3z0R4bc,0.844828,0.48996,0.909091,0.572667,0.0,0.362786,0.0,0.651741,0.787169,0.817286,0.8,0.0,0.163934,0.195612,0.0,1.0
643K3eEgRvdJiXjSzlz7dg,0.477688,0.654247,0.090909,0.745826,1.0,0.35447,0.0,0.3056,0.86558,0.803007,0.8,0.0,0.016393,0.436929,0.0,1.0


In [16]:
# check data types
X_train_scaled.dtypes

danceability          float64
energy                float64
key                   float64
loudness              float64
mode                  float64
speechiness           float64
instrumentalness      float64
liveness              float64
valence               float64
tempo                 float64
time_signature        float64
disc_number           float64
track_number          float64
duration_seconds      float64
is_featured_artist    float64
is_explicit           float64
dtype: object

## Set the baseline

In [17]:
#np.median(y_train)
np.mean(y_train)

38.33150548354935

In [18]:
#baseline = y_train.median()
baseline = y_train.mean()

baseline_rmse_train = round(sqrt(mean_squared_error(y_train, np.full(len(y_train), baseline))), 6)
print('RMSE (Root Mean Square Error) of Baseline on train data:\n', baseline_rmse_train)

baseline_rmse_validate = round(sqrt(mean_squared_error(y_validate, np.full(len(y_validate), baseline))), 6)
print('RMSE (Root Mean Square Error) of Baseline on validate data:\n', baseline_rmse_validate)

RMSE (Root Mean Square Error) of Baseline on train data:
 22.897138
RMSE (Root Mean Square Error) of Baseline on validate data:
 22.837724


Mean performed better than median.

Our baseline prediction of popularity will be ```38.46776470588235```, with an RMSE of ```22.770177``` on the train data and ```23.034868``` on the validate data.

## Model 1 - Ordinary Least Squares (OLS) using Linear Regression

In [19]:
# show available features
X_train_scaled.columns

Index(['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
       'disc_number', 'track_number', 'duration_seconds', 'is_featured_artist',
       'is_explicit'],
      dtype='object')

In [20]:
# use all features
X = X_train_scaled
y = y_train

X_v = X_validate_scaled
y_v = y_validate

lm_pred, lm_rmse, lm_pred_v, lm_rmse_v = OLS_model(X, y, X_v, y_v)

RMSE for OLS using Linear Regression

On train data:
 21.513114 

 On validate data:
 21.392782


Not great results, but they did beat the baseline model.

## Model 2 - Support Vector Regressor using RBF Kernel

In [38]:
# use all features
X = X_train_scaled
y = y_train
X_v = X_validate_scaled
y_v = y_validate

# most important SVR parameter is Kernel type.
# It can be linear, polynomial, or gaussian SVR.
# We have a non-linear condition so we can select polynomial or gaussian
# but here we select RBF (a gaussian type) kernel.

# create the model object
svr = SVR(kernel='rbf')

# fit the model to our training data
svr.fit(X, y)

# predict on train
svr_pred = svr.predict(X)
# compute root mean squared error
svr_rmse = sqrt(mean_squared_error(y, svr_pred))

# predict on validate
svr_pred_v = svr.predict(X_v)
# compute root mean squared error
svr_rmse_v = sqrt(mean_squared_error(y_v, svr_pred_v))

print("RMSE for SVR using RBF Kernel\n\nOn train data:\n", round(svr_rmse, 6), '\n\n', 
      "On validate data:\n", round(svr_rmse_v, 6))

#return svr_pred, svr_rmse, svr_pred_v, svr_rmse_v

RMSE for SVR using RBF Kernel

On train data:
 21.643583 

 On validate data:
 21.708837


## Model 2 - Support Vector Regressor using RBF Kernel