# Predictive Modeling
### Kwame V. Taylor

I will set the baseline and create the first ML model to predict song popularity.

## Set up Environment

In [25]:
import pandas as pd
import numpy as np
from scipy import stats
from math import sqrt

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error, explained_variance_score, mean_absolute_error
from sklearn.linear_model import LinearRegression, TweedieRegressor, LassoLars
from sklearn.feature_selection import RFE
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import IsolationForest, RandomForestRegressor

import warnings
warnings.filterwarnings("ignore")

In [2]:
from prepare import handle_nulls, set_index
from preprocessing import spotify_split, split_df, scale_data, encode_features
from model import get_model_features, OLS_model

## Acquire data

In [3]:
df = pd.read_csv('full-playlist.csv', index_col=0)

In [4]:
df.head()

Unnamed: 0,artist,album,release_date,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,explicit,popularity,disc_number
0,Tay-K,TRAPMAN,2020-07-12,TRAPMAN,6mecZbKK3JDeMdFRNxsCV5,0.792,0.594,2.0,-8.544,1.0,0.3,0.0,0.244,0.351,82.512,232803.0,4.0,True,43.0,1.0
1,Lil Wyte,Doubt Me Now,2003-03-04,Oxy Cotton,5PtMwNq8Dp31uYdGGacVJE,0.816,0.578,9.0,-6.912,1.0,0.233,0.0,0.114,0.265,148.077,193920.0,4.0,True,61.0,1.0
2,Kamelen,KINGPIN SLIM,2019-11-29,Kingpin O.G - Remix,6s8EhlBn2PIoESylkXnwYc,0.649,0.798,0.0,-6.45,0.0,0.145,0.0,0.409,0.717,160.011,254390.0,4.0,True,22.0,1.0
3,Waka Flocka Flame,Flockaveli,2010-10-01,Grove St. Party (feat. Kebo Gotti),2e9EZ2V5QGGZPMJacO3y0Y,0.705,0.702,0.0,-4.783,0.0,0.108,0.0,0.364,0.771,140.059,250493.0,4.0,True,62.0,1.0
4,Project Pat,Mista Don't Play: Everythangs Workin',2001-02-13,Don't Save Her (feat. Crunchy Black),3ZRd5Z0fiYtASLdEPPb16m,0.838,0.793,11.0,-5.47,0.0,0.0773,1e-06,0.106,0.8,160.003,261933.0,4.0,True,45.0,1.0


In [5]:
df.shape

(6074, 20)

## Prepare data

In [6]:
# handle null values
df = handle_nulls(df)

In [7]:
# check for nulls
df.isna().sum()

artist              0
album               0
release_date        0
track_name          0
track_id            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
time_signature      0
explicit            0
popularity          0
disc_number         0
dtype: int64

In [8]:
# check data types
df.dtypes

artist               object
album                object
release_date         object
track_name           object
track_id             object
danceability        float64
energy              float64
key                 float64
loudness            float64
mode                float64
speechiness         float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
duration_ms         float64
time_signature      float64
explicit               bool
popularity          float64
disc_number         float64
dtype: object

In [9]:
# set index to track_id
df = set_index(df)

Note to self: After MVP we need to convert release_data into a Timestamp.

## Preprocess data

In [10]:
# show features
df.columns

Index(['artist', 'album', 'release_date', 'track_name', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature',
       'explicit', 'popularity', 'disc_number'],
      dtype='object')

In [11]:
df.head(3)

Unnamed: 0_level_0,artist,album,release_date,track_name,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,explicit,popularity,disc_number
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
6mecZbKK3JDeMdFRNxsCV5,Tay-K,TRAPMAN,2020-07-12,TRAPMAN,0.792,0.594,2.0,-8.544,1.0,0.3,0.0,0.244,0.351,82.512,232803.0,4.0,True,43.0,1.0
5PtMwNq8Dp31uYdGGacVJE,Lil Wyte,Doubt Me Now,2003-03-04,Oxy Cotton,0.816,0.578,9.0,-6.912,1.0,0.233,0.0,0.114,0.265,148.077,193920.0,4.0,True,61.0,1.0
6s8EhlBn2PIoESylkXnwYc,Kamelen,KINGPIN SLIM,2019-11-29,Kingpin O.G - Remix,0.649,0.798,0.0,-6.45,0.0,0.145,0.0,0.409,0.717,160.011,254390.0,4.0,True,22.0,1.0


In [12]:
# encode features
df = encode_features(df)
df.head(3)

Unnamed: 0_level_0,artist,album,release_date,track_name,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,popularity,disc_number,is_explicit
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
6mecZbKK3JDeMdFRNxsCV5,Tay-K,TRAPMAN,2020-07-12,TRAPMAN,0.792,0.594,2.0,-8.544,1.0,0.3,0.0,0.244,0.351,82.512,232803.0,4.0,43.0,1.0,1
5PtMwNq8Dp31uYdGGacVJE,Lil Wyte,Doubt Me Now,2003-03-04,Oxy Cotton,0.816,0.578,9.0,-6.912,1.0,0.233,0.0,0.114,0.265,148.077,193920.0,4.0,61.0,1.0,1
6s8EhlBn2PIoESylkXnwYc,Kamelen,KINGPIN SLIM,2019-11-29,Kingpin O.G - Remix,0.649,0.798,0.0,-6.45,0.0,0.145,0.0,0.409,0.717,160.011,254390.0,4.0,22.0,1.0,1


In [13]:
# chose features for MVP modeling
df = get_model_features(df)

In [14]:
# split the data
X_train, y_train, X_validate, y_validate, X_test, y_test, train, validate, test = spotify_split(df, 'popularity')
train.head(3)

Shape of train: (4250, 14) | Shape of validate: (912, 14) | Shape of test: (911, 14)
Percent train: 70.0        | Percent validate: 15.0       | Percent test: 15.0


Unnamed: 0_level_0,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,popularity,disc_number,is_explicit
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
32tUYhAygMdx9XxFxxj3It,0.646,0.595,10.0,-6.709,0.0,0.0512,6e-06,0.0527,0.772,73.973,238880.0,4.0,62.0,1.0,0
4fDgQUNG3851Wnc67aK1hO,0.839,0.335,9.0,-14.418,1.0,0.175,9e-06,0.0967,0.566,127.053,151181.0,4.0,36.0,1.0,1
2usnXvtQNyCNiOMZGOMYkB,0.517,0.903,10.0,-6.333,0.0,0.568,0.0,0.69,0.643,84.792,196338.0,4.0,21.0,1.0,1


In [15]:
# scale the data
X_train_scaled, X_validate_scaled, X_test_scaled = scale_data(train, validate, test, 'popularity', 'MinMax')
X_train_scaled.head(3)

Unnamed: 0_level_0,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,disc_number,is_explicit
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
32tUYhAygMdx9XxFxxj3It,0.655172,0.597154,0.909091,0.727753,0.0,0.053222,6e-06,0.034912,0.786151,0.335858,0.411324,0.8,0.0,0.0
4fDgQUNG3851Wnc67aK1hO,0.850913,0.333988,0.818182,0.411772,1.0,0.181913,1e-05,0.080903,0.576375,0.576855,0.25102,0.8,0.0,1.0
2usnXvtQNyCNiOMZGOMYkB,0.524341,0.908904,0.909091,0.743165,0.0,0.590437,0.0,0.701056,0.654786,0.384979,0.333562,0.8,0.0,1.0


In [16]:
# check data types
X_train_scaled.dtypes

danceability        float64
energy              float64
key                 float64
loudness            float64
mode                float64
speechiness         float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
duration_ms         float64
time_signature      float64
disc_number         float64
is_explicit         float64
dtype: object

## Set the baseline

In [32]:
#np.median(y_train)
np.mean(y_train)

38.46776470588235

In [33]:
#baseline = y_train.median()
baseline = y_train.mean()

baseline_rmse_train = round(sqrt(mean_squared_error(y_train, np.full(len(y_train), baseline))), 6)
print('RMSE (Root Mean Square Error) of Baseline on train data:\n', baseline_rmse_train)

baseline_rmse_validate = round(sqrt(mean_squared_error(y_validate, np.full(len(y_validate), baseline))), 6)
print('RMSE (Root Mean Square Error) of Baseline on validate data:\n', baseline_rmse_validate)

RMSE (Root Mean Square Error) of Baseline on train data:
 22.770177
RMSE (Root Mean Square Error) of Baseline on validate data:
 23.034868


Mean performed better than median.

Our baseline prediction of popularity will be ```38.46776470588235```, with an RMSE of ```22.770177``` on the train data and ```23.034868``` on the validate data.

## Model 1 - Ordinary Least Squares (OLS) using Linear Regression

In [20]:
# show available features
X_train_scaled.columns

Index(['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms',
       'time_signature', 'disc_number', 'is_explicit'],
      dtype='object')

In [22]:
# use all features
X = X_train_scaled
y = y_train

X_v = X_validate_scaled
y_v = y_validate

lm_pred, lm_rmse, lm_pred_v, lm_rmse_v = OLS_model(X, y, X_v, y_v)

RMSE for OLS using Linear Regression

On train data:
 236.436274 

 On validate data:
 230.18736


Not great results, but they did beat the baseline model.

## Model 2 - 