You will be able to view the data set uses on this Kaggle page
[Spotify dataset](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks)

#Importing Libraries

In [15]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

#model libraries
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

#metric libraries
from sklearn.metrics import mean_absolute_error


#Cross validation
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV

#Wrangle Data

Issues:


*   Whether or not to include year in my modeling. I don't believe it was a leaky feature, but there is a bias towards newer released songs. So much that removing all features besides year would lead to better predictor of popularity than the baseline.

*   I removed the artist column because of high cardinality, but I created a new column which includes the number of artists involved in the song. I would have liked to find a way to add importance to a certain artists, if they were a featured artist in another song, or if they were becoming more popular.

*   I removed outliers of songs that were less than 10 seconds, only one song was over 5 minutes, and songs with less than a tempo of 5 beats per minutes


*   I created a column for the month it was released, maybe it was a Christmas song released in December, or certain months are worse or better for release


*   Set the ID as the Index.

*   Name of song was also removed due to high cardinality





In [16]:
spotify = pd.read_csv('Spotify.csv', index_col='id')

In [17]:
def wrangle(X):
  X=X.copy()

  #Removes tracks with no release date
  X = X[X['release_date'].isnull()==False]
  X=X[X['release_date'].apply(len) > 4]

  # Convert date of release to date time
  X['release_date'] = pd.to_datetime(X['release_date'])

  #Month of release
  X['release_month'] = pd.to_datetime(X['release_date']).dt.strftime('%m').astype(int)

  #remove outlier tracks(too slow, too long, too short)
  X=X[X['tempo']>5]
  X=X[X['duration_ms']<5_000_000]
  X=X[X['duration_ms']>10_000]

  #convert duration from milliseconds to minutes rounded to the 10th place
  X['duration_minutes'] = round((X['duration_ms']/1000)/60, 2)

  #create new column with numbers artists involved in the song
  X['artists'] = X['artists'].apply(lambda k: k.strip('[]/n'))
  X['artists'] = X['artists'].apply(lambda k: k.split(','))
  X['num_artist'] = X['artists'].apply(len)

  #drop columns(duration_ms, id, year, artists)
  drop_cols = ['duration_ms', 'year','release_date','artists', 'name']
  X.drop(columns=drop_cols, inplace=True)

  return X

In [18]:
spotify = wrangle(spotify)

In [19]:
spotify.head()

Unnamed: 0_level_0,acousticness,danceability,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo,valence,release_month,duration_minutes,num_artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
6M94FkXd15sOAOQYRnWPN8,0.995,0.781,0.13,0,0.887,1,0.111,-14.734,0,0,0.0926,108.003,0.72,9,3.01,1
6OaJ8Bh7lsBeYoBmwmo2nh,0.995,0.683,0.207,0,0.206,9,0.337,-9.801,0,0,0.127,119.833,0.493,10,2.71,2
6Rwn56jcC0TdGQzbRl7NGw,0.977,0.335,0.105,0,0.84,5,0.231,-16.049,0,0,0.0716,80.204,0.406,1,4.61,3
6TFuAErGpJ9FpxQQ1HC8nM,0.994,0.787,0.156,0,0.659,4,0.11,-14.056,0,0,0.157,117.167,0.849,9,2.79,2
6Ukl7n0q3Cjd0Og8uBmVeP,0.992,0.763,0.132,0,0.0693,4,0.112,-13.002,1,0,0.0886,111.679,0.832,9,2.9,1


#Split Data to Training, Validation, Test

I chose to do a split using ```train_test_split``` rather than a split by year to try to limit the bias, it won't help much if a song is released the same year and that's the only thing giving a song a higher popularity score.



In [20]:
target = 'popularity'
y=spotify[target]
X = spotify.drop(columns=target)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.2, random_state=42)

#Establish Baseline

Since Popularity is a continous value, I chose to use MAE as a baseline.

In [21]:
y_pred = [y_train.mean()] * len(y_train)
print('Baseline MAE', mean_absolute_error(y_train, y_pred))

Baseline MAE 17.5697405403575


#Ridge Regression Model

Since Popularity is a continous value, a regression model is needed. I chose Ridge regression in order to reduce the standard errors

In [22]:
model_R = make_pipeline(
    StandardScaler(),
    SelectKBest(),
    Ridge(alpha=1, random_state=42)
)

model_R.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('selectkbest',
                 SelectKBest(k=10,
                             score_func=<function f_classif at 0x7f71166e7f28>)),
                ('ridge',
                 Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None,
                       normalize=False, random_state=42, solver='auto',
                       tol=0.001))],
         verbose=False)

##Check Metrics

In [23]:
print('Training MAE:', mean_absolute_error(y_train, model_R.predict(X_train)))
print('Validation MAE:', mean_absolute_error(y_val, model_R.predict(X_val)))
print('Training R^2:', model_R.score(X_train, y_train))
print('Validation R^2:', model_R.score(X_val, y_val))

Training MAE: 13.506244521476622
Validation MAE: 13.527477678993352
Training R^2: 0.3924794820362163
Validation R^2: 0.3834625254389077


While the model MAE beats the baseline, it does not beat it by much, and the R-squared score is not that great at .38.

#Random Forest Regression Model

In [24]:
model_forest = make_pipeline(
    RandomForestRegressor(random_state=42, n_jobs=-1)
)
model_forest.fit(X_train, y_train);

##Check Metrics

In [11]:
print('Training MAE:', mean_absolute_error(y_train, model_forest.predict(X_train)));
print('Validation MAE:', mean_absolute_error(y_val, model_forest.predict(X_val)));
print('Training R^2:', model_forest.score(X_train, y_train));
print('Validation R^2:', model_forest.score(X_val, y_val));

Training MAE: 3.7329571258992043
Validation MAE: 9.937997759367539
Training R^2: 0.9452485120901762
Validation R^2: 0.6150836927593843


Random Forest does yield better results than the ridge regression. The validation MAE is 9.94 and the validation R-sqaured score of .62

#Tuning Hyperparameters

##Ridge Model


I set the hyperparameters k-best to a range of 2 features to 16 and Ridge regression's alpha from 1 to 10.

In [12]:
params = {
    'selectkbest__k':range(2,17,2),
    'ridge__alpha':range(1,11,1)
}


tuned_ridge = GridSearchCV(model_R,
             param_grid = params,
             cv=5,
             n_jobs=-1,
             verbose=1)


In [13]:
tuned_ridge.fit(X_train, y_train)

Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    8.3s
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:   30.3s
[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed:   59.4s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('standardscaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('selectkbest',
                                        SelectKBest(k=10,
                                                    score_func=<function f_classif at 0x7f71166e7f28>)),
                                       ('ridge',
                                        Ridge(alpha=1, copy_X=True,
                                              fit_intercept=True, max_iter=None,
                                              normalize=False, random_state=42,
                                              solver='auto', tol=0.001))],
                                verbose=False),
             iid='deprecated', n_

In [None]:
ridge_tune = tuning1.best_estimator_
print('Training MAE:', mean_absolute_error(y_train, ridge_tune.predict(X_train)))
print('Validation MAE:', mean_absolute_error(y_val, ridge_tune.predict(X_val)))
print('Training R^2:', ridge_tune.score(X_train, y_train))
print('Validation R^2:', ridge_tune.score(X_val, y_val))

Tuning does yield a better MAE and r-sqaured score, but not as well as the Random Forest regressor

In [None]:
tuned_ridge.best_estimator_

##Random Forest Regressor

The hyper paramters I chose to tune were ```n_estimators, max_depth, max_features, and max_samples```.

I chose a randomized search in order to tune a better random forest model quickly.



In [None]:

params_f = {
    'randomforestregressor__n_estimators':range(10,201,10),
    'randomforestregressor__max_depth': range(5,36,5),
    'randomforestregressor__max_features': range(2,17,2),
    'randomforestregressor__max_samples': np.arange(0.2,0.8,0.2)
}

tuning_forest = RandomizedSearchCV(
    model_forest, 
    param_distributions=params_f, 
    n_iter=10,
    n_jobs=8, 
    cv=5, 
    random_state=42,
    verbose=1
)

In [None]:
#turning3.fit(X_train, y_train)
tuning_forest.fit(X_train, y_train)

In [None]:
tuned_forest = tuning_forest.best_estimator_
print('Training MAE:', mean_absolute_error(y_train, tuned_forest.predict(X_train)))
print('Validation MAE:', mean_absolute_error(y_val, tuned_forest.predict(X_val)))
print('Training R^2:', tuned_forest.score(X_train, y_train))
print('Validation R^2:', tuned_forest.score(X_val, y_val))

In [None]:
tuning_forest.best_estimator_

As the n_estimators increases, so does the time to fit the model. Seeing as the hyperparameter I set was up to 200 n_estimators, there maybe some tuning left on the table, but one more try at 400 n_estimators showed a slighly higher validation MAE and a higher r-squared score.

In [None]:
tuned_forest = RandomForestRegressor(max_depth=25, max_features=10,
                                       max_samples=0.4, n_estimators=400,
                                       n_jobs=-1, random_state=42)
tuned_forest.fit(X_train, y_train)

In [None]:
print('Training MAE:', mean_absolute_error(y_train, tuned_forest.predict(X_train)))
print('Validation MAE:', mean_absolute_error(y_val, tuned_forest.predict(X_val)))
print('Training R^2:', tuned_forest.score(X_train, y_train))
print('Validation R^2:', tuned_forest.score(X_val, y_val))

#Final Test MAE and R-Sqaured score using tuned RandomForestRegressor.

In [None]:
print('Test MAE:', mean_absolute_error(y_test, tuned_forest.predict(X_test)))
print('Test R^2:', tuned_forest.score(X_test, y_test))