### Part 3 - Preprocessing <br />
In this notebook we will be taking our now analyzed data and preparing it for modeling (which will occur in the next notebook - Regression Modeling)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import os.path

import warnings
warnings.simplefilter(action="ignore")
warnings.filterwarnings(action="ignore")

In [2]:
df = pd.read_csv('capstone3_data.csv', index_col=0)

Here we are dropping our non-numerical/categorical features for our feature. We could theoretically implement some dummy features (particularly for feature artist), but I determined it would simply be unnecessary as our end user would be Illenium anyways, The only way that it could be helpful is if he were to collaborate wth other popular artists in the future, and they would still need to determine what aspects of a song they would want to incorporate anyways.

In [3]:
preparation = df.drop(columns=["artists","id","name","feature_artist"])

Our target feature will be popularity (how can we maximize it?), the other features will determine how best to maximize popularity so they will be the features that we compare against

In [4]:
target = preparation.popularity
features = preparation.drop(columns=["popularity"])

Scaling our non-popularity features

In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

Splitting the data into training and testing data in preparation for modeling and the conclusion of the project. As stated above, we will be implementing regression models to our data.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(scaled_features, target, test_size=.2, random_state=42)

### Part 4 - Modeling <br />

Here we will test some of the most popular regression machine learning models. first we import all of the models modules and train them with the training data.

In [7]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor

In [8]:
lr = LinearRegression().fit(X_train, y_train)
r = Ridge().fit(X_train, y_train)
l = Lasso().fit(X_train, y_train)
rf = RandomForestRegressor().fit(X_train, y_train)
knn = KNeighborsRegressor().fit(X_train, y_train)
sv = SVR().fit(X_train, y_train)
nn = MLPRegressor().fit(X_train, y_train)

Here we check the score, predictions, r^2 score, explained variance score, and mean absolute error for all of our models

In [9]:
from sklearn import metrics

models = [lr, r, l, rf, knn, sv, nn]
metrics = {"r2_score": metrics.r2_score, "explained_variance_score": metrics.explained_variance_score, "mean_absolute_error":metrics.mean_absolute_error}

for model in models:
    print(str(model).strip("()"), 'Score:', str(model.score(X_train,y_train)))
    print("\n predictions:", model.predict(X_test),"\n")
    for k, metric in metrics.items():
        print("\t", k, ":", metric(y_test, model.predict(X_test)))
    print('-'*34)

LinearRegression Score: 0.19140385952675698

 predictions: [52.36067904 46.12238296 47.8112182  55.52120398 49.59117454 54.24344913
 47.32301572 49.20127545 45.57957551 47.8426646  51.37704596 51.73906102
 50.1519536  49.32706049 49.65194156 47.53693472 46.78374365 51.32005343
 50.93056483 44.15446007 56.51506345 47.74462297 44.57780843 48.20795386
 57.87092072 53.47258141 53.11702266 43.93373503 45.38026224 48.55912001
 51.00801367 44.89865745 50.01800176 50.69294819 51.19260663 50.16459423
 52.88750595 49.02246016 48.58303064 49.64889352 46.17943798] 

	 r2_score : 0.013541704007281274
	 explained_variance_score : 0.013782877228005752
	 mean_absolute_error : 6.129081903733724
----------------------------------
Ridge Score: 0.19139185258337932

 predictions: [52.34361901 46.15505061 47.73489254 55.48326273 49.60136684 54.21573216
 47.32681931 49.19396712 45.62613978 47.80911602 51.35394376 51.73629579
 50.14643236 49.31676498 49.6692289  47.55194373 46.81611212 51.29946775
 50.9215584

It appears that the Random Forest performs the best. This will be the best model to use for our general conclusions and moving forward with the project.

-----
##### Part II: Utilizing Model for Prediction
The goal of this project was to see if we could increase popularity with different feature input. We will randomly generate a new dataframe with each of the features and see if we can exceed our max popularity of 76

In [10]:
df.sort_values(by='popularity',ascending=False).head(1)

Unnamed: 0,artists,id,name,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,feature_artist
212,"The Chainsmokers, ILLENIUM, Lennon Stella",3g0mEQx3NTanacLseoP0Gw,Takeaway,76,0.528,0.511,3.0,-8.144,1.0,0.0324,0.126,0.0,0.101,0.351,100.1,4.0,ILLENIUM


create the randomly generated dataframe utilizing the min and max values for each feature as a range. This is making the assumption that anything out of this range is unrealistic while making the track (whether that be from constraints within genre, technology, or simply creativity level)

In [11]:
rand_df = pd.DataFrame(columns = list(features.columns))

for x in range(1000):
    rand_dict = {}
    for feature in list(features.columns):
        feat_min = int(df[feature].min()*100)
        feat_max = int(df[feature].max()*100)
        rand_float = float(np.random.randint(feat_min, feat_max)/100)
        rand_dict[feature] = rand_float
    rand_df = rand_df.append(rand_dict, ignore_index = True)
        
        
rand_df.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,0.5,0.92,1.22,-11.04,0.5,0.04,0.36,0.23,0.55,0.2,195.77,3.46
1,0.72,0.5,10.5,-2.55,0.33,0.06,0.73,0.63,0.48,0.48,73.22,2.54
2,0.27,0.15,2.97,-6.16,0.56,0.33,0.19,0.54,0.25,0.79,124.42,3.11
3,0.78,0.92,1.54,-5.44,0.8,0.23,0.59,0.6,0.59,0.16,140.5,1.92
4,0.82,0.72,2.2,-14.55,0.39,0.12,0.63,0.79,0.12,0.56,98.43,3.32


add feature/column for the new predicted popularity for each randomly generated/simulated song

In [12]:
rand_df['pred_popularity'] = rf.predict(rand_df.loc[:,:])

In [13]:
rand_df.sort_values(by='pred_popularity',ascending=False).head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,pred_popularity
287,0.3,0.06,3.36,-13.76,0.79,0.15,0.1,0.8,0.68,0.67,126.74,2.85,57.63
302,0.87,0.92,1.38,-10.54,0.49,0.26,0.04,0.51,0.67,0.96,88.49,3.16,57.41
696,0.17,0.77,8.46,-11.77,0.03,0.29,0.0,0.07,0.68,0.85,111.06,3.79,57.2
824,0.81,0.75,9.82,-3.04,0.29,0.14,0.11,0.57,0.66,0.69,110.22,3.1,57.16
776,0.87,0.73,7.75,-9.54,0.63,0.05,0.31,0.21,0.67,0.67,112.32,1.92,57.14


In [30]:
print('Illenium Average Popularity', float(df.loc[df['feature_artist'] == 'ILLENIUM',['popularity']].mean()))
print('-'*20)
for artist in list(df['feature_artist'].unique()):
    print(artist, 'Average Popularity', int(df.loc[df['feature_artist'] == artist,['popularity']].mean()))

Illenium Average Popularity 62.5
--------------------
Nurko Average Popularity 49
ARMNHMR Average Popularity 42
William Black Average Popularity 50
Said The Sky Average Popularity 54
Seven Lions Average Popularity 52
MitiS Average Popularity 45
Crystal Skies Average Popularity 41
ILLENIUM Average Popularity 62
Dabin Average Popularity 49
Slushii Average Popularity 52
Kasbo Average Popularity 49
Skrux Average Popularity 37
Jai Wolf Average Popularity 52
Crywolf Average Popularity 40
Ekali Average Popularity 46
INZO Average Popularity 44
Zeds Dead Average Popularity 55
Lost Kings Average Popularity 57
Kaivon Average Popularity 45
NGHTMRE Average Popularity 57
San Holo Average Popularity 52


It appears that the highest popularity that we can randomly generate within the specified constraints is only 58.53. Where Illenium tracks already have an average popularity of 62.5, he is exceeding this and can probably do without utilizing the analysis/modeling. This model would be more useful for any artist that has an average popularity below 57 (essentially all of the artists that were included in this analysis other than Illenium, Lost Kings, and NGHTMRE)