### Part 3 - Preprocessing <br />
In this notebook we will be taking our now analyzed data and preparing it for modeling (which will occur in the next notebook - Regression Modeling)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import os.path

import warnings
warnings.simplefilter(action="ignore")
warnings.filterwarnings(action="ignore")

In [2]:
df = pd.read_csv('capstone3_data.csv', index_col=0)

Here we are dropping our non-numerical/categorical features for our feature. We could theoretically implement some dummy features (particularly for feature artist), but I determined it would simply be unnecessary as our end user would be Illenium anyways, The only way that it could be helpful is if he were to collaborate wth other popular artists in the future, and they would still need to determine what aspects of a song they would want to incorporate anyways.

In [3]:
preparation = df.drop(columns=["artists","id","name","feature_artist"])

Our target feature will be popularity (how can we maximize it?), the other features will determine how best to maximize popularity so they will be the features that we compare against

In [4]:
target = preparation.popularity
features = preparation.drop(columns=["popularity"])

Scaling our non-popularity features

In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

Splitting the data into training and testing data in preparation for modeling and the conclusion of the project. As stated above, we will be implementing regression models to our data.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(scaled_features, target, test_size=.2, random_state=42)

### Part 4 - Modeling <br />

Here we will test some of the most popular regression machine learning models. first we import all of the models modules and train them with the training data.

In [7]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor

In [8]:
lr = LinearRegression().fit(X_train, y_train)
r = Ridge().fit(X_train, y_train)
l = Lasso().fit(X_train, y_train)
rf = RandomForestRegressor().fit(X_train, y_train)
knn = KNeighborsRegressor().fit(X_train, y_train)
sv = SVR().fit(X_train, y_train)
nn = MLPRegressor().fit(X_train, y_train)

Here we check the score, predictions, r^2 score, explained variance score, and mean absolute error for all of our models

In [9]:
from sklearn import metrics

models = [lr, r, l, rf, knn, sv, nn]
metrics = {"r2_score": metrics.r2_score, "explained_variance_score": metrics.explained_variance_score, "mean_absolute_error":metrics.mean_absolute_error}

for model in models:
    print(str(model).strip("()"), 'Score:', str(model.score(X_train,y_train)))
    print("\n predictions:", model.predict(X_test),"\n")
    for k, metric in metrics.items():
        print("\t", k, ":", metric(y_test, model.predict(X_test)))
    print('-'*34)

LinearRegression Score: 0.19140385952675698

 predictions: [52.36067904 46.12238296 47.8112182  55.52120398 49.59117454 54.24344913
 47.32301572 49.20127545 45.57957551 47.8426646  51.37704596 51.73906102
 50.1519536  49.32706049 49.65194156 47.53693472 46.78374365 51.32005343
 50.93056483 44.15446007 56.51506345 47.74462297 44.57780843 48.20795386
 57.87092072 53.47258141 53.11702266 43.93373503 45.38026224 48.55912001
 51.00801367 44.89865745 50.01800176 50.69294819 51.19260663 50.16459423
 52.88750595 49.02246016 48.58303064 49.64889352 46.17943798] 

	 r2_score : 0.013541704007281274
	 explained_variance_score : 0.013782877228005752
	 mean_absolute_error : 6.129081903733724
----------------------------------
Ridge Score: 0.19139185258337932

 predictions: [52.34361901 46.15505061 47.73489254 55.48326273 49.60136684 54.21573216
 47.32681931 49.19396712 45.62613978 47.80911602 51.35394376 51.73629579
 50.14643236 49.31676498 49.6692289  47.55194373 46.81611212 51.29946775
 50.9215584

It appears that the Random Forest performs the best. This will be the best model to use for our general conclusions and moving forward with the project.