In this notebook, we will explore our preprocessed dataset.  We will analyze the data for outliers, correlated features, and merits in features against the selected features the model outputs.  We will use different feature selection methods such as PCA, Forward/Backward selection, and Decision Trees; and will determine which features we can exclude based on the results and error rate.  We want to train our models only on the features most important to determining the popularity of the song per genre (with the exception of Decision Tree since it performs its own feature selection).

We use kMeans to explore how well kMeans could predict a song's popularity based on its features and perhaps even if it can cluster songs into genres based on the track's features.

This data has been preprocessed and does not contain null values.

In [1]:
import pandas as pd
import numpy as np

In [2]:
popular_data = pd.read_csv('./data/PopularData.csv')

In [4]:
popular_data.head(3)

Unnamed: 0,SpotifyTrack,SpotifyArtist,uri,year,genre,release_date,popular,key,explicit,mode,...,danceability,energy,duration_ms,instrumentalness,valence,tempo,liveness,loudness,speechiness,time_signature
0,The Good Stuff,Kenny Chesney,spotify:track:1sR3kJi14jA8Gau3a0yXAo,2002,Country,4/2/2002,1,7,0,1,...,0.612,0.62,200440,0.0,0.502,143.78,0.129,-9.785,0.0645,4
1,Drive (For Daddy Gene),Alan Jackson,spotify:track:1FV374EPG5CrjdIbIMLkcv,2002,Country,1/15/2002,1,11,0,1,...,0.713,0.579,242733,0.0,0.511,125.179,0.174,-8.066,0.0413,4
2,Living And Living Well,George Strait,spotify:track:3YxKqZFpcxBPvpUssL8FS2,2002,Country,1/1/2001,1,1,0,1,...,0.602,0.683,218307,0.00338,0.522,120.787,0.11,-6.82,0.0304,4


In [8]:
print(popular_data.shape)
popular_data.describe()

(5655, 22)


Unnamed: 0,year,popular,key,explicit,mode,chartrank,acousticness,danceability,energy,duration_ms,instrumentalness,valence,tempo,liveness,loudness,speechiness,time_signature
count,5655.0,5655.0,5655.0,5655.0,5655.0,5655.0,5655.0,5655.0,5655.0,5655.0,5655.0,5655.0,5655.0,5655.0,5655.0,5655.0,5655.0
mean,2012.560743,1.0,5.321662,0.225464,0.65924,41.484704,0.207662,0.660912,0.683874,229145.221043,0.063507,0.580522,121.437516,0.171417,-5.894027,0.09101,3.939876
std,5.02123,0.0,3.597298,0.417925,0.474007,27.32729,0.208004,0.128686,0.153107,44091.232698,0.207379,0.223677,29.559015,0.129147,2.106324,0.09374,0.326183
min,2002.0,1.0,0.0,0.0,0.0,1.0,7.2e-05,0.201,0.0645,78200.0,0.0,0.0383,48.718,0.0148,-19.727,0.0227,1.0
25%,2009.0,1.0,2.0,0.0,0.0,19.0,0.0405,0.575,0.577,200980.0,0.0,0.4105,96.99,0.09035,-7.0275,0.0345,4.0
50%,2013.0,1.0,5.0,0.0,1.0,37.0,0.133,0.67,0.699,225205.0,0.0,0.588,119.966,0.121,-5.632,0.0489,4.0
75%,2017.0,1.0,8.0,0.0,1.0,61.0,0.322,0.753,0.8045,251353.0,3.5e-05,0.76,141.583,0.216,-4.4305,0.102,4.0
max,2020.0,1.0,11.0,1.0,1.0,100.0,0.985,0.974,0.993,992160.0,0.961,0.982,214.025,0.984,-0.025,0.597,5.0


In [5]:
raw_data = pd.read_csv('./data/rawData_final.csv')

In [6]:
raw_data.head(3)

Unnamed: 0,year,genre,popular,chartrank,acousticness,danceability,energy,duration_ms,instrumentalness,valence,...,key_10,key_11,explicit_0,explicit_1,mode_0,mode_1,time_signature_1,time_signature_3,time_signature_4,time_signature_5
0,2016,jazz,0,0,0.677,0.566,0.468,0.09433,2.5e-05,0.718,...,0,0,1,0,0,1,0,0,1,0
1,2017,latin,0,0,0.0317,0.709,0.744,0.166493,0.0,0.314,...,0,0,0,1,0,1,0,0,1,0
2,2017,latin,0,0,0.392,0.787,0.732,0.143168,3e-06,0.897,...,0,0,1,0,0,1,0,0,1,0


In [22]:
print(raw_data.shape)
raw_data.describe(include='all').T

(23127, 34)


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
year,23127,,,,2011.24,5.42064,2002.0,2007.0,2011.0,2016.0,2020.0
genre,23127,5.0,r&b,5233.0,,,,,,,
popular,23127,,,,0.244476,0.429785,0.0,0.0,0.0,0.0,1.0
chartrank,23127,,,,10.1429,22.372,0.0,0.0,0.0,0.0,100.0
acousticness,23127,,,,0.292056,0.283272,2.05e-06,0.0513,0.193,0.4785,0.996
danceability,23127,,,,0.622599,0.143972,0.0917,0.526,0.631,0.729,0.976
energy,23127,,,,0.623486,0.206216,0.00154,0.493,0.652,0.783,0.997
duration_ms,23127,,,,0.130604,0.0432525,0.0,0.107969,0.125852,0.14664,1.0
instrumentalness,23127,,,,0.0895416,0.242499,0.0,0.0,2.14e-06,0.0007815,0.988
valence,23127,,,,0.532651,0.238052,0.0266,0.344,0.534,0.725,0.983


We have a total of 17,473 track objects, 5,654 of which are our 'popular' (aka target) tracks.  The total number of tracks per genre is distributed within a few standard deviations of one another.  The average amount of tracks per genre is 4,625.  The pop genre has the lowest count with 4,158 tracks.  The r&b genre has the highest count at 5,233 tracks.

In [31]:
print("Unpopular & popular counts:\n", raw_data.popular.value_counts())

Unpopular & popular counts:
 0    17473
1     5654
Name: popular, dtype: int64


In [32]:
print("Tracks per genre count:\n", raw_data.genre.value_counts())
print("\nMean distribution of tracks per genre: ", (raw_data.genre.value_counts().sum())/(raw_data.genre.value_counts().shape[0]))

Tracks per genre count:
 r&b        5233
country    4739
latin      4709
jazz       4288
pop        4158
Name: genre, dtype: int64

Mean distribution of tracks per genre:  4625.4


In [30]:
# Standard deviation of tracks per genre
raw_data.genre.value_counts().std()

424.66845891824835

## Analyze data for correlated features

In [45]:
pd.options.display.max_columns = 40
raw_data.corr()

Unnamed: 0,year,popular,chartrank,acousticness,danceability,energy,duration_ms,instrumentalness,valence,tempo,liveness,loudness,speechiness,key_0,key_1,key_2,key_3,key_4,key_5,key_6,key_7,key_8,key_9,key_10,key_11,explicit_0,explicit_1,mode_0,mode_1,time_signature_1,time_signature_3,time_signature_4,time_signature_5
year,1.0,0.138451,0.130752,-0.002452,0.046851,-0.021573,-0.167721,0.067053,-0.103803,0.021754,-0.023592,0.002073,0.049199,-0.002263,0.015966,-0.032831,-0.004625,-0.008777,0.001609,0.026043,-0.015778,0.010659,-0.016223,0.015503,0.016352,-0.16033,0.16033,0.038952,-0.038952,0.02253,0.02182,-0.024109,-0.004779
popular,0.138451,1.0,0.797029,-0.169428,0.151191,0.166754,-0.052411,-0.061045,0.114409,0.032547,-0.065085,0.220876,0.053378,-0.015835,0.019332,-0.001087,-0.009668,-0.01143,0.001985,0.002094,-0.016264,0.021887,-0.015486,0.010748,0.016039,-0.120955,0.120955,-0.004908,0.004908,-0.009251,-0.021057,0.020957,0.001589
chartrank,0.130752,0.797029,1.0,-0.144451,0.115322,0.136503,-0.058011,-0.099075,0.08228,0.033867,-0.040205,0.191342,0.082973,-0.013688,0.018005,-0.002551,-0.005895,-0.00336,0.000406,0.007925,-0.014646,0.013317,-0.010427,0.004846,0.009019,-0.142394,0.142394,-0.010256,0.010256,-0.000972,0.001475,-0.000855,-0.000557
acousticness,-0.002452,-0.169428,-0.144451,1.0,-0.209523,-0.660217,0.020545,0.226459,-0.21807,-0.115901,-0.070233,-0.568515,-0.126149,0.039998,-0.073837,0.001789,0.061509,0.010325,0.039661,-0.034303,0.015086,-0.011588,0.008964,0.019415,-0.061004,0.139627,-0.139627,-0.04624,0.04624,0.030794,0.191595,-0.187576,0.016498
danceability,0.046851,0.151191,0.115322,-0.209523,1.0,0.166004,-0.103111,-0.077973,0.433544,-0.148001,-0.133937,0.213652,0.169115,-0.02609,0.072236,-0.02759,-0.047143,-0.043062,-0.007849,0.017933,-0.022289,0.015157,-0.009861,0.032229,0.034267,-0.211484,0.211484,0.12745,-0.12745,-0.04526,-0.183928,0.208624,-0.080495
energy,-0.021573,0.166754,0.136503,-0.660217,0.166004,1.0,-0.066631,-0.216517,0.456319,0.158591,0.135999,0.758895,0.114181,-0.042096,0.031699,0.012795,-0.042484,0.003595,-0.026451,0.018111,-0.012146,0.003228,0.001124,-0.007765,0.053954,-0.0252,0.0252,0.033573,-0.033573,-0.035892,-0.20407,0.204081,-0.02675
duration_ms,-0.167721,-0.052411,-0.058011,0.020545,-0.103111,-0.066631,1.0,0.129611,-0.166101,-0.031343,0.066335,-0.097962,-0.04672,0.008103,0.007667,0.009259,0.012673,-0.020157,0.034351,-0.018302,-0.005728,-0.005835,-0.011605,-0.014414,0.001032,0.003488,-0.003488,0.057073,-0.057073,-0.031908,-0.026305,0.033703,-0.003129
instrumentalness,0.067053,-0.061045,-0.099075,0.226459,-0.077973,-0.216517,0.129611,1.0,-0.0844,-0.092717,-0.043328,-0.457803,-0.121866,0.010065,-0.023444,-0.000572,0.016227,-0.013213,0.04611,-0.02102,0.009452,-0.006836,-0.013367,0.018726,-0.020568,0.143067,-0.143067,0.100709,-0.100709,0.019908,0.036926,-0.039173,0.000969
valence,-0.103803,0.114409,0.08228,-0.21807,0.433544,0.456319,-0.166101,-0.0844,1.0,0.064667,-0.004543,0.321009,0.075949,-0.015091,-0.004774,-0.009778,-0.016717,-0.014916,0.01224,0.000384,0.005456,-0.001464,0.006941,0.017789,0.016648,0.075448,-0.075448,0.037845,-0.037845,-0.016496,-0.100361,0.106037,-0.029137
tempo,0.021754,0.032547,0.033867,-0.115901,-0.148001,0.158591,-0.031343,-0.092717,0.064667,1.0,0.024525,0.149094,0.060678,0.00749,-0.001627,0.010594,0.002791,0.023215,-0.028198,-0.005401,0.007594,0.01431,-0.001872,-0.031658,0.002043,-0.016518,0.016518,-0.049382,0.049382,-0.02523,0.066462,-0.046767,-0.01671


In [46]:
data_corr_df = pd.DataFrame(raw_data.corr(), copy=True)

We don't have many features that are significantly correlated to popularity as a whole. Popularity and chartrank are highly correlated, as is energy and loudness.

<font color='red'>TO DO: group data by genre, then run corr analysis again.  Perhaps we'll find more correlations in subgroups and can plot charts.</font>

In [49]:
data_corr_df.where(data_corr_df > 0.5)

Unnamed: 0,year,popular,chartrank,acousticness,danceability,energy,duration_ms,instrumentalness,valence,tempo,liveness,loudness,speechiness,key_0,key_1,key_2,key_3,key_4,key_5,key_6,key_7,key_8,key_9,key_10,key_11,explicit_0,explicit_1,mode_0,mode_1,time_signature_1,time_signature_3,time_signature_4,time_signature_5
year,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
popular,,1.0,0.797029,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
chartrank,,0.797029,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
acousticness,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
danceability,,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,
energy,,,,,,1.0,,,,,,0.758895,,,,,,,,,,,,,,,,,,,,,
duration_ms,,,,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,
instrumentalness,,,,,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,
valence,,,,,,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,
tempo,,,,,,,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,


## Analyze data for outliers

## Analyze merits in features against the selected features the model outputs

## Use Feature Selection methods to determine which features are most important to determining popularity of the song per genre

### Backward Selection

### Forward Selection

### PCA

### Decision Trees
Perhaps move this to Task 3 and discuss briefly the results here.

## Cluster Analysis

### kMeans

We use kMeans to explore how well kMeans could predict a song's popularity based on its features and perhaps even if it can cluster songs into genres based on the track's features