# Dataset

We are doing our analysis on [Spotify Multi-Genre Playlist Data](https://www.kaggle.com/siropo/spotify-multigenre-playlists-data).
This dataset is a collection of song features taken from Spotify and separated into six broad genres of music. 
Our analysis aims to answer to solve two problems:
1. How do you predict the genre of a song based on its features
2. What song features are more influential to a particular genre. 

It is not a random sampling of songs on Spotify: each song was on a playlist made by the person who collected the dataset. 
However, there is still a wide variety of genres it will work for purposes of our analysis. 
 
The dataset has the following 22 columns:

1. Artist Name
2. Song Name
3. Popularity: value from 1 to 100 that represents the song's popularity (magically determined by Spotify)
4. Genres: a detailed list of the genres for each artist
5. Playlist: the name of the playlist each song came from
6. Danceability
7. Energy
8. Key
9. Loudness
9. Mode
10. Speechiness
11. Acousticness
12. Instrumentalness
13. Liveness
14. Valence
15. Tempo
16. ID
17. URI
18. HRef
19. Analysis_url
19. Duration_Ms
20. Time-Signature

## Loading the dataset

The dataset is broken into 6 files, with each file containing the songs from a single genre of music. 
Here, we load the files into memory and combine them into one dataset. 
We also drop the playlist, ID, URI, HRef, and Analysis_url columns because they are not relevant for our analysis. 
Since we will be combining all of the songs into a single dataset, we also have to add another column containing the genre of each song. 

In [2]:
import pandas as pd

def load_dataset(music_genre):
    # First, we read the genre dataset into memory
    # Then we drop all of the columns we don't need
    # and finally we add the genre column and return it 
    return pd.read_csv(f'{music_genre}_music_data.csv').drop(columns=['Playlist', 'id', 'uri', 'track_href', 'analysis_url']).assign(genre=music_genre)

alternative = load_dataset('alternative')
blues = load_dataset('blues')
hiphop = load_dataset('hiphop')
indie_alt = load_dataset('indie_alt')
metal = load_dataset('metal')
pop = load_dataset('pop')
rock = load_dataset('rock')
    
dataset = pd.concat([alternative, blues, hiphop, indie_alt, metal, pop, rock])

# Analysis

## Logistic Regression

In the last logistic regression analysis, we determined variables are the least significant for identifying each genre. 
In this analysis, we will use forward variable selection to determine which variables are the most significant for distinguishing each genre. 

In [11]:
import statsmodels.api as sm
import math

def make_genre_training_data(genre_name):# collect all of the songs that belong to the given genre (and ignore any columns that aren't song features)
    X_same_genre = dataset[dataset['genre'] == genre_name].loc[:, 'danceability':'duration_ms']
    # Collect the songs that don't belong to the given genre (collect only a random sample to ensure a balanced dataset)
    X_different_genre = dataset[dataset['genre'] != genre_name].loc[:,'danceability':'duration_ms'].sample(n=len(X_same_genre))
    
    X_train = pd.concat([X_same_genre, X_different_genre], ignore_index=True)
    
    # for the Y data, we just need to make a dataframe of all 1s or all 0s with the same length as the X data
    Y_same_genre = pd.DataFrame(index=range(len(X_same_genre)), columns=['genre']).assign(genre=1)
    Y_different_genre = pd.DataFrame(index=range(len(X_same_genre)), columns=['genre']).assign(genre=0)

    Y_train = pd.concat([Y_same_genre,Y_different_genre], ignore_index=True)
    
    return X_train, Y_train

metal:
  Variable: danceability 	| aic: 8314.251570423245
  Variable: energy 	| aic: 8348.664131487367
  Variable: key 	| aic: 8444.53206638844
  Variable: loudness 	| aic: 8375.154284305805
  Variable: mode 	| aic: 8438.6702430024
  Variable: speechiness 	| aic: 8437.818293349814
  Variable: acousticness 	| aic: 7525.578482768849
  Variable: instrumentalness 	| aic: 8285.922858061165
  Variable: liveness 	| aic: 8422.243907612705
  Variable: valence 	| aic: 8205.731753687065
  Variable: tempo 	| aic: 8442.391793834204
  Variable: duration_ms 	| aic: 8402.867841144049
Best Variable: acousticness



This function finds the most influential variable by building a separate logistic model for each feature and selecting the one with the lowest AIC score.
By default, it just returns the name of the most influential variable. However, if its verbosity is set to a value greater than 0, it will
also show the AIC score for each feature.

In [15]:
def find_best_variable(genre_name, verbose=0):
    
    X_train, Y_train = make_genre_training_data(genre_name)
    
    lowest_aic = math.inf
    best_variable = ''
    
    if verbose > 0:
        print(f'{genre_name}:')
    for col in X_train:
        logistic_model = sm.Logit(Y_train, X_train[col]).fit(disp=0)
        
        if verbose > 0:
            print('  Variable:', col,'\t| aic:', logistic_model.aic)
        
        if logistic_model.aic < lowest_aic:
            lowest_aic = logistic_model.aic
            best_variable = col
    
    return best_variable

In [18]:
find_best_variable('metal', verbose=1)
print('Best Variable for Alternative:', find_best_variable('alternative'))
print('Best Variable for Blues:', find_best_variable('blues'))
print('Best Variable for Hip Hop:', find_best_variable('hiphop'))
print('Best Variable for Indie Alt:', find_best_variable('indie_alt'))
print('Best Variable for Metal:', find_best_variable('metal'))
print('Best Variable for Pop:', find_best_variable('pop'))
print('Best Variable for Rock:', find_best_variable('rock'))

metal:
  Variable: danceability 	| aic: 8315.35351471684
  Variable: energy 	| aic: 8348.053445288784
  Variable: key 	| aic: 8444.326503098675
  Variable: loudness 	| aic: 8378.080791086475
  Variable: mode 	| aic: 8436.918402439835
  Variable: speechiness 	| aic: 8439.174986226144
  Variable: acousticness 	| aic: 7460.97763718127
  Variable: instrumentalness 	| aic: 8296.389772324055
  Variable: liveness 	| aic: 8428.821406706025
  Variable: valence 	| aic: 8214.683169953361
  Variable: tempo 	| aic: 8442.591677267035
  Variable: duration_ms 	| aic: 8405.216232530296
Best Variable for Alternative: instrumentalness
Best Variable for Blues: acousticness
Best Variable for Hip Hop: speechiness
Best Variable for Indie Alt: instrumentalness
Best Variable for Metal: acousticness
pop:
  Variable: danceability 	| aic: 10566.802759422313
  Variable: energy 	| aic: 10580.19406956776
  Variable: key 	| aic: 10623.20297579958
  Variable: loudness 	| aic: 10621.169375562775
  Variable: mode 	| aic