## Dataset

We are doing our analysis on [Spotify Multi-Genre Playlist Data](https://www.kaggle.com/siropo/spotify-multigenre-playlists-data).
This dataset is a collection of song features taken from Spotify and separated into six broad genres of music. 
It is not a random sampling of songs on Spotify: each song was on a playlist made by the person who collected the dataset. 
However, there is still a wide variety of genres it will work for purposes of our analysis. 
 
 The dataset has the following 22 columns:

1. Artist Name
2. Song Name
3. Popularity: value from 1 to 100 that represents the song's popularity (magically determined by Spotify)
4. Genres: a detailed list of the genres for each artist
5. Playlist: the name of the playlist each song came from
6. Danceability
7. Energy
8. Key
9. Loudness
9. Mode
10. Speechiness
11. Acousticness
12. Instrumentalness
13. Liveness
14. Valence
15. Tempo
16. ID
17. URI
18. HRef
19. Analysis_url
19. Duration_Ms
20. Time-Signature

### Loading the dataset

The dataset is broken into 6 files, with each file containing the songs from a single genre of music. 
Here, we load the files into memory and combine them into one dataset. 
We also drop the playlist, ID, URI, HRef, and Analysis_url columns because they are not relevant for our analysis. 
Since we will be combining all of the songs into a single dataset, we also have to add another column containing the genre of each song. 

In [11]:
import pandas as pd

def load_dataset(music_genre):
    # First, we read the genre dataset into memeory
    # Then we drop all of the columns we don't need
    # and finally we add the genre column and return it 
    return pd.read_csv(f'{music_genre}_music_data.csv').drop(columns=['Playlist', 'id', 'uri', 'track_href', 'analysis_url']).assign(genre=music_genre)

alternative = load_dataset('alternative')
blues = load_dataset('blues')
hiphop = load_dataset('hiphop')
indie_alt = load_dataset('indie_alt')
metal = load_dataset('metal')
pop = load_dataset('pop')
rock = load_dataset('rock')
    
dataset = pd.concat([alternative, blues, hiphop, indie_alt, metal, pop, rock])

### Dataset Meta Analysis

Before analysing our dataset, we want to figure out what we are working with regards to the number of songs in each genre and the missing values. 

In [13]:
def inspect_dataset(music_genre):
    print('Genre:', music_genre)
    print('Number of songs:', len(dataset[dataset['genre'] == music_genre]))
    # thanks to Stack Overflow user piRSquared and their answer to this question:
    # https://stackoverflow.com/a/38733583
    print('Number of missing values:', (dataset[dataset['genre'] == music_genre].isna()).to_numpy().sum(), '\n')

genres = ['alternative', 'blues', 'hiphop', 'indie_alt', 'metal', 'pop', 'rock']

for genre in genres:
    inspect_dataset(genre)
    
print('Number of songs in dataset:', len(dataset))

Genre: alternative
Number of songs: 2160
Number of missing values: 0 

Genre: blues
Number of songs: 2050
Number of missing values: 0 

Genre: hiphop
Number of songs: 2581
Number of missing values: 0 

Genre: indie_alt
Number of songs: 4338
Number of missing values: 0 

Genre: metal
Number of songs: 3045
Number of missing values: 0 

Genre: pop
Number of songs: 3831
Number of missing values: 0 

Genre: rock
Number of songs: 8747
Number of missing values: 0 

Number of songs in dataset: 26752


Fortunately, there are no missing values in this dataset, so we don't have to worry about that. 
There is a disproportionately large number of rock songs, so if we end up building a classifier, it may end up being biased towards rock (that may not be a bad thing, but what do I know? I'm also biased towards rock)