#### Data Cleaning Data Exploration
This notebook serves to clean up the data to create a concise dataset to feed into the model. This notebook also further explores and verifies that the data presented in Tableau visualizations accurately captures the raw data in ```songs_normalize.csv```. 

**NOTE**: This notebook explores ONLY the genres used to train the model (top 3), see below. See ```data_exploration_raw.ipynb``` for data exploration using ALL genres included in the raw data.

_______________________________________________________________________________


_**RATIONALE FOR REDUCED GENRES**: The objective of this project is to train a model to be able to accurately sort songs into genres using the ```songs_normalize.csv``` as reference. When viewing the CSV, it was noted that many genre values included more than 1 genre (for example, "rock, pop") which our group believed would fog up the data that the model would be used to train on. Our group reduced the genres down to values that contained only 1 genre, to create clear differentiation for the models to classify by._

In [1]:
# dependencies
import pandas as pd
import numpy as np
import requests

In [2]:
# read csv
songs_df = pd.read_csv('../Resources/songs_normalize.csv')
songs_df.head()

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.3,1.8e-05,0.355,0.894,95.053,pop
1,blink-182,All The Small Things,167066,False,1999,79,0.434,0.897,0,-4.918,1,0.0488,0.0103,0.0,0.612,0.684,148.726,"rock, pop"
2,Faith Hill,Breathe,250546,False,1999,66,0.529,0.496,7,-9.007,1,0.029,0.173,0.0,0.251,0.278,136.859,"pop, country"
3,Bon Jovi,It's My Life,224493,False,2000,78,0.551,0.913,0,-4.063,0,0.0466,0.0263,1.3e-05,0.347,0.544,119.992,"rock, metal"
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,0,0.0516,0.0408,0.00104,0.0845,0.879,172.656,pop


#### Data Cleaning: Removing Genres not Used to Train Model

In [42]:
# review unique genre values in genre column
songs_df['genre'].unique()

array(['pop', 'rock, pop', 'pop, country', 'rock, metal',
       'hip hop, pop, R&B', 'hip hop', 'pop, rock', 'pop, R&B',
       'Dance/Electronic', 'pop, Dance/Electronic',
       'rock, Folk/Acoustic, easy listening', 'metal', 'hip hop, pop',
       'R&B', 'pop, latin', 'Folk/Acoustic, rock',
       'pop, easy listening, Dance/Electronic', 'rock',
       'rock, blues, latin', 'pop, rock, metal', 'rock, pop, metal',
       'hip hop, R&B', 'pop, Folk/Acoustic', 'set()',
       'hip hop, pop, latin', 'hip hop, Dance/Electronic',
       'hip hop, pop, rock', 'World/Traditional, Folk/Acoustic',
       'Folk/Acoustic, pop', 'rock, easy listening',
       'World/Traditional, hip hop', 'hip hop, pop, R&B, latin',
       'rock, blues', 'rock, R&B, Folk/Acoustic, pop', 'latin',
       'pop, R&B, Dance/Electronic', 'World/Traditional, rock',
       'pop, rock, Dance/Electronic', 'pop, easy listening, jazz',
       'rock, Dance/Electronic', 'World/Traditional, pop, Folk/Acoustic',
       'countr

In [13]:
# drop rows that contain more than 1 genre in genre column

# identify substring to be removed
remove = ','

# create boolean mask for rows not containing substring
mask = ~songs_df['genre'].str.contains(remove, case=False, na=False)

# apply mask
songs_filtered_df = songs_df[mask]

songs_filtered_df

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.30000,0.000018,0.3550,0.894,95.053,pop
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,0,0.0516,0.04080,0.001040,0.0845,0.879,172.656,pop
6,Eminem,The Real Slim Shady,284200,True,2000,86,0.949,0.661,5,-4.244,0,0.0572,0.03020,0.000000,0.0454,0.760,104.504,hip hop
9,Modjo,Lady - Hear Me Tonight,307153,False,2001,77,0.720,0.808,6,-5.627,1,0.0379,0.00793,0.029300,0.0634,0.869,126.041,Dance/Electronic
10,Gigi D'Agostino,L'Amour Toujours,238759,False,2011,1,0.617,0.728,7,-7.932,1,0.0292,0.03280,0.048200,0.3600,0.808,139.066,pop
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1994,Post Malone,Goodbyes (Feat. Young Thug),174960,True,2019,1,0.580,0.653,5,-3.818,1,0.0745,0.44700,0.000000,0.1110,0.175,150.231,hip hop
1995,Jonas Brothers,Sucker,181026,False,2019,79,0.842,0.734,1,-5.065,0,0.0588,0.04270,0.000000,0.1060,0.952,137.958,pop
1996,Taylor Swift,Cruel Summer,178426,False,2019,78,0.552,0.702,9,-5.707,1,0.1570,0.11700,0.000021,0.1050,0.564,169.994,pop
1998,Sam Smith,Dancing With A Stranger (with Normani),171029,False,2019,75,0.741,0.520,8,-7.513,1,0.0656,0.45000,0.000002,0.2220,0.347,102.998,pop


In [15]:
# verify only single genre values are included in df
songs_filtered_df['genre'].unique() 

array(['pop', 'hip hop', 'Dance/Electronic', 'metal', 'R&B', 'rock',
       'set()', 'latin', 'country', 'easy listening'], dtype=object)

In [18]:
# view counts of each genre
songs_filtered_df['genre'].value_counts()

genre
pop                 428
hip hop             124
rock                 58
Dance/Electronic     41
set()                22
latin                15
R&B                  13
country              10
metal                 9
easy listening        1
Name: count, dtype: int64

In [45]:
# 'set()' does not appear to be a valid genre, remove from df
remove_set = 'set()'

# filter df
songs_filtered_df = songs_filtered_df[songs_filtered_df['genre'] != remove_set]

# 'easy listening' has 1 count, remove from df
remove_easy_listening = 'easy listening'
songs_filtered_df = songs_filtered_df[songs_filtered_df['genre'] != remove_easy_listening]

# verify genres in df
songs_filtered_df['genre'].unique()

array(['pop', 'hip hop', 'Dance/Electronic', 'metal', 'R&B', 'rock',
       'latin', 'country'], dtype=object)

In [30]:
# preview df
songs_filtered_df

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.30000,0.000018,0.3550,0.894,95.053,pop
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,0,0.0516,0.04080,0.001040,0.0845,0.879,172.656,pop
6,Eminem,The Real Slim Shady,284200,True,2000,86,0.949,0.661,5,-4.244,0,0.0572,0.03020,0.000000,0.0454,0.760,104.504,hip hop
9,Modjo,Lady - Hear Me Tonight,307153,False,2001,77,0.720,0.808,6,-5.627,1,0.0379,0.00793,0.029300,0.0634,0.869,126.041,Dance/Electronic
10,Gigi D'Agostino,L'Amour Toujours,238759,False,2011,1,0.617,0.728,7,-7.932,1,0.0292,0.03280,0.048200,0.3600,0.808,139.066,pop
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1994,Post Malone,Goodbyes (Feat. Young Thug),174960,True,2019,1,0.580,0.653,5,-3.818,1,0.0745,0.44700,0.000000,0.1110,0.175,150.231,hip hop
1995,Jonas Brothers,Sucker,181026,False,2019,79,0.842,0.734,1,-5.065,0,0.0588,0.04270,0.000000,0.1060,0.952,137.958,pop
1996,Taylor Swift,Cruel Summer,178426,False,2019,78,0.552,0.702,9,-5.707,1,0.1570,0.11700,0.000021,0.1050,0.564,169.994,pop
1998,Sam Smith,Dancing With A Stranger (with Normani),171029,False,2019,75,0.741,0.520,8,-7.513,1,0.0656,0.45000,0.000002,0.2220,0.347,102.998,pop


In [28]:
# save filtered dataframe as CSV
songs_filtered_df.to_csv('../Resources/songs_filtered.csv', index=False)

#### Visualization 1b: Genre Distribution

In [36]:
# verify number of unique genres in csv
songs_filtered_df['genre'].nunique()

8

In [40]:
# verify genres by count in csv
songs_filtered_df['genre'].value_counts()

genre
pop                 428
hip hop             124
rock                 58
Dance/Electronic     41
latin                15
R&B                  13
country              10
metal                 9
Name: count, dtype: int64

#### Visualization 2a: Top 3 Genres (Energy, Danceability, Valence)

In [62]:
# verify avg energy, danceability, valence in top 3 genres

# ENERGY: A measure from 0.0 to 1.0, represents a perceptual measure of intensity and activity.

print(f'Average Energy for Top 3 Genres:')

# pop
genre_1_energy = songs_filtered_df[songs_filtered_df['genre'] == 'pop']
avg_energy_1 = genre_1_energy['energy'].mean()
avg_energy_1

print(f'[pop] = {avg_energy_1:.4f}')

# hip hop
genre_2_energy = songs_filtered_df[songs_filtered_df['genre'] == 'hip hop']
avg_energy_2 = genre_2_energy['energy'].mean()
avg_energy_2

print(f'[hip hop] = {avg_energy_2:.4f}')

# rock
genre_3_energy = songs_filtered_df[songs_filtered_df['genre'] == 'rock']
avg_energy_3 = genre_3_energy['energy'].mean()
avg_energy_3

print(f'[rock] = {avg_energy_3:.4f}')

Average Energy for Top 3 Genres:
[pop] = 0.7164
[hip hop] = 0.6834
[rock] = 0.8093


In [50]:
# verify avg energy, danceability, valence in top 3 genres

# DANCEABILITY: 
# Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. 
# A value of 0.0 is least danceable and 1.0 is most danceable.

print(f'Average Danceability for Top 3 Genres:')

# pop
genre_1_dance = songs_filtered_df[songs_filtered_df['genre'] == 'pop']
avg_dance_1 = genre_1_dance['danceability'].mean()
avg_dance_1

print(f'[pop] = {avg_dance_1:.4f}')

# hip hop
genre_2_dance = songs_filtered_df[songs_filtered_df['genre'] == 'hip hop']
avg_dance_2 = genre_2_dance['danceability'].mean()
avg_dance_2

print(f'[hip hop] = {avg_dance_2:.4f}')

# rock
genre_3_dance = songs_filtered_df[songs_filtered_df['genre'] == 'rock']
avg_dance_3 = genre_3_dance['danceability'].mean()
avg_dance_3

print(f'[rock] = {avg_dance_3:.4f}')

Average Danceability for Top 3 Genres:
[pop] = 0.6479
[hip hop] = 0.7238
[rock] = 0.5262


In [51]:
# verify avg energy, danceability, valence in top 3 genres

# VALENCE: 
# A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. 
# Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

print(f'Average VALENCE for Top 3 Genres:')

# pop
genre_1_valence = songs_filtered_df[songs_filtered_df['genre'] == 'pop']
avg_valence_1 = genre_1_valence['valence'].mean()
avg_valence_1

print(f'[pop] = {avg_valence_1:.4f}')

# hip hop
genre_2_valence = songs_filtered_df[songs_filtered_df['genre'] == 'hip hop']
avg_valence_2 = genre_2_valence['valence'].mean()
avg_valence_2

print(f'[hip hop] = {avg_valence_2:.4f}')

# rock
genre_3_valence = songs_filtered_df[songs_filtered_df['genre'] == 'rock']
avg_valence_3 = genre_3_valence['valence'].mean()
avg_valence_3

print(f'[rock] = {avg_valence_3:.4f}')


Average VALENCE for Top 3 Genres:
[pop] = 0.5549
[hip hop] = 0.5295
[rock] = 0.5198


#### Visualization 3a: Top 3 Genres (Speechiness, Acousticness, Instrumentalness)

In [63]:
# verify avg speechiness, acousticness, instrumentalness in top 5 genres

# SPEECHINESS: 
# Detects the presence of spoken words in a track. 
# The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. 
# Values above 0.66 describe tracks that are probably made entirely of spoken words. 
# Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. 
# Values below 0.33 most likely represent music and other non-speech-like tracks.

print(f'Average SPEECHINESS for Top 3 Genres:')

# pop
genre_1_speech = songs_filtered_df[songs_filtered_df['genre'] == 'pop']
avg_speech_1 = genre_1_speech['speechiness'].mean()
avg_speech_1

print(f'[pop] = {avg_speech_1:.4f}')

# hip hop
genre_2_speech = songs_filtered_df[songs_filtered_df['genre'] == 'hip hop']
avg_speech_2 = genre_2_speech['speechiness'].mean()
avg_speech_2

print(f'[hip hop] = {avg_speech_2:.4f}')

# rock
genre_3_speech = songs_filtered_df[songs_filtered_df['genre'] == 'rock']
avg_speech_3 = genre_3_speech['speechiness'].mean()
avg_speech_3

print(f'[pop] = {avg_speech_3:.4f}')

Average SPEECHINESS for Top 3 Genres:
[pop] = 0.0734
[hip hop] = 0.2010
[pop] = 0.0647


In [64]:
# verify avg speechiness, acousticness, instrumentalness in top 5 genres

# ACOUSTICNESS:
# A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 
# 1.0 represents high confidence the track is acoustic.

print(f'Average ACOUSTICNESS for Top 3 Genres:')

# pop
genre_1_acoustic = songs_filtered_df[songs_filtered_df['genre'] == 'pop']
avg_acoustic_1 = genre_1_acoustic['acousticness'].mean()
avg_acoustic_1

print(f'[pop] = {avg_acoustic_1:.4f}')

# hip hop
genre_2_acoustic = songs_filtered_df[songs_filtered_df['genre'] == 'hip hop']
avg_acoustic_2 = genre_2_acoustic['acousticness'].mean()
avg_acoustic_2

print(f'[hip hop] = {avg_acoustic_2:.4f}')

# rock
genre_3_acoustic = songs_filtered_df[songs_filtered_df['genre'] == 'rock']
avg_acoustic_3 = genre_3_acoustic['acousticness'].mean()
avg_acoustic_3

print(f'[rock] = {avg_acoustic_3:.4f}')

Average ACOUSTICNESS for Top 3 Genres:
[pop] = 0.1522
[hip hop] = 0.1397
[rock] = 0.0483


In [66]:
# verify avg speechiness, acousticness, instrumentalness in top 5 genres

# INSTRUMENTALNESS:
# Predicts whether a track contains no vocals. 
# "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". 
# The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. 
# Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

print(f'Average INSTRUMENTALNESS for Top 3 Genres:')

# pop
genre_1_instrument = songs_filtered_df[songs_filtered_df['genre'] == 'pop']
avg_instrument_1 = genre_1_instrument['instrumentalness'].mean()
avg_instrument_1

print(f'[pop] = {avg_instrument_1:.4f}')

# hip hop
genre_2_instrument = songs_filtered_df[songs_filtered_df['genre'] == 'hip hop']
avg_instrument_2 = genre_2_instrument['instrumentalness'].mean()
avg_instrument_2

print(f'[hip hop] = {avg_instrument_2:.4f}')

# rock
genre_3_instrument = songs_filtered_df[songs_filtered_df['genre'] == 'rock']
avg_instrument_3 = genre_3_instrument['instrumentalness'].mean()
avg_instrument_3

print(f'[rock] = {avg_instrument_3:.4f}')

Average INSTRUMENTALNESS for Top 3 Genres:
[pop] = 0.0075
[hip hop] = 0.0091
[rock] = 0.0497
