### Workflow for Using Machine Learning in Predictive Modeling:
**1. Preprocessing: Preparation of Data for Modeling**
    - Labels / Raw Data
    - Convertion of String Data into Binary Numerical Values
    - Convertion of Binary Numerical Data into Vectors
    - Convertion of Vector Columns into Combined Vectors
    - Pipelining of Data
**3. Learning**
    - Transformation and Fit of Data to Pipeline
    - Splitting of Training and Testing Data
    - Learning Algorithm
        - Model Selection
        - Cross-Validation
        - Create and Test Grid of Performance Metrics (Hyperparameters)
        - Hyperparameter Optimization
**4. Evaluation** 
    - Final Model
    - Labels Prediction Accuracy
**5. Prediction**
    - New Data
    - Label Predictions

In [1]:
# Import the relevant python libraries for the analysis
import pandas as pd
from pandas import DataFrame
import numpy as np

## Section 1: Preprocessing and Exploratory Analysis¶
- Load and Clean SongDb.tsv dataset
- Visualize Data

### A. Load Genre Audio Categories and Clean Data

In [2]:
# Load SongDb.tsv dataset - convert .tsv file to .csv for uploading
file_encoding = 'utf8'
input_fd = open('data/songDb.tsv', encoding=file_encoding, errors='backslashreplace')
beats = pd.read_csv(input_fd, delimiter='\t', low_memory=False)
beats.head()

Unnamed: 0,Name,Danceability,Energy,Key,Loudness,Mode,Speechness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Type,ID,Uri,Ref_Track,URL_features,Duration_ms,time_signature,Genre
0,YuveYuveYu,0.624,0.857,10.0,-6.25,0.0,0.0542,0.0208,0.206,0.11,0.324,131.926,audio_features,6J2VvzKwWc2f0JP5RQVZjq,spotify:track:6J2VvzKwWc2f0JP5RQVZjq,https://api.spotify.com/v1/tracks/6J2VvzKwWc2f...,https://api.spotify.com/v1/audio-analysis/6J2V...,282920.0,4.0,celticmetal
1,Gloryhammer,0.517,0.916,0.0,-4.933,1.0,0.0559,0.000182,0.00191,0.306,0.444,135.996,audio_features,4HA34COgxgVJ6zK88UN4Ik,spotify:track:4HA34COgxgVJ6zK88UN4Ik,https://api.spotify.com/v1/tracks/4HA34COgxgVJ...,https://api.spotify.com/v1/audio-analysis/4HA3...,300320.0,4.0,celticmetal
2,Nostos,0.251,0.894,8.0,-4.103,0.0,0.057,0.0144,0.0,0.123,0.297,114.223,audio_features,3W6Xik6Xxf06JuUoZSATlD,spotify:track:3W6Xik6Xxf06JuUoZSATlD,https://api.spotify.com/v1/tracks/3W6Xik6Xxf06...,https://api.spotify.com/v1/audio-analysis/3W6X...,175353.0,4.0,celticmetal
3,Yggdrasil,0.469,0.743,1.0,-5.57,0.0,0.0272,0.00222,0.000111,0.276,0.481,86.953,audio_features,2gGveBaLJQMtJ43X4UL5kH,spotify:track:2gGveBaLJQMtJ43X4UL5kH,https://api.spotify.com/v1/tracks/2gGveBaLJQMt...,https://api.spotify.com/v1/audio-analysis/2gGv...,272292.0,4.0,celticmetal
4,Incense&Iron,0.487,0.952,1.0,-4.429,0.0,0.0613,0.000228,0.0,0.161,0.329,125.993,audio_features,1lRF81A1C9QoCgBcEop2zg,spotify:track:1lRF81A1C9QoCgBcEop2zg,https://api.spotify.com/v1/tracks/1lRF81A1C9Qo...,https://api.spotify.com/v1/audio-analysis/1lRF...,237933.0,4.0,celticmetal


In [3]:
# List # of column, # of unique Genres, and total row length of dataset
len(beats.columns), len(beats.Genre.unique()), len(beats)

(20, 626, 131580)

In [4]:
# List column names
list(beats.columns)

['Name',
 'Danceability',
 'Energy',
 'Key',
 'Loudness',
 'Mode',
 'Speechness',
 'Acousticness',
 'Instrumentalness',
 'Liveness',
 'Valence',
 'Tempo',
 'Type',
 'ID',
 'Uri',
 'Ref_Track',
 'URL_features',
 'Duration_ms',
 'time_signature',
 'Genre']

#### Assess Data Cleanliness: Drop all rows with NaN values and Drop Unnecessary Information

In [5]:
beats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131580 entries, 0 to 131579
Data columns (total 20 columns):
Name                131578 non-null object
Danceability        131580 non-null float64
Energy              131580 non-null float64
Key                 131580 non-null float64
Loudness            131580 non-null float64
Mode                131580 non-null float64
Speechness          131580 non-null float64
Acousticness        131580 non-null float64
Instrumentalness    131580 non-null float64
Liveness            131580 non-null float64
Valence             131580 non-null float64
Tempo               131580 non-null object
Type                131580 non-null object
ID                  131580 non-null object
Uri                 131580 non-null object
Ref_Track           131580 non-null object
URL_features        131580 non-null object
Duration_ms         131580 non-null float64
time_signature      131580 non-null object
Genre               131554 non-null object
dtypes: float64(11

In [6]:
beats.head(1)

Unnamed: 0,Name,Danceability,Energy,Key,Loudness,Mode,Speechness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Type,ID,Uri,Ref_Track,URL_features,Duration_ms,time_signature,Genre
0,YuveYuveYu,0.624,0.857,10.0,-6.25,0.0,0.0542,0.0208,0.206,0.11,0.324,131.926,audio_features,6J2VvzKwWc2f0JP5RQVZjq,spotify:track:6J2VvzKwWc2f0JP5RQVZjq,https://api.spotify.com/v1/tracks/6J2VvzKwWc2f...,https://api.spotify.com/v1/audio-analysis/6J2V...,282920.0,4.0,celticmetal


In [8]:
#list(beats['Genre'].unique())

### Necessary Adjustments to Data:

#### 1. Group Genres into 5 Categories:
- RocknRoll
- Electronic
- Hiphop
- Indie
- Pop

In [9]:
# Insertions with SQL 
import pandasql as ps

In [10]:
music_query = """
select
  case 
    when Genre like '%metal%' then 'RocknRoll'
    when Genre like '%punk%' then 'RocknRoll'
    when Genre like '%rock%' then 'RocknRoll'
    when Genre like '%alternative%' then 'RocknRoll'
    when Genre like '%beat%' then 'RocknRoll'
    
    when Genre like '%trance%' then 'Electronic'
    when Genre like '%electro%' then 'Electronic'
    when Genre like '%house%' then 'Electronic'
    when Genre like '%dub%' then 'Electronic'
    when Genre like '%techno%' then 'Electronic'
    when Genre like '%chill%' then 'Electronic'
    when Genre like '%deep%' then 'Electronic'
    when Genre like '%rave%' then 'Electronic'
    
    when Genre like '%trap%' then 'HipHop'
    when Genre like '%hiphop%' then 'HipHop'
    
    when Genre like '%folk%' then 'Indie'
    when Genre like '%indie%' then 'Indie'
    
    when Genre like '%pop%' then 'Pop'
    
  end as Genre
from beats"""

In [11]:
# Add modified Genres to beats
beats['Genres'] = ps.sqldf(music_query)
beats.tail()

Unnamed: 0,Name,Danceability,Energy,Key,Loudness,Mode,Speechness,Acousticness,Instrumentalness,Liveness,...,Tempo,Type,ID,Uri,Ref_Track,URL_features,Duration_ms,time_signature,Genre,Genres
131575,Youth,0.568,0.708,8.0,-9.96,1.0,0.0601,0.00793,0.000528,0.266,...,127.741,audio_features,5AozgGtATNJi2Yx5Vb2InS,spotify:track:5AozgGtATNJi2Yx5Vb2InS,https://api.spotify.com/v1/tracks/5AozgGtATNJi...,https://api.spotify.com/v1/audio-analysis/5Aoz...,259560.0,4.0,britishindierock,RocknRoll
131576,IFoundOut,0.47,0.909,4.0,-1.674,1.0,0.0546,0.0611,0.0,0.294,...,146.986,audio_features,34XDIqYypZc3jyGRDgd5p4,spotify:track:34XDIqYypZc3jyGRDgd5p4,https://api.spotify.com/v1/tracks/34XDIqYypZc3...,https://api.spotify.com/v1/audio-analysis/34XD...,127400.0,4.0,britishindierock,RocknRoll
131577,Animal,0.272,0.918,11.0,-2.589,0.0,0.0625,0.000749,0.0092,0.307,...,139.574,audio_features,5MpD4w1JTHkesmjn9I8Qo5,spotify:track:5MpD4w1JTHkesmjn9I8Qo5,https://api.spotify.com/v1/tracks/5MpD4w1JTHke...,https://api.spotify.com/v1/audio-analysis/5MpD...,159627.0,4.0,britishindierock,RocknRoll
131578,PostBreak-UpSex,0.402,0.902,5.0,-4.115,1.0,0.0469,7.3e-05,0.00465,0.261,...,136.883,audio_features,77GZxme1GMNbbooEH8nHNX,spotify:track:77GZxme1GMNbbooEH8nHNX,https://api.spotify.com/v1/tracks/77GZxme1GMNb...,https://api.spotify.com/v1/audio-analysis/77GZ...,174453.0,4.0,britishindierock,RocknRoll
131579,WantYouSoBad,0.482,0.839,3.0,-4.171,0.0,0.1,0.0134,0.244,0.104,...,176.068,audio_features,6XoNBGHoWlqMtSTob5heto,spotify:track:6XoNBGHoWlqMtSTob5heto,https://api.spotify.com/v1/tracks/6XoNBGHoWlqM...,https://api.spotify.com/v1/audio-analysis/6XoN...,258720.0,4.0,britishindierock,RocknRoll


#### 2. Drop all NaN values
    - Drop all columns with music not clearly allocated into the 5 chosen genres
#### 3. Convert numerical strings to numerical floats
    - Tempo
    - time_signature
#### 4. Drop unnecessary columns
    - Name
    - ID
    - Uri
    - Ref_Track
    - URL_features
    - Type
    - Genre

In [12]:
# Drop NaN values
beats = beats.dropna()

# Convert column values to numbers
beats['Tempo'] = pd.to_numeric(beats['Tempo'])
beats['time_signature'] = pd.to_numeric(beats['time_signature'])

# Drop unnecessary columns
beats = beats.drop(['Name','ID','Uri','Ref_Track','URL_features','Type', 'Genre'], axis=1)

In [13]:
# Test output
beats.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 79233 entries, 0 to 131579
Data columns (total 14 columns):
Danceability        79233 non-null float64
Energy              79233 non-null float64
Key                 79233 non-null float64
Loudness            79233 non-null float64
Mode                79233 non-null float64
Speechness          79233 non-null float64
Acousticness        79233 non-null float64
Instrumentalness    79233 non-null float64
Liveness            79233 non-null float64
Valence             79233 non-null float64
Tempo               79233 non-null float64
Duration_ms         79233 non-null float64
time_signature      79233 non-null float64
Genres              79233 non-null object
dtypes: float64(13), object(1)
memory usage: 9.1+ MB


In [14]:
# Store as global variable
%store beats

Stored 'beats' (DataFrame)


## Section 2: Preparation of Data for Modeling

### Encode Categorical Data into Numberical Data
- Where every observation of a given feature has a unique vector which al l elements are 0 aside from one value of 1, which corresponds to vector level

In [15]:
# Save order and list of beats Genre list
genre_strings = beats['Genres']

In [16]:
# Convert Column value strings to a numeric value
for i, column in enumerate(list([str(d) for d in beats.dtypes])):
    if column == "object":
        beats[beats.columns[i]] = beats[beats.columns[i]].fillna(beats[beats.columns[i]].mode())
        beats[beats.columns[i]] = beats[beats.columns[i]].astype("category").cat.codes
    else:
        beats[beats.columns[i]] = beats[beats.columns[i]].fillna(beats[beats.columns[i]].median())
beats.head()

Unnamed: 0,Danceability,Energy,Key,Loudness,Mode,Speechness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Duration_ms,time_signature,Genres
0,0.624,0.857,10.0,-6.25,0.0,0.0542,0.0208,0.206,0.11,0.324,131.926,282920.0,4.0,4
1,0.517,0.916,0.0,-4.933,1.0,0.0559,0.000182,0.00191,0.306,0.444,135.996,300320.0,4.0,4
2,0.251,0.894,8.0,-4.103,0.0,0.057,0.0144,0.0,0.123,0.297,114.223,175353.0,4.0,4
3,0.469,0.743,1.0,-5.57,0.0,0.0272,0.00222,0.000111,0.276,0.481,86.953,272292.0,4.0,4
4,0.487,0.952,1.0,-4.429,0.0,0.0613,0.000228,0.0,0.161,0.329,125.993,237933.0,4.0,4


In [17]:
beats.tail()

Unnamed: 0,Danceability,Energy,Key,Loudness,Mode,Speechness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Duration_ms,time_signature,Genres
131575,0.568,0.708,8.0,-9.96,1.0,0.0601,0.00793,0.000528,0.266,0.214,127.741,259560.0,4.0,4
131576,0.47,0.909,4.0,-1.674,1.0,0.0546,0.0611,0.0,0.294,0.607,146.986,127400.0,4.0,4
131577,0.272,0.918,11.0,-2.589,0.0,0.0625,0.000749,0.0092,0.307,0.53,139.574,159627.0,4.0,4
131578,0.402,0.902,5.0,-4.115,1.0,0.0469,7.3e-05,0.00465,0.261,0.569,136.883,174453.0,4.0,4
131579,0.482,0.839,3.0,-4.171,0.0,0.1,0.0134,0.244,0.104,0.571,176.068,258720.0,4.0,4


In [18]:
# Create copy of beats (beats2) and store as global variable
beats2 = beats.copy()
%store beats2
%store genre_strings

Stored 'beats2' (DataFrame)
Stored 'genre_strings' (Series)


### Watermark Extension
- Documentation of when program was run and with which packages

In [29]:
# Install requirements.txt
!pip install -r requirements.txt

In [30]:
# install watermark extension
!pip install --upgrade pip
!pip install watermark

Requirement already up-to-date: pip in /anaconda3/lib/python3.7/site-packages (19.1.1)


In [31]:
# Use a future note
%load_ext watermark

In [33]:
%watermark -a "Emily Schoof" -d -t -v -p numpy,pandas,seaborn,matplotlib,sklearn

Emily Schoof 2019-07-10 14:57:23 

CPython 3.7.3
IPython 7.4.0

numpy 1.16.2
pandas 0.24.2
seaborn 0.9.0
matplotlib 3.0.3
sklearn 0.20.3
