# Data Science Life Cycle and Process

Step 1: Data Engineering 

- Get Data.ipynb
>
> This notebook is designed to be the start of the data ingestion pipeline 
> for getting the Raw Data from Sources, GTZAN and Spotify

Step 2: Exploratory Data Analysis 

- EDA.ipynb
>
> This notebook is designed to be the EDA for audio feature extraction using Librosa. 
> We will explore things like Mel-Frequency Cepstral Coefficients and Canstant-Q Transform and Chromagram with Pitch Classes

Step 3: Modeling, Validation, and Testing

CNN - Convolution Neural Network    
KNN - K Nearst Neighbor 
- Modeling.ipynb
>
> This notebook is designed to be the implementation of KNN & CNN from Keras Tensorflow and can be run on either GTZAN(CNN Only) or Spotifiy datasets.
>      



# This Notebook covers how to interact with Spotify API

Info: https://developer.spotify.com/dashboard/

In [None]:
# Needed information to be used down stream 

# Parameters to decide what to pull from Spotify based on the Genre DataSet
NumberOfGenres = 10 # Max 5683
NumberOfSongs = 200 # 
Randomize = False # Using random we can select from max Genre & Songs as a sampling method
RandomSeed = 6 # Provide repeatability to the experiment

# Google Drive path, aka you need to make this path in your drive to use as the root
RootPath = '/content/drive/MyDrive/W207'

# Spotify Rest API Credentials
ClientID = 'YOUR-CLIENT-ID' 
SecretID = 'YOUR-SECRET-ID' 

# Genre source data file path
GenreFilePath = RootPath + '/Data/Genre.txt'

# Is the path for the presistence of the Spotify data sets pulled
SpotifyDataPath = RootPath + "/Data/Spotify/" + str(NumberOfGenres) + "-" + str(NumberOfSongs) + "-" + str(Randomize) + "-" + str(RandomSeed )

# GTZAN Data path
GTZANDataPath = RootPath + "/Data/GTZAN/"

# Imports

In [None]:
# Imports
from IPython.display import clear_output
from google.colab import drive
from bs4 import BeautifulSoup
import pandas as pd
import requests
import tarfile
import random
import json
import time
import os

# Methods

## Define helper methods

In [None]:
# Method for displaying progress for long running processes
def update_progress(current, total, msg = '', bar_length = 20):
  '''
    Controls the output of a Jupyter cell for progress and message display without taking up to much notebook space

          Parameters:
                  current (int): An integer
                  Defines the current iteration on the progress
                      
                  total (int): An integer
                  Defines the number of elements for total progress

                  msg (str): A string
                  The Message to be displayed

                  bar_length (int): An integer 
                  Determines the width of the progress bar by number of ticks to display

          Returns:
                  Controled output for progress information in a Jupyter Notebook, best used for long running processes 
                  that don't require presistance of the output for record. 
  '''
  
  # Defines the style to displace on the progress blocks
  style = lambda x: "\x1b[1;30;43m" + str(x) + "\x1b[0m"

  # Determines teh progress ratio
  progress = current / total

  #
  if isinstance(progress, int): progress = float(progress)
  if not isinstance(progress, float) or progress < 0: progress = 0
  if progress >= 1: progress = 1

  block = int(round(bar_length * progress))

  # Using imports (from IPython.display import clear_output)
  clear_output(wait = True)

  text = "Progress: [{0}] {1:.1f}% ({2} of {3})".format( style("░" * block) + "-" * (bar_length - block), progress * 100, current, total)
  
  # Display the text to output
  print(text)
  # Display further message context
  if msg != '': 
    print(msg)

In [None]:
def laptime(stime, total, current):
  running = (time.time() - stime)
  hrs, r = divmod(running, 3600)
  mins, secs = divmod(r, 60)

  remaining = (running/(current+1)) * (total - (current+1))
  hours, rem = divmod(remaining, 3600)
  minutes, seconds = divmod(rem, 60)

  running = {"hours":hrs,"remainder":r,"minutes":mins,"seconds":secs}
  remaining = {"hours":hours,"remainder":rem,"minutes":minutes,"seconds":remaining}
    
  return {"Running":running, "Remaining":remaining}

## Define the method to get Tracks directly from Spotify REST API

In [None]:
def GetAPIToken(ClientID, SecretID):
  AUTH_URL = 'https://accounts.spotify.com/api/token'

  # POST
  auth_response = requests.post(AUTH_URL, {
      'grant_type': 'client_credentials',
      'client_id': ClientID,
      'client_secret': SecretID,
  })

  # convert the response to JSON
  auth_response_data = auth_response.json()

  # save the access token
  access_token = auth_response_data['access_token']

  return access_token

In [None]:
def APIrequest(get, access_token):
  '''
  Manages the Spotify REST API
  Returns 

          Parameters:
                  get (string): A string that should be apended to the base url path for the get function beinging used    
                  e.g. https://api.spotify.com/v1/{artists/{id}/albums}

          Returns:
                  A dictionary of with ('items') from Spotify
                  "See Spoftify API guidelines"
  '''
  headers = {'Authorization': 'Bearer {token}'.format(token=access_token)}

  # base URL of all Spotify API endpoints
  BASE_URL = 'https://api.spotify.com/v1/'

  # Rest API throttle and error trapping logic
  error = True
  z = 1
  while error:
    try:
      request = requests.get(BASE_URL + get, headers=headers).json()
      error = not error
    except:
      print('Throttling API request...')
      time.sleep(z) # Sleep for n seconds to throttle for rest api limits
      z += 1

  time.sleep(.025) # Async delay to pop mem from the call

  return request

In [None]:
def GetTracks(GenreFilePath, AccessToken ,NumberOfGenres = 10, NumberOfSongs = 10, Rnd = True, Seed = 0):
  '''
  Get Tracks directly from Spotify REST API
  Returns 

          Parameters:
                  NumberOfGenres (int): An integer (0, N)
                  defines the number of Genres to get playlist songs for.                
                  
                  Rnd (boolean): A boolean & | integer (1, 0)~ (True, False)
                  defines if the function is to use random selection and shuffling.
                  
                  seed (int): An integer (0, N)
                  defines the random seed to use for repeatability.              

          Returns:
                  A DataFrame of Genre playlists songs('tracks') information for download of preview urls to be used in sampling of Music Genre Classification dataset from Spotify.
                  100 songs per Genre playlist as a max. "See Spoftify API guidelines"
  '''
  g = Genres(GenreFilePath)
  genres = list(g.keys())
  if Rnd:
    random.seed(Seed)
    random.shuffle(genres)

  columns = ['Genre', 'PlaylistID', 'TrackID', 'ArtistID', 'PreviewUrl']
  df = pd.DataFrame(columns = columns)
  stime = time.time() # used to capture run time and estimate total time to execute

  for genre in genres[:NumberOfGenres]: 
    playlist_id = g[genre]
    
    # Rest API throttle and error trapping logic
    playlist_tracks = APIrequest('playlists/' + playlist_id + '/tracks', AccessToken)
    if 'items' in playlist_tracks.keys() and Rnd: random.shuffle(playlist_tracks['items'])  

    # Add existing info of tracks per genre playlist
    for tr in playlist_tracks['items']:  

      track = tr['track']
      idx = len(df) 

      if str(track['preview_url']) == 'None': continue # If we don't have a url move to the next

      df.loc[idx, 'Genre'] = genre
      df.loc[idx, 'PlaylistID'] = g[genre]   
      df.loc[idx, 'TrackID'] = str(track['id'])
      df.loc[idx, 'ArtistID'] = str(track['album']['artists'][0]['id']) if track['album']['artists'] else ''
      df.loc[idx, 'PreviewUrl'] = str(track['preview_url']) 
        
      lt = laptime(stime, NumberOfGenres * NumberOfSongs, idx  + 1)
      msg = 'Extracting... ' \
          + '\n       Genre: ' + genre \
          + '\n   TrackName: ' + str(track['name']) \
          + '\n' \
          + '\n        Time Running : ' + "{:0>2}:{:0>2}:{:05.2f}".format(int(lt['Running']['hours']),int(lt['Running']['minutes']),lt['Running']['seconds']) \
          + '\n Time Remaining Est. : ' + "{:0>2}:{:0>2}:{:05.2f}".format(int(lt['Remaining']['hours']),int(lt['Remaining']['minutes']),lt['Remaining']['seconds']) 

      update_progress(idx  + 1, NumberOfGenres * NumberOfSongs, msg)

      if len(df[df['Genre'] == genre]) >= NumberOfSongs - 1: 
        break # Move to next genre if we have a completed to number of songs
      else: # Extend songs from standard genre list to meet number of songs
      
        # Get artists list for extending genre song depth
        artists = [artist[0]['id'] for artist in [track['track']['album']['artists'] for track in playlist_tracks['items']]] 

        for artist_id in artists:
          # eg. https://api.spotify.com/v1/artists/{id}/albums
          r1 = APIrequest('artists/' + artist_id + '/albums', AccessToken)

          # Get Tracks by Album
          for album in r1['items']:
            
            # Get Track id's by Album
            # eg. https://api.spotify.com/v1/albums/{id}/tracks
            r2 = APIrequest('albums/' + album['id'] + '/tracks', AccessToken) 

            for track in r2['items']:        
              # new df record
              idx = len(df)
              
              if str(track['preview_url']) == 'None': continue # If we don't have a url move to the next

              # Update DF with new Track_id and Artist_ID, backprobagate the genre
              df.loc[idx, 'Genre'] = genre
              df.loc[idx, 'PlaylistID'] = 'None'  
              df.loc[idx, 'TrackID'] = track['id']
              df.loc[idx, 'ArtistID'] = artist_id
              df.loc[idx, 'PreviewUrl'] = str(track['preview_url'])
              
              lt = laptime(stime, NumberOfGenres * NumberOfSongs, idx  + 1)
              msg = 'Extracting extended genre by artist... ' \
                  + '\n       Genre: ' + genre \
                  + '\n   TrackName: ' + str(track['name']) \
                  + '\n' \
                  + '\n        Time Running : ' + "{:0>2}:{:0>2}:{:05.2f}".format(int(lt['Running']['hours']),int(lt['Running']['minutes']),lt['Running']['seconds']) \
                  + '\n Time Remaining Est. : ' + "{:0>2}:{:0>2}:{:05.2f}".format(int(lt['Remaining']['hours']),int(lt['Remaining']['minutes']),lt['Remaining']['seconds']) 

              update_progress(idx  + 1, NumberOfGenres * NumberOfSongs, msg)

              if len(df[df['Genre'] == genre]) >= NumberOfSongs - 1: break # Move to next genre if we have a completed to number of songs
            if len(df[df['Genre'] == genre]) >= NumberOfSongs - 1: break # Move to next genre if we have a completed to number of songs
          if len(df[df['Genre'] == genre]) >= NumberOfSongs - 1: break # Move to next genre if we have a completed to number of songs

  return df

In [None]:
def GetFeatures(DatasetPath, AccessToken, NumberOfGenres = 10, NumberOfSongs = 10, Seed = 0):
  '''
  Get Track Features directly from Spotify REST API
  Returns 

          Parameters:
                  NumberOfGenres (int): An integer (0, N)
                  defines the number of Genres to get playlist songs for.                
                  
                  Rnd (boolean): A boolean & | integer (1, 0)~ (True, False)
                  defines if the function is to use random selection and shuffling.
                  
                  seed (int): An integer (0, N)
                  defines the random seed to use for repeatability.              


  '''
  columns = ['Genre', 'TrackID', "Danceability", "Energy", "Valence"]
  df = pd.DataFrame(columns = columns)
  stime = time.time() # used to capture run time and estimate total time to execute

  file_count = sum(len(files) for _, _, files in os.walk(DatasetPath))

  for (root, dirs, files) in os.walk(DatasetPath):
    if len(files) == 0 : continue # skip top root path that list folders only
        
    Genre = root.split('/')[-1]
    for file in files:
      track_id = file.split(".")[0]    
    
      # Rest API throttle and error trapping logic
      track_features = APIrequest('audio-features/' + track_id, AccessToken)
      if 'error' in track_features.keys(): continue

      # Add existing info of tracks per genre playlist
      idx = len(df) 

      df.loc[idx, 'Genre'] = Genre
      df.loc[idx, 'TrackID'] = str(track_features['id'])
      df.loc[idx, 'Danceability'] = str(track_features['danceability'])
      df.loc[idx, 'Energy'] = str(track_features['energy'])
      df.loc[idx, 'Valence'] = str(track_features['valence'])
        
      lt = laptime(stime, file_count, idx  + 1)
      msg = 'Extracting... ' \
          + '\n       Genre: ' + Genre \
          + '\n   TrackName: ' + str(track_features['id']) \
          + '\n' \
          + '\n        Time Running : ' + "{:0>2}:{:0>2}:{:05.2f}".format(int(lt['Running']['hours']),int(lt['Running']['minutes']),lt['Running']['seconds']) \
          + '\n Time Remaining Est. : ' + "{:0>2}:{:0>2}:{:05.2f}".format(int(lt['Remaining']['hours']),int(lt['Remaining']['minutes']),lt['Remaining']['seconds']) 

      update_progress(idx  + 1, file_count, msg)

  return df

## Define Get Genres method

Presist to file

You will need to attach your Google Drive and set the 

Source: https://everynoise.com/everynoise1d.cgi?root=dark%20minimal%20techno&scope=all

In [None]:
def Genres(FilePath, local = True, write = False):
  '''
  Returns a dictionary of Genre's from everynoise.com.

          Parameters:
                  local (boolean): A boolean & | integer (1,0)~ (True, False)
                  defines if the function is to use the local file for loading the dictionary.
                  If the local file doesn't exist then it goto the web url and download it.                 

          Returns:
                  Genre's (dictionary): labeled list of music genre's with spotify playlist id's that correstond to the genre.
                  100 songs per playlist
  '''
  dictionary = {}

  if local & os.path.exists(FilePath):
    with open(FilePath, 'r') as f:
      data = f.read()
      dictionary = json.loads(data)
  else:
    url = 'https://everynoise.com/everynoise1d.cgi?scope=all'
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    for row in soup.find('table').find_all('tr'):
        for i, data in enumerate(row.find_all('td')):        
            if i == 1:
                uri = data.find('a', href=True)['href'].split(':')[-1]            
            if i == 2:
                genre = data.find('a').getText()
        dictionary[genre] = uri
        #dictionary = {row.find_all('td')[2].find('a').getText(): row.find_all('td')[1].find('a', href=True)['href'] for i, data in enumerate(row.find_all('td')) for row in soup.find('table').find_all('tr') if i > 1 }
    
    if not write or not os.path.exists(FilePath):       
      with open(FilePath, 'w') as file: 
          file.write(json.dumps(dictionary))

  return dictionary

In [None]:
len(Genres(GenreFilePath))

5683

## Define Get Spotify Raw Data download method

In [None]:
# Download Spotify Preview Samples for Genre's and Songs
def downloadRawData(df):
  '''
  Get Raw data from Spotify REST API
  Returns 

          Parameters:
                  df (pandas dataframe): A dataframe 
                  holds the list of preview urls from spotifiy to download      

          Returns:
                  None
  '''
  stime = time.time() 
  for i, row in df.iterrows():
    url = row['PreviewUrl']
    filename = SpotifyDataPath + "/" + row['Genre'] + "/" + row['TrackID'] + ".mp3"

    if not os.path.exists(os.path.dirname(filename)):
        try:
            os.makedirs(os.path.dirname(filename))
        except OSError as exc: # Guard against race condition
            if exc.errno != errno.EEXIST:
                raise

    with open(filename, 'wb') as file:
      file.write(requests.get(url).content)     

    lt = laptime(stime, len(df), i + 1)
    msg = r'Presisting to... ' \
        + '\n    Path: ' + SpotifyDataPath \
        + '\n   Genre: ' + row['Genre'] \
        + '\n TrackID: ' + row['TrackID'] \
        + '\n     Url: ' + url \
        + '\n' \
        + '\n        Time Running : ' + "{:0>2}:{:0>2}:{:05.2f}".format(int(lt['Running']['hours']),int(lt['Running']['minutes']),lt['Running']['seconds']) \
        + '\n Time Remaining Est. : ' + "{:0>2}:{:0>2}:{:05.2f}".format(int(lt['Remaining']['hours']),int(lt['Remaining']['minutes']),lt['Remaining']['seconds']) 

        
    update_progress(i + 1, len(df), msg)

## Test methods

In [None]:
update_progress(250,1000, 'Testing message')

Progress: [[1;30;43m░░░░░[0m---------------] 25.0% (250 of 1000)
Testing message


In [None]:
update_progress(750,1000, '')

Progress: [[1;30;43m░░░░░░░░░░░░░░░[0m-----] 75.0% (750 of 1000)


In [None]:
for i in range(0,100):
  time.sleep(.05)
  msg = '' \
      + 'Current iteration: ' + str(i) \
      + '\n Out of: ' + str(100)
  update_progress(i, 100, msg )

Progress: [[1;30;43m░░░░░░░░░░░░░░░░░░░░[0m] 99.0% (99 of 100)
Current iteration: 99
 Out of: 100


# Preprocessing of data 

### Useing Google drive 

mounted to support larger data workloads with Colab

In [None]:
drive.mount('/content/drive/')

#drive.flush_and_unmount()

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


# Get Spotify Data 

Download Spotify Music Previews as Sample DataSet(s) with play length of 30sec

Info: https://developer.spotify.com/documentation/web-api/

In [None]:
# Generate the AccessToken for OAuth Rest API connections for this notebook
AccessToken = GetAPIToken(ClientID, SecretID)

In [None]:
df = GetTracks(GenreFilePath
             , AccessToken 
             , NumberOfGenres  
             , NumberOfSongs                
             , Randomize
             , RandomSeed)

Progress: [[1;30;43m░░░░░░░░░░░░░░░░░░░░[0m] 100.0% (2000 of 2000)
Extracting... 
       Genre: modern rock
   TrackName: Mind Over Matter (Reprise)

        Time Running : 00:01:05.04
 Time Remaining Est. : -1:59:-0.03


In [None]:
# Check we have unique values and they equal NumberOfGenres
a = df[('Genre')].unique()
b = len(a)
print(a)
print(b)
assert b == NumberOfGenres

['pop' 'dance pop' 'rap' 'rock' 'latin' 'pop rap' 'hip hop' 'trap latino'
 'trap' 'modern rock']
10


In [None]:
# Check we have all song count for each genre NumberOfSongs
display(df.groupby('Genre').size())

for s in df.groupby('Genre').size():
  assert s == NumberOfSongs

Genre
dance pop      200
hip hop        200
latin          200
modern rock    200
pop            200
pop rap        200
rap            200
rock           200
trap           200
trap latino    200
dtype: int64

In [None]:
# Download Spotify Preview Samples for Genre's and Songs
downloadRawData(df)

Progress: [[1;30;43m░░░░░░░░░░░░░░░░░░░░[0m] 100.0% (2000 of 2000)
Presisting to... 
    Path: /content/drive/MyDrive/W207/Data/Spotify/10-200-False-6
   Genre: modern rock
 TrackID: 77KnJc8o5G1eKVwX5ywMeZ
     Url: https://p.scdn.co/mp3-preview/b0f1a450736ed261fab3fd1f7d7aabd06c3b3f25?cid=fa54a9c868174f4e83b2616cc27443db

        Time Running : 00:03:55.06
 Time Remaining Est. : -1:59:-0.12


In [None]:
# Getting Features for KNN from Spotify
df2 = GetFeatures(SpotifyDataPath
                , AccessToken
                , NumberOfGenres
                , NumberOfSongs
                , RandomSeed)

Progress: [[1;30;43m░░░░░░░░░░░░░░░░░░░░[0m] 100.0% (1992 of 1992)
Extracting... 
       Genre: modern rock
   TrackName: 77KnJc8o5G1eKVwX5ywMeZ

        Time Running : 00:03:25.92
 Time Remaining Est. : -1:59:-0.10


In [None]:
df2.to_csv(SpotifyDataPath + '/AudioFeatures.csv')

In [None]:
data_spotify = pd.read_csv(SpotifyDataPath + '/AudioFeatures.csv')
data_spotify

Unnamed: 0.1,Unnamed: 0,Genre,TrackID,Danceability,Energy,Valence
0,0,pop,1058fW9H3fZA6QjYCdOBad,0.666,0.796,0.610
1,1,pop,65Xycqq3KSaLmEIbWszegR,0.702,0.664,0.570
2,2,pop,3G0T9gfKq0Wo6ofdWAOEa5,0.622,0.618,0.477
3,3,pop,2nZq5WQOW4FEPxCVTdNGfB,0.528,0.343,0.342
4,4,pop,1WnKdiEPEvd1Meo3gUipPe,0.594,0.355,0.628
...,...,...,...,...,...,...
1987,1987,modern rock,1qMKtlvhNujDZ3G4vdFSBF,0.533,0.500,0.335
1988,1988,modern rock,3EYx0Sw78e5ByIx0HmsJTM,0.551,0.715,0.345
1989,1989,modern rock,4ozBdncjTFpDqzpREZ4LR4,0.486,0.900,0.533
1990,1990,modern rock,5CN4Al76NntcR1vj7tGO2V,0.369,0.879,0.326


# Get GTZAN Genre DataSet

Source: http://opihi.cs.uvic.ca/sound/genres.tar

Info: http://marsyas.info/downloads/datasets.html


In [None]:
# Avg execution time: Expect 10|20 mins for this one, as the source server and colab are both slow to send and recieve.
url = 'http://opihi.cs.uvic.ca/sound/genres.tar'
filename = RootPath + "/Data/GTZAN.tar"

with open(filename, 'wb') as file:
    file.write(requests.get(url).content) 

In [None]:
# Avg execution time: Expect 1|2 mins for this one, as the source server and colab are both slow to send and recieve.
# open file
file = tarfile.open(filename)

# extracting file
file.extractall(GTZANDataPath)  
file.close()