In [1]:
# Magic Commands
# %load_ext lab_black
%load_ext dotenv
%dotenv ../brainstation_capstone_cfg.env

In [2]:
# Package Imports
import os
import sys
# import pymysql
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
# from scipy import stats
# import spotipy
# from spotipy.oauth2 import SpotifyClientCredentials
import requests
from spotify_dl import spotify_dl
from pathlib import Path
import time
# import glob

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.options.display.float_format = "{:,.2f}".format

In order to use the `spotify_dl` package to download the tracks that will comprise the dataset, a Spotify Web API app had to be created. From that app, the client ID and client secret were obtained and placed into the `brainstation_capstone_cfg.env` file. This file is then pulled into the present environment using the `python-dotenv` package via the magic command `%dotenv ../brainstation_capstone_cfg.env` above. The `spotify_dl` package also leverages another package known as `spotipy` and requires the `SPOTIPY_CLIENT_ID` and `SPOTIPY_CLIENT_SECRET` before it can begin downloading tracks. The credentials were stored within the external environment file which is not included within the repo for security reasons. In order to run this code, that information will be required. To obtain credentials please go to https://developer.spotify.com/documentation/web-api and follow the instructions under the 'Getting Started' section.

In [3]:
CLIENT_ID = os.environ["SPOTIPY_CLIENT_ID"]
CLIENT_SECRET = os.environ["SPOTIPY_CLIENT_SECRET"]

The dataset from Kaggle is read in below. The data contains 232,725 rows which is comprised of 176,774 unique track_ids. A random sample of 30,000 rows is extracted below. From these 28,622 unique track_ids are present. The unique track_ids are stripped of trailing and leading spaces before being put into a list for looping. More investigation into the Kaggle dataset can be found in the notebook `20230719_kaggle_data_spotify_tracks.ipynb` which is also found within this directory.

In [4]:
kaggle_df = pd.read_csv("../data/SpotifyFeatures.csv")
track_ids = kaggle_df.sample(30000, random_state = 123).track_id.str.strip().unique().tolist()
len(track_ids)

28622

The following two cells are test code that was used to check if the downloader could work with just a single song and if the output could be stored on a different drive. The first test succeeded, however, attempts to try and store the `.mp3` files onto a storage drive when run on a desktop computer did not work as `WSL` did not have write permissions for the drive in question. This may require further investigation later if more space is required.

In [5]:
# TEST CODE - This passed previously
# file_path = '/Users/vii/repos/brainstation_capstone/data/mp3s/'
# base_url = 'https://open.spotify.com/track/'
# track_id = '0BRjO6ga9RKCKjfDqeFgWV'
# url = base_url+track_id
# os.system("spotify_dl -s y --url {} -o ../data/mp3s/{}".format(url,track_id))

In [6]:
# TEST CODE - Check to see if files could be written to larger storage space
# This didn't work within WSL.
# file_path = '/Users/vii/repos/brainstation_capstone/data/mp3s/'
# base_url = 'https://open.spotify.com/track/'
# track_id = '0BRjO6ga9RKCKjfDqeFgWV'
# url = base_url+track_id
# os.system("spotify_dl -s y --url {} -o /mnt/d/data/mp3s/{}".format(url,track_id)) # replaced output directory with D drive


The cell below takes the track_ids that were extracted from the Kaggle data above and loops through them. It then concatenates these track_ids with the base_url to create a url that is fed into the `spotify_dl` package. This package then downloads the songs as `.webm` files which are then converted to `.mp3` files using `ffmpeg` (which was installed separately using the `conda install -c conda-forge ffmpeg` command). With `ffmpeg` installed, the `spotify_dl` package utilizes it automatically. The `.mp3` files are then saved to the `mp3s` folder within the `data` directory.

This code also contains a `for` loop before the `while` block that uses the `os.walk` function to check the `mp3s` directory for track_ids that have already been downloaded. I included this code since the process of downloading the sample needed to train the model needed required multiple days worth of runtime. Effectively, this code provides some measure of resume functionality to prevent overwriting existing files or needless iteration over tracks that have already been downloaded.

In [7]:
file_path = '/Users/vii/repos/brainstation_capstone/data/mp3s/'
base_url = 'https://open.spotify.com/track/'
track_dirs =[]
dirs =[]
for root, dir, files in os.walk(file_path):
    dirs.append(dir)
    if len(dirs)>0:
        track_dirs = dirs[0].copy()
len(track_dirs)

2877

In [8]:
os.system("export SPOTIPY_CLIENT_ID={}".format(CLIENT_ID))
os.system('export SPOTIPY_CLIENT_SECRET={}'.format(CLIENT_SECRET))
count = 0
while count<28622:
    for track_id in track_ids:
        if (len(track_dirs)>0) & (track_id in track_dirs):
            print(f'{count}: {track_id} already downloaded...skipping....')
            count +=1
        else:
            url = base_url+track_id    
            os.system("spotify_dl --url {} -s y  -o ../data/mp3s/{}".format(url,track_id))
            count+=1
            time.sleep(5) # 5 second delay between requests
            

Starting spotify_dl v8.[1;36m8.2[0m                                     ]8;id=583283;file:///Users/vii/anaconda3/envs/brainstation_capstone/lib/python3.8/site-packages/spotify_dl/spotify_dl.py\[2mspotify_dl.py[0m]8;;\[2m:[0m]8;id=439502;file:///Users/vii/anaconda3/envs/brainstation_capstone/lib/python3.8/site-packages/spotify_dl/spotify_dl.py#143\[2m143[0m]8;;\
Sponsorblock enabled?: y                                       ]8;id=736166;file:///Users/vii/anaconda3/envs/brainstation_capstone/lib/python3.8/site-packages/spotify_dl/spotify_dl.py\[2mspotify_dl.py[0m]8;;\[2m:[0m]8;id=729644;file:///Users/vii/anaconda3/envs/brainstation_capstone/lib/python3.8/site-packages/spotify_dl/spotify_dl.py#185\[2m185[0m]8;;\
Saving songs to Alone directory                                ]8;id=682497;file:///Users/vii/anaconda3/envs/brainstation_capstone/lib/python3.8/site-packages/spotify_dl/spotify_dl.py\[2mspotify_dl.py[0m]8;;\[2m:[0m]8;id=906662;file:///Users/

KeyboardInterrupt: 