# Music ETL Process Notebook

This code is the notebook version of the `processor.py` file which is the ETL script I use whenever I have a new batch of songs that need to processed, analyzed, and uploaded to my database.

Here I annotate each step of the process with explanations of what this code does.

In [1]:
#Imports
import sqlite3
import pandas as pd
import numpy as np
import matchering as mg
import json
from glob import glob
from tqdm import tqdm
from io import StringIO
import sys
import pathlib

import taglib
from datetime import datetime
import shutil
import os
from essentia.standard import MusicExtractor, YamlOutput,MetadataReader, PCA, YamlInput
import warnings
from zipfile import ZipFile
warnings.filterwarnings('ignore')
pd.set_option('max_colwidth', 100)

In [2]:
#Import utility functions from project_tools package
from project_tools.utils import effnet_config, json_opener, adapt_array, convert_array, tag_cleaner, digit2letters
from project_tools.models import Activator, Classifier

Since my sqlite database holds numpy arrays, which aren't a native data type, I need to register two functions `adapt_array` and `convert_array` which are used to handle the arrays.

In [3]:

sqlite3.register_adapter(np.ndarray, adapt_array)
sqlite3.register_converter("array", convert_array)

Connect to the `jaage.db` database

In [4]:
conn = sqlite3.connect("jaage.db", detect_types= sqlite3.PARSE_DECLTYPES)
cur = conn.cursor()

`Loading Dock` is the directory in my hard drive which is the staging area for newly downloaded songs.

`DJ Hub` is where the original files are sent to after being processed.

In [5]:
load_path = "../../../../Volumes/LaCie/Loading Dock/"
dj_hub  = "../../../../Volumes/LaCie/DJ Hub/"

Sometimes downloaded songs are packaged as zip files, so if that is the case I unzip them.

In [6]:
zip_files = glob(load_path+"*.zip")
zip_files

[]

In [7]:
if len(zip_files) > 0:
    for z in zip_files:
        zf = ZipFile(z)
        zf.extractall(path=load_path)
        shutil.move(z,dj_hub)

Collect all the music files.

In [8]:
loading_files = pathlib.Path(load_path).glob("*[.wav, .mp3, .aiff]")

In [9]:
len_loading_files = len(list(loading_files))
print("There are {} files for the ETL pipeline".format(len_loading_files))

There are 7 files for the ETL pipeline


In [10]:
loading_files = pathlib.Path(load_path).glob("*[.wav, .mp3, .aiff]")

## Process Steps

### 1. Mastering


### 2. Essentia Features Extraction


### 3. Effnet Embeddings and Genre Activations

### 4. Style, Mood, and Genre Classification



### Mastering

Here I use the [matchering](https://github.com/sergree/matchering) package to adjust the sound and levels of the fresh batch of songs to make them more appropriate for DJing.


The tool works by taking the sound of an existing song and adjusts new songs on based on it.

In [11]:
#The song used to adjust other songs.
ref_file = '../../../../Volumes/LaCie/DJ Hub/Rayko - Magnetized (Rayko rework).wav'
#The directory where all my music is stored.
collection = "Collection"


- Loop over the new files in the `loading_files` list, 

- Process them using matchering, 

- Then move them to collection

- Move original files to DJ Hub

In [12]:
new_file_paths = []
for f in tqdm(loading_files):
    out_stem = f.stem
    out_path = f.parent.parent/collection/f.stem
    out_path = out_path.as_posix() +".wav"
    
    mg.process(target= f.as_posix(),
              reference=ref_file, 
              results = [mg.pcm24(out_path)])
    
    load_tags = taglib.File(f.as_posix())
    mastered_tags = taglib.File(out_path)
    mastered_tags.tags = load_tags.tags
    mastered_tags.save()
    
    new_file_paths.append(out_path)
    
    try:
        shutil.move(f.as_posix(), dj_hub)
    except:
        print(f, "already exists")
        os.remove(f.as_posix())
    

6it [03:43, 33.66s/it]

../../../../Volumes/LaCie/Loading Dock/Istanbul 70 by Barış K - ISTANBUL70 - Disco, Psych, Folk edits by Barış K Vol.5 - 01 Şenay - Dalkavuk (Barış K edit).aiff already exists


7it [04:13, 36.18s/it]

../../../../Volumes/LaCie/Loading Dock/Istanbul 70 by Barış K - ISTANBUL70 - Disco, Psych, Folk edits by Barış K Vol.5 - 02 Modern Folk Üçlüsü feat. Ayşegül Aldinç - Dönme Dolap (Barış K edit).aiff already exists





**Before I move on to the next step I head over to RekordBox (the program I use to manage my library) and upload the new batch of songs on there and edit their metadata tags**

### Music Extraction

I use essentia to extract a variety of musical features and metadata from the songs.

The full description of what the `MusicExtractor` tool does can be found [here](https://essentia.upf.edu/tutorial_extractors_musicextractor.html) 

In [13]:
copied_paths = new_file_paths[:]
new_file_paths = []
for i in copied_paths:
    if os.path.exists(i):
        new_file_paths.append(i)
        
len(new_file_paths)

7

Initialize the extractor object

In [15]:
music_ext = MusicExtractor(lowlevelStats=['mean', 'stdev'],
                                    rhythmStats=['mean', 'stdev', "max", "min", "median"],
                                    tonalStats=['mean', 'stdev'],
                           mfccStats = ["mean", "cov"],
                           gfccStats = ["mean", "cov"])

Iterate the new batch, extract their data, and collect it in a list called `extracted_files`

In [16]:
out_dir = 'temp_features/'
extracted_files = []
id_2_paths = {}

for fil in tqdm(new_file_paths, total = len(new_file_paths)):
    try:
        features, _ = music_ext(fil)
        idd = features['metadata.audio_properties.md5_encoded']
        YamlOutput(filename= out_dir+"features.json", format="json")(features)
        json_data = json_opener(out_dir+"features.json")
        id_2_paths[idd] = fil
        extracted_files.append(json_data)
    except Exception as e:
        print(e)

  0%|                                                                                                                                           | 0/7 [00:00<?, ?it/s][   INFO   ] MusicExtractor: Read metadata
[   INFO   ] MusicExtractor: Compute md5 audio hash, codec, length, and EBU 128 loudness
[   INFO   ] MusicExtractor: Replay gain
[   INFO   ] MusicExtractor: Compute audio features
[   INFO   ] MusicExtractor: Compute aggregation
[   INFO   ] All done
 14%|██████████████████▋                                                                                                                | 1/7 [00:14<01:29, 14.92s/it][   INFO   ] MusicExtractor: Read metadata
[   INFO   ] MusicExtractor: Compute md5 audio hash, codec, length, and EBU 128 loudness
[   INFO   ] MusicExtractor: Replay gain
[   INFO   ] MusicExtractor: Compute audio features
[   INFO   ] MusicExtractor: Compute aggregation
[   INFO   ] All done
 29%|█████████████████████████████████████▍                                 

Convert `extracted_files` to a pandas dataframe for easier data handling.

In [17]:
extracted = pd.json_normalize(extracted_files)
extracted.columns = extracted.columns.str.replace(".", "_")

Rename the `metadata_audio_properties_md5_encoded` column to `sid`. This property serves as the unique ids for my songs, `sid` is short for song id.

In [18]:
extracted.rename(columns={"metadata_audio_properties_md5_encoded":"sid"}, inplace=True)

Load in array of column names which are used to filter unnecessary data in the `extracted` dataframe.

In [19]:
drop_cols = np.load("drop_cols.pkl", allow_pickle=True).tolist()
extracted.drop(drop_cols, axis = 1, inplace=True, 
               errors="ignore"
              )
extracted.set_index("sid", inplace=True)
extracted.shape

(7, 158)

Separate the metadata from the whole dataset, by creating a new dataframe called `meta_df`

In [20]:
cols = extracted.columns

meta_cols = cols[cols.str.startswith("meta")]
non_meta_cols = cols[~cols.str.startswith("meta")]

meta_df = extracted[meta_cols].copy()
extracted.drop(meta_cols, axis = 1, inplace=True)

This process right is used divide the remaining data in `extracted` into a dataframe where the values are lists and another where they are not.

In [21]:
list_cols = extracted.columns[extracted.iloc[0].apply(lambda x:type(x)) == list]
no_list_cols = extracted.columns[extracted.iloc[0].apply(lambda x:type(x)) != list]
list_data = extracted[list_cols]
no_list_data = extracted[no_list_cols]

`tag_cleaner` is used to deal with null values and empty lists in `meta_df`

In [22]:
meta_df = meta_df.applymap(tag_cleaner)

In [23]:
meta_df.columns = meta_df.columns.str.split("_").map(lambda x:x[-1])

In [24]:
meta_df.rename(columns={"name":"file_name"}, inplace=True)

The clean version of `meta_df`

In [25]:
meta_df.head()

Unnamed: 0_level_0,length,gain,codec,file_name,artist,date,title,album,bpm,composer,genre,initialkey,tracktotal
sid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
de5515fdc54bd7cad645c5446b3e748f,383.280182,-13.44842,pcm_s24le,COEO - Emergency Loop.wav,Gazzz696,2019.0,COEO - Emergency Loop,,,,,,
a7cf765cb13c65ced228eb13383cba15,322.037537,-13.170233,pcm_s24le,Ezirk - Yabai Gang.wav,Funk'n Disco & Stuff,2021.0,Ezirk - Yabai Gang,,,,,,
03617d81828c17cf15b7e25f05a6f9a1,265.785126,-12.738928,pcm_s24le,Hamid El Shaeri - Dari Demouek (V4YS Rework).wav,Hamid El Shaeri,,Dari Demouek (V4YS Rework),,,,,,
dbe511f964f95849e05848d5dae6ee5d,315.583984,-11.940247,pcm_s24le,Herbert Leonard - Chante Avec Moi (Les Aristos Remlx).wav,Herbert Leonard,2022.0,Chante Avec Moi (Les Aristos Remlx),Collection 80's,124.0,Les Aristos,French Remix,10A,1.0
6b4bd633ec076c4d3e3046044bd647b7,284.165802,-12.149185,pcm_s24le,Herbert Leonard - Chante avec moi (Absolut Bibiche ReWork).wav,Herbert Leonard,2022.0,Chante avec moi (Absolut Bibiche ReWork),Ced ReWrk Prod,,,Funk,,


Before uploading `meta_df` to the `tags` table in the db, I need to make sure the column name are aligned. I import the columns from `tags` and use that to filter `meta_df`

In [26]:
tags_cols = pd.read_sql("SELECT * FROM tags LIMIT 1", con = conn).set_index('sid').columns.tolist()

In [27]:
meta_cols = [i for i in meta_df.columns if i in tags_cols]
meta_cols

['length',
 'gain',
 'codec',
 'file_name',
 'artist',
 'date',
 'title',
 'album',
 'bpm',
 'genre',
 'initialkey']

Append `meta_df` to the `tags` table in the db.

In [28]:
meta_df[meta_cols].to_sql("tags", con=conn, if_exists = "append")

Now it's time to update the `files` table which is how I connect the song paths to their unique ids.

In [29]:
files = pd.DataFrame(id_2_paths.items(), columns=["sid", "file_path"])

In [30]:
files.to_sql("files", con = conn, if_exists="append", index = False)

Divide the `no_list_data` dataframe into three sections: tonal, lowlevel, rhythm. These features are explained on the essentia website.

In [31]:
cols = no_list_data.columns
tonal_cols = cols[cols.str.startswith("tonal")]
lowlevel_cols = cols[cols.str.startswith("lowlevel")]
rhythm_cols = cols[cols.str.startswith("rhyt")]

tonal_df = no_list_data[tonal_cols]
lowlevel_df = no_list_data[lowlevel_cols]
rhythm_df = no_list_data[rhythm_cols]

Use those dataframes to update their corresponding tables in the db.

In [32]:
tonal_df.to_sql("tonal_features", con=conn, if_exists="append")
lowlevel_df.to_sql("lowlevel_features", con=conn, if_exists="append")
rhythm_df.to_sql("rhythm_features", con=conn, if_exists="append")

Upload the dataframes with list values to the db.

In [33]:
for col in tqdm(list_cols):
    ser = list_data[col].apply(pd.Series)
    ser.columns = col + "_"+ ser.columns.astype(str)
    ser.to_sql(col+"_tbl", con = conn,if_exists="append")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 81.64it/s]


### EffNet Embbedings and Genre Classifications

I use the [discogs-effnet model](https://essentia.upf.edu/models.html#discogs-effnet) to generate Nx1280 embeddings and activation scores for 400 genres.

Those embeddings are then used later on as the input data for the other classification models.

The metadata on this model can be found [here](https://essentia.upf.edu/models/music-style-classification/discogs-effnet/discogs-effnet-bs64-1.json)

In [34]:
path2id = {v:k for k, v in id_2_paths.items()}

Initialize the `Activator` which is the tool I used to preprocess the raw song data, generate the predictions and then upload them to the database.

In [35]:
act = Activator(input_length=2.05, 
                model_path="onnx_models/discogs-effnet-bsdynamic-1.onnx",
                   pathid_dict=path2id)

Grab the columns names from the `effnet_genres` table

In [36]:
gcols = pd.read_sql_query("SELECT * FROM effnet_genres LIMIT 1 ", con = conn).columns[1:].tolist()
# gcols[:5]

- Iterate over the new batch of songs

- Generate the activations and embeddings.

- Upload them to their corresponding tables in the database.

In [37]:
for song in act.batch_inference():
    with conn:
        sid, sf, output = song
        genre_acts = output["activations"]
        embeds = output["embeddings"]
        genre_acts = [np.expand_dims(genre_acts[:, i], 0) for i in range(400)]
        genre_acts = pd.DataFrame(index = [sid], data = [genre_acts], columns=gcols)
        cur.execute("INSERT INTO effnet_embeddings (sid, effnet_embedding) values (?,?)", 
                    (sid, np.expand_dims(embeds,0)))
        genre_acts.to_sql("effnet_genres", con=conn, if_exists="append", index=False)
    conn.commit()

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:10<00:00,  1.43s/it]


### Classification Head Models

This section is where I use the discogs-effnet embeddings to generate predictions from a variety of music [style and mood classification models](https://essentia.upf.edu/models.html#classification-heads).

These models classify attributes such as danceability, happiness, relaxedness, and more.


The `onnx_models` directory hosts all the downloaded models. The `json_info` subdirectory hosts all their corresponding metadata.

I collect all the models and their metadata here.

I have other models but for now I'm only working with the effnet ones.

In [38]:
model_paths = sorted(glob("onnx_models/*.onnx"))
model_infos = sorted(glob("onnx_models/json_info/*.json"))
effnet_models = [{"model": model_paths[i], 
                  "json":model_infos[i]} for i in range(len(model_paths)) if "effnet" in model_paths[i]]

In [39]:
effnet_models = effnet_models[:2] + effnet_models[4:]

In [40]:
new_ids = list(path2id.values())

- Iterate over all the models.

- Generate batch inferences from them. The models are feed data queried from the database.

In [58]:
for em in effnet_models:
    cls = Classifier(em, new_ids=new_ids)
    cls.batch_inference()
    cls.conn.commit()
    print("Completed => ", cls.table_name, "\n\n")

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 2858.21it/s]


Completed =>  approachability_2c_effnet_discogs_1_activations 




100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 3084.17it/s]


Completed =>  danceability_effnet_discogs_1_activations 




100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 2784.60it/s]


Completed =>  engagement_2c_effnet_discogs_1_activations 




100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 2033.94it/s]


Completed =>  genre_electronic_effnet_discogs_1_activations 




100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 2920.16it/s]


Completed =>  mood_acoustic_effnet_discogs_1_activations 




100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 2603.60it/s]


Completed =>  mood_aggressive_effnet_discogs_1_activations 




100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 2482.26it/s]


Completed =>  mood_happy_effnet_discogs_1_activations 




100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 2885.58it/s]


Completed =>  mood_party_effnet_discogs_1_activations 




100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 2785.50it/s]


Completed =>  mood_sad_effnet_discogs_1_activations 




100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 322.41it/s]


Completed =>  mtg_jamendo_genre_effnet_discogs_1_activations 




100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 458.82it/s]


Completed =>  mtg_jamendo_moodtheme_effnet_discogs_1_activations 




100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 464.91it/s]


Completed =>  mtg_jamendo_top50tags_effnet_discogs_1_activations 




100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 2940.49it/s]

Completed =>  timbre_effnet_discogs_1_activations 





