# Creating Data For Modeling
This is the working notebook used to create the data used for modeling.

As the capstone project was iterative, we created a few versions of data to expirement on different things.  This notebook has kept that last passes of data creation that were used in the analysis as a record and as an example. 

It can also be a reference for those who may want to create their own data

In [2]:
import sys
import numpy as np
import pandas as pd
sys.path.insert(0, '../../')

#the projects libary houses the code that does most of the heavy lifting 
#the notebook api's data loader provides the interface for combined source data, 
# as well as Modeling data, which is populated by the output of this prcess 
from library.notebook_api.data_loader import CombinedDataLoader, ModelDataLoader
from library.source_data.parallel_processor import AudioParallelProcessor

## Balanced Extraction - For V006 data population
This scenario runs extraction on a subset of the large version of the data set, filtering to rows matching a list of provided genres, and also trying to balance the dataset through obtaining an equal sized sample from each genre.   This is because data generation takes a while,  so doing it in batches and chunks means we have some data to work with on modeling tasks while we continue to batch create more data

CombinedDataLoader is the primary class used to interact with the source data, the first argument is for which FMA data source to use and it supports large, medium, and small.  small is the only one that comes in our project_data_source folder as it is impracticle for us to store large and medium given their size.  
See the Readme for more details on data, for CombinedDataLoader to work it requires that the project_data_source folder is populated with the source dataset assets

In [3]:
#instantate data_loader and the dataframes it makes available 
in_scope_labels = ['rock', 'electronic', 'hiphop', 'classical', 'jazz','country']
data_loader = CombinedDataLoader('large', in_scope_labels)

#this data loader method, samples the soruce data separately for each provided genre in the in_scope_labels list
#the argument controls the number of samples per genre
#if you run out of trakcs per genre, it will just stop at the max track 
df_balanced = data_loader.get_data_sampled_by_label(5000)

len(df_balanced)

tracks in meta 29701
tracks with files available in project_data_path:  29701
tracks with top level genres available 29701
tracks with genres and files (df_filtered) 29701


15947

This notebook is currently setup to update using incremental batches.
The code below checks the current context of the Model data version and determines the list of tracks that have yet to been process from the source

This only really makes sense if you have already populated data into a version
If it is to be resed you can skip that part and just pass df_balanced, or CombinedDataLoader.df_filtered to the audio processor in teh next step

Note - some of the files had errors when trying to process, in which case they are skipped.  we consider it out of the scope of the project to investigate and address the errors, but it can be a reason why a "complete" process does not have all of the tracks in the metadata 

In [4]:
#get dataframe of current existing track ids from ModelDataLoader
current_data = ModelDataLoader('006')
current_data_df = current_data.df
current_track_list = current_data_df[['track_id','file_available']].copy()
#rename file_available column to differentiate it 
current_track_list.rename(columns={'file_available':'file_available_in_input'}, inplace=True)
#join this to medium source data to figure out tracks we don't have 
df_filtered_with_current_track_list = pd.merge(df_balanced,current_track_list,on='track_id',how='left')
#boolean indexer for the incremental files
incremental = (df_filtered_with_current_track_list['file_available_in_input'].isnull()) &(df_filtered_with_current_track_list['file_available'] ==1)

df_filtered_incremental = df_filtered_with_current_track_list[incremental].copy()

print(len(df_filtered_incremental), 'files left to process ')

3494 files left to process 


In this section we kept a record of the batches that were run. 
As you can see the AudioParallelProcessor takes arguments for version number, batch, threads
These all become part of the file names and paths generated in `project_data_source/model_input_data`

Each batch will also spawn multiple threads, according to the thread argument to allow to parallelize processing across multi-core cpus. 

SAMPLE_RATE and second configs inform the audio extraction using librosa, for this version we downsamples and truncated to reduce vector size and have consistent vector lengths. 

Note this version is the only one where both MFCCs and Log Melspectrograms are generated as a result of running the batches

In [19]:
SAMPLE_RATE  = 22500
SECONDS = 25
#these batches include mfccs and log_melspectrogram, downsampled to 22500 , truncated to 25 seconds 
#batch_mfcc = AudioParallelProcessor(df_filtered_incremental.iloc[0:1200],version = '006', batch=2, threads = 4,sample_rate=SAMPLE_RATE,start_sample=0,end_sample=SAMPLE_RATE*SECONDS)
#batch_mfcc2 = AudioParallelProcessor(df_filtered_incremental.iloc[1201:],version = '006', batch=3, threads = 4,sample_rate=SAMPLE_RATE,start_sample=0,end_sample=SAMPLE_RATE*SECONDS)
#batch_mfcc = AudioParallelProcessor(df_filtered_incremental.iloc[0:2000],version = '006', batch=4, threads = 5,sample_rate=SAMPLE_RATE,start_sample=0,end_sample=SAMPLE_RATE*SECONDS)
#batch_mfcc2 = AudioParallelProcessor(df_filtered_incremental.iloc[2001:4000],version = '006', batch=5, threads = 5,sample_rate=SAMPLE_RATE,start_sample=0,end_sample=SAMPLE_RATE*SECONDS)

#batch_mfcc = AudioParallelProcessor(df_filtered_incremental.iloc[0:2000],version = '006', batch=6, threads = 5,sample_rate=SAMPLE_RATE,start_sample=0,end_sample=SAMPLE_RATE*SECONDS)
#batch_mfcc2 = AudioParallelProcessor(df_filtered_incremental.iloc[2001:4000],version = '006', batch=7, threads = 5,sample_rate=SAMPLE_RATE,start_sample=0,end_sample=SAMPLE_RATE*SECONDS)

#batch_mfcc = AudioParallelProcessor(df_filtered_incremental.iloc[0:2000],version = '006', batch=8, threads = 5,sample_rate=SAMPLE_RATE,start_sample=0,end_sample=SAMPLE_RATE*SECONDS)
#batch_mfcc2 = AudioParallelProcessor(df_filtered_incremental.iloc[2001:4000],version = '006', batch=9, threads = 5,sample_rate=SAMPLE_RATE,start_sample=0,end_sample=SAMPLE_RATE*SECONDS)

#batch_mfcc = AudioParallelProcessor(df_filtered_incremental.iloc[0:2000],version = '006', batch=10, threads = 5,sample_rate=SAMPLE_RATE,start_sample=0,end_sample=SAMPLE_RATE*SECONDS)
#batch_mfcc2 = AudioParallelProcessor(df_filtered_incremental.iloc[2001:4000],version = '006', batch=11, threads = 5,sample_rate=SAMPLE_RATE,start_sample=0,end_sample=SAMPLE_RATE*SECONDS)

batch_mfcc = AudioParallelProcessor(df_filtered_incremental,version = '006', batch=12, threads = 5,sample_rate=SAMPLE_RATE,start_sample=0,end_sample=SAMPLE_RATE*SECONDS)


In [None]:
batch_mfcc.execute()

In [None]:
batch_mfcc2.execute()

## Feature Extraction fma_large V005 data population

we completed this version after we downloaded and incororated fma_large data, but before we decided to narrow down to a subset of generes, and before we added Log melspectrogram vectors. 

As a result this version has repesentation for all tracks and has metadata, numerical features and MFCC vectors at full sample rate and length

### Load Metadata
pass the optional fma_audio size argument to Combined data loader to reference the fma_large file directory as part of combinding fma data with gtzan data

In [5]:
#instantate data_loader and the dataframes it makes available 
data_loader = CombinedDataLoader('large')
df = data_loader.df
df_files_available = data_loader.df_files_available
df_genres_available = data_loader.df_genres_available
df_filtered = data_loader.df_filtered

#get dataframe of current existing track ids from ModelDataLoader
current_data = ModelDataLoader('005')
current_data_df = current_data.df
current_track_list = current_data_df[['track_id','file_available']].copy()
#rename file_available column to differentiate it 
current_track_list.rename(columns={'file_available':'file_available_in_input'}, inplace=True)
#join this to medium source data to figure out tracks we don't have 
df_filtered_with_current_track_list = pd.merge(df_filtered,current_track_list,on='track_id',how='left')
#boolean indexer for the incremental files
incremental = (df_filtered_with_current_track_list['file_available_in_input'].isnull()) &(df_filtered_with_current_track_list['file_available'] ==1)

df_filtered_incremental = df_filtered_with_current_track_list[incremental].copy()

print(len(df_filtered_incremental), 'files left to process ')

tracks in meta 107574
tracks with files available in project_data_path:  107574
tracks with top level genres available 50598
tracks with genres and files (df_filtered) 50598
25720 files left to process 


In [None]:
in_scope_labels = ['rock', 'electronic', 'hiphop', 'classical', 'jazz','country']

df_incremental_under_represented = df_filtered_incremental[df_filtered_incremental['label'].apply(lambda label: True if label in in_scope_labels else False)]
df_incremental_under_represented['label'].value_counts()

label
hiphop       2173
classical     444
jazz          255
country       202
Name: count, dtype: int64

### Extract and save numerical features in parquet and mfcc as npy nd array

In [10]:
batch_mfcc = AudioParallelProcessor(df_incremental_under_represented,version = '005', batch=12, threads = 4)

#batch_mfcc = AudioParallelProcessor(df_filtered_incremental.iloc[0:2000],version = '005', batch=11, threads = 4)
#batch_mfcc2 = AudioParallelProcessor(df_filtered_incremental.iloc[2001:4000],version = '005', batch=12, threads = 4)
#batch_mfcc3 = AudioParallelProcessor(df_filtered_incremental.iloc[4001:6000],version = '005', batch=13, threads = 4)

In [7]:
#batch_mfcc.execute()

In [8]:
#batch_mfcc2.execute()

In [9]:
#batch_mfcc3.execute()

In [22]:
m = ModelDataLoader('005')

In [23]:
m.df.head()
len(m.df)

24878