# FMA: A Dataset For Music Analysis

Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson, EPFL LTS2.

## Features

The notebook generates:
* `features.json`: common features extracted with librosa.
* `spotify.json`: audio features provided by Spotify (formerly Echonest).

TODO:
* features given for the MSD

All features are extracted using [librosa](https://github.com/librosa/librosa). Alternatives:
* [MARSYAS](https://github.com/marsyas/marsyas) (C++ with Python bindings)
* [RP extract](http://www.ifs.tuwien.ac.at/mir/downloads.html) (Matlab, Java, Python)
* [jMIR jAudio](http://jmir.sourceforge.net) (Java)

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import utils
import librosa
import pandas as pd
import numpy as np
import os.path
import ast

In [None]:
DATA_DIR = os.environ.get('DATA_DIR')
tracks = pd.read_csv(os.path.join(DATA_DIR, 'tracks.csv'), index_col=0, converters={'genres': ast.literal_eval})
#tracks = pd.read_json(os.path.join(DATA_DIR, 'tracks.json'), orient='split')
path = utils.build_path(tracks, os.path.join(DATA_DIR, 'fma_small'))

# Todo: fix dataset
#tracks.index.set_names('track_id', inplace=True)

tracks = tracks[:10]

In [None]:
n_mfcc = 13
columns = []
columns.extend(('mfcc', 'mean', '{:02d}'.format(i+1)) for i in range(n_mfcc))
columns.extend(('mfcc', 'std', '{:02d}'.format(i+1)) for i in range(n_mfcc))
columns = pd.MultiIndex.from_tuples(columns, names=('feature', 'statistics', 'number'))

features = pd.DataFrame(index=tracks.index, columns=columns, dtype=np.float32)

## 1 Segmentation

## 2 Low-level features

* Timbre (short-term): ZCR, SC, SR, SF, MFCC, DWCH
* Temporal: SM, ARM, FP, AM

Todo:
* parallel implementation

In [None]:
for i, tid in enumerate(tracks.index):
    x, sr = librosa.load(path(i), sr=None, mono=True)  # res_type='kaiser_fast'
    m = librosa.feature.mfcc(x, sr, n_mfcc=n_mfcc, n_fft=2048, hop_length=512)
    features.loc[tid, ('mfcc', 'mean')] = m.mean(axis=1)
    features.loc[tid, ('mfcc', 'std')] = m.std(axis=1)

## 3 High-level features

* Pitch: PH/PCP, EPCP
* Rhythm: BH, BPM
* Harmony: CP, CH

## 4 Store features

In [None]:
# More performant to slice if indexes are sorted.
features.sort_index(axis=0, inplace=True)
features.sort_index(axis=1, inplace=True)

assert not features.isnull().values.any()

ndigits = 10
filename = os.path.join(DATA_DIR, 'features.csv')
features.to_csv(filename, float_format='%.{}e'.format(ndigits))

#features.to_json(os.path.join(DATA_DIR, 'features.json'), orient='split')
#features.to_hdf('features.hdf', 'features')
#features.to_hdf('features_zlib.hdf', 'features', complevel=9, complib='zlib')
#features.to_hdf('features_bzip2.hdf', 'features', complevel=9, complib='bzip2')
#features.to_hdf('features_lzo.hdf', 'features', complevel=9, complib='lzo')
#features.to_hdf('features_blosc.hdf', 'features', complevel=9, complib='blosc')

In [None]:
tmp = pd.read_csv(filename, index_col=0, header=[0, 1, 2])
np.testing.assert_allclose(tmp.values, features.values, rtol=10**-ndigits)

## 5 Spotify features

Todo: grab features through the Spotify API (formerly Echonest).