# FMA: A Dataset For Music Analysis

Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson, EPFL LTS2.

## Generation / Collection / Creation

Todo
* update counts

In [None]:
%load_ext autoreload
%autoreload 2

import pickle
import pandas as pd
import IPython.display as ipd

## 1 Retrieve metadata from FMA

* To add new tracks: iterate from max to most recent only.
* To update user data: get them all again.

In [None]:
# Script used to query the API for all tracks, albums and artists.
!cat creation.py

In [None]:
# converters={'genres': ast.literal_eval}
tracks = pd.read_csv('tracks_raw.csv', index_col=0)
albums = pd.read_csv('albums_raw.csv', index_col=0)
artists = pd.read_csv('artists_raw.csv', index_col=0)
genres = pd.read_csv('genres_raw.csv', index_col=0)

not_found = pickle.load(open('not_found.pickle', 'rb'))

In [None]:
tmp = tracks.shape[0], len(not_found['tracks'])
print('tracks: {} collected ({} not found)'.format(*tmp))
tmp = albums.shape[0], len(not_found['albums'])
print('albums: {} collected ({} not found)'.format(*tmp))
tmp = artists.shape[0], len(not_found['artists'])
print('artists: {} collected ({} not found)'.format(*tmp))
print('genres: {} collected'.format(genres.shape[0]))

In [None]:
ipd.display(tracks.head(5))
ipd.display(albums.head(5))
ipd.display(artists.head(5))
ipd.display(genres.head(5))

## 2 Format metadata

* Columns who are lists: genres, album_images, artist_images
* Fill `tracks.json` by iterating over all `track_id`.
* Fill `genres.json`

In [None]:
translation = {
    'track_id': 'track_id',
    'album_id': 'album_id',
    'artist_id': 'artist_id',
    'license_title': 'license'
}

In [None]:
genres['genre_parent_id'].fillna(0, inplace=True)
genres['genre_parent_id'] = genres['genre_parent_id'].astype(int)

## 2 Download audio from FMA

1. Download original `.mp3` from each stored `track_id`, only if it does not exist already.
    1. Verify checksum for some random tracks.
1. Compute and store a checksum. By sha1sum in another file?
1. Fill meta-data about encoding: length, number of samples, sample rate, bit rate, channels (mono/stereo), 16bits?.

## 3 Data cleaning

* Missing audio or meta-data (all files are in tracks.csv and vice-versa)
* Duplicates
* Exclude non-CC licensed songs.

Genres
* Some genres have a `parent_id` which does not exist.

In [None]:
# 13 (Easy Listening) has parent 126 which is missing
# --> a root genre on the website, although not in the genre menu
genres.loc[13, 'genre_parent_id'] = 0

# 580 (Abstract Hip-Hop) has parent 1172 which is missing
# --> listed as child of Hip-Hop on the website
genres.loc[580, 'genre_parent_id'] = 21

# 810 (Nu-Jazz) has parent 51 which is missing
# --> listed as child of Easy Listening on website
genres.loc[810, 'genre_parent_id'] = 13

## 4 Splits: train, validation, test

Take into account:
* Artists may only appear on one side.
* Stratification: all characteristics (sampling rates) should be distributed equally.

## 5 Subsets: large, medium, small

* Select the subsets.
* Clip all tracks.

In [None]:
# Songs shorter than 30s

## 6 Store

* Fill the archives and compute their checksum.
    * Tool: zipfile
* Set permissions and creation/modification/access times.


* `fma_metadata.zip`
    * `tracks.csv`
    * `genres.csv`
* `fma_features.zip`
    * `features.csv`
    * `echonest.csv`
* `fma_full.zip`
* `fma_large.zip`
* `fma_medium.zip`
* `fma_small.zip` (30G full length --> 3.4GiB)

In [None]:
for dataset in 'tracks', 'albums', 'artists', 'genres':
    eval(dataset).sort_index(axis=0, inplace=True)
    eval(dataset).sort_index(axis=1, inplace=True)
    eval(dataset).to_csv(dataset + '.csv')