# FMA: A Dataset For Music Analysis

Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson, EPFL LTS2.

## Generation / Collection / Creation

From `raw_*.csv`, this notebook generates:
* `tracks.csv`: per-track / album / artist metadata.
* `genres.csv`: genre hierarchy.

A companion script, `creation.py`:
1. Download metadata through the API and store them in `raw_tracks.csv`, `raw_albums.csv`, `raw_artists.csv` and `raw_genres.csv`.
2. Download the full audio for each track.
3. Trim the audio to 30s clips.
4. Normalize the permissions and modification / acess times.
5. Create the `.zip` archives.

In [None]:
%load_ext autoreload
%autoreload 2

import os
import ast
import pickle
import numpy as np
import pandas as pd
import IPython.display as ipd

In [None]:
import utils
AUDIO_DIR = os.environ.get('AUDIO_DIR')
BASE_DIR = os.path.abspath(os.path.dirname(AUDIO_DIR))
FMA_FULL = os.path.join(BASE_DIR, 'fma_full')
FMA_LARGE = os.path.join(BASE_DIR, 'fma_large')
FMA_MEDIUM = os.path.join(BASE_DIR, 'fma_medium')
FMA_SMALL = os.path.join(BASE_DIR, 'fma_small')

## 1 Retrieve metadata and audio from FMA

1. Crawl the tracks, albums and artists metadata through their [API](https://freemusicarchive.org/api).
2. Download original `.mp3` by HTTPS for each track id (only if it does not exist already).

Todo:
* Scrap curators.
* Download images (`track_image_file`, `album_image_file`, `artist_image_file`). Beware the quality.
* Verify checksum for some random tracks.

Examples:
* To add new tracks: iterate from largest known track id to the most recent only.
* To update user data: get them all again.

In [None]:
# Script used to query the API for all tracks, albums and artists.
# Then to download the audio through HTTPS.

# ./creation.py metadata
# ./creation.py data /path/to/fma/fma_full

#!cat creation.py

In [None]:
# converters={'genres': ast.literal_eval}
tracks = pd.read_csv('raw_tracks.csv', index_col=0)
albums = pd.read_csv('raw_albums.csv', index_col=0)
artists = pd.read_csv('raw_artists.csv', index_col=0)
genres = pd.read_csv('raw_genres.csv', index_col=0)

not_found = pickle.load(open('not_found.pickle', 'rb'))

In [None]:
def get_fs_tids(audio_dir):
    tids = []
    for _, dirnames, files in os.walk(audio_dir):
        if dirnames == []:
            tids.extend(int(file[:-4]) for file in files)
    return tids

audio_tids = get_fs_tids(FMA_FULL)
clips_tids = get_fs_tids(FMA_LARGE)

In [None]:
tmp = tracks.shape[0], len(not_found['tracks'])
print('tracks: {} collected ({} not found)'.format(*tmp))
tmp = albums.shape[0], len(not_found['albums']), len(tracks['album_id'].unique())
print('albums: {} collected ({} not found, {} in tracks)'.format(*tmp))
tmp = artists.shape[0], len(not_found['artists']), len(tracks['artist_id'].unique())
print('artists: {} collected ({} not found, {} in tracks)'.format(*tmp))
print('genres: {} collected'.format(genres.shape[0]))
print('audio: {} collected ({} not found)'.format(len(audio_tids), len(not_found['audio'])))
print('clips: {} collected ({} not found)'.format(len(clips_tids), len(not_found['clips'])))
assert len(tracks.index.isin(audio_tids)) + len(not_found['audio']) == len(tracks.index)
assert len(tracks.index.isin(clips_tids)) + len(not_found['clips']) == len(tracks.index.isin(audio_tids))

In [None]:
N = 5
ipd.display(tracks.head(N))
ipd.display(albums.head(N))
ipd.display(artists.head(N))
ipd.display(genres.head(N))

## 2 Format metadata

* Columns who are lists: genres, album_images, artist_images
* Fill `tracks.json` by iterating over all `track_id`.
* Fill `genres.json`
* Fill meta-data about encoding: length, number of samples, sample rate, bit rate, channels (mono/stereo), 16bits?.

Todo:
* Sanitize values, e.g. list of words for tags, valid links in `artist_wikipedia_page`.

In [None]:
df, column = tracks, 'tags'
null = sum(df[column].isnull())
print('{} null, {} non-null'.format(null, df.shape[0] - null))
df[column].value_counts().head(10)

### 2.1 Tracks

In [None]:
drop = [
    'license_image_file', 'license_image_file_large', 'license_parent_id', 'license_url',  # keep title only
    'track_file', 'track_image_file',  # used to download only
    'track_url', 'album_url', 'artist_url',  # only relevant on website
    'track_copyright_c', 'track_copyright_p',  # present for ~1000 tracks only
    # 'track_composer', 'track_lyricist', 'track_publisher',  # present for ~4000, <1000 and <2000 tracks
    'track_disc_number',  # different from 1 for <1000 tracks
    'track_explicit', 'track_explicit_notes',  # present for <4000 tracks
    'track_instrumental'  # ~6000 tracks have a 1, there is an instrumental genre
]
tracks.drop(drop, axis=1, inplace=True)
tracks.rename(columns={'license_title': 'track_license', 'tags': 'track_tags'}, inplace=True)

In [None]:
def convert_duration(x):
    times = x.split(':')
    seconds = int(times[-1])
    minutes = int(times[-2])
    try:
        minutes += 60 * int(times[-3])
    except IndexError:
        pass
    return seconds + 60 * minutes

tracks['track_duration'] = tracks['track_duration'].map(convert_duration)

In [None]:
def convert_datetime(df, column, format=None):
    df[column] = pd.to_datetime(df[column], infer_datetime_format=True, format=format)
convert_datetime(tracks, 'track_date_created')
convert_datetime(tracks, 'track_date_recorded')

In [None]:
tracks['album_id'].fillna(-1, inplace=True)
tracks['track_bit_rate'].fillna(-1, inplace=True)
tracks = tracks.astype({'album_id': int, 'track_bit_rate': int})

In [None]:
def convert_genres(genres):
    genres = ast.literal_eval(genres)
    return [int(genre['genre_id']) for genre in genres]

tracks['track_genres'].fillna('[]', inplace=True)
tracks['track_genres'] = tracks['track_genres'].map(convert_genres)

In [None]:
tracks.columns

### 2.2 Albums

In [None]:
drop = [
    'artist_name', 'album_url', 'artist_url',  # in tracks already (though it can be different)
    'album_handle',
    'album_image_file', 'album_images',  # todo: shall be downloaded
    #'album_producer', 'album_engineer',  # present for ~2400 albums only
]
albums.drop(drop, axis=1, inplace=True)
albums.rename(columns={'tags': 'album_tags'}, inplace=True)

In [None]:
convert_datetime(albums, 'album_date_created')
convert_datetime(albums, 'album_date_released')

In [None]:
albums.columns

### 2.3 Artists

In [None]:
drop = [
    'artist_website', 'artist_url',  # in tracks already (though it can be different)
    'artist_handle',
    'artist_image_file', 'artist_images',  # todo: shall be downloaded
    'artist_donation_url', 'artist_paypal_name', 'artist_flattr_name',  # ~1600 & ~400 & ~70, not relevant
    'artist_contact',  # ~1500, not very useful data
    # 'artist_active_year_begin', 'artist_active_year_end',  # ~1400, ~500 only
    # 'artist_associated_labels',  # ~1000
    # 'artist_related_projects',  # only ~800, but can be combined with bio
]
artists.drop(drop, axis=1, inplace=True)
artists.rename(columns={'tags': 'artist_tags'}, inplace=True)

In [None]:
convert_datetime(artists, 'artist_date_created')
for column in ['artist_active_year_begin', 'artist_active_year_end']:
    artists[column].replace(0.0, np.nan, inplace=True)
    convert_datetime(artists, column, format='%Y.0')

In [None]:
artists.columns

### 2.4 Merge DataFrames

In [None]:
not_found['albums'].remove(None)
not_found['albums'].append(-1)
not_found['albums'] = [int(i) for i in not_found['albums']]
not_found['artists'] = [int(i) for i in not_found['artists']]

In [None]:
tracks = tracks.merge(albums, left_on='album_id', right_index=True, sort=False, how='left', suffixes=('', '_dup'))

n = sum(tracks['album_title_dup'].isnull())
print('{} tracks without extended album information ({} tracks without album_id)'.format(
    n, sum(tracks['album_id'] == -1)))
assert sum(tracks['album_id'].isin(not_found['albums'])) == n
assert sum(tracks['album_title'] != tracks['album_title_dup']) == n

tracks.drop('album_title_dup', axis=1, inplace=True)
assert not any('dup' in col for col in tracks.columns)

In [None]:
# Album artist can be different than track artist. Keep track artist.
#tracks[tracks['artist_name'] != tracks['artist_name_dup']].select(lambda x: 'artist_name' in x, axis=1)

In [None]:
tracks = tracks.merge(artists, left_on='artist_id', right_index=True, sort=False, how='left', suffixes=('', '_dup'))

n = sum(tracks['artist_name_dup'].isnull())
print('{} tracks without extended artist information'.format(n))
assert sum(tracks['artist_id'].isin(not_found['artists'])) == n
assert sum(tracks['artist_name'] != tracks[('artist_name_dup')]) == n

tracks.drop('artist_name_dup', axis=1, inplace=True)
assert not any('dup' in col for col in tracks.columns)

In [None]:
columns = []
for name in tracks.columns:
    names = name.split('_')
    columns.append((names[0], '_'.join(names[1:])))
tracks.columns = pd.MultiIndex.from_tuples(columns)
assert all(label in ['track', 'album', 'artist'] for label in tracks.columns.get_level_values(0))

In [None]:
# Todo: fill other columns ?
tracks['album', 'tags'].fillna('[]', inplace=True)
tracks['artist', 'tags'].fillna('[]', inplace=True)

columns = [('album', 'favorites'), ('album', 'comments'), ('album', 'listens'), ('album', 'tracks'),
           ('artist', 'favorites'), ('artist', 'comments')]
for column in columns:
    tracks[column].fillna(-1, inplace=True)
columns = {column: int for column in columns}
tracks = tracks.astype(columns)

## 3 Data cleaning

Todo
* Duplicates (metadata and audio)

In [None]:
def keep(index, df):
    old = len(df)
    df = df.loc[index]
    new = len(df)
    print('{} lost, {} left'.format(old - new, new))
    return df

tracks = keep(tracks.index, tracks)

In [None]:
# Audio not found.
tracks = keep(tracks.index.difference(not_found['audio']), tracks)

In [None]:
# License forbids redistribution.
tracks = keep(tracks['track', 'license'] != 'FMA-Limited: Download Only', tracks)
print('{} licenses'.format(len(tracks[('track', 'license')].unique())))

In [None]:
#sum(tracks['track', 'title'].duplicated())

## 4 Genres

In [None]:
genres.drop(['genre_handle', 'genre_color'], axis=1, inplace=True)
genres.rename(columns={'genre_parent_id': 'parent_id', 'genre_title': 'title'}, inplace=True)

In [None]:
genres['parent_id'].fillna(0, inplace=True)
genres = genres.astype({'parent_id': int})

In [None]:
# 13 (Easy Listening) has parent 126 which is missing
# --> a root genre on the website, although not in the genre menu
genres.at[13, 'parent_id'] = 0

# 580 (Abstract Hip-Hop) has parent 1172 which is missing
# --> listed as child of Hip-Hop on the website
genres.at[580, 'parent_id'] = 21

# 810 (Nu-Jazz) has parent 51 which is missing
# --> listed as child of Easy Listening on website
genres.at[810, 'parent_id'] = 13

# 763 (Holiday) has parent 763 which is itself
# --> listed as child of Sound Effects on website
genres.at[763, 'parent_id'] = 16

# Todo: should novely be under Experimental? It is alone on website.

In [None]:
# Genre 806 (hiphop) should not exist. Replace it by 21 (Hip-Hop).
print('{} tracks have genre 806'.format(
    sum(tracks['track', 'genres'].map(lambda genres: 806 in genres))))
def change_genre(genres):
    return [genre if genre != 806 else 21 for genre in genres]
tracks['track', 'genres'] = tracks['track', 'genres'].map(change_genre)
genres.drop(806, inplace=True)

In [None]:
# Be sure all genres are in genres.csv.
tracks_genres = set()
for row in tracks['track', 'genres'].iteritems():
    tracks_genres.update(row[1])
print('{} genres in dataset'.format(len(tracks_genres)))
assert tracks_genres.issubset(genres.index)
genres.loc[set(genres.index).difference(tracks_genres)]

In [None]:
# Number of tracks per genre.
genres['#tracks'] = 0
for row in tracks['track', 'genres'].items():
    for genre in row[1]:
        genres.at[genre, '#tracks'] += 1

In [None]:
def get_parent(genre):
    if genre != 0:
        get_parent(genres.at[genre, 'parent_id'])
        track_genres.append(genre)

# Cumulative number of tracks per genre.
genres['#tracks_cumulated'] = 0
for row in tracks['track', 'genres'].items():
    track_genres = list()
    for genre in row[1]:
        get_parent(genre)
    for genre in set(track_genres):
        genres.at[genre, '#tracks_cumulated'] += 1

In [None]:
def get_parent(genre):
    parent = genres.at[genre, 'parent_id']
    if parent == 0:
        return genre
    else:
        return get_parent(parent)

def get_top_genre(track_genres):
    top_genres = set(genres.at[genres.at[genre, 'top_level'], 'title'] for genre in track_genres)
    return top_genres.pop() if len(top_genres) == 1 else np.nan

# Top-level genre.
genres['top_level'] = genres.index.map(get_parent)
tracks['track', 'top_genre'] = tracks['track', 'genres'].map(get_top_genre)

In [None]:
genres.head(10)

## 5 Splits: train, validation, test

Take into account:
* Artists may only appear on one side.
* Stratification: all characteristics (sampling rates) should be distributed equally.

## 6 Subsets: large, medium, small

Todo:
* update duration from `ffmpeg`
* all files listed in `not_found['clips']` should have length 0 and can be removed

In [None]:
# ./creation.py clips /path/to/fma

### 6.1 Large

Main characteristic: the full set with clips trimmed to a manageable size.

In [None]:
fma_large = pd.DataFrame(tracks)
fma_large = keep(fma_large['track', 'duration'] > 30, fma_large)
fma_large = keep(fma_large.index.difference(not_found['clips']), fma_large)

### 6.2 Medium

Main characteristic: clean metadata (includes 1 top-level genre) and quality audio.

In [None]:
fma_medium = pd.DataFrame(fma_large)

In [None]:
# Missing meta-information.

# Missing extended album and artist information.
fma_medium = keep(~fma_medium['album', 'id'].isin(not_found['albums']), fma_medium)
fma_medium = keep(~fma_medium['artist', 'id'].isin(not_found['artists']), fma_medium)

# Untitled track or album.
fma_medium = keep(~fma_medium['track', 'title'].isnull(), fma_medium)
fma_medium = keep(fma_medium['track', 'title'].map(lambda x: 'untitled' in x.lower()) == False, fma_medium)
fma_medium = keep(fma_medium['album', 'title'].map(lambda x: 'untitled' in x.lower()) == False, fma_medium)

# One tag is often just the artist name. Tags too scarce for tracks and albums.
#keep(fma_medium['artist', 'tags'].map(len) >= 2, fma_medium)

# Too scarce.
#fma_medium = keep(~fma_medium['album', 'information'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'bio'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'website'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'wikipedia_page'].isnull(), fma_medium)

# Too scarce.
#fma_medium = keep(~fma_medium['artist', 'location'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'latitude'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'longitude'].isnull(), fma_medium)

In [None]:
# Technical quality.
# Todo: sample rate
fma_medium = keep(fma_medium['track', 'bit_rate'] > 100000, fma_medium)

# Choosing standard bit rates discards all VBR.
#fma_medium = keep(fma_medium['track', 'bit_rate'].isin([320000, 256000, 192000, 160000, 128000]), fma_medium)

In [None]:
fma_medium = keep(fma_medium['track', 'duration'] >= 60, fma_medium)
fma_medium = keep(fma_medium['track', 'duration'] <= 600, fma_medium)

fma_medium = keep(fma_medium['album', 'tracks'] >= 1, fma_medium)
fma_medium = keep(fma_medium['album', 'tracks'] <= 50, fma_medium)

In [None]:
# Lower popularity bound.
fma_medium = keep(fma_medium['track', 'listens'] >= 100, fma_medium)
fma_medium = keep(fma_medium['track', 'interest'] >= 200, fma_medium)
fma_medium = keep(fma_medium['album', 'listens'] >= 1000, fma_medium);

# Favorites and comments are very scarce.
#fma_medium = keep(fma_medium['artist', 'favorites'] >= 1, fma_medium)

In [None]:
# Targeted genre classification.
fma_medium = keep(~fma_medium['track', 'top_genre'].isnull(), fma_medium);
#keep(fma_medium['track', 'top_genres'].map(len) == 1, fma_medium);
#keep(fma_medium['track', 'genres'].map(len) == 1, fma_medium);

In [None]:
# Todo: kill some top-level genres ? Like Easy Listening.

In [None]:
# Adjust size with popularity measure. Should be of better quality.
N_TRACKS = 25000

# Observations
# * More albums killed than artists --> be sure not to kill diversity
# * Favorites and preterites genres differently --> do it per genre?
# Normalization
# * mean, median, std, max
# * tracks per album or artist
# Test
# * 4/5 of same tracks were selected with various set of measures
# * <5% diff with max and mean

popularity_measures = [('track', 'listens'), ('track', 'interest')]  # ('album', 'listens')
# ('track', 'favorites'), ('track', 'comments'),
# ('album', 'favorites'), ('album', 'comments'),
# ('artist', 'favorites'), ('artist', 'comments'),

normalization = {measure: fma_medium[measure].max() for measure in popularity_measures}
def popularity_measure(track):
    return sum(track[measure] / normalization[measure] for measure in popularity_measures)
fma_medium['popularity_measure'] = fma_medium.apply(popularity_measure, axis=1)
fma_medium = keep(fma_medium.sort_values('popularity_measure', ascending=False).index[:N_TRACKS], fma_medium)

In [None]:
tmp = genres[genres['parent_id'] == 0].reset_index().set_index('title')
tmp['#tracks_medium'] = fma_medium['track', 'top_genre'].value_counts()
tmp.sort_values('#tracks_medium', ascending=False)

### 6.3 Small

Main characteristic: genre balanced (and echonest features).

Choices:
* 8 genres with 1000 tracks --> 8,000 tracks
* 10 genres with 500 tracks --> 5,000 tracks

Todo:
* Download more echonest features so that all tracks can have them. Otherwise intersection of tracks with echonest features and one top-level genre is too small.

In [None]:
N_GENRES = 8
N_TRACKS = 1000

top_genres = tmp.sort_values('#tracks_medium', ascending=False)[:N_GENRES].index
fma_small = pd.DataFrame(fma_medium)
fma_small = keep(fma_small['track', 'top_genre'].isin(top_genres), fma_small)

In [None]:
to_keep = []
for genre in top_genres:
    subset = fma_small[fma_small['track', 'top_genre'] == genre]
    drop = subset.sort_values('popularity_measure').index[:-N_TRACKS]
    fma_small.drop(drop, inplace=True)
assert len(fma_small) == N_GENRES * N_TRACKS

In [None]:
echonest = pd.read_csv('echonest.csv', index_col=0, header=[0, 1, 2])
echonest = keep(echonest.index, echonest)
echonest = keep(~echonest['echonest', 'temporal_features'].isnull().any(axis=1), echonest)
echonest = keep(~echonest['echonest', 'audio_features'].isnull().any(axis=1), echonest)
echonest = keep(~echonest['echonest', 'social_features'].isnull().any(axis=1), echonest)

keep(fma_large.index.isin(echonest.index), fma_large);
keep(fma_medium.index.isin(echonest.index), fma_medium);
keep(fma_small.index.isin(echonest.index), fma_small);

### 6.4 Subset indication

In [None]:
tracks['set', 'subset'] = 'full'
tracks.loc[fma_large.index, ('set', 'subset')] = 'large'
tracks.loc[fma_medium.index, ('set', 'subset')] = 'medium'
tracks.loc[fma_small.index, ('set', 'subset')] = 'small'

## 7 Store

* Fill the archives and compute their checksum.
    * Tool: zipfile
* Set permissions and creation/modification/access times.

Todo:
* Checksum for each individual file? Store output of sha1sum in another file.

Directory structure:
* `fma_metadata.zip`
    * `tracks.csv`
    * `genres.csv`
* `fma_features.zip`
    * `features.csv`
    * `echonest.csv`
* `fma_full.zip` (967 GiB, last collection by Kirell in April 2016 was 752 GiB)
* `fma_large.zip` (97 GiB)
* `fma_medium.zip`
* `fma_small.zip` (30G full length --> 3.4GiB)

In [None]:
for dataset in 'tracks', 'genres':
    eval(dataset).sort_index(axis=0, inplace=True)
    eval(dataset).sort_index(axis=1, inplace=True)
    eval(dataset).to_csv(dataset + '.csv')

In [None]:
README = """This .zip archive is part of the FMA, a dataset for music analysis.
Code & data: https://github.com/mdeff/fma
Paper: https://arxiv.org/abs/1612.01840

Each .mp3 is licensed by the corresponding artist listed in tracks.csv.

You can verify the integrity of the uncompressed files with sha1sum -c checksums.
It's sha1 checksum should be XXXXXXXXXXXXXX
"""
#for dst in next(os.walk(BASE_DIR))[1]:
#    dst = os.path.join(BASE_DIR, dst, 'README')
#    try:
#        os.chmod(dst, 0o666)
#    except:
#        pass
#    with open(dst, 'w') as f:
#        f.write(README)

In [None]:
# ./creation.py normalize /path/to/fma
# ./creation.py zip

In [None]:
# Create .zip archives.
# TODO: use zipfile
#shutil.make_archive(DST_DIR, 'zip', par_dir, ARCHIVE)
#os.utime(DST_DIR + '.zip', (TIME, TIME))
#os.chmod(DST_DIR + '.zip', 0o444)

## 8 Description

Todo:
* verify all dtypes

In [None]:
tracks = utils.load('tracks.csv')
tracks.dtypes

In [None]:
for subset in tracks['set', 'subset'].unique():
    print('{:6} {:6} tracks'.format(subset, sum(tracks['set', 'subset'] <= subset)))

In [None]:
N = 5
ipd.display(tracks['track'].head(N))
ipd.display(tracks['album'].head(N))
ipd.display(tracks['artist'].head(N))