## Generating file lists for various Dunya collections (for mode and rhythm mode recognition experiments)

This notebook targets accessing Dunya data and collecting file lists for several collections.

The main aim is to create subsets of data for mode and rhythm mode recognition experiments and the process is split into two notebooks. In this first notebook, we create a list of recordings and relevant metadata. For each recording the following information is included:
- Files available for that recording
- MusicBrainz id (mbid)
- Mode information (raga, makam, etc)
- Rhythm mode information (tala, usul, etc)

Then, the second notebook (formExpSubsets4ModeRecognition.ipynb) reads this file and forms the subsets by grouping recordings with respect to mode or rhythm mode while also checking available files (ex: tonic annotation) for the recording

In [None]:
# Set your token here from https://dunya.compmusic.upf.edu/user/profile/
token = '...yourAPITokenGoesHere...'

In [None]:
import codecs
import json, os, sys
import pickle
import csv
import time
import datetime
import collections

import numpy as np

import compmusic
from compmusic import dunya as dn
from compmusic.dunya import hindustani as hi
from compmusic.dunya import carnatic as ca
from compmusic.dunya import makam as ma
from compmusic.dunya import docserver as ds
from compmusic import musicbrainz
from compmusic.dunya import conn

dn.set_token(token)

### Collecting files of three collections: Carnatic, Hindustani and Makam

In Dunya, data is stored according to a model specific to each culture. For cross-cultural studies (such as testing of a mode recognition algorithm for all Dunya collections), one needs to access all collections in some unified way. We access data from each culture collection and arrange it in a consistent format for further analysis. Further this list can be processed to create data subsets for automatic recognition experiments.

In [None]:
# Set to None to get all files
maxNumFiles = None 

These method get only the mode information from each collection that we require for this analysis. We rename the attributes to be consistent for all collections.
We only consider the first values for each of these fields, in the case that several modes are available, you may like to alter the code to check all modes and treat those having more than one distinct mode in a different way 

In [None]:
def get_carnatic_metadata(maxNumFiles=None):
    """ Get Carnatic specific mode and rhythmic mode metadata for all recordings."""
    carnatic_recordings = ca.get_recordings(recording_detail=True)
    if maxNumFiles:
        carnatic_recordings = carnatic_recordings[:maxNumFiles]
        
    # Get only the information that we want for each collection. Rename the attributes to be
    # consistent for all collections.
    # Carnatic
    # mode -> raaga, rhythmMode -> taala
    carnatic_metadata = []
    for r in carnatic_recordings:
        if r['raaga'] or r['taala']:
            data = {'mbid': r['mbid']}
            if r['raaga']:
                data['mode'] = r['raaga'][0]['common_name']
            if r['taala']:
                data['rhythmMode'] = r['taala'][0]['common_name']
            carnatic_metadata.append(data)
    return carnatic_metadata

def get_hindustani_metadata(maxNumFiles=None):
    """ Get Hindustani specific mode and rhythmic mode metadata for all recordings."""
    hindustani_recordings = hi.get_recordings(recording_detail=True)
    if maxNumFiles:
        hindustani_recordings = hindustani_recordings[:maxNumFiles]
        
    # Get only the information that we want for each collection. Rename the attributes to be
    # consistent for all collections.
    # Hindustani
    # mode -> raag, rhythmMode -> taal
    # The API for hindustani returns some MBIDs twice, we do a basic filtering here.
    seen_mbids = set()
    hindustani_metadata = []
    for r in hindustani_recordings:
        if r['raags'] or r['taals']:
            data = {'mbid': r['mbid']}
            if r['raags']:
                data['mode'] = r['raags'][0]['common_name']
            if r['taals']:
                data['rhythmMode'] = r['taals'][0]['common_name']
            if r['mbid'] not in seen_mbids:
                hindustani_metadata.append(data)
                seen_mbids.add(r['mbid'])
    return hindustani_metadata

def get_makam_metadata(maxNumFiles=None):
    """ Get Turkish-makam specific mode and rhythmic mode metadata for all recordings."""
    makam_recordings = ma.get_recordings(recording_detail=True)
    if maxNumFiles:
        makam_recordings = makam_recordings[:maxNumFiles]
    # Get only the information that we want for each collection. Rename the attributes to be
    # consistent for all collections.
    # Makam
    # mode -> makam, rhythmMode -> usul
    makam_metadata = []
    for r in makam_recordings:
        if r['makamlist'] or r['usullist']:
            data = {'mbid': r['mbid']}
            if r['makamlist']:
                data['mode'] = r['makamlist'][0]['name']
            if r['usullist']:
                data['rhythmMode'] = r['usullist'][0]['name']
            makam_metadata.append(data)
    return makam_metadata

This next step may take some time - these methods retrieve detailed information for all recordings in each collection, which requires a number of webservice requests

In [None]:
print('Process start time: {}'.format(datetime.datetime.now()))
print('Starting Hindustani: {}'.format(datetime.datetime.now()))
hindustani_metadata = get_hindustani_metadata(maxNumFiles)
print('Starting Carnatic: {}'.format(datetime.datetime.now()))
carnatic_metadata = get_carnatic_metadata(maxNumFiles)
print('Starting Makam: {}'.format(datetime.datetime.now()))
makam_metadata = get_makam_metadata(maxNumFiles)

metaData_collections = {'hindustani': hindustani_metadata,
                        'carnatic': carnatic_metadata,
                        'makam': makam_metadata}

print('Process end time: {}'.format(datetime.datetime.now()))

# Save data to file
pickle.dump(metaData_collections, open('metaData_collections.pkl', 'wb'))