## Generating file lists for various Dunya collections (for mode and rhythm mode recognition experiments)

This notebook targets accessing Dunya data and collecting file lists for several collections.

The main aim is to create subsets of data for mode and rhythm mode recognition experiments and the process is split into two notebooks. In this first notebook, only the list of files available in meta data is created and saved in a pickle file. For each recording the following information is included:
- Files available for that recording
- MusicBrainz id (mbid)
- Mode information (raga, makam, etc)
- Rhythm mode information (tala, usul, etc)

Then, the second notebook (formExpSubsets4ModeRecognition.ipynb) reads this file and forms the subsets by grouping recordings with respect to mode or rhythm mode while also checking available files (ex: tonic annotation) for the recording

In [1]:
import codecs
import json, os, sys
import numpy as np
import compmusic
from compmusic import dunya as dn
from compmusic.dunya import hindustani as hi
from compmusic.dunya import carnatic as ca
from compmusic.dunya import docserver as ds
from compmusic.dunya import conn
from compmusic import musicbrainz
import pickle
import csv
import time
import datetime

token = '...yourTokenGoesHere...'#you should put your Dunya-token here
dn.set_token(token)#setting the token

#### Function definitions to retrieve file lists

In [2]:
def uniqueMbidList(recs):
    ''' Collect list of mbids for a given list of recordings
    
    Args:
        recs (list): list of recordings 
    
    Outputs:
        mbids (list): list of musicBrainz ids for the recordings
    '''
    mbids=set()
    for rec in recs:
        mbids.add(rec['mbid'])
    return list(mbids)

def getAvailableFiles(mbid,collectionName):
    '''Returns a list of available files for a recording
    Args:
        mbid (str): recording's musicBrainz id 
        collectionName (str): name of the musicBrainz collection
    
    Outputs:
        allFiles4Mbid (list): list of available files for the recording
    '''
    allFiles4Mbid=[]
    try:
        files4ID=conn._dunya_query_json("/document/by-id/%s" % mbid)

        allFiles4Mbid+=files4ID['sourcefiles']

        if collectionName=='makam':
            allFiles4Mbid+=files4ID['derivedfiles']['audioanalysis'].keys()
        else:
            allFiles4Mbid+=files4ID['derivedfiles'].keys()
    except:
        pass#file content not available
        
    return allFiles4Mbid

### Collecting files of three collections: Carnatic, Hindustani and Makam

In Dunya, data is stored in a culture specific way. For cross-cultural studies (such as testing of a mode recognition algorithm for all Dunya collections), one needs to access all collections in some unified way. The cell below accesses data in a relatively simple/direct way and stores the list of recordings and their features/files in a pickle file. Further this list can be processed to create data subsets for automatic recognition experiments.

In [3]:
collections=['carnatic','hindustani','makam']
modes=['raags','raaga','makam']
rhythmModes=['taals','taala','usul']

ts = time.time()
print('Process start time:',datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))

maxNumFiles=10000#maximum number of files per collection, setting it above 7000 will correspond to 'all files'

metaData_collections={}
for collectionName in collections:
    mbids=uniqueMbidList(conn._get_paged_json('api/'+collectionName+'/recording'))
    print('Number of unique mbids for collection ',collectionName,' :',len(mbids))
    
    #Create subdirectory for collection
    dataDir=collectionName+'/'
    if not os.path.exists(dataDir):
        os.mkdir(dataDir);

    #For all recordings access metadata
    allMetaData=[]
    fileCnt=0
    fileCnt_metaDataRead=0
    for mbid in mbids:
        fileCnt+=1
        if fileCnt>maxNumFiles:
            break
        metaData={}
        metaData['mbid']=mbid
        
        #Mode and rhythm mode information are stored at different places in different collections
        #For carnatic and hindustani, they are available in recording info accessed through the api
        # , in makam it is in the metadata file (among other 'derivedfiles'/'audioanalysis')
        contentRead=False
        if collectionName=='carnatic' or collectionName=='hindustani':
            try:
                metaData_Dunya=conn._dunya_query_json('api/'+collectionName+'/recording/'+mbid)
                contentRead=True
            except:
                print('Info unavailable for ',mbid)
        elif collectionName=='makam':
            try:
                content = ds.file_for_document(mbid,'audioanalysis','metadata')
                metaData_Dunya = json.loads(content.decode())
                if metaData_Dunya!=None:
                    contentRead=True
            except:
                print('Info unavailable for ',mbid)
        
        if contentRead:
            fileCnt_metaDataRead+=1
            metaData['files']=getAvailableFiles(mbid,collectionName)#getting list of available files for the recording
            
            #Reading mode information and including in the meta data to be saved 
            # For the case of several mode information available, this code takes the first one
            # you may like to alter the code to check all modes 
            # and treat those having more than one distinct mode in a different way        
            for mode in modes:
                if mode in metaData_Dunya.keys():
                    if len(metaData_Dunya[mode])>0:#if there is a mode information
                        if collectionName=='carnatic':
                            metaData['mode']=metaData_Dunya[mode][0]['name']
                        elif collectionName=='hindustani':
                            metaData['mode']=metaData_Dunya[mode][0]['common_name']
                        elif collectionName=='makam':
                            metaData['mode']=metaData_Dunya[mode][0]['attribute_key']

            #Reading rhythm mode information and including in the meta data to be saved 
            for rhythmMode in rhythmModes:
                if rhythmMode in metaData_Dunya.keys():
                    if len(metaData_Dunya[rhythmMode])>0:#if there is a rhythmMode information
                        if collectionName=='carnatic':
                            metaData['rhythmMode']=metaData_Dunya[rhythmMode][0]['name']
                        elif collectionName=='hindustani':
                            metaData['rhythmMode']=metaData_Dunya[rhythmMode][0]['common_name']
                        elif collectionName=='makam':
                            metaData['rhythmMode']=metaData_Dunya[rhythmMode][0]['attribute_key']
            
            #Adding the recording in set of all files (those without mode and rhthm mode are excluded)
            if ('rhythmMode' in metaData.keys()) or ('mode' in metaData.keys()):#add in the list only if there is at least one mode info
                allMetaData.append(metaData)
        
    metaData_collections[collectionName]=allMetaData
    
    print('Number of files for which metadata exists',collectionName,' :',fileCnt_metaDataRead)

ts = time.time()
print('Process end time:',datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))

#Save data to file
pickle.dump(metaData_collections, open( "metaData_collections.pkl", "wb" ))

Process start time: 2018-04-16 15:34:43
Number of unique mbids for collection  carnatic  : 3521
Number of files for which metadata exists carnatic  : 3521
Number of unique mbids for collection  hindustani  : 1211
Number of files for which metadata exists hindustani  : 1211
Number of unique mbids for collection  makam  : 6295
Info unavailable for  6fdd0940-551b-45be-b2d9-e91e048ea251
Info unavailable for  a98a9f8d-66aa-4ca9-b92a-124c4e351fb7
Info unavailable for  b513ee3f-32cb-429c-9f1f-605706d2ed41
Info unavailable for  45edc774-8bb7-4f70-8ea0-43880962a5ce
Info unavailable for  f734c25b-6b15-4a20-9196-b1af6946e725
Info unavailable for  8d4deeb7-366c-4a06-8712-c9b26c2c239e
Info unavailable for  7d6b6322-4698-4c41-b589-a7ed8ef14217
Info unavailable for  7f931515-5e16-44c8-b311-8963ff8f9c38
Info unavailable for  2e96504c-7d84-44cd-b36e-b15fadffe444
Info unavailable for  97a94144-4383-4949-8b83-1a4fa5f03dd3
Info unavailable for  d7ae533b-6b86-4bcc-a5f6-290fd3776a07
Info unavailable for  8a