## Forming data subsets for mode and rhythm mode recognition experiments

This notebook targets forming data subsets for mode and rhythm mode recognition experiments starting from the list of files (and meta data info) created and stored (using generateFileLists4Collections.ipynb) in a pickle file. For each recording the following information is available:
- Files available for that recording
- MusicBrainz id (mbid)
- Mode information (raga, makam, etc)
- Rhythm mode information (tala, usul, etc)

This notebook reads this file and forms the subsets by grouping recordings with respect to mode or rhythm mode while also checking available files (ex: tonic annotation) for the recording. The outputs are json files for each culture (with the format of [this sample file](https://github.com/MTG/otmm_makam_recognition_dataset/blob/master/annotations.json)) which can be used in mode recognition implementations as in [this repo](https://github.com/emirdemirel/Supervised_Mode_Recognition).

In [1]:
import codecs
import json, os, sys
import numpy as np
import pickle
import csv
import time
import datetime
import random
from compmusic.dunya import docserver as ds
from compmusic import dunya as dn
from compmusic.dunya import conn

token = '...yourTokenGoesHere...'#you should put your Dunya-token here
dn.set_token(token)#setting the token

#Reading metadata
with open("metaData_collections.pkl", 'rb') as f:
    metaData = pickle.load(f)

In [2]:
def getMbidList4modes(colMetaData,categoryType,filesNeeded):
    '''Function to retrieve list of mbids for the given category type and 
    a list of all files needed for inclusion of a recording in the subset
    
    Args:
        colMetaData : collection meta data 
        categoryType (str): category type ('mode' or 'rhythmMode')
        filesNeeded (list): required list of files to decide inclusion of a recording in the subset
    
    Outputs:
        mbids (list): list of musicBrainz ids for the recordings
    '''
    mbidList={}
    for rec in colMetaData:
        #checking if all requested files exist
        filesExist=True
        for requiredFile in filesNeeded:
            if not(requiredFile in rec['files']):
                filesExist=False
                print(requiredFile, ' does not exist for ',rec['mbid'])

        #If files exist, check availability of mode information and add if yes
        if filesExist:
            #If the mode information is available
            if categoryType in rec:
                if rec[categoryType] in mbidList.keys():
                    val=mbidList[rec[categoryType]]+[rec['mbid']]
                    mbidList[rec[categoryType]]=val
                else:
                    mbidList[rec[categoryType]]=[rec['mbid']]
    return mbidList

def printSortedList(mbidList,numCategories=5):
    '''Print list of largest categories'''
    categories=list(mbidList.keys())
    
    numFiles_category=np.zeros((len(categories),), dtype=int)
    for index in range(len(categories)):
        numFiles_category[index]=len(mbidList[categories[index]])
    
    sortedIndexes=np.flipud(np.argsort(numFiles_category))#highest to lowest sorting
    for index in range(numCategories):
        print(categories[sortedIndexes[index]],' :\t',numFiles_category[sortedIndexes[index]])

def composeDataset(mbidList,categoryType,numCategories,numFilePerCategory=20):
    '''Composing the dataset(via random selection of recordings) with the following constraints:
        - Number of categories
        - Maximum number of files in each category

    Args:
        mbidList (dict): dictionary mapping categories to list of musicBrainz ids
        categoryType (str): category type ('mode' or 'rhythmMode')
        numCategories (int): number of categories
        numFilePerCategory (int): number of recordings in each category
    
    Outputs:
        allRecordings (list): list of musicBrainz ids for the recordings
    '''
    categories=list(mbidList.keys())
    
    numFiles_category=np.zeros((len(categories),), dtype=int)
    for index in range(len(categories)):
        numFiles_category[index]=len(mbidList[categories[index]])
    
    sortedIndexes=np.flipud(np.argsort(numFiles_category))#highest to lowest sorting
    
    sortedIndexes=sortedIndexes[:numCategories]
    allRecordings=[]
    for index in range(numCategories):
        category=categories[sortedIndexes[index]]
        
        mbids=mbidList[category]
        #Choose random samples
        selectedIndexes=random.sample(range(len(mbids)), numFilePerCategory)
        for selectedIndex in selectedIndexes:
            new_rec={}
            new_rec[categoryType]=category
            new_rec['mbid']=mbids[selectedIndex]
            allRecordings.append(new_rec)
    
    return allRecordings
        
    
def addTonicInformation(recs,collectionName):
    '''Adding tonic information to recording info 
    for all elements of the given recording list
    
    Args:
        recs (list): list of recordings (including features)
        collectionName (str): name of the collection
    
    Modifies:
        recs (list): adds 'tonic' field (containing the tonic in Hz) to each element
    '''
    for rec in recs:
        if collectionName=='makam':#collection-specific access to tonic info
            content = ds.file_for_document(rec['mbid'],'audioanalysis','tonic')
            tonic = json.loads(content.decode())['value']
        elif collectionName=='carnatic' or collectionName=='hindustani':
            content = ds.file_for_document(rec['mbid'],'ctonic','tonic')#collection-specific access
            tonic = content.decode()
        
        if tonic == None:
            print('Tonic could not be read for ',rec['mbid'])
        rec['tonic']=float(tonic)
        rec['mbid']='https://musicbrainz.org/recording/'+rec['mbid']


### Printing table of most frequently used modes 
Printing list of modes for which highest number of recordings(with tonic annotations) are available

In [3]:
collections=list(metaData.keys())
numModes=20
for collection in collections:
    #Get list of recordings in the collection which also has a tonic file
    if collection=='makam':
        modeMbids=getMbidList4modes(metaData[collection],'mode',['tonic'])
    elif collection=='carnatic' or collection=='hindustani':
        modeMbids=getMbidList4modes(metaData[collection],'mode',['ctonic'])
    print('-------------------------------------')
    print('Most frequently used modes in collection ',collection)
    printSortedList(modeMbids,numModes)


ctonic  does not exist for  1d0d2909-4c45-4d35-85fd-68cbcef9c605
ctonic  does not exist for  4f5a6a1b-7002-4538-b642-b0ed42e91140
ctonic  does not exist for  05341b83-b2c5-42e6-b26c-5f7ccd7c1768
-------------------------------------
Most frequently used modes in collection  hindustani
Bhairabi  :	 51
Yaman kalyan  :	 39
Khamaj  :	 30
Bageshree  :	 28
Malkauns  :	 28
Des  :	 25
Todi  :	 24
Miya malhar  :	 23
Marwa  :	 23
Lalat  :	 22
Ahir bhairav  :	 21
Bilaskhani todi  :	 20
Darbari  :	 19
Mishra piloo  :	 19
Bhairav  :	 18
Hamsadhvani  :	 18
Bihag  :	 17
Basant  :	 16
Mishra maand  :	 15
Puriya dhanashree  :	 15
-------------------------------------
Most frequently used modes in collection  makam
hicaz  :	 647
nihavent  :	 469
huzzam  :	 426
ussak  :	 383
kurdilihicazkar  :	 354
rast  :	 348
segah  :	 265
huseyni  :	 236
mahur  :	 151
hicazkar  :	 151
saba  :	 124
muhayyer  :	 115
suzinak  :	 104
karcigar  :	 90
beyati  :	 87
acemasiran  :	 81
muhayyerkurdi  :	 76
sultaniyegah  :	 75


### Composing a mode recognition datasets for all collections

Creating annotations.json file for each culture that can serve as an experiment dataset. These json files can be used as input to supervised mode recognition tests in [this repo](https://github.com/emirdemirel/Supervised_Mode_Recognition) 

In [4]:
numModes=10
numFilesPerCategory=20

for collection in collections:
    if collection=='makam':
        modeMbids=getMbidList4modes(metaData[collection],'mode',['tonic'])
    elif collection=='carnatic' or collection=='hindustani':
        modeMbids=getMbidList4modes(metaData[collection],'mode',['ctonic'])
    
    recs=composeDataset(modeMbids,'mode',numModes,numFilesPerCategory)
    addTonicInformation(recs,collection)
    with open('annotations_'+collection+'.json', mode='w') as f:
        json.dump(recs, f)


ctonic  does not exist for  1d0d2909-4c45-4d35-85fd-68cbcef9c605
ctonic  does not exist for  4f5a6a1b-7002-4538-b642-b0ed42e91140
ctonic  does not exist for  05341b83-b2c5-42e6-b26c-5f7ccd7c1768
ctonic  does not exist for  3dabbf57-d069-4b2c-b4cd-b1387f0460ce
ctonic  does not exist for  2ccabdce-3390-4b91-b882-46490741526c
