## Forming data subsets for mode and rhythm mode recognition experiments

This notebook targets forming data subsets for mode and rhythm mode recognition experiments starting from the list of files (and meta data info) created and stored (using generateFileLists4Collections.ipynb) in a pickle file. For each recording the following information is available:
- Files available for that recording
- MusicBrainz id (mbid)
- Mode information (raga, makam, etc)
- Rhythm mode information (tala, usul, etc)

This notebook reads this file and forms the subsets by grouping recordings with respect to mode or rhythm mode while also checking available files (ex: tonic annotation) for the recording. The outputs are json files for each culture (with the format of [this sample file](https://github.com/MTG/otmm_makam_recognition_dataset/blob/master/annotations.json)) which can be used in mode recognition implementations as in [this repo](https://github.com/emirdemirel/Supervised_Mode_Recognition).

In [None]:
# Set your token here from https://dunya.compmusic.upf.edu/user/profile/
token = '...yourAPITokenGoesHere...'

In [None]:
import codecs
import json, os, sys
import numpy as np
import pickle
import csv
import time
import datetime
import random
from compmusic.dunya import docserver as ds
from compmusic import dunya as dn
from compmusic.dunya import conn
import collections

dn.set_token(token)#setting the token

# Read metadata from the previous notebook
with open("metaData_collections.pkl", 'rb') as f:
    metaData = pickle.load(f)

## Most frequently used modes 
Modes in each collection ordered by the number of recordings that we have for each

In [None]:
numModes = 20

for collection, recordings in metaData.items():
    mode_counter = collections.Counter()
    for recording in recordings:
        if 'mode' in recording:
            mode_counter[recording['mode']] += 1
    print('Most frequently used modes in collection {}'.format(collection))
    common_modes = mode_counter.most_common(numModes)
    max_length = max([len(m) for m in dict(common_modes).keys()])
    for mode, count in common_modes:
        print('{mode:<{pad}} {count}'.format(mode=mode, pad=max_length, count=count))
    print('-'*50)

### Composing a mode recognition datasets for all collections

Creating annotations.json file for each culture that can serve as an experimental dataset. We collect the tonic frequency for a random selection of recordings in each mode. These json files can be used as input to supervised mode recognition tests in [this repo](https://github.com/emirdemirel/Supervised_Mode_Recognition) 

In [None]:
# Take the top `numModes` modes. Randomly select recordings of this mode from the collection
# until we have at least `numFilesPerMode` downloads for each mode.

def download_tonic(mbid, collection):
    """Retrieve the tonic value for a given recording
    
    Arguments:
        mbid: the recording MBID to retrieve
        collection: the name of the collection that this MBID comes from
                    (used to choose the download method)
                    
    Returns: The tonic of the recording, or None of this recording has no tonic computed
    """
    try:
        if collection == 'makam': 
            content = ds.get_document_as_json(mbid, 'audioanalysis', 'tonic')
            tonic = None
            if content:
                tonic = content['value']
        elif collection == 'carnatic' or collection == 'hindustani':
            content = ds.file_for_document(recording['mbid'], 'ctonic', 'tonic')
            tonic = content.decode()
        return tonic
    except dn.HTTPError as e:
        if e.args[0].response.status_code != 404:
            raise

def get_tonics_for_recordings(recordings, collection, numModes, numFilesPerMode):
    # Count the modes in the recording list and group recordings by their mode
    mode_counter = collections.Counter()
    mode_recordings = collections.defaultdict(list)
    for recording in recordings:
        if 'mode' in recording:
            mode_counter[recording['mode']] += 1
            mode_recordings[recording['mode']].append(recording)
    selected_modes = dict(mode_counter.most_common(numModes)).keys()
    # for each mode, download tonic for `numFilesPerMode` random recordings
    collection_sample = []
    for mode in selected_modes:
        recordings = mode_recordings[mode]
        num_recordings = 0
        for recording in recordings:
            if num_recordings >= numFilesPerMode:
                break
            tonic = download_tonic(recording['mbid'], collection)
            # Some recordings may not have a tonic, only add those for which we do
            if tonic:
                recording['tonic'] = tonic
                collection_sample.append(recording)
                num_recordings += 1
    return collection_sample

In [None]:
numModes = 10
numFilesPerMode = 20

tonics = {}

for collection, recordings in metaData.items():
    print('Downloading Tonic values for collection {}'.format(collection))
    
    collection_sample = get_tonics_for_recordings(recordings, collection, numModes, numFilesPerMode)

    tonics[collection] = collection_sample

In [None]:
# Write tonic data to file
for collection, recordings in tonics.items():
    with open('annotations_{}.json'.format(collection), 'w') as f:
        json.dump(recordings, f)