# Retrieving and Cutting Relevant Sanaa Scores + Audio
This notebook retreives the score portions (and audio, although audio is out of the scope of this project) relating to relevant sanaas. At this point, 'relevant' is defined by a subset that is pre-set in this notebook.
In the future, score and audio download should be automated using the Dunya API. However, for now, the audio needs to be manually downloaded (following the instructions of this notebook), and the scores are automatically downloaded from the github repository relating to the corpus. 

For information about the data and the corpus, please refer to the accompanying paper. The notebooks are only meant to document the preprocessing and plot generation code for result reproducil

In [42]:
import sys, os
import pandas as pd
import urllib.request
from music21 import *

import csv, glob, re, copy
import essentia.standard as es
import fractions

audio_source_path = 'audio_source/'
audio_destination_path = 'audio_dest/'

score_source_path = 'score_source/'
score_destination_path = 'score_dest/'

dataset_file = "arab_andalusian_sanaas.csv"

corpus_repository = 'https://github.com/MTG/arab-andalusian-music/raw/master/'
metadata_file = 'metadata-all-nawbas.csv'
scores_directory = 'Scores-MusicXML/'

mbids = ['f7bcb9af-6abb-4192-ae3d-37fa811034ce', 
 '8842c1f0-e261-4069-bd59-768bb9a3315c', 
 'a451a7fc-c53f-462a-b3fc-4377bb588105',
 'b11237b9-d45b-4b3a-a97b-ab7d198f927f']

fs=44100 #sampling frequency

sys.path.append('../') #to be able to use the andalusianextractSanaa 
import andalusianextractSanaa as sa

## Preparing Directories (Manual Download Required. Instructions Given Here!!)
The following cells create the directory structure and contents expected by the remaining sections of this notebook.

In [38]:
directories = [audio_source_path, audio_destination_path, score_source_path, score_destination_path]

for directory in directories:
    if not os.path.exists(directory):
        os.makedirs(directory)


First, a metadata file is downloaded from the Arab Andalusian Corpus repository on github. This file links between each mbid and its recording on archive.org, so clear instructions can be given on how to download the file and how to store it as.

In [39]:
#Audio Download
urllib.request.urlretrieve(corpus_repository+metadata_file, metadata_file)
metadata = pd.read_csv(metadata_file)

for mbid in mbids:
    row = metadata.loc[metadata['RECORDING_MBID'] == mbid] 
    audio_url = row['RECORDING_INTERNET_ARCHIVE_URL'].item()
    print('Please download audio as MP3 from {}, and save it as \n{}.mp3 in {} directory\n'.format(str(audio_url), mbid, audio_source_path))

Please download audio as MP3 from https://archive.org/details/RTMOrchestra_RTM1960s_QuddamMaya, and save it as 
f7bcb9af-6abb-4192-ae3d-37fa811034ce.mp3 in audio_source/ directory

Please download audio as MP3 from https://archive.org/details/BrihiOrchestra_RTM1960s_DarjMaya, and save it as 
8842c1f0-e261-4069-bd59-768bb9a3315c.mp3 in audio_source/ directory

Please download audio as MP3 from https://archive.org/details/BrihiOrchestra_RTM1960s_BtayhiMaya, and save it as 
a451a7fc-c53f-462a-b3fc-4377bb588105.mp3 in audio_source/ directory



In [43]:
#Score Download
#https://raw.githubusercontent.com/MTG/arab-andalusian-music/master/Scores-MusicXML/01da143e-4224-4692-8e6c-1d55f6de8a6d.xml
for mbid in mbids:
    score_path = 'https://raw.githubusercontent.com/MTG/arab-andalusian-music/master/Scores-MusicXML/{}.musicxml'.format(mbid)
    print(score_path)
    urllib.request.urlretrieve(score_path, '{}{}.xml'.format(score_source_path, mbid))
    

https://raw.githubusercontent.com/MTG/arab-andalusian-music/master/Scores-MusicXML/f7bcb9af-6abb-4192-ae3d-37fa811034ce.musicxml
https://raw.githubusercontent.com/MTG/arab-andalusian-music/master/Scores-MusicXML/8842c1f0-e261-4069-bd59-768bb9a3315c.musicxml
https://raw.githubusercontent.com/MTG/arab-andalusian-music/master/Scores-MusicXML/a451a7fc-c53f-462a-b3fc-4377bb588105.musicxml
https://raw.githubusercontent.com/MTG/arab-andalusian-music/master/Scores-MusicXML/b11237b9-d45b-4b3a-a97b-ab7d198f927f.musicxml


In [35]:
#an overlapping element will be stored in the list as: (recording_mbid, sanaa_id), (recording_mbid, sanaa_id)
#eg, first tuple retrieved as overlaps[i][0]
overlaps = [
    (('8842c1f0-e261-4069-bd59-768bb9a3315c', 'mu.2'), ('f7bcb9af-6abb-4192-ae3d-37fa811034ce', 'mu.2')), #darj/quddam. sfrt l'ashiy
    (('8842c1f0-e261-4069-bd59-768bb9a3315c', 'in.2'), ('a451a7fc-c53f-462a-b3fc-4377bb588105', 'in.2')), #darj/btayhi. noqla
    (('8842c1f0-e261-4069-bd59-768bb9a3315c', 'in.3'), ('a451a7fc-c53f-462a-b3fc-4377bb588105', 'in.4')), #,, itha nathkor
    (('8842c1f0-e261-4069-bd59-768bb9a3315c', 'in.4'), ('a451a7fc-c53f-462a-b3fc-4377bb588105', 'in.5')), #,, fi kol el ghoroub
    (('8842c1f0-e261-4069-bd59-768bb9a3315c', 'in.5'), ('a451a7fc-c53f-462a-b3fc-4377bb588105', 'in.6')), #,, ya laha ashiyah
    (('b11237b9-d45b-4b3a-a97b-ab7d198f927f ', 'mu.1'), ('f7bcb9af-6abb-4192-ae3d-37fa811034ce', 'mu.1')),
    (('b11237b9-d45b-4b3a-a97b-ab7d198f927f', 'ma.2'), ('f7bcb9af-6abb-4192-ae3d-37fa811034ce', 'ma.1')), #qult ya ashiyah (quddam)
    (('b11237b9-d45b-4b3a-a97b-ab7d198f927f', 'in.1'), ('f7bcb9af-6abb-4192-ae3d-37fa811034ce', 'in.1')), #atham ya ashiyah
    (('b11237b9-d45b-4b3a-a97b-ab7d198f927f', 'in.5'), ('f7bcb9af-6abb-4192-ae3d-37fa811034ce', 'in.2')), #safiiqo
    (('b11237b9-d45b-4b3a-a97b-ab7d198f927f', 'in.6'), ('f7bcb9af-6abb-4192-ae3d-37fa811034ce', 'in.3')), #ana kully milkon lakom
    (('b11237b9-d45b-4b3a-a97b-ab7d198f927f', 'in.7'), ('f7bcb9af-6abb-4192-ae3d-37fa811034ce', 'in.6')), #shams el ashiy rawnaqat
    (('f7bcb9af-6abb-4192-ae3d-37fa811034ce', 'in.7'), ('b11237b9-d45b-4b3a-a97b-ab7d198f927f', 'in.8'))  #wadda'tuki lillah
]


## Cutting the Audio Fragments

Below is a set of utility functions by which the scores and audio are cut to smaller fragments. They are very tightly coupled to the data csv file, but this will be addressed in future versions of this notebook

In [33]:
def get_sanaa_row_indexes(mbid, sanaa_id,  dataset_file):
    found_sanaas = []
    row = []
    with open(dataset_file) as csvfile:
        csv_file = csv.reader(csvfile, delimiter=',', quotechar='|')
        for rows in csv_file:
            row.append(rows)
    for count in range(len(row)):
        if row[count][0] == mbid and row[count][1] == sanaa_id:
            found_sanaas.append(count)
    return found_sanaas

def extractSanaa_audio(found_sanaas, fs, audio_source_path, audio_destination_path, dataset_file):
    row = []
    with open(dataset_file) as csvfile:
        csv_file = csv.reader(csvfile, delimiter=',', quotechar='|')
        for rows in csv_file:
            row.append(rows)
    for i in range(len(found_sanaas)):
        loader = es.MonoLoader(filename = os.path.join(audio_source_path, row[found_sanaas[i]][0] + '.mp3'))
        audio = loader()
        to_extract = audio[int(float(row[found_sanaas[i]][3])) * fs:int(float(row[found_sanaas[i]][4])) * fs]
        es.MonoWriter(filename= os.path.join(audio_destination_path,row[found_sanaas[i]][0]+'-'+row[found_sanaas[i]][1]+'-['+row[found_sanaas[i]][2]+'].wav'))(to_extract)
        
        
def extractSanaa_score(found_sanaas, score_source_path, score_destination_path, dataset_file):
    row = []
    with open(dataset_file) as csvfile:
        csv_file = csv.reader(csvfile, delimiter=',', quotechar='|')
        for rows in csv_file:
            row.append(rows)
    for i in range(len(found_sanaas)):
        fn = os.path.join(score_source_path, row[found_sanaas[i]][0]+ '.xml')
        s = converter.parse(fn)
        p = s.parts[0]
        if(row[found_sanaas[i]][5]!='-' and row[found_sanaas[i]][6]!='-'):
            segment = p.getElementsByOffset(float(row[found_sanaas[i]][5]), float(row[found_sanaas[i]][6]),
                              mustBeginInSpan=False,
                              includeElementsThatEndAtStart=False).stream()
            seg_measures = segment.getElementsByClass(stream.Measure)
            seg_timesigs = seg_measures[0].getTimeSignatures()[0]
            print(seg_timesigs)
            segment.insert(0, seg_timesigs)

            segment.write('musicxml', fp=os.path.join(score_destination_path,row[found_sanaas[i]][0]+'-'+row[found_sanaas[i]][1]+'-['+row[found_sanaas[i]][2]+'].xml'))

The following section of code maps between a sanaa mbid + index into its row in the csv, so that score and audio segmentation can be more organized. Note that this code segment takes time because of the audio loading part. It could be more efficient in the future by cutting all sanaas of the same audio at the same time. However, since the number of sanaas now are not that many this improvement was not implemented

In [44]:
ids2indexes = {}
for s1, s2 in overlaps:
    ids2indexes[s1[0]+ '-' + s1[1]] = get_sanaa_row_indexes(s1[0], s1[1], dataset_file)
    ids2indexes[s2[0]+ '-' + s2[1]] = get_sanaa_row_indexes(s2[0], s2[1], dataset_file)
    
for key, val in ids2indexes.items():
    extractSanaa_score(val, score_source_path, score_destination_path, dataset_file)
    extractSanaa_audio(val, fs, audio_source_path, audio_destination_path, dataset_file)

<music21.meter.TimeSignature 4/4>
<music21.meter.TimeSignature 3/4>
<music21.meter.TimeSignature 8/8>
<music21.meter.TimeSignature 8/8>
<music21.meter.TimeSignature 8/8>
<music21.meter.TimeSignature 8/8>
<music21.meter.TimeSignature 8/8>
<music21.meter.TimeSignature 8/8>
<music21.meter.TimeSignature 8/8>
<music21.meter.TimeSignature 8/8>
<music21.meter.TimeSignature 3/4>
<music21.meter.TimeSignature 3/4>
<music21.meter.TimeSignature 3/4>
<music21.meter.TimeSignature 6/8>
<music21.meter.TimeSignature 6/8>
<music21.meter.TimeSignature 6/8>
<music21.meter.TimeSignature 6/8>
<music21.meter.TimeSignature 6/8>
<music21.meter.TimeSignature 6/8>
<music21.meter.TimeSignature 6/8>
<music21.meter.TimeSignature 6/8>
<music21.meter.TimeSignature 6/8>
<music21.meter.TimeSignature 6/8>
