<a href="https://colab.research.google.com/github/EA-Digifolk/EA-Digifolk-Dataset/blob/main/EADigifolk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EA-Digifolk Explorer



Links:

* [EA-Digifolk dataset](https://github.com/EA-Digifolk/EA-Digifolk-Dataset.git)
* [Extract Features from MEI Parser](https://github.com/EA-Digifolk/MEIParser_features)
* [Presentation](https://)

## Setup

This section covers downloading the the [EA-Digifolk dataset](https://github.com/EA-Digifolk/EA-Digifolk-Dataset.git) and the [Parser](https://github.com/EA-Digifolk/MEIParser_features) to extract features from MEI files, and installing the required libraries for the parser to function, and the [Musescore](https://musescore.org) software for displaying the musical scores.

In [None]:
%%capture
#@title Download the EA-Digifolk Dataset from Github
%cd /content

import os
if os.path.exists('EA-Digifolk-Dataset'):
  !git -C EA-Digifolk-Dataset pull
else:
  !git clone https://github.com/EA-Digifolk/EA-Digifolk-Dataset.git

In [None]:
%%capture
%cd /content
#@title Download the MEI Parser

import os
if os.path.exists('MEIParser_features'):
  !git -C MEIParser_features pull
else:
  !git clone https://github.com/EA-Digifolk/MEIParser_features


!pip install -r MEIParser_features/requirements.txt -q

import sys
if not '/content/MEIParser_features' in sys.path:
  sys.path.append('/content/MEIParser_features')

In [None]:
%%capture
#@title Install Musescore
!apt-get update -q && apt-get install musescore lilypond -q
%env QT_QPA_PLATFORM=offscreen

In [None]:
#@title Install Music21 and setup Musescore in the Music21 Environment
!pip install music21 -q

import music21
env = music21.environment.Environment()
env['pdfPath'] = '/usr/bin/musescore'
env['graphicsPath'] = '/usr/bin/musescore'
env['musicxmlPath'] = '/usr/bin/musescore'
env['musescoreDirectPNGPath'] = '/usr/bin/musescore'
env['autoDownload'] = 'allow'
env['warnings'] = 0

## Extract features from MEI files

This section covers the processing of the dataset: extracting the features from the MEI files and save as a pandas dataframe for easy exploration.

This section is optional, as the saved pandas dataframe is provided in the EA-Digifolk dataset folder by default.

In [None]:
# @title Process Dataset

# Import Libs from Python
import importlib
import glob
from fractions import Fraction
from tqdm import tqdm

# Import External Libs
import music21 as m21
import pandas as pd

# Import Parser
import parser_mei_features
from parser_mei_features import MeiParser

songs = reversed(sorted(list(glob.glob('EA-Digifolk-Dataset/Spanish/*.mei') + glob.glob('EA-Digifolk-Dataset/Mexican/*.mei'))))
songs = [so for so in songs if so not in [f'EA-Digifolk-Dataset/Spanish/{s}' for s in ['ES-1948-AS-FP-006.mei', 'ES-1948-CB-CO-376.mei', 'ES-1948-CB-CO-418.mei', 'ES-1991-CL-KS-147.mei']] ]

songs = list(reversed(songs))

errors = []
EADIGIFOLKNT = pd.DataFrame()

for song in tqdm(songs):

    try:
      mei_parser = MeiParser()
      song_features = mei_parser.parse_mei(song, verbose=False)
      EADIGIFOLKNT = pd.concat([EADIGIFOLKNT, pd.DataFrame().from_dict(song_features)], axis=1)
    except Exception as e:
      errors.append((song, e))

print('\n Files with errors:')
for err in errors:
  print(err)

# Transpose Dataframe so songs' IDs are now the index and create country column from ID
EADIGIFOLK = EADIGIFOLKNT.T
EADIGIFOLK.set_index('id', inplace=True)
EADIGIFOLK['country'] = EADIGIFOLK.index.to_series().apply(lambda x: x.split('-')[0])

# Save Dataframe to compressed file to save
EADIGIFOLK.to_pickle('EADIGIFOLKT.gzip', compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1})

## Exploring the EA-Digifolk Dataset

This section covers possible ways of exploring the dataset.

In [None]:
#@title Import saved pandas dataframe

import pandas as pd
EADIGIFOLK = pd.read_pickle("/content/EA-Digifolk-Dataset/EADIGIFOLKT.gzip", compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1})

In [None]:
#@title List all songs in the dataset

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

EADIGIFOLK

In [None]:
#@title View Song

#@markdown

ID = "MX-1951-00-VM-00002" # @param {"type":"string","placeholder":"MX-1951-00-VM-00001"}

song_path = f"EA-Digifolk-Dataset/{'Mexican' if 'MX' in ID else 'Spanish'}/{ID}.mei"
print('Showing: ' + song_path + '\n')

mei_parser = MeiParser()
song_features = mei_parser.parse_mei(song_path, verbose=False)

print('Score: \n')
mei_parser.mtc_extractor.music_stream.show()

print('\nMetadata Features: \n')
metadata = pd.DataFrame().from_dict(song_features['v1'])
metadata.drop('features', axis=1)
display(metadata)

print('\nNote Features: \n')
features = pd.DataFrame().from_dict(song_features['v1']['features'])
display(features)

In [None]:
#@title Show Distributions of Feature In Dataset

#@markdown This code block generates descriptive statistics and distributions for various musical and metadata features in the EA-Digifolk dataset.
#@markdown >The user can select both a subset of the dataset and a feature whose distribution will be analyzed.

dataset = "all" # @param ["all", "mexican", "spanish"]
feature = "key-mode" # @param ["range", "number of phrases", "number of notes per phrase", "key", "mode", "key-mode", "meter", "country", "textual topics"]

temp_dataset = EADIGIFOLK.copy()
if dataset == 'spanish':
  temp_dataset = temp_dataset[temp_dataset['country'] == 'ES']
elif dataset == 'mexican':
  temp_dataset = temp_dataset[temp_dataset['country'] == 'MX']

if feature == 'range':
  temp_df = temp_dataset[['ambitus_highest', 'ambitus_lowest']].copy()
  temp_df['m21_H'] = temp_df['ambitus_highest'].apply(lambda x: m21.pitch.Pitch(x.replace(' flat','-')) if x is not None else None)
  temp_df['m21_L'] = temp_df['ambitus_lowest'].apply(lambda x: m21.pitch.Pitch(x.replace(' flat','-')) if x is not None else None)
  temp_df['range'] = temp_df.apply(lambda x: m21.interval.Interval(x['m21_H'], x['m21_L']).name, axis=1)
  display(temp_df['range'].describe())
  display(temp_df['range'].value_counts())
  # histogram
elif feature == 'number of phrases':
  print(temp_dataset['phrases'])
  # @TODO
elif feature == 'number of notes per phrase':
  print(temp_dataset['phrases'])
  # @TODO
elif feature == 'key-mode':
  temp_df = temp_dataset[['key', 'mode']].copy()
  temp_df['key-mode'] = temp_df[['key','mode']].apply(lambda x: x['key'].capitalize().replace(' ','') + ' ' + x['mode'].capitalize().replace(' ',''), axis=1)
  display(temp_df['key-mode'].describe())
  display(temp_df['key-mode'].value_counts())
elif feature == 'textual topics':
  display(temp_dataset['textual_topics'].explode().str.capitalize().describe())
  display(temp_dataset['textual_topics'].explode().str.capitalize().value_counts())
elif feature in ['key', 'mode', 'meter', 'country']:
  display(temp_dataset[feature].str.capitalize().describe())
  display(temp_dataset[feature].str.capitalize().value_counts())