2018 OpenEWLD

Open Enhanced Wikifonia Leadsheet Dataset

N.B. All content of this repository should be free of copyright. If you think that some score is under copyright and I shouldn't distribute it, please open an issue.

federicosimonetta.eu.org

What is EWLD

EWLD (Enhanced Wikifonia Leadsheet Dataset) is a music leadsheet dataset that comes with a lot of metadata about composers, works, lyrics and features. It is designed for musicological and research purposes.

What is OpenEWLD

OpenEWLD is a dataset extracted from EWLD containing only public domain scores. You can redistribuite this worl as you want.

What I have added

This dataset comes from the old Wikifonia archive that is available on the web. I just filtered its files to get only the ones compatible with algorithm descripted in my thesis that is a graph-based representation of musical scores aimed at computational music analysis, computational musicology and music information retrieval.

Moreover I added some notions taken from secondhandsongs.com and discogs.com, such as the correct title, authors, year of composition, authors' year of birth and death and nationality, language, musical genres and styles. Also, I added a lot of features relative to each music score computed through music21 and I separeted the lyrics where available.

What should be added

lyrics semantic tags (maybe using gensim)
delete path to empty lyrics files
add other features
use additional source info (google, musixmatch, etc.)
add path to authors directory
switch to extraction from data dumps no connection required
add auto language detection
correct measure numbering after removing emtpy measures
solve bug on key signatures setting: no key signature is saved if setted in highest object hierarchy
solve bug on time signatures: method getTimeSignatures() returns '4/4' by default
...

Database creation

The database was extracted with a python >=3.6 script from the old Wikifonia dataset. You can regenerate it simply launching the script with this command:

python3 EWLDcreation.py -d <path-to-wikifonia>

You will need a stable internet connection and some python libraries, that you can install using the following command:

sudo pip3 install discogs_client music21\\
requests argparse csv json operator os sys\\
traceback zipfile sqlite3 collections typing\\
time

Propably, most of them are already installed in your python3 distribution.

If internet connection goes down, you'll find some file in a directory called exception_dir. Delete it and rerun the script a few times without changing anything. If exceptions still persist, please, contact me.

Note that the file db_creation.sql must be in the working directory. It contains the SQL script needed to create the initial tables.

Finally, to create OpenEWLD you have to extract a subset containing only public domain scores by simply running this command from the same directory in which you find this README:

python3 OpenEWLDcreation.py

Dataset organization

The database is organized as follows:

in the directory called dataset you will find a directory for each composer and within it a directory for each score by that composer (or combination of composers). You will also find a directory called [Unknown] for all scores without a recognized composer.
in the directory except_dir, if it exist there will be scores and logs relative to files that cannot be parsed correctly
in the file EWLD.db you will find a SQLite3 database which contains all metatags and file path

Note that all scores are filtered and edited so that they have the following properties:

only one key signature and only one time signature (but they may not be there - see what should be added)
no strong modulations (see paragraph about tonality field in the features table of the db)
any triplet is allowed but not other tuplet, no nested and not between different measures (N.B. I thought to have removed tuplets with notes of different values, but actually they could still be there, see line 276 of the creation script)
the tonality is expressed in the MusicXML
chords symbol (harmony) are always annotated in the MusicXML
only one voice is present in the score, no secondary voices or chords are allowed
no multiple white measures at beginning or at the end of the segment

Database structure

The database EWLD.db is a SQLite3 database. You can use any software to read it, for example SQLiteStudio which is simple, tiny and portable. With it you can also extract XML, HTML, JSON, SQL and PDF files.

It contains 6 tables, each described in the following paragraphs.

`works` table

Each entry represents a work, as stated in secondhandsongs.com. A work is a music composition: it can have several recordings by different performers, but the music opera is only one.

Actually, because of the noisiness of beginning dataset, you could also find entries representing derived works. Most of them should have the authors marked as '[Unknown]'.

Fields are:

id: unique integer
title: if author is not 'Unknown' it is the title given by secondhandsongs.com, otherwise it is taken from the original score metadata or from the filename
first_performance_date: if available, the date of the first performance of the song, as given by secondhandsongs.com. Usually, it should be similar to the composition date
language: the language as given by secondhandsongs.com or as detected by music21
path_lyrics: the path to the txt file containing the lyrics (it could be empty)
path_leadsheet: the path to the compressed MusicXML file containing the leadsheet

`authors` table

This represents the authors. Fields are:

commonName: the name of the author as given by secondhandsongs.com (the real author name is not available through their API at now; homonimy are diversified with ending [1], [2], etc.
birth: date of birth if available (from secondhandsongs.com)
death: date of death if available (from secondhandsongs.com)
nationality nationality if available (from secondhandsongs.com)

`features` table

Each entry describes a work from a musical point of view.

id: the id of the work
metric: metric signature as stated in the original score N.B. This is affected by a bug, sometime 4/4 could appear but it could be wrong, see what should be added
tonality: the tonality as detected by krumhansl-schmuckler algorithm (to avoid erroneous notation given by wikifonia users). If the certainty computed by [music21] is not enough high (namely >= 0.9), the one provided in the score is used. This prevents by erroneouses key signatures provided by Wikifonia users. **N.B. a bug in saving score made key detection unuseful, see what should be added **
incipit_type: a string, it can be 'anacrusi', 'acefalo' o 'tetico'
has_triplets: a boolean that is true if there are triplets
features_path: path to the csv file containing features computed by music21. Each row is a different feature. You can find a feature given its index, e.g.:
with features.base.getIndex('<name>') you get the index of a certain feature, that is equal to the row index in the csv file;
with [x.id for x in features.extractorsById('all')] you can get the list of features id in the same order as in the csv;
read more here.

`work_author` table

This table is needed to join works table and author table

`work_genres` and `work_styles` tables

These tables give to each work a style and genre classification in a 2D space with a fuzzy approach. By example, a work genre could be identified by a 2D vector ('rock', 'pop'). Each entry of these tables represent an entry of the vector. Also, each entry of the vector is associated with a occurrences field that models the certainty of that genre for that work. A good way to infer genre and/or style of work is to check if one of the two vector entries has more than twice occurences of the other. If yes, you could consider it as genre or style of the work, otherwise you can represent its genre/style using two coordinates.

These tables create association between a work and genres derived by discogs.com following this procedure:

for each work creates a query made by title composer1 composer2 etc.
consider the top-10 releases ordered by year returned by discogs.com:
takes histogram of genres and styles, following discogs categorization
for each work, memorize the two most common genres and the two most common styles

Licenses

Part of the database coming from discogs.com (namely genres and styles info) is released under CC0 license.
Part of the database coming from secondhandsongs.com (namely authors info, works titles and performance date) is released under CC BY-NC 3.0.
Compressed MusicXML files and Lyrics files are intended to contain only Public Domain content.
The remaining part of this software, included database info, is released under MIT license:

Copyright (c) 2018, Federico Simonetta, federicosimonetta.eu.org

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Cite us

Simonetta, Federico and Carnovalini, Filippo and Orio, Nicola and Rodà, Antonio. Symbolic Music Similarity through a Graph-based Representation. Proceedings of the Audio Mostly 2018 on Sound in Immersion and Emotion - AM'18. ACM Press, Year 2018.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dataset		dataset
.gitignore		.gitignore
EWLDcreation.py		EWLDcreation.py
LICENSE		LICENSE
OpenEWLD.db		OpenEWLD.db
OpenEWLDcreation.py		OpenEWLDcreation.py
README.md		README.md
db_creation.sql		db_creation.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2018 OpenEWLD

Open Enhanced Wikifonia Leadsheet Dataset

What is EWLD

What is OpenEWLD

What I have added

What should be added

Database creation

Dataset organization

Database structure

`works` table

`authors` table

`features` table

`work_author` table

`work_genres` and `work_styles` tables

Licenses

Cite us

About

Releases 1

Packages

Languages

License

00sapo/OpenEWLD

Folders and files

Latest commit

History

Repository files navigation

2018 OpenEWLD

Open Enhanced Wikifonia Leadsheet Dataset

What is EWLD

What is OpenEWLD

What I have added

What should be added

Database creation

Dataset organization

Database structure

works table

authors table

features table

work_author table

work_genres and work_styles tables

Licenses

Cite us

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

`works` table

`authors` table

`features` table

`work_author` table

`work_genres` and `work_styles` tables

Packages