In [1]:
from LyricsAnnot import LyricsAnnot
import json
import main
import utils

Below are the functions we used and steps we followed to create our final dataset from DALI and DAMP, enriching it with information from Genius and Spotify. Please note that this notebook if rather to keep track of what we did rather than rerunning it. However, if you need to add another dataset to our dataset, you could use something similar to what we did below to integrate it to our work.

# Conversion of the whole DALI dataset

In [5]:
# Path to datasets
DAMP_dir = "../data/DAMP_MVP/sing_300x30x2"
DALI_dir = "../data/dali_json" 

id_file_path = "./id.txt" # List of the usable converted songs along with their new id

# List of the songs put aside from the conversion for different reasons
dali_avoided_songs_file_path = "./dali_avoided_songs.txt"
damp_avoided_songs_file_path = "./damp_avoided_songs.txt"

dali_already_converted_path = "../data/dali_json/dali_already_converted"
damp_already_converted_path = "../data/DAMP_MVP/damp_already_converted"

# Path of the folder where the final json files get saved once the song gets processed and converted
save_path = "./saved/"

In [None]:
main.create_damp_notations(DAMP_dir, save_path, damp_already_converted_path, id_file_path, damp_avoided_songs_file_path)
main.create_dali_notations(DALI_dir, save_path, dali_already_converted_path, id_file_path, dali_avoided_songs_file_path)

## Post-conversion steps

Is there some songs that got wrongly converted (because of errors in the dataset content that prevent from right line matching, for example) and hence need to be put aside? Usually, these songs only have one single paragraph.

In [21]:
song_ids = utils.get_single_paragraph_song_info("./saved/")

In [22]:
len(song_ids)

397

There are 397 songs we need to put aside. Let's use their id to perform this.

In [23]:
print(song_ids)

{'00000003': ('üÑºüÑ∞üÑªüÑ∞üÑº TERAKHIR Terakhir Terakhir', 'üá≥ üá¶\ufeff üá© üá¶\ufeff = üá≥ üá¶\ufeff üá© \ufeffüáÆ \u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000\u3000 MALAM'), '00000005': ('Me acostumbre', 'Arcangel feat. Bad Bunny'), '00000016': ('Shape Of You', 'Shape'), '0000001E': ('Muchacha ojos de papel', 'Spinetta'), '0000002F': ('Me Duele Amarte', 'Reik'), '00000039': ('Asi fue', 'Juan Gabriel'), '0000005A': ('Felices los 4', 'Maluma'), '0000008E': ('Let it go', 'James Bay'), '00000096': ('Tomorrow Starts Today - Andi Mack Theme Song', 'Sabrina Carpenter'), '000000A4': ("How Far I'll Go", 'Moana'), '000000A6': ('Strip That Down', 'Liam Payne ft. Quavo'), '000000C1': ('We Found Love', 'Rihanna'), '000000D0': ('Hallelujah', 'Jeff Buckley'), '000000E3': ('1Kilo-Deixe me ir', '1Kilo'), '000000E4': ('Bang - Anitta',

Let's remove the 397 songs from the usable converted songs.

In [24]:
utils.delete_dict_entries(song_ids, './id.txt')

Initial dictionary size: 4517
Dictionary size after deletions: 4120


The 304 songs have been deleted from the id file. Now, let's update the id of the usable songs (and their names) so that they are all consecutive.

In [26]:
utils.delete_wrong_converted_songs(song_ids, './saved/')

Song 1/397 successfully deleted!
Song 2/397 successfully deleted!
Song 3/397 successfully deleted!
Song 4/397 successfully deleted!
Song 5/397 successfully deleted!
Song 6/397 successfully deleted!
Song 7/397 successfully deleted!
Song 8/397 successfully deleted!
Song 9/397 successfully deleted!
Song 10/397 successfully deleted!
Song 11/397 successfully deleted!
Song 12/397 successfully deleted!
Song 13/397 successfully deleted!
Song 14/397 successfully deleted!
Song 15/397 successfully deleted!
Song 16/397 successfully deleted!
Song 17/397 successfully deleted!
Song 18/397 successfully deleted!
Song 19/397 successfully deleted!
Song 20/397 successfully deleted!
Song 21/397 successfully deleted!
Song 22/397 successfully deleted!
Song 23/397 successfully deleted!
Song 24/397 successfully deleted!
Song 25/397 successfully deleted!
Song 26/397 successfully deleted!
Song 27/397 successfully deleted!
Song 28/397 successfully deleted!
Song 29/397 successfully deleted!
Song 30/397 successfull

In [27]:
utils.change_ids_and_rename_files('./id.txt', './saved/', 8)

# Statistics about the datasets

In [28]:
id_file_path = "./id.txt"
damp_avoided_songs_file_path = "./damp_avoided_songs.txt"
dali_avoided_songs_file_path = "./dali_avoided_songs.txt"

## DALI

In [29]:
dali_usable_count = utils.count_sources(id_file_path)

In [30]:
print(dali_usable_count['count_dali'])

3051


In [31]:
dali_avoided_songs = utils.count_avoided_songs(dali_avoided_songs_file_path)

In [32]:
print(dali_avoided_songs)

{'avoided_counts': {'no_language_information': 26, 'wrongly_encoded_asian_song': 54, 'no_paragraphs': 1776, 'not_found_on_Genius': 29, 'total': 1885}}


In [33]:
if dali_usable_count != 0:
    percentage = dali_usable_count['count_dali']/(dali_usable_count['count_dali']+dali_avoided_songs['avoided_counts']['total'])*100
    print(f"{percentage}% of the songs from DALI dataset are usable to create the feed the knowledge graph.")

61.81118314424635% of the songs from DALI dataset are usable to create the feed the knowledge graph.


## DAMP

We have way less statistics to present regarding our DAMP suset as our avoided_songs file got overwritten. However, if we consider the 19 songs previously discarded (cf. post-conversion steps from DAMP conversion section) as the number of avoided_songs, and our total number of files to be the 264 files we got from the (sadly interrupted midway)conversion, we get:

In [34]:
damp_usable_count = utils.count_sources(id_file_path)

In [35]:
print(damp_usable_count['count_damp'])

1115


In [36]:
damp_avoided_songs = utils.count_avoided_songs(damp_avoided_songs_file_path)

In [37]:
print(damp_avoided_songs)

{'avoided_counts': {'no_language_information': 25, 'wrongly_encoded_asian_song': 0, 'no_paragraphs': 1876, 'not_found_on_Genius': 1933, 'total': 3834}}


We notice that this number is inferior to the number of DAMP files we merges (245). Since it takes into account the overlaps between DAMP and DALI, the only possible reason is that some songs from the DAMP subset were duplicates from each other, and that it was the case for 14 of them. This is probably a consequence of our DAMP partially failed conversion. Fortunately, the steps we followed enabled us to manage to clean this before using the data to create the knowledge graph.

In [38]:
if damp_usable_count != 0:
    percentage = damp_usable_count['count_damp']/(damp_avoided_songs['avoided_counts']['total'])*100
    print(f"{percentage}% of the songs from our subset of DAMP dataset are usable to create the feed the knowledge graph.")

29.081898800208663% of the songs from our subset of DAMP dataset are usable to create the feed the knowledge graph.


## Common statistics

Finally, here is the number of songs that were in common in DALI and in our subset of DAMP datasets:

In [39]:
print(damp_usable_count['count_double_sources'])

46


And here is the total number of usable files from our dataset:

In [40]:
print(damp_usable_count['count_dali']+damp_usable_count['count_damp']-damp_usable_count['count_double_sources'])

4120
