In [3]:
from LyricsAnnot import LyricsAnnot
import json
import main
import utils

Below are the functions we used and steps we followed to create our final dataset from DALI and DAMP, enriching it with information from Genius and Spotify. Please note that this notebook if rather to keep track of what we did rather than rerunning it. However, if you need to add another dataset to our dataset, you could use something similar to what we did below to integrate it to our work.

# Conversion of the whole DALI dataset

In [5]:
# Path of the DALI json files folder (not present in the directory)
DALI_dir = "../dali/dali_json/" 

# Since some DAMP songs were leading to some overwriting conversion errors, we preferred to keep
# the id and avoided song files separated from dali ones
id_file_path = "./dali_id.txt" # List of the usable converted songs along with their new id
avoided_songs_file_path = "./dali_avoided_songs.txt" # List of the songs put aside from the conversion for different reasons

# A temporary folder (not present in the repository anymore) hosting the dali json files already
# processed that get deleted from the dali folder.
# This way, we can rerun the function in case it stopped midway without having to start the conversion from scratch.
already_converted_path = "../dali/dali_already_converted"

# Path of the folder where the final json files get saved once the song gets processed and converted
save_path = "./saved/"

In [None]:
main.create_dali_notations(DALI_dir, save_path, already_converted_path, id_file_path, avoided_songs_file_path)

## Post-conversion steps

Is there some songs that got wrongly converted (because of errors in the dataset content that prevent from right line matching, for example) and hence need to be put aside? Usually, these songs only have one single paragraph.

In [None]:
song_ids = utils.get_single_paragraph_song_info("./saved/")

In [None]:
len(song_ids)

304

There are 304 songs we need to put aside. Let's use their id to perform this.

In [None]:
print(song_ids)

{'00000003': ('Geschwisterliebe', 'Die Ärzte'), '00000011': ('Adios', 'Rammstein'), '00000015': ('Kinky Afro', 'Happy Mondays'), '00000048': ('Bad Moon Rising', 'Creedence Clearwater Revival'), '0000006F': ('Hounds Of Love', 'Kate Bush'), '00000082': ('My Love', 'Westlife'), '00000085': ('Have You Ever Seen The Rain', 'Creedence Clearwater Revival'), '0000008A': ('Sweetest Thing', 'U2'), '00000091': ('The Invisible Man', 'Queen'), '0000009C': ('Mine', 'Taylor Swift'), '000000A5': ('Totale Finsternis', 'Tanz Der Vampire'), '000000A8': ('A Boy Named Sue', 'Johnny Cash'), '000000B5': ('Mirror Mirror', 'Blind Guardian'), '000000BC': ('Eet', 'Regina Spektor'), '000000BD': ('Freiheit', 'Marius Müller-Westernhagen'), '000000D0': ('Stray Heart', 'Green Day'), '000000D7': ('Hard Candy Christmas', 'Dolly Parton'), '000000DF': ('Last Man Standing', 'HammerFall'), '000000EB': ('Rosas', 'La Oreja De Van Gogh'), '000000FA': ('Du Entschuldige I Kenn Di', 'Peter Cornelius'), '00000106': ("'39", 'Queen

Let's remove the 304 songs from the usable converted songs.

In [None]:
utils.delete_dict_entries(song_ids, './dali_id.txt')

The 304 songs have been deleted from the id file. Now, let's update the id of the usable songs (and their names) so that they are all consecutive.

In [None]:
utils.delete_wrong_converted_songs(song_ids, './saved/')

In [None]:
utils.change_ids_and_rename_files('./dali_id.txt', './saved/', 8)

Finally, let's add the 304 avoided songs' information to the avoided songs file from DALI.

In [1]:
song_ids = {'00000003': ('Geschwisterliebe', 'Die Ärzte'), '00000011': ('Adios', 'Rammstein'), '00000015': ('Kinky Afro', 'Happy Mondays'), '00000048': ('Bad Moon Rising', 'Creedence Clearwater Revival'), '0000006F': ('Hounds Of Love', 'Kate Bush'), '00000082': ('My Love', 'Westlife'), '00000085': ('Have You Ever Seen The Rain', 'Creedence Clearwater Revival'), '0000008A': ('Sweetest Thing', 'U2'), '00000091': ('The Invisible Man', 'Queen'), '0000009C': ('Mine', 'Taylor Swift'), '000000A5': ('Totale Finsternis', 'Tanz Der Vampire'), '000000A8': ('A Boy Named Sue', 'Johnny Cash'), '000000B5': ('Mirror Mirror', 'Blind Guardian'), '000000BC': ('Eet', 'Regina Spektor'), '000000BD': ('Freiheit', 'Marius Müller-Westernhagen'), '000000D0': ('Stray Heart', 'Green Day'), '000000D7': ('Hard Candy Christmas', 'Dolly Parton'), '000000DF': ('Last Man Standing', 'HammerFall'), '000000EB': ('Rosas', 'La Oreja De Van Gogh'), '000000FA': ('Du Entschuldige I Kenn Di', 'Peter Cornelius'), '00000106': ("'39", 'Queen'), '00000115': ('Piano Man', 'Billy Joel'), '00000121': ('Tu', 'Noelia'), '00000125': ('Tra Te E Il Mare', 'Laura Pausini'), '00000141': ("Cryin'", 'Aerosmith'), '00000150': ('Yo (Excuse Me Miss)', 'Chris Brown'), '00000169': ('Return To Innocence', 'Enigma'), '0000016D': ('Mindestens In 1000 Jahren', 'Frittenbude'), '00000174': ('Breathe Easy', 'Blue'), '00000178': ('Cassius', 'Foals'), '00000191': ('Change The World', 'The Offspring'), '00000199': ('Broken Angel', 'Arash'), '000001A0': ('DOA', 'Foo Fighters'), '000001A3': ('Our Song', 'Taylor Swift'), '000001A6': ('Proud Mary', 'Creedence Clearwater Revival'), '000001C7': ('Time Is On My Side', 'The Rolling Stones'), '000001CB': ("Without You I'm Nothing", 'Placebo'), '000001D4': ('Build A Bridge', 'Limp Bizkit'), '000001DF': ('Shining Light', 'Ash'), '000001E7': ('Danny Boy', 'Elvis Presley'), '000001EA': ('Allt På Ett Kort', 'Bob Hund'), '000001ED': ('Touch Of My Hand', 'Britney Spears'), '000001F0': ('Calling You', 'Kat DeLuna'), '00000200': ('Leave It Alone', 'NOFX'), '00000201': ('Get Out Alive', 'Three Days Grace'), '00000205': ('National Anthem', 'LeAnn Rimes'), '0000020B': ('Saturday Night', 'Suede'), '0000020E': ('Bad Case Of Loving You (Doctor, Doctor)', 'Robert Palmer'), '00000217': ('Take Your Mama', 'Scissor Sisters'), '00000218': ('Desire', 'U2'), '0000022C': ('Stranger', 'Hilary Duff'), '00000236': ('(Do You) Get Excited', 'Roxette'), '0000023E': ('7 E 40', 'Lucio Battisti'), '00000246': ('Yes', 'Demi Lovato'), '0000025C': ('Stay', 'Tooji'), '00000260': ('En Algún Lugar', 'Duncan Dhu'), '00000265': ("You'll Be Mine", 'The Pierces'), '0000026C': ('Brown Eyed Girl', 'Van Morrison'), '0000026F': ('Missing', 'Evanescence'), '0000028B': ('Crush', 'David Archuleta'), '00000290': ('Antisocial', 'Anthrax'), '00000299': ('Woman', 'John Lennon'), '000002B0': ('The Greatest Man That Ever Lived (Variations On A Shaker Hymn)', 'Weezer'), '000002BB': ('You Sexy Thing', 'Hot Chocolate'), '000002BD': ('The Prayer', 'Bloc Party'), '000002C2': ('The Waiting One', 'All That Remains'), '000002CE': ('The Down Syndrome', 'Grey Daze'), '000002D0': ('Walk Like An Egyptian', 'The Bangles'), '000002EA': ('Dramatic Song', 'Toby Turner'), '000002ED': ('Deutschland', 'Die Prinzen'), '000002FD': ('Asereje', 'Las Ketchup'), '00000308': ('Mr. Sandman', 'Emmylou Harris'), '00000318': ('Serenata Rap', 'Jovanotti'), '00000329': ('Sunshine', 'Aerosmith'), '00000332': ('(Do You Get) Excited?', 'Roxette'), '0000033F': ("Baba O'Riley", 'The Who'), '00000344': ('Vieni A Ballare In Puglia', 'CapaRezza'), '00000345': ('Una Carezza In Un Pugno', 'Adriano Celentano'), '00000350': ('Broken', 'Seether'), '00000352': ('Would You Love A Monsterman?', 'Lordi'), '00000353': ('Virtual Insanity', 'Jamiroquai'), '0000036E': ('Mil Horas', 'Los Abuelos De La Nada'), '00000376': ('Half A World Away', 'R.E.M.'), '0000037A': ('Time Bomb', 'Rancid'), '00000386': ('There Is A Light That Never Goes Out', 'The Smiths'), '00000387': ("Weus'd A Herz Hast Wia A Bergwerk", 'Rainhard Fendrich'), '00000399': ('You Are The One', 'Shiny Toy Guns'), '000003AF': ('Sex Bomb', 'Tom Jones'), '000003C4': ('RadioVideo', 'System Of A Down'), '000003D2': ('Mentira', 'João Pedro Pais'), '000003D3': ('New Soul', 'Yael Naim'), '000003DB': ('Bad Day', 'Darwin Deez'), '000003F8': ('Hostage Of Love', 'Razorlight'), '00000410': ('Rooftops (A Liberation Broadcast)', 'Lostprophets'), '0000041D': ('Jour 1', 'Louane'), '00000426': ('Vielleicht', 'Madsen'), '00000428': ('Down', 'Something Corporate'), '0000042F': ("Europe's Living A Celebration", 'Rosa López'), '0000043E': ('Easy To Ignore', 'Sixpence None The Richer'), '00000448': ('Black Sheep', 'Sonata Arctica'), '0000044F': ('Tourniquet', 'Evanescence'), '00000462': ('Deutsche Bahn', 'Wise Guys'), '0000046E': ("I'm Just A Kid", 'Simple Plan'), '00000477': ('Since U Been Gone', 'Kelly Clarkson'), '0000047A': ('Up!', 'Shania Twain'), '00000484': ('Der Graf', 'Die Ärzte'), '0000048F': ('Fast Car', 'Jonas Blue'), '0000049D': ('Yellow Ledbetter', 'Pearl Jam'), '000004B8': ('Heartbreaker', 'G-Dragon'), '000004DA': ('Omaboy', 'Die Ärzte'), '000004E8': ('Three Lions', 'Lightning Seeds'), '000004EC': ('Paper Gangsta', 'Lady Gaga'), '000004F5': ('Pocahontas', 'AnnenMayKantereit'), '000004F8': ('Porch', 'Pearl Jam'), '0000050F': ('Double Je', 'Christophe Willem'), '0000051C': ('50 Special', 'Lùnapop'), '0000051E': ('Born To Be Wild', 'Steppenwolf'), '00000533': ('Dear God', 'Avenged Sevenfold'), '00000536': ('Lo Que Pasó, Pasó', 'Daddy Yankee'), '00000553': ('When I Kissed The Teacher', 'ABBA'), '00000555': ('Yellow', 'Coldplay'), '00000560': ('Too Young', 'Phoenix'), '00000562': ('I Am What I Am', 'Gloria Gaynor'), '00000569': ('Dicono Di Me', 'Cesare Cremonini'), '0000056A': ('First Time', 'Lifehouse'), '00000571': ("Jesus Doesn't Want Me For A Sunbeam", 'Nirvana'), '00000579': ('Psycho', 'Puddle Of Mudd'), '00000597': ('Invece No', 'Laura Pausini'), '0000059F': ('Tomber La Chemise', 'Zebda'), '000005B8': ('The Wind Blows', 'The All-American Rejects'), '000005BC': ('Una Donna Per Amico', 'Lucio Battisti'), '000005C9': ('Mama Kin', 'Aerosmith'), '000005EF': ('Around The World (La La La La La)', 'ATC'), '000005F7': ('Fallen Angels', 'Black Veil Brides'), '000005FB': ('Lift U Up', 'Gotthard'), '00000607': ("Who's Going Home With You Tonight?", 'Trapt'), '00000614': ('Schickeria', 'Rainhard Fendrich'), '00000619': ('Strange World', 'Iron Maiden'), '0000061A': ('Show Me Your Colours', 'S Club 7'), '00000626': ('Never Wake Up', 'Sum 41'), '00000627': ('Break Stuff', 'Limp Bizkit'), '00000631': ("C'mon C'mon", 'The Von Bondies'), '00000633': ("God's Gonna Cut You Down", 'Johnny Cash'), '0000063E': ('If I Had A Hammer', 'Trini Lopez'), '00000643': ('Por Mujeres Como Tú', 'Pepe Aguilar'), '0000064F': ('Have I Told You Lately', 'Rod Stewart'), '00000655': ('From The Inside', 'Linkin Park'), '00000661': ('Sultans Of Swing', 'Dire Straits'), '00000683': ('Großvater', 'STS'), '000006A1': ('500 Miles', 'Peter'), '000006B1': ('Why Still Bother', 'Itchy Poopzkid'), '000006C1': ('On Melancholy Hill', 'Gorillaz'), '000006D8': ('Walk Away', 'Kelly Clarkson'), '000006DE': ('Inside Of You', 'Hoobastank'), '000006EE': ('Runaround Sue', 'Dion'), '000006F9': ('These Dreams', 'Heart'), '000006FE': ('Days Without', 'All That Remains'), '00000719': ('Cherry Lips (Go Baby Go!)', 'Garbage'), '00000734': ('Just Give Me A Reason', 'P!nk'), '00000740': ('(Let Me Be Your) Teddy Bear', 'Elvis Presley'), '00000753': ('Moviestar', 'Harpo'), '00000755': ('Le Manège', 'Stanislas'), '0000076E': ('Georgia On My Mind', 'Ray Charles'), '00000781': ('Molinos De Viento', 'Mägo De Oz'), '00000785': ("'54, '74, '90, 2006", 'Sportfreunde Stiller'), '000007A3': ('Desperado', 'Eagles'), '000007B6': ('These Eyes', 'The Guess Who'), '000007C8': ('Go Tell It On The Mountain', 'Simon And Garfunkel'), '000007CD': ('Raggio Di Sole', 'Le Vibrazioni'), '000007E0': ("The Shoop Shoop Song (It's In His Kiss)", 'Cher'), '000007F0': ("I Don't Feel Like Dancin'", 'Scissor Sisters'), '000007F2': ('Demon Cleaner', 'Kyuss'), '000007F8': ('Free Your Mind', 'En Vogue'), '000007F9': ('We Are Who We Are', 'Little Mix'), '000007FB': ("Sgt. Pepper's Lonely Hearts Club Band", 'The Beatles'), '00000802': ('Doctor Jones', 'Aqua'), '00000813': ('Fallen Angel', 'Poison'), '0000082A': ('Summertime', 'DJ Jazzy Jeff'), '0000082E': ('Enjoy The Silence', 'Lacuna Coil'), '00000847': ('Élan', 'Nightwish'), '00000859': ('Kirche', 'Böhse Onkelz'), '0000086E': ("It's Hard To Know", 'Hot Water Music'), '00000880': ('Hey Tonight', 'Creedence Clearwater Revival'), '00000885': ("Macy's Day Parade", 'Green Day'), '0000088D': ('Chandelier', 'Sia'), '0000088E': ('Them Bones', 'Alice In Chains'), '00000897': ('Dance Of Death', 'Iron Maiden'), '000008AB': ("Vuonna '85", 'Eppu Normaali'), '000008AE': ("Don't Cry For Pain", 'Ana Johnsson'), '000008AF': ('Pushing Me Away', 'Linkin Park'), '000008BF': ('Spaceman', 'Babylon Zoo'), '000008D7': ('En Apesanteur', 'Calogero'), '000008DC': ('Blue (Da Ba Dee)', 'Eiffel 65'), '000008E2': ('Tu Seras', 'Emma Daumas'), '000008ED': ('Dile Al Sol', 'La Oreja De Van Gogh'), '000008FA': ('Cat And Mouse', 'The Red Jumpsuit Apparatus'), '00000916': ('Geil, Geil, Geil (Wir Sind Die Größten)', 'Wolfgang Petry'), '00000918': ('Leave In Silence', 'Depeche Mode'), '0000091D': ('Vindicated', 'Dashboard Confessional'), '0000092E': ('Second Chance', 'Shinedown'), '00000936': ('Skandal Im Sperrbezirk', 'Spider Murphy Gang'), '0000093A': ('Gay Pirates', 'Cosmo Jarvis'), '0000093C': ('Working For The Weekend', 'Loverboy'), '0000093D': ('Linoleum', 'NOFX'), '0000093E': ('Non Passerai', 'Marco Mengoni'), '00000943': ('Pretty Fly (For A White Guy)', 'The Offspring'), '00000950': ('Schickeria', 'Spider Murphy Gang'), '00000953': ('Half The World Away', 'Oasis'), '00000955': ('Oerend Hard', 'Normaal'), '00000967': ('Little Things', 'One Direction'), '0000096D': ('Lonely Boy', 'The Black Keys'), '00000975': ('Going Nowhere', 'Little Mix'), '0000097B': ('Carta', 'Toranja'), '00000988': ('Friday On My Mind', 'Gary Moore'), '00000995': ('Take A Chance On Me', 'ABBA'), '000009A4': ('Shapeshifter', 'Celldweller'), '000009AF': ('Sei Parte Di Me', 'Zero Assoluto'), '000009BA': ('Strong', 'London Grammar'), '000009CD': ("Un'emozione Per Sempre", 'Eros Ramazzotti'), '000009D1': ('Animals', 'Maroon 5'), '000009E7': ('Laura', 'Scissor Sisters'), '000009F1': ('Shine', 'Imogen Heap'), '00000A02': ('Coma', "Guns N' Roses"), '00000A04': ('Dieser Weg', 'Xavier Naidoo'), '00000A06': ('Numb', 'Linkin Park'), '00000A09': ('Hit The Lights', 'Metallica'), '00000A0E': ('Cotton-Eye Joe', 'Rednex'), '00000A14': ('Pure Morning', 'Placebo'), '00000A1A': ("Ich Möcht' So Gerne Metal Hör'n", 'J.B.O.'), '00000A20': ("Sorry, You're Not A Winner", 'Enter Shikari'), '00000A28': ('Plug In Baby', 'Muse'), '00000A46': ('Girls & Boys', 'Blur'), '00000A47': ('El Del Medio De Los Chichos', 'Estopa'), '00000A4D': ('Vivir Así Es Morir De Amor', 'Camilo Sesto'), '00000A5B': ('This Will Make You Love Again', 'IAMX'), '00000A6A': ('I Remember You', 'Skid Row'), '00000A72': ('She Goes Nana', 'The Radios'), '00000A7C': ('Story Of A Lonely Guy', 'Blink-182'), '00000A8F': ('Run To The Hills', 'Iron Maiden'), '00000AA3': ("Give 'Em Hell, Kid", 'My Chemical Romance'), '00000AA6': ('Zoe Jane', 'Staind'), '00000AAE': ('Between Angels And Insects', 'Papa Roach'), '00000AB0': ("Gangsta's Paradise", 'Coolio'), '00000AB2': ('Ho Hey', 'The Lumineers'), '00000ABA': ('Soulweeper 2', 'Volbeat'), '00000ABC': ('Beweg Dein Arsch', 'LaFee'), '00000AC3': ('Kein Gerede', 'WIZO'), '00000AC6': ('Tu Mi Porti Su', 'Giorgia'), '00000AD1': ('Ricominciamo', 'Adriano Pappalardo'), '00000AD6': ('Tutto Il Resto È Noia', 'Franco Califano'), '00000ADE': ('Hail To The King', 'Avenged Sevenfold'), '00000AE9': ('99 Red Balloons', 'Goldfinger'), '00000AF5': ('Return To Sender', 'Elvis Presley'), '00000AF9': ('Going Missing', 'Maxïmo Park'), '00000B07': ('The Joker', 'Steve Miller Band'), '00000B09': ('Name', 'Goo Goo Dolls'), '00000B0A': ('Oh Yeah', 'Roxy Music'), '00000B16': ('Wrapped', 'Gloria Estefan'), '00000B1C': ('Sweet Home Alabama', 'Lynyrd Skynyrd'), '00000B39': ('Rusted From The Rain', 'Billy Talent'), '00000B49': ('DeMoliendo Hoteles', 'Charly García'), '00000B51': ('Do You Want To', 'Franz Ferdinand'), '00000B57': ('Schwinger', 'Seeed'), '00000B61': ('Cross My Heart', 'Marianas Trench'), '00000B6E': ('Juke Box Hero', 'Foreigner'), '00000B74': ('Say You, Say Me', 'Lionel Richie'), '00000B7B': ('I Miss You', 'Miley Cyrus'), '00000B90': ('Celebrity Status', 'Marianas Trench'), '00000B97': ("Parce Qu'on Vient De Loin", 'Corneille'), '00000B98': ('Jump', 'Van Halen'), '00000B9D': ('Certe Notti', 'Ligabue'), '00000BA4': ('Traffic', 'Stereophonics'), '00000BB1': ('Brizgalna Brizga', 'Atomik Harmonik'), '00000BB4': ("Rome Wasn't Built In A Day", 'Morcheeba'), '00000BB5': ('Our Velocity', 'Maxïmo Park'), '00000BBD': ('Touch Me', 'The Doors'), '00000BC3': ('E Ritorno Da Te', 'Laura Pausini'), '00000BC4': ('Denial', 'Sugababes'), '00000BC7': ('Hier Kommt Kurt', 'Frank Zander'), '00000BCE': ('Eve Of Destruction', 'Barry McGuire'), '00000BD5': ('Thinking Out Loud', 'Ed Sheeran'), '00000BDD': ('Leuchtturm', 'Nena'), '00000C07': ('Real Things', 'Javine'), '00000C09': ('Cinderela', 'Carlos Paião'), '00000C16': ('Add It Up', 'Violent Femmes'), '00000C1B': ('Die Welt Kann Mich Nicht Mehr Verstehen', 'Tocotronic'), '00000C20': ('Crying In The Chapel', 'Elvis Presley'), '00000C4F': ('Guilty Of Love', 'Shanadoo'), '00000C75': ('Noches De Bohemia', 'Navajita Plateá'), '00000C87': ('4000 Rainy Nights', 'Stratovarius'), '00000C8E': ('The Trooper', 'Iron Maiden'), '00000C98': ('Gifts And Curses', 'Yellowcard'), '00000C9B': ('Tonight', 'Big Bang'), '00000CA0': ('Haschisch Kakalake', 'Creme De La Creme'), '00000CB0': ('My Hero', 'Foo Fighters'), '00000CB9': ('Torero', 'Chayanne'), '00000CC6': ('Everyday', 'Buddy Holly'), '00000CC7': ('Bad Day', 'Daniel Powter'), '00000CCC': ('Kiss This', 'Aaron Tippin'), '00000CD7': ('Dirty Deeds Done Dirt Cheap', 'AC/DC'), '00000CE4': ('El Alma En Pie', 'Chenoa'), '00000CEC': ('We Own The Night', 'The Wanted'), '00000CEF': ('Ghostbusters', 'Ray Parker'), '00000D20': ('...Baby One More Time', 'Britney Spears')}

## Statistics about the DALI dataset

In [7]:
dali_usable_count = utils.count_sources(id_file_path)

In [9]:
print(dali_usable_count['count_dali'])

3060


In [None]:
utils.calculate_percentage(id_file_path, avoided_songs_path, 'DALI')

'63.86% of the songs from DALI dataset are usable to create the feed the knowledge graph.'

In [None]:
dali_avoided_songs = utils.dali_count_avoided_songs(avoided_songs_file_path)

{'dali_counts': {'no_language_information': 28,
  'wrongly_encoded_asian_song': 48,
  'no_paragraphs': 1801,
  'not_found_on_Genius': 27,
  'total': 1904}}

In [None]:
print(dali_avoided_songs)

In [None]:
if dali_usable_count != 0:
    percentage = dali_usable_count['count_dali']/(dali_usable_count['count_dali']+dali_avoided_songs['dali_counts']['total'])*100
    print(f"{percentage}% of the songs from DALI dataset are usable to create the feed the knowledge graph.")

# Conversion of the whole DAMP dataset

In [11]:
# Path of the DAMP json files folder (not present in the directory)
DAMP_dir = "D:/US/"

# Since some DAMP songs were leading to some overwriting conversion errors, we preferred to keep
# the id and avoided song files separated from DALI ones
id_file_path = "./damp_id.txt" # List of the usable converted songs along with their new id
avoided_songs_file_path = "./damp_avoided_songs.txt" # List of the songs put aside from the conversion for different reasons

# A temporary folder (not present in the repository anymore) hosting the DAMP json files already
# processed that get deleted from the dali folder.
# This way, we can rerun the function in case it stopped midway without having to start the conversion from scratch.
already_converted_path = "D:/US_already/"

# Path of the folder where the final json files get saved once the song gets processed and converted
save_path = "./damp_saved/"

In [None]:
main.create_damp_notations(DAMP_dir, save_path, already_converted_path, id_file_path, avoided_songs_file_path)

## Post-conversion steps

Is there some songs that got wrongly converted (because of errors in the dataset content that prevent from right line matching, for example) and hence need to be put aside? Usually, these songs only have one single paragraph.

In [None]:
song_ids = utils.get_single_paragraph_song_info("./damp_saved/")

In [None]:
len(song_ids)

304

There are 304 songs we need to put aside. Let's use their id to perform this.

In [None]:
print(song_ids)

{'00000003': ('Geschwisterliebe', 'Die Ärzte'), '00000011': ('Adios', 'Rammstein'), '00000015': ('Kinky Afro', 'Happy Mondays'), '00000048': ('Bad Moon Rising', 'Creedence Clearwater Revival'), '0000006F': ('Hounds Of Love', 'Kate Bush'), '00000082': ('My Love', 'Westlife'), '00000085': ('Have You Ever Seen The Rain', 'Creedence Clearwater Revival'), '0000008A': ('Sweetest Thing', 'U2'), '00000091': ('The Invisible Man', 'Queen'), '0000009C': ('Mine', 'Taylor Swift'), '000000A5': ('Totale Finsternis', 'Tanz Der Vampire'), '000000A8': ('A Boy Named Sue', 'Johnny Cash'), '000000B5': ('Mirror Mirror', 'Blind Guardian'), '000000BC': ('Eet', 'Regina Spektor'), '000000BD': ('Freiheit', 'Marius Müller-Westernhagen'), '000000D0': ('Stray Heart', 'Green Day'), '000000D7': ('Hard Candy Christmas', 'Dolly Parton'), '000000DF': ('Last Man Standing', 'HammerFall'), '000000EB': ('Rosas', 'La Oreja De Van Gogh'), '000000FA': ('Du Entschuldige I Kenn Di', 'Peter Cornelius'), '00000106': ("'39", 'Queen

3364 songs have been converted and are considered as usable. We need to remove the 304 songs from them.

In [None]:
utils.delete_dict_entries(song_ids, './dali_id.txt')

The 304 songs have been deleted from the id file. Now, let's update the id of the usable songs so that they are all consecutive.

In [None]:
utils.delete_wrong_converted_songs(song_ids, './saved/')

In [None]:
utils.change_ids_and_rename_files('./dali_id.txt', './saved/', 8)