### Loading the dataset in a big dict

In this notebook we perform the conversion of the txt tokens to a 2D file containing for each song the sequence of ids of its tokens.
For a song to be added to the dataset, it needs to contain bass guitar and a rhythmic guitar.

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import guitarpro as pygp
import pathlib
import pickle

In [2]:
path_to_bass_folder = pathlib.Path("..\..\data\BGTG\BGTG_Bass")
thr_measures = 8
thr_tokens = 1000
# Iterate over all the alphabetical and group folders within each folders 
# For the first implementation we will assume that if a song has bass it also has RG.

In [3]:
# Loop on the files of the bass folder, for each file check if there is a corresponding file in the RG folder
# Then generate the bass sequence and the RG sequence 
path_errors=0
path_errors_list=[]
big_dict = {'All_Events':[], 'Encoder_RG':[], 'Decoder_Bass':[]}

# big_dict will have the following structure:
# keys: encoder_rg, decoder_bass
# big_dict['encoder_rg'][0] = list of rg txt tokens for the first song
# big_dict['decoder_bass'][0] = list of bass txt tokens for the first song

for bass_file_path in tqdm(path_to_bass_folder.rglob("*.txt"), total=14480, desc="Generating sequences of ids"):
    # Replace _bass with _rythmic and BGTG_Bass by BGTG_RG to get the corresponding RG file
    rg_file_path = pathlib.Path((str(bass_file_path).replace("_bass.txt", "_rythmic.txt")).replace("BGTG_Bass", "BGTG_RG"))
    all_file_path = pathlib.Path((str(bass_file_path).replace("_bass.txt", "_rythmic.txt")).replace("BGTG_Bass", "BGTG_RG_Bass"))

    song_name = bass_file_path.stem.split("_")[0]
    
    if rg_file_path.exists() and all_file_path.exists():
        bass_sequence = []
        rg_sequence = []
        all_events_sequence = []
        remove_song = False
        # Truncate the token sequences at a certain number of measures
        # Remove the token sequences that have too much tokens
        
        # Open the bass file
        with open(bass_file_path, 'r') as bass_file:
            bass_lines = bass_file.readlines()
            count_measures = 0
            token_count = 0
            for line in bass_lines:
                token_count+=1
                if line.strip() == "new_measure":
                    count_measures+=1
                    if count_measures == thr_measures:
                        break
                
                if token_count > thr_tokens:
                    remove_song = True

                bass_sequence.append(line.strip())
				
        with open(rg_file_path, 'r') as rg_file:
            rg_lines = rg_file.readlines()
            count_measures = 0
            token_count = 0
            for line in rg_lines:
                token_count+=1
                if line.strip() == "new_measure":
                    count_measures+=1
                    if count_measures == thr_measures:
                        break
                    
                if token_count > thr_tokens:
                    remove_song = True
            
                rg_sequence.append(line.strip())
                
        with open(all_file_path, 'r') as all_file:
            all_lines = all_file.readlines()
            count_measures = 0
            for line in all_lines:
                if line.strip() == "new_measure":
                    count_measures+=1
                    if count_measures == thr_measures:
                        break

                all_events_sequence.append(line.strip())
        
        if remove_song:
            print("Song removed, too much tokens", song_name, token_count)
        else:
            big_dict['Encoder_RG'].append(rg_sequence)
            big_dict['Decoder_Bass'].append(bass_sequence)
            big_dict['All_Events'].append(all_events_sequence)
        
    else:
        path_errors+=1
        path_errors_list.append((song_name, bass_file_path, rg_file_path, all_file_path))

print("Path errors: ", path_errors)

Generating sequences of ids:   8%|▊         | 1095/14480 [00:02<00:29, 446.25it/s]

Song removed, too much tokens Audioslave - Cochise (2) 1020


Generating sequences of ids:  26%|██▋       | 3827/14480 [00:07<00:21, 502.81it/s]

Song removed, too much tokens Dimmu Borgir - Tormentor Of Christian Souls (2) 1222


Generating sequences of ids:  57%|█████▋    | 8196/14480 [02:35<03:48, 27.49it/s] 

Song removed, too much tokens Mest - Rooftop 1274


Generating sequences of ids:  58%|█████▊    | 8452/14480 [02:46<03:54, 25.72it/s]

Song removed, too much tokens Millencolin - A - Ten 1004


Generating sequences of ids:  64%|██████▎   | 9211/14480 [03:18<03:47, 23.19it/s]

Song removed, too much tokens Nirvana - Molly's Lips (2) 1580


Generating sequences of ids:  67%|██████▋   | 9680/14480 [03:38<03:25, 23.31it/s]

Song removed, too much tokens Opium - Rest in Peace 1241


Generating sequences of ids:  81%|████████  | 11726/14480 [05:02<01:43, 26.60it/s]

Song removed, too much tokens Scissor Sisters - Take Your Mama Out 1062


Generating sequences of ids:  94%|█████████▍| 13621/14480 [06:19<00:38, 22.47it/s]

Song removed, too much tokens Uncommonmenfrommars - Pizzaman 1126


Generating sequences of ids: 100%|██████████| 14480/14480 [06:49<00:00, 35.37it/s] 

Path errors:  1480





In [6]:
# Dump the big_dict to a pickle file

with open("..\..\data\preprocessed_dadagp_" + str(thr_measures) + '_' + str(thr_tokens) + ".pickle", "wb") as handle:
    pickle.dump(big_dict, handle, pickle.HIGHEST_PROTOCOL)

In [30]:
len(big_dict['Encoder_RG']), len(big_dict['Decoder_Bass']), len(big_dict['All_Events'])

(13000, 13000, 13000)

In [7]:
max_enc_length = 0
max_dec_length = 0
index_enc = 0
index_dec = 0

for k in range(0, len(big_dict['Encoder_RG'])):
    
    #get max seq_lengths
    if max_enc_length < len(big_dict['Encoder_RG'][k]):
        max_enc_length = len(big_dict['Encoder_RG'][k])
        index_enc = k
    if max_dec_length < len(big_dict['Decoder_Bass'][k]):
        max_dec_length = len(big_dict['Decoder_Bass'][k])
        index_dec = k
        
print("Max encoder length: ", max_enc_length)
print("Max decoder length: ", max_dec_length)

Max encoder length:  973
Max decoder length:  737
