# Preparing BANGLA ASR DATASET

This notebook will make the regulations for splitting the entire dataset into train, validation and test sets. It will create text files containing the stems (file names) of the wavs in each different set.

[[[ If the DatasetPreparations folder already contains the 6 output files, then it is not necessary to run this notebook. ]]]

## Terminologies:

* Stem: Stem is the file name from a path. For example, the stem of "a/b/c/abc.wav" is "abc"
* devset: Devset indicates the set of files that will be used during development or training. This includes both train set and evaluation set.

Necessary imports:

In [1]:
import glob
import csv

from pathlib import Path

import random

In [2]:
# Directories are assumed to have a trailing '/' or '\\' in all the subsequent code

CURRENT_WORKING_DIRECTORY = "W:/SpeakerRecognitionResearch"

BANGLA_ASR_DATASET_DIRECTORY = "data/BanglaASR/WavFiles/"
BANGLA_ASR_TSV_LOCATION = "data/BanglaASR/utt_spk_text.tsv"

# To avoid file location related errors, we make sure "SpeakerRecognitionResearch" root folder is the current working directory.
os.chdir(CURRENT_WORKING_DIRECTORY)
os.getcwd()

'W:\\SpeakerRecognitionResearch'

## Output files:

The following output files will be generated by this notebook:
1. trainset_list.txt: List of all stems that will be used for training
2. evalset_list.txt: List of all stems that will be used for evaluation in the training loop
3. test_set.txt: List of all stems that will be used for testing/validating after training phase is complete
4. devset_class_order.txt: Order of the classes used for both trainset and evalset
5. testset_class_order.txt: Order of the classes used for testset
6. eval_trials.txt: (expected utt1 utt2) tuples for each trial to compute for evaluation

In [3]:
TRAINSET_LIST_LOCATION = "notebooks/TrainBdAsrOnRawNet2/DatasetPreparations/trainset_list.txt"
EVALSET_LIST_LOCATION = "notebooks/TrainBdAsrOnRawNet2/DatasetPreparations/evalset_list.txt"
TESTSET_LIST_LOCATION = "notebooks/TrainBdAsrOnRawNet2/DatasetPreparations/testset_list.txt"
DEVSET_CLASS_ORDER_LOCATION = "notebooks/TrainBdAsrOnRawNet2/DatasetPreparations/devset_classes.txt"
TESTSET_CLASS_ORDER_LOCATION = "notebooks/TrainBdAsrOnRawNet2/DatasetPreparations/testset_classes.txt"

EVAL_TRIALS_LOCATION = "notebooks/TrainBdAsrOnRawNet2/DatasetPreparations/eval_trials.txt"

In [4]:
# There are in total 508 speakers in the dataset
TEST_CLASS_NUMBERS = 100
DEV_CLASS_NUMBERS = 408

EVAL_TRAIN_RATIO = 0.10

# Don't change it, to keep the work reproducible
RANDOM_SEED = 99

In [5]:
def get_wav_list(dataset_dir):
    # Given a directory, return path list of all wav files
    pattern = '**/*.wav'
    files = glob.glob(dataset_dir + pattern , recursive=True)

    # Normalize the file paths. To get file paths with '/' or '\\' consistently depending on OS
    wav_list = [os.path.normpath(i) for i in files]
    return wav_list

In [6]:
wav_paths_list = get_wav_list(BANGLA_ASR_DATASET_DIRECTORY)
len(wav_paths_list)

218703

# Label dictionaries

Three dictionaries will be helpful to us

1. stem_to_speaker_dict: Given stem, who is it's speaker?
2. speaker_to_paths_dict: Given a speaker, what are their audio paths?
3. stem_to_path_dict: Given the stem, what is its path?

In [7]:
def get_stem_to_speaker_dict(tsv_loc):
    # Reads the annotation tsv file provided with the dataset 
    # and returns a stem to speaker mapping dictionary
    stem_to_speaker_d = {}

    with open(tsv_loc, encoding="utf-8") as tsvfile:
        tsvreader = csv.reader(tsvfile, delimiter="\t", quoting=csv.QUOTE_NONE)
        for line in tsvreader:
            wav_file_name = line[0]
            speaker_id = line[1]

            stem_to_speaker_d[wav_file_name] = speaker_id

    return stem_to_speaker_d

In [8]:
stem_to_speaker_dict = get_stem_to_speaker_dict(BANGLA_ASR_TSV_LOCATION)
print("Size:", len(stem_to_speaker_dict), "Speaker of 000020a912:", stem_to_speaker_dict["000020a912"])

Size: 218703 Speaker of 000020a912: 16cfb


In [9]:
def get_speaker_to_paths_dict(wav_list, stem_to_speaker_dict):
    
    spk_to_path_d = {}

    for wav_path in wav_list:
        wav_name = Path(wav_path).stem
        spk_id = stem_to_speaker_dict[wav_name]

        if spk_id in spk_to_path_d.keys():
            spk_to_path_d[spk_id].append(wav_path)
        else:
            spk_to_path_d[spk_id] = [wav_path]
    
    return spk_to_path_d

In [10]:
speaker_to_paths_dict = get_speaker_to_paths_dict(wav_paths_list, stem_to_speaker_dict)
len(speaker_to_paths_dict.keys())

508

In [11]:
total = 0
for key in speaker_to_paths_dict.keys():
    total += len(speaker_to_paths_dict[key])

total

218703

In [12]:
def get_stem_to_path_dict(stem_list, wav_paths_list):
    # Sets are faster to search
    stem_set = set(stem_list)
    
    stem_to_path_d ={}
    for wav_path in wav_paths_list:
        wav_stem = Path(wav_path).stem
        if wav_stem in stem_set:
            if wav_stem in stem_to_path_d:
                stem_to_path_d[wav_stem].append(wav_path)
            else:
                stem_to_path_d[wav_stem] = [wav_path]

    return stem_to_path_d          
    

In [13]:
stem_list = stem_to_speaker_dict.keys()

stem_to_path_dict = get_stem_to_path_dict(stem_list, wav_paths_list)

In [14]:
stem_to_path_dict['c47c6e6be0']

['data\\BanglaASR\\WavFiles\\asr_bengali_c\\asr_bengali\\data\\c4\\c47c6e6be0.wav']

## Split whole dataset into dev and test

In [15]:
random.seed(RANDOM_SEED)
test_speakers_keys = random.sample(speaker_to_paths_dict.keys(), TEST_CLASS_NUMBERS)
random.seed(RANDOM_SEED)

len(test_speakers_keys)

100

In [16]:
test_speakers_wavs_stems = []
dev_speakers_wavs_stems = []

total = 0

for key in speaker_to_paths_dict.keys():
    current_speaker_wavs = speaker_to_paths_dict[key]
    current_speaker_wavs_stems = [Path(x).stem for x in current_speaker_wavs]
    total += len(current_speaker_wavs_stems)

    if key in test_speakers_keys:
        test_speakers_wavs_stems += current_speaker_wavs_stems
    else:
        dev_speakers_wavs_stems += current_speaker_wavs_stems
    
print(total)

218703


In [17]:
print(len(test_speakers_wavs_stems), len(dev_speakers_wavs_stems))
print(len(test_speakers_wavs_stems) + len(dev_speakers_wavs_stems))

# 42075 176628
# 218703

42075 176628
218703


We have divided dataset into dev and test.
Now, we need to divide dev between train and eval

In [18]:
total_dev_wavs = len(dev_speakers_wavs_stems)
num_eval_wavs = int(total_dev_wavs * EVAL_TRAIN_RATIO)

random.seed(RANDOM_SEED)
random.shuffle(dev_speakers_wavs_stems)
random.seed(RANDOM_SEED)

eval_stems, train_stems = dev_speakers_wavs_stems[:num_eval_wavs], dev_speakers_wavs_stems[num_eval_wavs:]

In [19]:
train_size = len(train_stems)
eval_size = len(eval_stems)
test_size = len(test_speakers_wavs_stems)

print("Train size:", train_size)
print("Eval size:", eval_size)
print("Test size:", test_size)

print("Total:", train_size+eval_size+test_size)

Train size: 158966
Eval size: 17662
Test size: 42075
Total: 218703


## Writing to the output files

In [20]:
with open(TRAINSET_LIST_LOCATION, "w") as file:
    for stem in train_stems:
        file.write(stem+"\n")

with open(EVALSET_LIST_LOCATION, "w") as file:
    for stem in eval_stems:
        file.write(stem+"\n")

with open(TESTSET_LIST_LOCATION, "w") as file:
    for stem in test_speakers_wavs_stems:
        file.write(stem+"\n")

In [21]:
# Set for uniqueness
# Sorted list for consistency in the order
# 
dev_speakers_list = sorted(list(set([stem_to_speaker_dict[stem] for stem in dev_speakers_wavs_stems])))
test_speakers_list = sorted(list(set([stem_to_speaker_dict[stem] for stem in test_speakers_wavs_stems])))
len(dev_speakers_list)

408

In [22]:
with open(DEVSET_CLASS_ORDER_LOCATION, "w") as file:
    for speaker in dev_speakers_list:
        file.write(speaker+"\n")

with open(TESTSET_CLASS_ORDER_LOCATION, "w") as file:
    for speaker in test_speakers_list:
        file.write(speaker+"\n")

## Evaluation Trials

In [23]:
import numpy as np

NUM_TRIALS = 10000

In [24]:
# This function will generate a file that will contain which two files should be compared with cos similarities
# Format:
# 1 spk/_1_/audio/1.wav spk/_1_/audio/2.wav
# 0 spk/_1_/audio/1.wav spk/_2_/audio/1.wav

def generate_validation_trials(wav_list, nb_trial, val_trial_location, speaker_to_paths_d):
    val_trial_file = open(val_trial_location, 'w')

    # We define, same speaker trial as target trial
    # Target trial: 1; Non-trg: 0

    # There will be equal numbers of target and non target trials
    nb_target_trials = int(nb_trial / 2)
        
    speakers_list = list(speaker_to_paths_d.keys())

    #compose target trials
    selected_spks = np.random.choice(speakers_list, size=nb_target_trials, replace=True)
    for spk in selected_spks:
        wav_paths_of_speaker = speaker_to_paths_d[spk]
        utt_a, utt_b = np.random.choice(wav_paths_of_speaker, size=2, replace=False)
        utt_a_stem, utt_b_stem = Path(utt_a).stem, Path(utt_b).stem
        val_trial_file.write('1 %s %s\n'%(utt_a_stem, utt_b_stem))

    #compose non-target trials
    for i in range(nb_target_trials):
        two_different_speakers = np.random.choice(speakers_list, size=2, replace = False)
        utt_a = np.random.choice(speaker_to_paths_d[two_different_speakers[0]], size=1)[0]
        utt_b = np.random.choice(speaker_to_paths_d[two_different_speakers[1]], size=1)[0]
        utt_a_stem, utt_b_stem = Path(utt_a).stem, Path(utt_b).stem
        val_trial_file.write('0 %s %s\n'%(utt_a_stem, utt_b_stem))

    val_trial_file.close()

In [25]:
eval_speakers_to_path_dict = {}

eval_stem_set = set(eval_stems)

for speaker in speaker_to_paths_dict.keys():
    for path in speaker_to_paths_dict[speaker]:
        if Path(path).stem in eval_stem_set:
            if speaker in eval_speakers_to_path_dict.keys():
                eval_speakers_to_path_dict[speaker].append(path)
            else:
                eval_speakers_to_path_dict[speaker] = [path]

In [26]:
total = 0
for speaker in eval_speakers_to_path_dict:
    total += len(eval_speakers_to_path_dict[speaker])

total

17662

In [27]:
eval_paths = []
for stem in eval_stems:
    eval_paths.append(stem_to_path_dict[stem])
len(eval_paths)

17662

In [28]:
generate_validation_trials(eval_paths, NUM_TRIALS, EVAL_TRIALS_LOCATION, eval_speakers_to_path_dict)