# Preparing LARGE ASR DATASET

https://www.isca-speech.org/archive/sltu_2018/kjartansson18_sltu.html

This notebook will make the regulations for splitting the entire dataset into train, validation and test sets. It will create text files containing the stems (file names) of the wavs in each different set.

[[[ If the DatasetPreparations folder already contains the 6 output files, then it is not necessary to run this notebook. ]]]

## Terminologies:

* Stem: Stem is the file name from a path. For example, the stem of "a/b/c/abc.wav" is "abc"
* devset: Devset indicates the set of files that will be used during development or training. This includes both train set and evaluation set.

Necessary imports:

In [1]:
import glob
import csv

from pathlib import Path

import random

import pandas as pd

In [2]:
# Directories are assumed to have a trailing '/' or '\\' in all the subsequent code

CURRENT_WORKING_DIRECTORY = "W:/SpeakerRecognitionResearch/"

LARGE_ASR_DATASET_DIRECTORY = "S:/Large ASR/WavFiles/"

# This is a combination of the 5 TSV files, and it will be generated in this notebook
LARGE_ASR_TSV_LOCATION = "S:/Large ASR/WavFiles/utt_spk_text.tsv"

# To avoid file location related errors, we make sure "SpeakerRecognitionResearch" root folder is the current working directory.
os.chdir(CURRENT_WORKING_DIRECTORY)
os.getcwd()

'W:\\SpeakerRecognitionResearch'

## Output files:

The following output files will be generated by this notebook:
1. trainset_list.txt: List of all stems that will be used for training
2. evalset_list.txt: List of all stems that will be used for evaluation in the training loop
3. test_set.txt: List of all stems that will be used for testing/validating after training phase is complete
4. devset_class_order.txt: Order of the classes used for both trainset and evalset
5. testset_class_order.txt: Order of the classes used for testset
6. eval_trials.txt: (expected utt1 utt2) tuples for each trial to compute for evaluation
7. test_trials.txt: (expected utt1 utt2) tuples for each trial to compute for test

In [3]:
TRAINSET_LIST_LOCATION = "notebooks/TrainLargeAsrOnRawNet2/DatasetPreparations/trainset_list.txt"
EVALSET_LIST_LOCATION = "notebooks/TrainLargeAsrOnRawNet2/DatasetPreparations/evalset_list.txt"
TESTSET_LIST_LOCATION = "notebooks/TrainLargeAsrOnRawNet2/DatasetPreparations/testset_list.txt"
DEVSET_CLASS_ORDER_LOCATION = "notebooks/TrainLargeAsrOnRawNet2/DatasetPreparations/devset_classes.txt"
TESTSET_CLASS_ORDER_LOCATION = "notebooks/TrainLargeAsrOnRawNet2/DatasetPreparations/testset_classes.txt"
EVAL_TRIALS_LOCATION = "notebooks/TrainLargeAsrOnRawNet2/DatasetPreparations/eval_trials.txt"
TEST_TRIALS_LOCATION = "notebooks/TrainLargeAsrOnRawNet2/DatasetPreparations/test_trials.txt"

## Joining the TSV files into one

In [4]:
BENGALI_TSV = "S:/Large ASR/WavFiles/BengaliASR/ben_utt_spk_text.tsv"
JAVANESE_TSV = "S:/Large ASR/WavFiles/JavaneseASR/jav_utt_spk_text.tsv"
NEPALI_TSV = "S:/Large ASR/WavFiles/NepaliASR/nep_utt_spk_text.tsv"
SINHALA_TSV = "S:/Large ASR/WavFiles/SinhalaASR/sin_utt_spk_text.tsv"
SUNDANESE_TSV = "S:/Large ASR/WavFiles/SundaneseASR/sun_utt_spk_text.tsv"

In [5]:
bengali_tsv_df = pd.read_csv(BENGALI_TSV, quoting=csv.QUOTE_NONE, sep='\t', header=None)
bengali_tsv_df = bengali_tsv_df.iloc[:,:-1]

javanese_tsv_df = pd.read_csv(JAVANESE_TSV, quoting=csv.QUOTE_NONE, sep='\t', header=None)
javanese_tsv_df = javanese_tsv_df.iloc[:,:-1]

nepali_tsv_df = pd.read_csv(NEPALI_TSV, quoting=csv.QUOTE_NONE, sep='\t', header=None)
nepali_tsv_df = nepali_tsv_df.iloc[:,:-1]

sinhala_tsv_df = pd.read_csv(SINHALA_TSV, quoting=csv.QUOTE_NONE, sep='\t', header=None)
sinhala_tsv_df = sinhala_tsv_df.iloc[:,:-1]

sundanese_tsv_df = pd.read_csv(SUNDANESE_TSV, quoting=csv.QUOTE_NONE, sep='\t', header=None)
sundanese_tsv_df = sundanese_tsv_df.iloc[:,:-1]

print("Bengali: {}, Javanese: {}, Nepali: {}, Sinhala: {}, Sundanese: {}".format(
	len(bengali_tsv_df), len(javanese_tsv_df), len(nepali_tsv_df), len(sinhala_tsv_df), len(sundanese_tsv_df)
))

Bengali: 218703, Javanese: 185076, Nepali: 157905, Sinhala: 185293, Sundanese: 219156


In [6]:
frames = [bengali_tsv_df, javanese_tsv_df, nepali_tsv_df, sinhala_tsv_df, sundanese_tsv_df]
large_asr_tsv_df = pd.concat(frames)

print("Total utterrances:", len(large_asr_tsv_df))
large_asr_tsv_df.head()

Total utterrances: 966133


Unnamed: 0,0,1
0,000020a912,16cfb
1,000039928e,976b1
2,00005debc7,f83df
3,00009e687c,9813c
4,00012843bc,7ec1c


In [7]:
all_speakers = large_asr_tsv_df.iloc[:,1].unique()
print("Total number of speakers:", len(all_speakers))

Total number of speakers: 3071


In [8]:
# Save th ecombined dataframe as a tsv file

large_asr_tsv_df.to_csv(LARGE_ASR_TSV_LOCATION, index=False, header=False, sep="\t")

NameError: name 'LARGE_TSV_LOCATION' is not defined

In [None]:
# There are in total 3071 speakers in the dataset

# In our splitting strategy, we keep 10% speakers for TEST
TEST_CLASS_NUMBERS = 300
DEV_CLASS_NUMBERS = 2771

# 10% of the devset will be used for evaluating each epoch
EVAL_TRAIN_RATIO = 0.10

# Don't change it, to keep the work reproducible
RANDOM_SEED = 99

In [None]:
def get_wav_list(dataset_dir):
    # Given a directory, return path list of all wav files
    pattern = '**/*.wav'
    files = glob.glob(dataset_dir + pattern , recursive=True)

    # Normalize the file paths. To get file paths with '/' or '\\' consistently depending on OS
    wav_list = [os.path.normpath(i) for i in files]
    return wav_list

In [None]:
wav_paths_list = get_wav_list(LARGE_ASR_DATASET_DIRECTORY)
len(wav_paths_list)

966133

# Label dictionaries

Three dictionaries will be helpful to us

1. stem_to_speaker_dict: Given stem, who is it's speaker?
2. speaker_to_paths_dict: Given a speaker, what are their audio paths?
3. stem_to_path_dict: Given the stem, what is its path?

In [9]:
def get_stem_to_speaker_dict(tsv_loc):
    # Reads the annotation tsv file provided with the dataset 
    # and returns a stem to speaker mapping dictionary
    stem_to_speaker_d = {}

    with open(tsv_loc, encoding="utf-8") as tsvfile:
        tsvreader = csv.reader(tsvfile, delimiter="\t", quoting=csv.QUOTE_NONE)
        for line in tsvreader:
            wav_file_name = line[0]
            speaker_id = line[1]

            stem_to_speaker_d[wav_file_name] = speaker_id

    return stem_to_speaker_d

In [10]:
stem_to_speaker_dict = get_stem_to_speaker_dict(LARGE_ASR_TSV_LOCATION)
print("Size:", len(stem_to_speaker_dict), "Speaker of 000020a912:", stem_to_speaker_dict["000020a912"])

Size: 966133 Speaker of 000020a912: 16cfb


In [11]:
def get_speaker_to_paths_dict(wav_list, stem_to_speaker_dict):
    
    spk_to_path_d = {}

    for wav_path in wav_list:
        wav_name = Path(wav_path).stem
        spk_id = stem_to_speaker_dict[wav_name]

        if spk_id in spk_to_path_d.keys():
            spk_to_path_d[spk_id].append(wav_path)
        else:
            spk_to_path_d[spk_id] = [wav_path]
    
    return spk_to_path_d

In [12]:
speaker_to_paths_dict = get_speaker_to_paths_dict(wav_paths_list, stem_to_speaker_dict)
len(speaker_to_paths_dict.keys())

NameError: name 'wav_paths_list' is not defined

In [None]:
total = 0
for key in speaker_to_paths_dict.keys():
    total += len(speaker_to_paths_dict[key])

total

966133

In [None]:
def get_stem_to_path_dict(stem_list, wav_paths_list):
    # Sets are faster to search
    stem_set = set(stem_list)
    
    stem_to_path_d ={}
    for wav_path in wav_paths_list:
        wav_stem = Path(wav_path).stem
        if wav_stem in stem_set:
            if wav_stem in stem_to_path_d:
                stem_to_path_d[wav_stem].append(wav_path)
            else:
                stem_to_path_d[wav_stem] = [wav_path]

    return stem_to_path_d          
    

In [13]:
stem_list = stem_to_speaker_dict.keys()

stem_to_path_dict = get_stem_to_path_dict(stem_list, wav_paths_list)

NameError: name 'get_stem_to_path_dict' is not defined

In [14]:
stem_to_path_dict['c47c6e6be0']

NameError: name 'stem_to_path_dict' is not defined

## Split whole dataset into dev and test

In [15]:
random.seed(RANDOM_SEED)
test_speakers_keys = random.sample(speaker_to_paths_dict.keys(), TEST_CLASS_NUMBERS)
random.seed(RANDOM_SEED)

len(test_speakers_keys)

NameError: name 'RANDOM_SEED' is not defined

In [34]:
test_speakers_wavs_stems = []
dev_speakers_wavs_stems = []

total = 0

for key in speaker_to_paths_dict.keys():
    current_speaker_wavs = speaker_to_paths_dict[key]
    current_speaker_wavs_stems = [Path(x).stem for x in current_speaker_wavs]
    total += len(current_speaker_wavs_stems)

    if key in test_speakers_keys:
        test_speakers_wavs_stems += current_speaker_wavs_stems
    else:
        dev_speakers_wavs_stems += current_speaker_wavs_stems
    
print(total)

966133


In [35]:
print(len(test_speakers_wavs_stems), len(dev_speakers_wavs_stems))
print(len(test_speakers_wavs_stems) + len(dev_speakers_wavs_stems))

# 42075 176628
# 218703

94982 871151
966133


We have divided dataset into dev and test.
Now, we need to divide dev between train and eval

In [36]:
total_dev_wavs = len(dev_speakers_wavs_stems)
num_eval_wavs = int(total_dev_wavs * EVAL_TRAIN_RATIO)

random.seed(RANDOM_SEED)
random.shuffle(dev_speakers_wavs_stems)
random.seed(RANDOM_SEED)

eval_stems, train_stems = dev_speakers_wavs_stems[:num_eval_wavs], dev_speakers_wavs_stems[num_eval_wavs:]

In [37]:
train_size = len(train_stems)
eval_size = len(eval_stems)
test_size = len(test_speakers_wavs_stems)

print("Train size:", train_size)
print("Eval size:", eval_size)
print("Test size:", test_size)

print("Total:", train_size+eval_size+test_size)

Train size: 784036
Eval size: 87115
Test size: 94982
Total: 966133


## Writing to the output files

In [38]:
with open(TRAINSET_LIST_LOCATION, "w") as file:
    for stem in train_stems:
        file.write(stem+"\n")

with open(EVALSET_LIST_LOCATION, "w") as file:
    for stem in eval_stems:
        file.write(stem+"\n")

with open(TESTSET_LIST_LOCATION, "w") as file:
    for stem in test_speakers_wavs_stems:
        file.write(stem+"\n")

In [39]:
# Set for uniqueness
# Sorted list for consistency in the order
# 
dev_speakers_list = sorted(list(set([stem_to_speaker_dict[stem] for stem in dev_speakers_wavs_stems])))
test_speakers_list = sorted(list(set([stem_to_speaker_dict[stem] for stem in test_speakers_wavs_stems])))
len(dev_speakers_list)

2771

In [40]:
with open(DEVSET_CLASS_ORDER_LOCATION, "w") as file:
    for speaker in dev_speakers_list:
        file.write(speaker+"\n")

with open(TESTSET_CLASS_ORDER_LOCATION, "w") as file:
    for speaker in test_speakers_list:
        file.write(speaker+"\n")

## Evaluation Trials

In [52]:
import numpy as np
np.random.seed(0)

NUM_TRIALS = 50000

In [53]:
# This function will generate a file that will contain which two files should be compared with cos similarities
# Format:
# 1 spk/_1_/audio/1.wav spk/_1_/audio/2.wav
# 0 spk/_1_/audio/1.wav spk/_2_/audio/1.wav

def generate_validation_trials(wav_list, nb_trial, val_trial_location, speaker_to_paths_d):
    val_trial_file = open(val_trial_location, 'w')

    # We define, same speaker trial as target trial
    # Target trial: 1; Non-trg: 0

    # There will be equal numbers of target and non target trials
    nb_target_trials = int(nb_trial / 2)
        
    speakers_list = list(speaker_to_paths_d.keys())

    #compose target trials
    selected_spks = np.random.choice(speakers_list, size=nb_target_trials, replace=True)

    for spk in selected_spks:
        wav_paths_of_speaker = speaker_to_paths_d[spk]
        
        if len(wav_paths_of_speaker) < 2:
            utt_a, utt_b = wav_paths_of_speaker[0], wav_paths_of_speaker[0]
        else:    
            utt_a, utt_b = np.random.choice(wav_paths_of_speaker, size=2, replace=False)
            
        utt_a_stem, utt_b_stem = Path(utt_a).stem, Path(utt_b).stem
        val_trial_file.write('1 %s %s\n'%(utt_a_stem, utt_b_stem))

    #compose non-target trials
    for i in range(nb_target_trials):
        two_different_speakers = np.random.choice(speakers_list, size=2, replace = False)
        utt_a = np.random.choice(speaker_to_paths_d[two_different_speakers[0]], size=1)[0]
        utt_b = np.random.choice(speaker_to_paths_d[two_different_speakers[1]], size=1)[0]
        utt_a_stem, utt_b_stem = Path(utt_a).stem, Path(utt_b).stem
        val_trial_file.write('0 %s %s\n'%(utt_a_stem, utt_b_stem))

    val_trial_file.close()

In [54]:
eval_speakers_to_path_dict = {}

eval_stem_set = set(eval_stems)

for speaker in speaker_to_paths_dict.keys():
    for path in speaker_to_paths_dict[speaker]:
        if Path(path).stem in eval_stem_set:
            if speaker in eval_speakers_to_path_dict.keys():
                eval_speakers_to_path_dict[speaker].append(path)
            else:
                eval_speakers_to_path_dict[speaker] = [path]

In [55]:
total = 0
for speaker in eval_speakers_to_path_dict:
    total += len(eval_speakers_to_path_dict[speaker])

total

87115

In [56]:
eval_paths = []
for stem in eval_stems:
    eval_paths.append(stem_to_path_dict[stem])
len(eval_paths)

87115

In [57]:
generate_validation_trials(eval_paths, NUM_TRIALS, EVAL_TRIALS_LOCATION, eval_speakers_to_path_dict)

## Testset trials

In [58]:
NUM_TEST_TRIALS = 50000

In [59]:
test_speakers_to_path_dict = {}

test_stem_set = set(test_speakers_wavs_stems)

for speaker in speaker_to_paths_dict.keys():
    for path in speaker_to_paths_dict[speaker]:
        if Path(path).stem in test_stem_set:
            if speaker in test_speakers_to_path_dict.keys():
                test_speakers_to_path_dict[speaker].append(path)
            else:
                test_speakers_to_path_dict[speaker] = [path]

In [39]:
test_paths = []
for stem in test_speakers_wavs_stems:
    test_paths.append(stem_to_path_dict[stem])
len(test_paths)

42075

In [44]:
generate_validation_trials(test_speakers_wavs_stems, NUM_TEST_TRIALS, TEST_TRIALS_LOCATION, test_speakers_to_path_dict)