# Preparing Timit DATASET

https://deepai.org/dataset/timit

This notebook will make the regulations for splitting the entire dataset into train, validation and test sets. It will create text files containing the stems (file names) of the wavs in each different set.

[[[ If the DatasetPreparations folder already contains the 6 output files, then it is not necessary to run this notebook. ]]]

## Terminologies:

* Stem: Stem is the file name from a path. For example, the stem of "a/b/c/abc.wav" is "abc"
* devset: Devset indicates the set of files that will be used during development or training. This includes both train set and evaluation set.

Necessary imports:

In [3]:
import glob
import csv
import os

from pathlib import Path

import random

import pandas as pd

In [4]:
# Directories are assumed to have a trailing '/' or '\\' in all the subsequent code

CURRENT_WORKING_DIRECTORY = "/home/abdullah/Code/dl/SpeakerRecognitionResearch/"

TIMIT_DATASET_DIRECTORY = "/home/abdullah/Code/datasets/timit/data/"

# This is a combination of the 5 TSV files, and it will be generated in this notebook
TIMIT_TRAIN_CSV_LOCATION = "/home/abdullah/Code/datasets/timit/train_data.csv"
TIMIT_TEST_CSV_LOCATION = "/home/abdullah/Code/datasets/timit/test_data.csv"

# To avoid file location related errors, we make sure "SpeakerRecognitionResearch" root folder is the current working directory.
os.chdir(CURRENT_WORKING_DIRECTORY)
os.getcwd()

'/home/abdullah/Code/dl/SpeakerRecognitionResearch'

## Output files:

The following output files will be generated by this notebook:
1. trainset_list.txt: List of all stems that will be used for training
2. evalset_list.txt: List of all stems that will be used for evaluation in the training loop
3. test_set.txt: List of all stems that will be used for testing/validating after training phase is complete
4. devset_class_order.txt: Order of the classes used for both trainset and evalset
5. testset_class_order.txt: Order of the classes used for testset
6. eval_trials.txt: (expected utt1 utt2) tuples for each trial to compute for evaluation
7. test_trials.txt: (expected utt1 utt2) tuples for each trial to compute for test

In [5]:
TRAINSET_LIST_LOCATION = "notebooks/TrainTimitOnRawNet2/DatasetPreparations/trainset_list.txt"
EVALSET_LIST_LOCATION = "notebooks/TrainTimitOnRawNet2/DatasetPreparations/evalset_list.txt"
TESTSET_LIST_LOCATION = "notebooks/TrainTimitOnRawNet2/DatasetPreparations/testset_list.txt"
DEVSET_CLASS_ORDER_LOCATION = "notebooks/TrainTimitOnRawNet2/DatasetPreparations/devset_classes.txt"
TESTSET_CLASS_ORDER_LOCATION = "notebooks/TrainTimitOnRawNet2/DatasetPreparations/testset_classes.txt"
EVAL_TRIALS_LOCATION = "notebooks/TrainTimitOnRawNet2/DatasetPreparations/eval_trials.txt"
TEST_TRIALS_LOCATION = "notebooks/TrainTimitOnRawNet2/DatasetPreparations/test_trials.txt"

In [30]:
# There are in total 632 speakers in the dataset
# timit_train = pd.read_csv(TIMIT_TRAIN_CSV_LOCATION)
# timit_train_df = pd.DataFrame(timit_train)
# unique_speaker_train = timit_train_df["speaker_id"].unique()
# print(len(unique_speaker_train))

# timit_test = pd.read_csv(TIMIT_TEST_CSV_LOCATION)
# timit_test_df = pd.DataFrame(timit_test)
# unique_speaker_test = timit_test_df["speaker_id"].unique()
# print(len(unique_speaker_test))

# In our splitting strategy, we keep 10% speakers for TEST
TEST_CLASS_NUMBERS = 169
DEV_CLASS_NUMBERS = 463

# 10% of the devset will be used for evaluating each epoch
EVAL_TRAIN_RATIO = 0.10

# Don't change it, to keep the work reproducible
RANDOM_SEED = 99

In [31]:
def get_wav_list(dataset_dir):
    # Given a directory, return path list of all wav files
    pattern = '**/*.wav'
    files = glob.glob(dataset_dir + pattern , recursive=True)

    # Normalize the file paths. To get file paths with '/' or '\\' consistently depending on OS
    wav_list = [os.path.normpath(i) for i in files]
    return wav_list

In [32]:
wav_paths_list = get_wav_list(TIMIT_DATASET_DIRECTORY)
len(wav_paths_list)

6300

# Label dictionaries

Three dictionaries will be helpful to us

1. stem_to_speaker_dict: Given stem, who is it's speaker?
2. speaker_to_paths_dict: Given a speaker, what are their audio paths?
3. stem_to_path_dict: Given the stem, what is its path?

In [33]:
def get_stem_to_speaker_dict(csv_loc):
    # Reads the annotation csv file provided with the dataset 
    # and returns a stem to speaker mapping dictionary
    stem_to_speaker_d = {}

    with open(csv_loc, encoding="utf-8") as csvfile:
        csvreader = csv.reader(csvfile, delimiter=",", quoting=csv.QUOTE_NONE)
        for line in csvreader:
            wav_file_name = line[4]
            speaker_id = line[3]

            stem_to_speaker_d[wav_file_name] = speaker_id

    return stem_to_speaker_d

In [34]:
stem_to_speaker_dict = get_stem_to_speaker_dict(TIMIT_TRAIN_CSV_LOCATION)
print("Size:", len(stem_to_speaker_dict),
      "Speaker of SI681:", stem_to_speaker_dict["SI681.WAV"])

Size: 8592 Speaker of SI681: MMDM0


In [35]:
def get_speaker_to_paths_dict(wav_list, stem_to_speaker_dict):
    
    spk_to_path_d = {}

    for wav_path in wav_list:
        wav_name = Path(wav_path).stem

        try:
            spk_id = stem_to_speaker_dict[wav_name]

            if spk_id in spk_to_path_d.keys():
                spk_to_path_d[spk_id].append(wav_path)
            else:
                spk_to_path_d[spk_id] = [wav_path]
        except:
            pass
    
    return spk_to_path_d

In [36]:
speaker_to_paths_dict = get_speaker_to_paths_dict(wav_paths_list, stem_to_speaker_dict)
len(speaker_to_paths_dict.keys())

462

In [37]:
total = 0
for key in speaker_to_paths_dict.keys():
    total += len(speaker_to_paths_dict[key])

total

4956

In [38]:
def get_stem_to_path_dict(stem_list, wav_paths_list):
    # Sets are faster to search
    stem_set = set(stem_list)
    
    stem_to_path_d ={}
    for wav_path in wav_paths_list:
        wav_stem = Path(wav_path).stem
        if wav_stem in stem_set:
            if wav_stem in stem_to_path_d:
                stem_to_path_d[wav_stem].append(wav_path)
            else:
                stem_to_path_d[wav_stem] = [wav_path]

    return stem_to_path_d          
    

In [39]:
stem_list = stem_to_speaker_dict.keys()

stem_to_path_dict = get_stem_to_path_dict(stem_list, wav_paths_list)

In [41]:
stem_to_path_dict['SI681.WAV']


['/home/abdullah/Code/datasets/timit/data/TRAIN/DR4/MMDM0/SI681.WAV.wav']

## Split whole dataset into dev and test

In [42]:
random.seed(RANDOM_SEED)
test_speakers_keys = random.sample(speaker_to_paths_dict.keys(), TEST_CLASS_NUMBERS)
random.seed(RANDOM_SEED)

len(test_speakers_keys)

169

In [43]:
test_speakers_wavs_stems = []
dev_speakers_wavs_stems = []

total = 0

for key in speaker_to_paths_dict.keys():
    current_speaker_wavs = speaker_to_paths_dict[key]
    current_speaker_wavs_stems = [Path(x).stem for x in current_speaker_wavs]
    total += len(current_speaker_wavs_stems)

    if key in test_speakers_keys:
        test_speakers_wavs_stems += current_speaker_wavs_stems
    else:
        dev_speakers_wavs_stems += current_speaker_wavs_stems
    
print(total)

4956


In [44]:
print(len(test_speakers_wavs_stems), len(dev_speakers_wavs_stems))
print(len(test_speakers_wavs_stems) + len(dev_speakers_wavs_stems))

# 42075 176628
# 218703

1354 3602
4956


We have divided dataset into dev and test.
Now, we need to divide dev between train and eval

In [45]:
total_dev_wavs = len(dev_speakers_wavs_stems)
num_eval_wavs = int(total_dev_wavs * EVAL_TRAIN_RATIO)

random.seed(RANDOM_SEED)
random.shuffle(dev_speakers_wavs_stems)
random.seed(RANDOM_SEED)

eval_stems, train_stems = dev_speakers_wavs_stems[:num_eval_wavs], dev_speakers_wavs_stems[num_eval_wavs:]

In [46]:
train_size = len(train_stems)
eval_size = len(eval_stems)
test_size = len(test_speakers_wavs_stems)

print("Train size:", train_size)
print("Eval size:", eval_size)
print("Test size:", test_size)

print("Total:", train_size+eval_size+test_size)

Train size: 3242
Eval size: 360
Test size: 1354
Total: 4956


## Writing to the output files

In [47]:
with open(TRAINSET_LIST_LOCATION, "w") as file:
    for stem in train_stems:
        file.write(stem+"\n")

with open(EVALSET_LIST_LOCATION, "w") as file:
    for stem in eval_stems:
        file.write(stem+"\n")

with open(TESTSET_LIST_LOCATION, "w") as file:
    for stem in test_speakers_wavs_stems:
        file.write(stem+"\n")

In [48]:
# Set for uniqueness
# Sorted list for consistency in the order
# 
dev_speakers_list = sorted(list(set([stem_to_speaker_dict[stem] for stem in dev_speakers_wavs_stems])))
test_speakers_list = sorted(list(set([stem_to_speaker_dict[stem] for stem in test_speakers_wavs_stems])))
len(dev_speakers_list)

293

In [49]:
with open(DEVSET_CLASS_ORDER_LOCATION, "w") as file:
    for speaker in dev_speakers_list:
        file.write(speaker+"\n")

with open(TESTSET_CLASS_ORDER_LOCATION, "w") as file:
    for speaker in test_speakers_list:
        file.write(speaker+"\n")

## Evaluation Trials

In [50]:
import numpy as np
np.random.seed(0)

NUM_TRIALS = 50000

In [51]:
# This function will generate a file that will contain which two files should be compared with cos similarities
# Format:
# 1 spk/_1_/audio/1.wav spk/_1_/audio/2.wav
# 0 spk/_1_/audio/1.wav spk/_2_/audio/1.wav

def generate_validation_trials(wav_list, nb_trial, val_trial_location, speaker_to_paths_d):
    val_trial_file = open(val_trial_location, 'w')

    # We define, same speaker trial as target trial
    # Target trial: 1; Non-trg: 0

    # There will be equal numbers of target and non target trials
    nb_target_trials = int(nb_trial / 2)
        
    speakers_list = list(speaker_to_paths_d.keys())

    #compose target trials
    selected_spks = np.random.choice(speakers_list, size=nb_target_trials, replace=True)

    for spk in selected_spks:
        wav_paths_of_speaker = speaker_to_paths_d[spk]
        
        if len(wav_paths_of_speaker) < 2:
            utt_a, utt_b = wav_paths_of_speaker[0], wav_paths_of_speaker[0]
        else:    
            utt_a, utt_b = np.random.choice(wav_paths_of_speaker, size=2, replace=False)
            
        utt_a_stem, utt_b_stem = Path(utt_a).stem, Path(utt_b).stem
        val_trial_file.write('1 %s %s\n'%(utt_a_stem, utt_b_stem))

    #compose non-target trials
    for i in range(nb_target_trials):
        two_different_speakers = np.random.choice(speakers_list, size=2, replace = False)
        utt_a = np.random.choice(speaker_to_paths_d[two_different_speakers[0]], size=1)[0]
        utt_b = np.random.choice(speaker_to_paths_d[two_different_speakers[1]], size=1)[0]
        utt_a_stem, utt_b_stem = Path(utt_a).stem, Path(utt_b).stem
        val_trial_file.write('0 %s %s\n'%(utt_a_stem, utt_b_stem))

    val_trial_file.close()

In [52]:
eval_speakers_to_path_dict = {}

eval_stem_set = set(eval_stems)

for speaker in speaker_to_paths_dict.keys():
    for path in speaker_to_paths_dict[speaker]:
        if Path(path).stem in eval_stem_set:
            if speaker in eval_speakers_to_path_dict.keys():
                eval_speakers_to_path_dict[speaker].append(path)
            else:
                eval_speakers_to_path_dict[speaker] = [path]

In [53]:
total = 0
for speaker in eval_speakers_to_path_dict:
    total += len(eval_speakers_to_path_dict[speaker])

total

2166

In [54]:
eval_paths = []
for stem in eval_stems:
    eval_paths.append(stem_to_path_dict[stem])
len(eval_paths)

360

In [55]:
generate_validation_trials(eval_paths, NUM_TRIALS, EVAL_TRIALS_LOCATION, eval_speakers_to_path_dict)

## Testset trials

In [56]:
NUM_TEST_TRIALS = 50000

In [57]:
test_speakers_to_path_dict = {}

test_stem_set = set(test_speakers_wavs_stems)

for speaker in speaker_to_paths_dict.keys():
    for path in speaker_to_paths_dict[speaker]:
        if Path(path).stem in test_stem_set:
            if speaker in test_speakers_to_path_dict.keys():
                test_speakers_to_path_dict[speaker].append(path)
            else:
                test_speakers_to_path_dict[speaker] = [path]

In [58]:
test_paths = []
for stem in test_speakers_wavs_stems:
    test_paths.append(stem_to_path_dict[stem])
len(test_paths)

1354

In [59]:
generate_validation_trials(test_speakers_wavs_stems, NUM_TEST_TRIALS, TEST_TRIALS_LOCATION, test_speakers_to_path_dict)