# Split Data
- Splits the TIMIT dataset into a `TRAIN`, `VALIDATION`, `TEST` dataset
- Note: the TIMIT dataset already comes with a `TEST` folder so we will simply take a % of the `TEST` dataset for validation

## Step 1) Import Audio
- Create Training Dataset
- Create Test Dataset
- Since TIMIT does not differentiate between Test and Validation, we will manually split the Test into Test and Validation as a downstream task to respect the original naming conventions of the TIMIT dataset

### Motivation
The motivation for creating this dataset is because the existing `timit_asr` Hugging Face datasets do not use the complete TIMIT `TEST` and `TRAIN` data available. 

In addition, we also add  speaker gender and duration of speech.

### Linguistic Distinction in categories
`SA` = "Speaker Accent"/Dialect or Shibboleth sentences designed to highlight dialect region differences
`SX` = Phonetically Compact sentences designed to highlight pairs of phones of interest (i.e. voiced vs unvoiced velar stops) in specific phonetic contexts (i.e. coda position = at the end of a word)
`SI` = Phonetically Diverse sentences designed to highlight many different phonemes and sentence types per speaker

In [16]:
from datasets import Dataset, Audio, DatasetDict, load
import os

In [2]:
from timit_metadata_extractors import get_transcription_detail, get_speech_duration, get_replace_ending, get_timit_path, get_speaker_info, get_sentence_info, get_text

In [3]:
PATH_TO_TIMIT = "../data/input_data/TIMIT-Database/TIMIT"

In [4]:
def get_audio_files(dir: str, file_type: str = "wav") -> dict[tuple[str, list[str]], tuple[str, list[str]]]:
    """
    Walks through every directory in `dir` and returns all files which end in `file_type`
    """

    audio_paths = []

    # Walk through each directory
    for dirpath, _dirnames, filenames in os.walk(dir):
        # Check each file in the directory
        for file_name in filenames:
            # If the `file_type` matches
            if file_name.endswith(file_type):
                # Add that file
                full_local_path = os.path.join(dirpath, file_name)
                audio_paths.append(full_local_path)
    return audio_paths

In [5]:
# Quick sanity check that we get the appropriate local paths and task names
paths = get_audio_files("../data/input_data/TIMIT-Database/TIMIT/TEST")
print("local paths", paths[:3])

local paths ['../data/input_data/TIMIT-Database/TIMIT/TEST/DR4/MGMM0/SX139.wav', '../data/input_data/TIMIT-Database/TIMIT/TEST/DR4/MGMM0/SA2.wav', '../data/input_data/TIMIT-Database/TIMIT/TEST/DR4/MGMM0/SX229.wav']


In [6]:
train_path = os.path.join(PATH_TO_TIMIT, "TRAIN")
testvalidation_path = os.path.join(PATH_TO_TIMIT, "TEST")

In [7]:
# TODO REMOVE AFTER DEBUGGING
# train_dataset = Dataset.from_dict(
#     {
#         "audio": [audio_path for audio_path in get_audio_files(train_path)][:1]
#     }
# ).cast_column("audio", Audio())

# testvalidation_dataset = Dataset.from_dict(
#      {
#         "audio": [audio_path for audio_path in get_audio_files(testvalidation_path)][:1]
#     }
# ).cast_column("audio", Audio())

train_dataset = Dataset.from_dict(
    {
        "audio": [audio_path for audio_path in get_audio_files(train_path)]
    }
).cast_column("audio", Audio())

testvalidation_dataset = Dataset.from_dict(
     {
        "audio": [audio_path for audio_path in get_audio_files(testvalidation_path)]
    }
).cast_column("audio", Audio())

In [8]:
train_dataset

Dataset({
    features: ['audio'],
    num_rows: 3629
})

In [9]:
# Quick sanity check to make sure we have valid audio files
train_dataset[0]["audio"]

{'path': '../data/input_data/TIMIT-Database/TIMIT/TRAIN/DR4/MMDM0/SI681.wav',
 'array': array([-2.13623047e-04,  6.10351562e-05,  3.05175781e-05, ...,
        -3.05175781e-05, -9.15527344e-05, -6.10351562e-05]),
 'sampling_rate': 16000}

## Step 2) Add Metadata
Although not all of the metadata is needed for this specific project, perhaps others in the open-source community will find this metadata helpful

In [10]:
def add_metadata(example):
    """
    Adds transcriptions and metadata to the TIMIT dataset

    Note: after using `example["audio"]`, the `example["audio"]["path"]` is 
    automatically set to None for security reasons
    (see Github Issue: https://github.com/huggingface/datasets/issues/5190)
    """
    wav_path = example["audio"]["path"]

    # Add Transcriptions
    phn_file = get_replace_ending(wav_path, new_extension=".phn")
    wrd_file = get_replace_ending(wav_path, new_extension=".wrd")
    text_file = get_replace_ending(wav_path, new_extension=".txt")

    example["phonetic_detail"] = get_transcription_detail(phn_file)
    example["word_detail"] = get_transcription_detail(wrd_file)
    example["text"] = get_text(text_file)

    # Speech Duration
    example["duration"] = get_speech_duration(
        example["audio"]["array"], sr=example["audio"]["sampling_rate"])

    # TIMIT Path
    example["timit_path"] = get_timit_path(
        abs_path=wav_path, base_path=PATH_TO_TIMIT)

    # Speaker Metadata
    (dialect_region, dialect_region_name), (speaker_id, sex) = get_speaker_info(
        example["timit_path"])
    example["dialect_region"] = dialect_region
    example["dialect_region_name"] = dialect_region_name
    example["speaker_id"] = speaker_id
    example["speaker_sex"] = sex

    # Sentence Metadata
    sentence_id, sentence_type = get_sentence_info(wav_path)
    example["id"] = sentence_id
    example["sentence_type"] = sentence_type
    return example

In [11]:
train_dataset_with_metadata = train_dataset.map(add_metadata)
testvalidation_dataset_with_metadata = testvalidation_dataset.map(add_metadata)

Map: 100%|██████████| 3629/3629 [01:25<00:00, 42.32 examples/s] 
Map: 100%|██████████| 1340/1340 [00:31<00:00, 42.04 examples/s]


## Step 3) Upload Dataset to HuggingFace Hub
- Note: we will upload the `testvalidation_dataset_with_metadata` as `test` in the TIMIT tradition
- We will also create a train/validation/test split and have more features extracted from the data but do so in a different repository to remain consistent with the original TIMIT dataset

In [12]:
timit_asr_dataset_base = DatasetDict({
    "train": train_dataset_with_metadata,
    "test": testvalidation_dataset_with_metadata
})

In [13]:
timit_asr_dataset_base

DatasetDict({
    train: Dataset({
        features: ['audio', 'phonetic_detail', 'word_detail', 'text', 'duration', 'timit_path', 'dialect_region', 'dialect_region_name', 'speaker_id', 'speaker_sex', 'id', 'sentence_type'],
        num_rows: 3629
    })
    test: Dataset({
        features: ['audio', 'phonetic_detail', 'word_detail', 'text', 'duration', 'timit_path', 'dialect_region', 'dialect_region_name', 'speaker_id', 'speaker_sex', 'id', 'sentence_type'],
        num_rows: 1340
    })
})

#### Push Upload and Verify

In [14]:
timit_asr_dataset_base.push_to_hub("kylelovesllms/timit_asr")

Map: 100%|██████████| 3629/3629 [00:00<00:00, 9691.34 examples/s]s]
Creating parquet from Arrow format: 100%|██████████| 37/37 [00:00<00:00, 109.97ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:15<00:00, 15.23s/it]
Map: 100%|██████████| 1340/1340 [00:00<00:00, 12737.90 examples/s]]
Creating parquet from Arrow format: 100%|██████████| 14/14 [00:00<00:00, 128.94ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:04<00:00,  4.95s/it]


CommitInfo(commit_url='https://huggingface.co/datasets/kylelovesllms/timit_asr/commit/29c837c76686cb61f3efe199b0ea7d959b12f343', commit_message='Upload dataset', commit_description='', oid='29c837c76686cb61f3efe199b0ea7d959b12f343', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/kylelovesllms/timit_asr', endpoint='https://huggingface.co', repo_type='dataset', repo_id='kylelovesllms/timit_asr'), pr_revision=None, pr_num=None)

In [17]:
timit_asr_dataset_base_from_hub = load("")

TypeError: 'module' object is not callable