# Split Data
- Splits the TIMIT dataset into a `TRAIN`, `VALIDATION`, `TEST` dataset
- Note: the TIMIT dataset already comes with a `TEST` folder so we will simply take a % of the `TEST` dataset for validation

## Step 1) Import Audio
- Create Training Dataset
- Create Test Dataset
- Since TIMIT does not differentiate between Test and Validation, we will manually split the Test into Test and Validation as a downstream task to respect the original naming conventions of the TIMIT dataset

### Motivation
The motivation for creating this dataset is because the existing `timit_asr` Hugging Face datasets do not use the complete TIMIT `TEST` and `TRAIN` data available. 

In addition, we also add  speaker gender and duration of speech.

### Linguistic Distinction in categories
`SA` = "Speaker Accent"/Dialect or Shibboleth sentences designed to highlight dialect region differences
`SX` = Phonetically Compact sentences designed to highlight pairs of phones of interest (i.e. voiced vs unvoiced velar stops) in specific phonetic contexts (i.e. coda position = at the end of a word)
`SI` = Phonetically Diverse sentences designed to highlight many different phonemes and sentence types per speaker

In [None]:
from datasets import Dataset, Audio, DatasetDict
import os

In [26]:
PATH_TO_TIMIT = "../data/input_data/TIMIT-Database/TIMIT"

In [27]:
def get_audio_files(dir: str, file_type: str = "wav") -> dict[tuple[str, list[str]], tuple[str, list[str]]]:
    """
    Walks through every directory in `dir` and returns all files which end in `file_type`
    """

    audio_paths = {"speech_tasks": [], "local_paths": []}

    # Walk through each directory
    for dirpath, _dirnames, filenames in os.walk(dir):
        # Check each file in the directory
        for file_name in filenames:
            # If the `file_type` matches
            if file_name.endswith(file_type):
                # Add that file
                full_local_path = os.path.join(dirpath, file_name)
                speech_task = file_name.removesuffix(f".{file_type}")

                audio_paths["local_paths"].append(full_local_path)
                audio_paths["speech_tasks"].append(speech_task)
                
    return audio_paths

In [None]:
# Quick sanity check that we get the appropriate local paths and task names
paths = get_audio_files("../data/input_data/TIMIT-Database/TIMIT/TEST")
print("local paths", paths["local_paths"][:3])
print("speech tasks", paths["speech_tasks"][:3])

local paths ['../data/input_data/TIMIT-Database/TIMIT/TEST/DR4/MGMM0/SX139.wav', '../data/input_data/TIMIT-Database/TIMIT/TEST/DR4/MGMM0/SA2.wav', '../data/input_data/TIMIT-Database/TIMIT/TEST/DR4/MGMM0/SX229.wav']
speech tasks ['SX139', 'SA2', 'SX229']


In [29]:
train_path = os.path.join(PATH_TO_TIMIT, "TRAIN")

In [32]:
train_dataset = Dataset.from_dict(
    {
        "audio": [audio_path for audio_path in get_audio_files(train_path)["local_paths"]]
    }
).cast_column("audio", Audio())

In [None]:
# Quick sanity check to make sure we have valid audio files
train_dataset[0]["audio"]

{'path': '../data/input_data/TIMIT-Database/TIMIT/TRAIN/DR4/MMDM0/SI681.wav',
 'array': array([-2.13623047e-04,  6.10351562e-05,  3.05175781e-05, ...,
        -3.05175781e-05, -9.15527344e-05, -6.10351562e-05]),
 'sampling_rate': 16000}

## Step 2) Add Metadata
Although not all of the metadata is needed for this specific project, perhaps others in the open-source community will find this metadata helpful

In [None]:
def get_