# Split Data
- Splits the TIMIT dataset into a `TRAIN`, `VALIDATION`, `TEST` dataset
- Note: the TIMIT dataset already comes with a `TEST` folder so we will simply take a % of the `TEST` dataset for validation

## Step 1) Import Audio
- Create Training Dataset
- Create Test Dataset
- Since TIMIT does not differentiate between Test and Validation, we will manually split the Test into Test and Validation as a downstream task to respect the original naming conventions of the TIMIT dataset

### Motivation
The motivation for creating this dataset is because the existing `timit_asr` Hugging Face datasets do not use the complete TIMIT `TEST` and `TRAIN` data available. 

In addition, we also add  speaker gender and duration of speech.

### Linguistic Distinction in categories
`SA` = "Speaker Accent"/Dialect or Shibboleth sentences designed to highlight dialect region differences
`SX` = Phonetically Compact sentences designed to highlight pairs of phones of interest (i.e. voiced vs unvoiced velar stops) in specific phonetic contexts (i.e. coda position = at the end of a word)
`SI` = Phonetically Diverse sentences designed to highlight many different phonemes and sentence types per speaker

In [None]:
from datasets import Dataset, Audio, DatasetDict, load_dataset
import os

In [None]:
from timit_metadata_extractors import get_transcription_detail, get_speech_duration, get_replace_ending, get_timit_path, get_speaker_info, get_sentence_info, get_text, get_ipa_transcription
from timit_dataset_splitter import stratify_timt_dataset

In [None]:
PATH_TO_TIMIT = "../data/input_data/TIMIT-Database/TIMIT"
UPLOAD_TIMIT_BASE_NAME = "kylelovesllms/timit_asr"
UPLOAD_TIMIT_IPA_NAME = "kylelovesllms/timit_asr_ipa"

In [None]:
def get_audio_files(dir: str, file_type: str = "wav") -> dict[tuple[str, list[str]], tuple[str, list[str]]]:
    """
    Walks through every directory in `dir` and returns all files which end in `file_type`
    """

    audio_paths = []

    # Walk through each directory
    for dirpath, _dirnames, filenames in os.walk(dir):
        # Check each file in the directory
        for file_name in filenames:
            # If the `file_type` matches
            if file_name.endswith(file_type):
                # Add that file
                full_local_path = os.path.join(dirpath, file_name)
                audio_paths.append(full_local_path)
    return audio_paths

In [None]:
# Quick sanity check that we get the appropriate local paths and task names
paths = get_audio_files("../data/input_data/TIMIT-Database/TIMIT/TEST")
print("local paths", paths[:3])

In [None]:
train_path = os.path.join(PATH_TO_TIMIT, "TRAIN")
testvalidation_path = os.path.join(PATH_TO_TIMIT, "TEST")

In [None]:
train_dataset = Dataset.from_dict(
    {
        "audio": [audio_path for audio_path in get_audio_files(train_path)]
    }
).cast_column("audio", Audio())

testvalidation_dataset = Dataset.from_dict(
     {
        "audio": [audio_path for audio_path in get_audio_files(testvalidation_path)]
    }
).cast_column("audio", Audio())

In [None]:
train_dataset

In [None]:
# Quick sanity check to make sure we have valid audio files
train_dataset[0]["audio"]

## Step 2) Add Metadata
Although not all of the metadata is needed for this specific project, perhaps others in the open-source community will find this metadata helpful

In [None]:
def add_metadata(example):
    """
    Adds transcriptions and metadata to the TIMIT dataset

    Note: after using `example["audio"]`, the `example["audio"]["path"]` is 
    automatically set to None for security reasons
    (see Github Issue: https://github.com/huggingface/datasets/issues/5190)
    """
    wav_path = example["audio"]["path"]

    # Add Transcriptions
    phn_file = get_replace_ending(wav_path, new_extension=".phn")
    wrd_file = get_replace_ending(wav_path, new_extension=".wrd")
    text_file = get_replace_ending(wav_path, new_extension=".txt")

    example["phonetic_detail"] = get_transcription_detail(phn_file)
    example["word_detail"] = get_transcription_detail(wrd_file)
    example["text"] = get_text(text_file)

    # Speech Duration
    example["duration"] = get_speech_duration(
        example["audio"]["array"], sr=example["audio"]["sampling_rate"])

    # TIMIT Path
    example["timit_path"] = get_timit_path(
        abs_path=wav_path, base_path=PATH_TO_TIMIT)

    # Speaker Metadata
    (dialect_region, dialect_region_name), (speaker_id, sex) = get_speaker_info(
        example["timit_path"])
    example["dialect_region"] = dialect_region
    example["dialect_region_name"] = dialect_region_name
    example["speaker_id"] = speaker_id
    example["speaker_sex"] = sex

    # Sentence Metadata
    sentence_id, sentence_type = get_sentence_info(wav_path)
    example["id"] = sentence_id
    example["sentence_type"] = sentence_type
    return example

In [None]:
train_dataset_with_metadata = train_dataset.map(add_metadata)
testvalidation_dataset_with_metadata = testvalidation_dataset.map(add_metadata)

## Step 3) Upload Dataset to HuggingFace Hub
- Note: we will upload the `testvalidation_dataset_with_metadata` as `test` in the TIMIT tradition
- We will also create a train/validation/test split and have more features extracted from the data but do so in a different repository to remain consistent with the original TIMIT dataset

In [None]:
timit_asr_dataset_base = DatasetDict({
    "train": train_dataset_with_metadata,
    "test": testvalidation_dataset_with_metadata
})

In [None]:
timit_asr_dataset_base

#### Push Upload and Verify

In [None]:
timit_asr_dataset_base.push_to_hub(UPLOAD_TIMIT_BASE_NAME)

In [None]:
timit_asr_dataset_base_from_hub = load_dataset(UPLOAD_TIMIT_BASE_NAME)

In [None]:
# Verify that the repository is pulled down correctly
print(timit_asr_dataset_base_from_hub)

## Step 4) Adding Phonetic Transcriptions
- Although the TIMIT IPA has a `phonetic_detail` which is documented in `TIMIT/DOC/PHONCODE.DOC`, some downstream use cases may require the use of transcription records in the International Phonetic Alphabet (IPA) format
- This format is universal accross almost every tradition in phonetics 
- Note: the transcription scheme used in the repository below is not peer reviewed and is specific to a broad transcription for evaluating a fine tuned Wav2Vec2 model against Wav2Vec2-XLSR

In [None]:
def add_ipa_transcription(example):
    """
    Returns an interpretation of the IPA transcription in the `ipa_transcription` property
    """
    example_with_metadata = add_metadata(example)
    example_with_metadata["ipa_transcription"] = get_ipa_transcription(example_with_metadata["phonetic_detail"])
    return example_with_metadata

In [None]:
train_dataset_with_ipa = train_dataset.map(add_ipa_transcription)
testvalidation_dataset_with_ipa = testvalidation_dataset.map(add_ipa_transcription)

In [None]:
# Sanity check the transcriptions
print(train_dataset_with_ipa[0]["text"])
print([seg["utterance"] for seg in train_dataset_with_ipa[0]["phonetic_detail"]])
print(train_dataset_with_ipa[0]["ipa_transcription"])

In [None]:
# Sanity check the transcriptions
print(testvalidation_dataset_with_ipa[0]["text"])
print([seg["utterance"] for seg in testvalidation_dataset_with_ipa[0]["phonetic_detail"]])
print(testvalidation_dataset_with_ipa[0]["ipa_transcription"])

## Step 5) Add Validation Split
- The original TIMIT dataset has only `TRAIN` and `TEST`
- Thus, we will create an 80-10-10 `TRAIN`/ `VALIDATION` / `TEST`

### Stratification
- The TIMIT dataset is carefully split into a `TRAIN` and `TEST` dataset.
- When making the Validation dataset from splitting the `TEST` dataset in half, we want to `stratify` or keep the proportion of speakers the same to ensure accurate hyperparameter tuning and final test evaluation
- The parameters that the TIMIT dataset stratifies on for `TRAIN` and `TEST` is `sex` and `dialect_region`

In [None]:
validation_dataset, test_dataset = stratify_timt_dataset(
    testvalidation_dataset_with_ipa
)

In [None]:
# Sanity check that the dialect regions are balanced in validation set
validation_dataset.to_pandas()["dialect_region_name"].value_counts()

In [None]:
# Sanity check that the dialect regions are balanced in test set
test_dataset.to_pandas()["dialect_region_name"].value_counts()

## Step 6) Upload IPA and Validation Test split to HF Hub

In [None]:
timit_ipa_dataset = DatasetDict({
    "train": train_dataset_with_ipa,
    "validation": validation_dataset,
    "test": test_dataset
})

In [None]:
timit_ipa_dataset

In [None]:
timit_ipa_dataset["train"].features

In [None]:
timit_ipa_dataset["test"].features

In [None]:
timit_ipa_dataset.push_to_hub(UPLOAD_TIMIT_IPA_NAME)

In [None]:
# Verify that upload succeeded
timit_ipa_from_hub = load_dataset(UPLOAD_TIMIT_IPA_NAME)

In [None]:
timit_ipa_from_hub