11/25 (Tue)

---

# Spliting Target Audio Segment from 1st Read-Aloud Recordings

This notebook demonstrates how to split target audio segments from the first read-aloud recordings using Python's `pydub` library.
The target segment is the following sentence (cf. Udofot, 2003):
> "Mosquito went away humiliated"

Note that the timestamps for the target segment were initially annotated by forced alignment and then manually corrected.

The following code cell imports the necessary libraries and sets up the file paths for the conversion process using Praat.

In [1]:
from pathlib import Path

from copy import deepcopy

from pydub import AudioSegment
from textgrids import TextGrid, Tier

from l2speech_ree_group_proj import PROCESSED_DATA_DIR, EXTERNAL_DATA_DIR

TARGET_SEGMENT_WORDS = ["mosquito", "went", "away", "humiliated"]



The following code cell explores audio files and their corresponding annotation files to identify the target segments.

In [2]:
target_wav_path_list = []
target_textgrid_path_list = []

for wav_path in PROCESSED_DATA_DIR.glob("*.wav"):
    textgrid_path = EXTERNAL_DATA_DIR / f"{wav_path.stem}_manual.TextGrid"

    if not textgrid_path.exists():
        continue

    target_wav_path_list.append(wav_path)
    target_textgrid_path_list.append(textgrid_path)

The following code cell identifies the start and end times of the target segments from the annotation files.

In [3]:
timestamp_list = []

for textgrid_path in target_textgrid_path_list:
    textgrid = TextGrid(filename=str(textgrid_path))

    tier = textgrid["words"]

    start_time = 0.0
    end_time = 0.0
    current_pointer = 0

    for interval in tier:
        if interval.text == "":
            continue

        if interval.text == TARGET_SEGMENT_WORDS[current_pointer]:
            if current_pointer == 0:
                start_time = interval.xmin
            current_pointer += 1

            if current_pointer == len(TARGET_SEGMENT_WORDS):
                end_time = interval.xmax
                break
        else:
            current_pointer = 0

    timestamp_list.append((start_time, end_time))

The following code cell splits the target audio segments from the original recordings based on the identified timestamps.

In [4]:
if len(timestamp_list) != len(target_wav_path_list):
    raise ValueError("Mismatch between number of timestamps and wav files.")

for wav_path, timestamp in zip(target_wav_path_list, timestamp_list):
    if not wav_path.exists():
        raise FileNotFoundError(f"{wav_path} does not exist.")

    output_dir = PROCESSED_DATA_DIR / "mosquito_away_segments"
    output_dir.mkdir(exist_ok=True, parents=True)

    audio = AudioSegment.from_wav(wav_path)
    start_ms = int(timestamp[0] * 1000)
    end_ms = int(timestamp[1] * 1000)

    segment = audio[start_ms:end_ms]

    output_wav_path = output_dir / wav_path.name
    segment.export(output_wav_path, format="wav")

The following code cell splits the target TextGrid files to correspond with the extracted audio segments.
Also, I added tiers, "syllables" and "stress", to the split TextGrid files for further analysis.



In [6]:
if len(timestamp_list) != len(target_textgrid_path_list):
    raise ValueError("Mismatch between number of timestamps and TextGrid files.")

for textgrid_path, timestamp in zip(target_textgrid_path_list, timestamp_list):
    if not textgrid_path.exists():
        raise FileNotFoundError(f"{textgrid_path} does not exist.")

    output_dir = PROCESSED_DATA_DIR / "mosquito_away_segments"
    output_dir.mkdir(exist_ok=True, parents=True)

    textgrid = TextGrid(filename=str(textgrid_path))

    tier = textgrid["words"]

    segmented_intervals = []
    for interval in tier:
        if interval.xmax < timestamp[0]:
            continue
        if interval.xmin > timestamp[1]:
            break

        new_xmin = max(interval.xmin, timestamp[0]) - timestamp[0]
        new_xmax = min(interval.xmax, timestamp[1]) - timestamp[0]

        new_interval = deepcopy(interval)
        new_interval.xmin = new_xmin
        new_interval.xmax = new_xmax

        if new_interval.xmin == new_interval.xmax:
            continue

        segmented_intervals.append(new_interval)

    new_tier = Tier(data=segmented_intervals, xmin=0.0, xmax=timestamp[1] - timestamp[0])

    new_textgrid = TextGrid()
    new_textgrid.xmin = 0.0
    new_textgrid.xmax = timestamp[1] - timestamp[0]

    syllable_tier = Tier(data=[], xmin=0.0, xmax=timestamp[1] - timestamp[0])
    new_textgrid["syllables"] = syllable_tier

    stress_tier = Tier(data=[], xmin=0.0, xmax=timestamp[1] - timestamp[0])
    new_textgrid["stress"] = stress_tier

    new_textgrid["words"] = new_tier

    output_textgrid_path = output_dir / textgrid_path.name
    new_textgrid.write(str(output_textgrid_path))