11/25 (Tue)

---

# Forced Alignment 1st Read Aloud Speech & Text

This notebook demonstrates how to perform forced alignment on a set of read-aloud speech recordings and their corresponding text transcriptions using `wav2vec 2.0` and `CTC` (Connectionist Temporal Classification) loss.

The following code cell imports the necessary libraries and sets up the file paths for the conversion process.

In [1]:
from pathlib import Path

import pandas as pd
from pydub import AudioSegment
from textgrids import Interval, TextGrid, Tier

from l2speech_ree_group_proj import PROCESSED_DATA_DIR
from l2speech_ree_group_proj.wav2vec_fa import wav2vec_fa




The following code cell preprocesses the corresponding transcripts of the read-aloud speech.

In [2]:
target_paragraph = """He stretched himself and scratched his thigh where a mosquito had bitten him while he slept. Another one was wailing near his ear. He slapped the ear and hoped he had killed it. "Why do they always go for one's ears?" When he was a child, his mother had told him a story about it. Mosquito, she had said, had asked Ear to marry him, whereupon Ear fell on the floor in uncontrollable laughter. "How much longer do you think you will live?" she asked. "You are already a skeleton!" Mosquito went away humiliated; and anytime he passed her way, he told Ear that he was still alive."""

punctuations = [".", ",", "!", "?", ";", "\""]

for punct in punctuations:
    target_paragraph = target_paragraph.replace(punct, " ")

while "  " in target_paragraph:
    target_paragraph = target_paragraph.replace("  ", " ")

target_paragraph = target_paragraph.strip()

print(target_paragraph)

He stretched himself and scratched his thigh where a mosquito had bitten him while he slept Another one was wailing near his ear He slapped the ear and hoped he had killed it Why do they always go for one's ears When he was a child his mother had told him a story about it Mosquito she had said had asked Ear to marry him whereupon Ear fell on the floor in uncontrollable laughter How much longer do you think you will live she asked You are already a skeleton Mosquito went away humiliated and anytime he passed her way he told Ear that he was still alive


The following code cell saves the preprocessed transcripts to a text file.

In [3]:
for wav_path in PROCESSED_DATA_DIR.glob("*.wav"):
    txt_path = PROCESSED_DATA_DIR / f"{wav_path.stem}.txt"

    with open(txt_path, "w", encoding="utf-8") as f:
        f.write(target_paragraph)

The following code cell aligns the speech recordings with their corresponding text transcriptions using forced alignment techniques.

In [4]:
wav2vec_fa(
    input_dir=PROCESSED_DATA_DIR,
    output_dir=PROCESSED_DATA_DIR,
    skip_aligned=True
)

Skip aligned files!

Loading wav2vec... DONE!


Skip 248922-1-14564746-task-v7ri-50724318-readaloud1-7-1.wav: 100%|██████████| 2/2 [00:00<00:00, 1623.50it/s]


The following code cell converts the forced alignment results (csv files) into TextGrid format for further analysis.

In [5]:
for fa_csv_path in PROCESSED_DATA_DIR.glob("*.csv"):
    textgrid_path = PROCESSED_DATA_DIR / f"{fa_csv_path.stem}.TextGrid"
    if textgrid_path.exists():
        continue

    wav_path = PROCESSED_DATA_DIR / f"{fa_csv_path.stem}.wav"

    df = pd.read_csv(fa_csv_path)
    textgrid = TextGrid(xmin=0.0)
    audio_segment = AudioSegment.from_wav(wav_path)

    duration_in_sec = len(audio_segment) / 1000.0
    textgrid.xmax = duration_in_sec

    intervals = []
    prev_xmax = 0.0
    for _, row in df.iterrows():
        if row["start_time"] > prev_xmax:
            intervals.append(Interval(text="", xmin=prev_xmax, xmax=row["start_time"]))

        start_time = row["start_time"]
        end_time = row["end_time"]
        word = row["word"]
        intervals.append(Interval(text=word, xmin=start_time, xmax=end_time))

        prev_xmax = end_time

    if prev_xmax < duration_in_sec:
        intervals.append(Interval(text="", xmin=prev_xmax, xmax=duration_in_sec))

    tier = Tier(data=intervals, xmin=0.0, xmax=duration_in_sec)

    textgrid["words"] = tier
    textgrid.write(str(textgrid_path))