5/11 (Sat) | Preprocess

# Forced Alignment Preparation for Monologue Data (Manual Transcript)

## 1. Introduction

This notebook generates monologue transcripts for wav2vec forced alignment.
The transcript generation procedures consist of the following stages.

1. Load a TextGrid file
2. Extract texts in the TextGrid file
3. Transform the texts
4. Save it as .txt file

Before starting the process, the following code block loads required packages and define global variables.

In [1]:
from typing import List, Generator
from pathlib import Path
import re

from textgrids import TextGrid

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

TASK = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA"]

PUNCTUATIONS = [".", ",", ":", "?", "!"]

---

## 2. Define Functions

This section defines functions to complete the preprocess.
The following code block defines a generator of TextGrid path.

In [2]:
def textgrid_path_generator(task: str) -> Generator[Path, None, None]:
    load_dir = DATA_DIR / f"{task}/01_Manual_TextGrid"

    for textgrid_path in load_dir.glob("*.TextGrid"):
        yield textgrid_path

The following code block defines a function to extract texts from a TextGrid file.

In [3]:
def extract_texts_from_textgrid(textgrid_path: Path) -> List[str]:
    textgrid = TextGrid(str(textgrid_path))
    transcript_tier = textgrid["Transcript"]
    
    texts = []
    for interval in transcript_tier:
        text = interval.text
        texts.append(text)

    return texts

The following code block define a function to transform the extracted texts.

In [4]:
def transform_texts(texts: List[str]) -> str:
    # 1. transform list 2 str
    fa_transcript = " ".join(texts)

    # 2. remove disfluency tags
    fa_transcript = fa_transcript.replace("{", " ")
    fa_transcript = fa_transcript.replace("}", " ")

    # 3. remove other punctuations
    for punct in PUNCTUATIONS:
        fa_transcript = fa_transcript.replace(punct, " ")
    fa_transcript = fa_transcript.replace("-", "")
    fa_transcript = fa_transcript.replace("é", "e")

    # 4. replace characters unsupported by FA to star tokens
    number_pattern = r"\b\w*\d\w*\b"
    fa_transcript = re.sub(number_pattern, "*", fa_transcript)

    # 5. lower transcript
    fa_transcript = fa_transcript.lower()

    # 6. remove extra pauses
    while "  " in fa_transcript:
        fa_transcript = fa_transcript.replace("  ", " ")
    
    if fa_transcript[0] == " ":
        fa_transcript = fa_transcript[1:]
    if fa_transcript[-1] == " ":
        fa_transcript = fa_transcript[:-1]

    return fa_transcript

The following code block defines a function to remove prefix of the TextGrid filename.

In [5]:
def get_filename(textgrid_path: Path, task: str) -> str:
    participant_id = textgrid_path.stem[:4]
    filename = f"{participant_id}_{task}"

    return filename

The following code block defines a function to save the transcript.

In [6]:
def save_transcript(fa_transcript: str, filename: str, task: str) -> None:
    save_path = DATA_DIR / f"{task}/05_FA_Audio_Transcript_Manu/{filename}.txt"
    
    with open(save_path, "w") as f:
        f.write(fa_transcript)

---

## 3. Conduct Forced Alignment Preparation

the following code block conduct the preparation.

In [7]:
for task in TASK:
    for textgrid_path in textgrid_path_generator(task):
        try:
            texts = extract_texts_from_textgrid(textgrid_path)
            fa_transcript = transform_texts(texts)

            filename = get_filename(textgrid_path, task)
            save_transcript(fa_transcript, filename, task)
        except:
            print(textgrid_path)