5/11 (Sat) | Preprocess

# Post-Process for Forced Alignment Outputs of Monologue Data (Manual Transcript)

## 1. Introduction

This notebook fills star tokens in the forced alignment outputs focusing on monologue data.
The procedure consists of the followign stages.

1. Load a TextGrid file
2. Load a corresponding forced alignment output file
3. Preprocess the TextGrid
4. Fill star tokens (i.e., unsupported characters in wav2vec forced alignment)
5. Save the result

The following code block loads required packages and define global variables.

In [1]:
from typing import List, Generator
from pathlib import Path
import re

import numpy as np
import pandas as pd
from textgrids import TextGrid

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

TASK = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA"]

PUNCTUATIONS = [".", ",", ":", "?", "!"]

---

## 2. Define Functions

This section defines functions to complete the post-process.
The following code block defines a generator of TextGrid path.

In [2]:
def textgrid_path_generator(task: str) -> Generator[Path, None, None]:
    load_dir = DATA_DIR / f"{task}/01_Manual_TextGrid"

    for textgrid_path in load_dir.glob("*.TextGrid"):
        yield textgrid_path

The following code block defines a function to remove prefix of the TextGrid filename.

In [3]:
def get_filename(textgrid_path: Path, task: str) -> str:
    participant_id = textgrid_path.stem[:4]
    filename = f"{participant_id}_{task}"

    return filename

The following code block defines a function to load a forced alignment output file.

In [4]:
def load_fa_output(filename: str, task: str) -> pd.DataFrame:
    fa_output_path = DATA_DIR / f"{task}/06_FA_csv_Manu/{filename}.csv"
    df_fa = pd.read_csv(fa_output_path)

    return df_fa

The following code block defines a function to extract texts from a TextGrid file.

In [5]:
def extract_texts_from_textgrid(textgrid_path: Path) -> List[str]:
    textgrid = TextGrid(str(textgrid_path))
    transcript_tier = textgrid["Transcript"]
    
    texts = []
    for interval in transcript_tier:
        text = interval.text
        texts.append(text)

    return texts

The following code block define a function to transform the extracted texts.

In [6]:
def transform_texts(texts: List[str]) -> np.ndarray:
    # 1. transform list 2 str
    fa_transcript = " ".join(texts)

    # 2. remove disfluency tags
    fa_transcript = fa_transcript.replace("{", " ")
    fa_transcript = fa_transcript.replace("}", " ")

    # 3. remove other punctuations
    for punct in PUNCTUATIONS:
        fa_transcript = fa_transcript.replace(punct, " ")
    fa_transcript = fa_transcript.replace("-", "")
    fa_transcript = fa_transcript.replace("é", "e")

    # 4. lower transcript
    fa_transcript = fa_transcript.lower()

    # 5. remove extra pauses
    while "  " in fa_transcript:
        fa_transcript = fa_transcript.replace("  ", " ")
    
    if fa_transcript[0] == " ":
        fa_transcript = fa_transcript[1:]
    if fa_transcript[-1] == " ":
        fa_transcript = fa_transcript[:-1]

    return np.array(fa_transcript.split(" "))

The following code block defines a function to check the number of words in a TextGrid transcript and the corresponding forced alignment output.

In [7]:
def is_same_length(word_list: np.ndarray, df_fa: pd.DataFrame) -> bool:
    return len(word_list) == len(df_fa)

The following code block defines a function to fill star tokens in a forced alignment output.

In [8]:
def fill_startokens(word_list: np.ndarray, df_fa: pd.DataFrame) -> pd.DataFrame:
    df_fa_filled = df_fa.copy(deep=True)

    mask_star = (df_fa["word"] == "*").to_numpy()

    df_fa_filled.loc[mask_star, "word"] = word_list[mask_star]

    return df_fa_filled

The following code block defines a function to save a forced alignment output in which star tokens are filled.

In [9]:
def save_df_fa_filled(df_fa_filled: pd.DataFrame, filename: str, task: str) -> None:
    save_path = DATA_DIR / f"{task}/06_FA_csv_Manu/{filename}_filled.csv"

    df_fa_filled.to_csv(save_path, index=False)

---

## 3. Conduct Post-Process

The following code block conducts post-process.

In [10]:
for task in TASK:
    for textgrid_path in textgrid_path_generator(task):
        texts = extract_texts_from_textgrid(textgrid_path)
        
        filename = get_filename(textgrid_path, task)
        df_fa = load_fa_output(filename, task)

        word_list = transform_texts(texts)

        if is_same_length(word_list, df_fa):
            df_fa_filled = fill_startokens(word_list, df_fa)
            save_df_fa_filled(df_fa_filled, filename, task)

        else:
            print(f"[Error] {filename}")