5/10 (Fri) | Preprocess

# Post-Process for Forced Alignment Outputs of Monologue Data (ASR Transcript)

## 1. Introduction

This notebook fills star tokens in the forced alignment outputs focusing on monologue data.
The procedure consists of the followign stages.

1. Load a rev-format transcript
2. Load a corresponding forced alignment output file
3. Preprocess the rev-format transcript
4. Fill star tokens (i.e., unsupported characters in wav2vec forced alignment)
5. Save the result

The following code block loads required packages and define global variables.

In [1]:
from typing import List, Generator
from pathlib import Path
import re

import numpy as np
import pandas as pd

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

TASK = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA"]

PUNCTUATIONS = [".", ",", ":", "?", "!"]

---

## 2. Define Functions

This section defines functions to complete the post-process.
The following code block defines a generator of rev-format transcript path.

In [2]:
def rev_transcript_path_generator(task: str) -> Generator[Path, None, None]:
    load_dir = DATA_DIR / f"{task}/02_Rev_Transcript"

    for rev_transcript_path in load_dir.glob("*_long.csv"):
        yield rev_transcript_path

The following code blocks define functions to load a rev-format transcript and forced alignment output.

In [3]:
def load_rev_transcript(rev_transcript_path: Path) -> pd.DataFrame:
    df_transcript = pd.read_csv(rev_transcript_path, index_col=0, na_values=["", " "], keep_default_na=False)

    return df_transcript

In [4]:
def load_fa_output(rev_transcript_path: Path, task: str) -> pd.DataFrame:
    filename = rev_transcript_path.stem.removesuffix("_long")

    fa_output_path = DATA_DIR / f"{task}/04_FA_csv_Auto/{filename}.csv"
    df_fa = pd.read_csv(fa_output_path)

    return df_fa

The following code block defines functions to transform a rev-format transcript.

In [5]:
def extract_word_list(df_transcript: pd.DataFrame) -> List[str]:
    punct_mask = (df_transcript["type"] == "punct")
    df_transcript_wo_punct = df_transcript[~punct_mask]

    word_list = df_transcript_wo_punct["text"].to_list()

    return word_list

def convert_word_list(word_list: List[str]) -> np.ndarray:
    # 1. get word list
    word_list_conv = []
    number_pattern = r"\d"
    for word in word_list:
        if re.match(number_pattern, word) is None:
            word_list_conv.append(word)
            continue
        
        # if the word consists of numbers, split them
        numbers = word.split(" ")
        for number in numbers:
            word_list_conv.append(number)
    
    # 2. temporally change an inaudible tags to star token
    fa_transcript = " ".join(word_list_conv)
    fa_transcript = fa_transcript.replace("<inaudible>", "*")

    # 3. remove other tags
    tag_pattern = r"\<.*?\>"
    fa_transcript = re.sub(tag_pattern, " ", fa_transcript) 

    # 4. recover star token to inaudible tags
    fa_transcript = fa_transcript.replace("*", "<inaudible>")

    # 5. lower transcript
    fa_transcript = fa_transcript.lower()

    # 6. remove punctuations
    for punct in PUNCTUATIONS:
       fa_transcript = fa_transcript.replace(punct, " ") 

    # 7. remove extra pauses
    while "  " in fa_transcript:
        fa_transcript = fa_transcript.replace("  ", " ")

    if fa_transcript[0] == " ":
        fa_transcript = fa_transcript[1:]
    if fa_transcript[-1] == " ":
        fa_transcript = fa_transcript[:-1]

    return np.array(fa_transcript.split(" "))

The following code block defines a function to check the number of words in a rev-format transcript and the corresponding forced alignment output.

In [6]:
def is_same_length(word_list: np.ndarray, df_fa: pd.DataFrame) -> bool:
    return len(word_list) == len(df_fa)

The following code block defines a function to fill star tokens in a forced alignment output.

In [7]:
def fill_startokens(word_list: np.ndarray, df_fa: pd.DataFrame) -> pd.DataFrame:
    df_fa_filled = df_fa.copy(deep=True)

    mask_star = (df_fa["word"] == "*").to_numpy()

    df_fa_filled.loc[mask_star, "word"] = word_list[mask_star]

    return df_fa_filled

The following code block defines a function to save a forced alignment output in which star tokens are filled.

In [8]:
def save_df_fa_filled(df_fa_filled: pd.DataFrame, rev_transcript_path: Path, task: str) -> None:
    filename = rev_transcript_path.stem.removesuffix("_long")
    save_path = DATA_DIR / f"{task}/04_FA_csv_Auto/{filename}_filled.csv"

    df_fa_filled.to_csv(save_path, index=False)

---

## 3. Conduct Post-Process

The following code block conducts post-process.

In [9]:
for task in TASK:
    for rev_transcript_path in rev_transcript_path_generator(task):
        df_transcript = load_rev_transcript(rev_transcript_path)
        df_fa = load_fa_output(rev_transcript_path, task)

        word_list = extract_word_list(df_transcript)
        word_list = convert_word_list(word_list)

        if is_same_length(word_list, df_fa):
            df_fa_filled = fill_startokens(word_list, df_fa)
            save_df_fa_filled(df_fa_filled, rev_transcript_path, task)

        else:
            print(f"[Error] {rev_transcript_path.stem}")