5/10 (Fri) | Preprocess

# Post-Process for Forced Alignment Outputs of Dialogue Data (ASR Transcript)

## 1. Introduction

This notebook fills star tokens in the forced alignment outputs focusing on dialogue data.
The procedure consists of the following stages.

1. Get user ids
2. Load a transcript text corresponding to a user id
3. Get a turn-level ids and transcript
4. Load a corresponding forced alignment output data
5. Does the forced alignment output data includes star tokens?

    a. True: Convert transcript to word list and fill star token
        
    b. False: Save it with the prefix "_filled"

Before startint the process, the following code block loads required packages and define global variables.

In [1]:
from typing import Tuple, Generator
from pathlib import Path
import re

import numpy as np
import pandas as pd

TRANSCRIPT_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data/WoZ_Interview/02_Rev_Transcript")
FA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data/WoZ_Interview/04_FA_csv_Auto")

PUNCTUATIONS = [".", ",", ":", "?", "!"]

---

## 2. Define Functions

This section defines functions to complete the post-process.
The following code block defines a generator of user ids.

In [2]:
def user_id_generator() -> Generator[str, None, None]:
    for uid in range(1, 86):
        uid = str(uid).zfill(3)

        yield uid

The following code block defines functions to load a transcript.

In [3]:
def load_transcript(uid: str) -> pd.DataFrame:
    csv_path = TRANSCRIPT_DIR / f"{uid}.csv"
    df_transcript = pd.read_csv(csv_path, index_col=0)

    return df_transcript

The following code block defines a generator of turn-level ids, transcripts, and start and end times.

In [4]:
def turn_level_info_generator(
        df_transcript: pd.DataFrame
) -> Generator[Tuple[str, str], None, None]:
    user_mask = (df_transcript["speaker"] == "user")
    
    intro_mask = (df_transcript["topic"] == "intro")
    closing_mask = (df_transcript["topic"] == "closing")
    topic_mask = intro_mask | closing_mask

    mask = user_mask & (~topic_mask)

    df_transcript_masked = df_transcript[mask]
    
    for idx in df_transcript_masked.index:
        transcript = df_transcript_masked.at[idx, "transcript"]

        turn_id = str(idx).zfill(3)

        yield turn_id, transcript

The following code block defines a function to load a forced alignment output.

In [5]:
def load_fa_output(user_id: str, turn_id: str) -> pd.DataFrame:
    fa_output_path = FA_DIR / f"{user_id}_{turn_id}.csv"
    df_fa = pd.read_csv(fa_output_path)

    return df_fa

The following code block defines a function to check whether a forced alignment output includes star tokens.

In [6]:
def is_star_token_included(df_fa: pd.DataFrame) -> bool:
    mask_star_token = (df_fa["word"] == "*")
    
    return bool(mask_star_token.sum())

The following code block defines a function to convert original transcripts to word list.

In [7]:
def convert_fa_transcript(transcript: str) -> np.ndarray:
    # 1. temporally change an inaudible tags to star tokens
    fa_transcript = transcript.replace("<inaudible>", "*")

    # 2. remove punctuations
    for punct in PUNCTUATIONS:
       fa_transcript = fa_transcript.replace(punct, "") 
    fa_transcript = fa_transcript.replace("-", "")

    # 3. remove other tags
    tag_pattern = r"\<.*?\>"
    fa_transcript = re.sub(tag_pattern, " ", fa_transcript) 

    # 4. recover inaudible tags from temporal star tokens
    fa_transcript = fa_transcript.replace("*", "<inaudible>")

    # 5. lower transcript
    fa_transcript = fa_transcript.lower()

    # 6. remove extra pauses
    while "  " in fa_transcript:
        fa_transcript = fa_transcript.replace("  ", " ")

    if fa_transcript[0] == " ":
        fa_transcript = fa_transcript[1:]
    if fa_transcript[-1] == " ":
        fa_transcript = fa_transcript[:-1]

    return np.array(fa_transcript.split(" "))

The following code block defines a function to check the number of words in a rev-format transcript and the corresponding forced alignment output.

In [8]:
def is_same_length(word_list: np.ndarray, df_fa: pd.DataFrame) -> bool:
    return len(word_list) == len(df_fa)

The following code block defines a function to fill star tokens in a forced alignment output.

In [9]:
def fill_startokens(word_list: np.ndarray, df_fa: pd.DataFrame) -> pd.DataFrame:
    df_fa_filled = df_fa.copy(deep=True)

    mask_star = (df_fa["word"] == "*").to_numpy()

    df_fa_filled.loc[mask_star, "word"] = word_list[mask_star]

    return df_fa_filled

The following code block defines a function to save a forced alignment output in which star tokens are filled.

In [10]:
def save_df_fa_filled(df_fa_filled: pd.DataFrame, user_id: str, turn_id: str) -> None:
    save_path = FA_DIR / f"{user_id}_{turn_id}_filled.csv"

    df_fa_filled.to_csv(save_path, index=False)

---

## 3. Conduct Post-Process

The following code block conducts post-process.

In [11]:
for uid in user_id_generator():
    df_transcript = load_transcript(uid)
    
    for tid, transcript in turn_level_info_generator(df_transcript):
        if isinstance(transcript, float):
            continue

        df_fa = load_fa_output(uid, tid)

        if not is_star_token_included(df_fa):
            save_df_fa_filled(df_fa, uid, tid)
            continue

        word_list = convert_fa_transcript(transcript)
        if is_same_length(word_list, df_fa):
            df_fa_filled = fill_startokens(word_list, df_fa)
            save_df_fa_filled(df_fa_filled, uid, tid)
            continue

        print(f"[Error] {uid}_{tid}")