5/13 (Mon) | SCTK

# Preprocess of Manual Annotation of Monologic Corpus for SCTK Alignment

## 1. Introduction

This notebook conducts the preprocess of manual annotation results for SCTK alignment.
The goal of the preprocess is to generate csv files which have the following four columns:

- start_time ... the start time of the row in sec
- end_time ... the end time of the row in sec
- type ... the type of the row (01_text, 02_pause, 03_disfl, 04_filler)
- text ... the text of the row (word, \<CI\>, \<CE\>, \<DISFLUENCU\>, \<FILLER\>)

The preprocess consists of the following procedures.

1. Load a wav2vec forced alignment (FA) result csv file
2. Load a textgrid corresponding to the csv file
3. Extract texts from the textgrid
4. Get 01_text type rows from the turn object
5. Get 03_disfl type rows from the texts
6. Get 02_pause type rows from the textgrid
7. Get 03_filler type rows from the textgrid

Before starting the preprocess, the following code block loads required packages and defnines global variables.

In [1]:
import sys
from typing import List, Tuple, Generator
from pathlib import Path

import numpy as np
import pandas as pd
import pickle as pkl
from textgrids import TextGrid

sys.path.append(
    "/home/matsuura/Development/app/feature_extraction_api/app/modules"
)

from fluency import Turn

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

TASK = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA"]

PUNCTUATIONS = [".", ",", ":", "?", "!"]

---

## 2. Define Functions

This section defines functions for the preprocess.
The following code block defines a generator to yield a file path of FA csv and textgrid and a function to load them.

In [2]:
def fa_textgrid_path_generator(task: str) -> Generator[Tuple[Path, Path], None, None]:
    load_dir = DATA_DIR / f"{task}/01_Manual_TextGrid"

    for textgrid_path in load_dir.glob("*.TextGrid"):
        uid = textgrid_path.stem[:4]
        
        fa_filename = f"{uid}_{task}_filled.csv"
        fa_csv_path = DATA_DIR / f"{task}/06_FA_csv_Manu/{fa_filename}"

        yield fa_csv_path, textgrid_path

def load_fa_and_textgrid(fa_path: Path, textgrid_path: Path) -> Tuple[pd.DataFrame, TextGrid]:
    df_fa = pd.read_csv(fa_path)
    textgrid = TextGrid(str(textgrid_path))

    return df_fa, textgrid

The following code block defines a function to extract texts from textgrid.

In [3]:
def extract_texts_from_textgrid(textgrid_path: Path) -> List[str]:
    textgrid = TextGrid(str(textgrid_path))
    transcript_tier = textgrid["Transcript"]
    
    texts = []
    for interval in transcript_tier:
        text = interval.text
        texts.append(text)

    return texts

def transform_texts(texts: List[str]) -> np.ndarray:
    # 1. transform list 2 str
    fa_transcript = " ".join(texts)

    # 2. remove punctuations
    for punct in PUNCTUATIONS:
        fa_transcript = fa_transcript.replace(punct, " ")
    fa_transcript = fa_transcript.replace("-", "")
    fa_transcript = fa_transcript.replace("é", "e")

    # 3. remove space before or after curly blacket
    while "{ " in fa_transcript:
        fa_transcript = fa_transcript.replace("{ ", "{")

    while " }" in fa_transcript:
        fa_transcript = fa_transcript.replace(" }", "}")

    # 4. add space before and after curly blacket
    fa_transcript = fa_transcript.replace("{", " {")
    fa_transcript = fa_transcript.replace("}", "} ")

    # 5. lower transcript
    fa_transcript = fa_transcript.lower()

    # 6. remove extra pauses
    while "  " in fa_transcript:
        fa_transcript = fa_transcript.replace("  ", " ")
    
    if fa_transcript[0] == " ":
        fa_transcript = fa_transcript[1:]
    if fa_transcript[-1] == " ":
        fa_transcript = fa_transcript[:-1]

    return np.array(fa_transcript.split(" "))

def is_same_length(word_list: np.ndarray, df_fa: pd.DataFrame) -> bool:
    return len(word_list) == len(df_fa)

The following code block defines a function to generate text and disfl rows.

In [4]:
def extract_text_disfl_rows(df_fa: pd.DataFrame, texts: np.ndarray) -> List[dict]:
    is_in_curly_blanket = False
    rows = []
    for idx in df_fa.index:
        start_tiem = df_fa.at[idx, "start_time"]
        end_time = df_fa.at[idx, "end_time"]

        text_row = {
            "start_time": start_tiem,
            "end_time": end_time,
            "type": "01_text",
            "text": df_fa.at[idx, "word"]
        }
        rows.append(text_row)

        # TextGrid 上の書き起こしが {} に囲まれている場合
        if is_in_curly_blanket:
            disfl_row = {
                "start_time": start_tiem,
                "end_time": end_time,
                "type": "03_disfl",
                "text": "<DISFLUENCY>"
            }
            rows.append(disfl_row)

            if "}" in texts[idx]: # その単語で {} が終了する場合
                is_in_curly_blanket = False
            
            continue

        # TextGrid 上の書き起こしの {} が開始する場合
        if "{" in texts[idx]:
            disfl_row = {
                "start_time": start_tiem,
                "end_time": end_time,
                "type": "03_disfl",
                "text": "<DISFLUENCY>"
            }
            rows.append(disfl_row)

            if "}" not in texts[idx]: # {} に1単語しかない場合
                is_in_curly_blanket = True

    return rows

The following code block defines a function to get pause and filler rows from a TextGrid.

In [5]:
def extract_rows_from_textgrid(textgrid: TextGrid) -> List[dict]:
    pause_tier = textgrid["silences"]
    
    rows = []
    for interval in pause_tier:
        pause_type = interval.text

        if pause_type == "":
            continue
        
        row = {
            "start_time": interval.xmin,
            "end_time": interval.xmax,
            "type": "02_pause",
            "text": f"<{pause_type}>"
        }
        rows.append(row)

    filler_tier = textgrid["Filled"]
    for interval in filler_tier:
        row = {
            "start_time": interval.xpos,
            "end_time": interval.xpos,
            "type": "04_filler",
            "text": "<FILLER>"
        }
        rows.append(row)

    return rows

---

## 3. Preprocess

This section conducts the preprocess.

In [6]:
for task in TASK:
    save_dir = DATA_DIR / f"{task}/10_SCTK_Inputs"

    for fa_path, textgrid_path in fa_textgrid_path_generator(task):
        data = []
        filename = fa_path.stem.removesuffix("_filled")
        save_path = save_dir / f"{filename}_manu.csv"

        if save_path.exists():
            continue

        df_fa, textgrid = load_fa_and_textgrid(fa_path, textgrid_path)

        texts = extract_texts_from_textgrid(textgrid_path)
        texts = transform_texts(texts)

        if not is_same_length(texts, df_fa):
            raise Exception

        data += extract_text_disfl_rows(df_fa, texts)
        data += extract_rows_from_textgrid(textgrid)

        df_annotation = pd.DataFrame.from_dict(data)
        df_annotation = df_annotation.sort_values("start_time").reset_index(drop=True)

        df_annotation.to_csv(save_path, index=False)     