5/13 (Mon) | SCTK

# Preprocess of Automatic Annotation Result for SCTK Alignment 

## Introduction

This notebook coducts the preprocess of automatic annotation results for SCTK alignment.
The goal of the preprocess is to generate csv files which have the following four columns:

- start_time ... the start time of the row in sec
- end_time ... the end time of the row in sec
- type ... the type of the row (01_text, 02_pause, 03_disfl, 04_filler)
- text ... the text of the row (word, \<CI\>, \<CE\>, \<DISFLUENCU\>, \<FILLER\>)

The preprocess consists of the following procedures.

1. Load a turn object
2. Load a textgrid corresponding to the turn
3. get 01_text type rows from the turn object
4. get 03_disfl type rows from the turn object
5. get 04_filler type rows from the turn object
6. get 02_pause type roes from the textgrid
7. concat rows as a pandas' DataFrame object and sort them by start_time
8. save the DataFrame as a csv file

Before starting the preprocess, the following code block load required packages and defines global variables.

In [1]:
import sys
from typing import List, Tuple, Generator
from pathlib import Path

import pandas as pd
import pickle as pkl
from textgrids import TextGrid

sys.path.append(
    "/home/matsuura/Development/app/feature_extraction_api/app/modules"
)

from fluency import Turn

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

TASK = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA", "WoZ_Interview"]

---

## 2. Define Functions

This section defines functions to conduct the preprocess.
The following code block defines a generator to yield file paths of Turn object and TextGrid and a function to load the Turn object and TextGrid.

In [2]:
def turn_textgrid_path_generator(task: str) -> Generator[Tuple[Path, Path], None, None]:
    load_dir = DATA_DIR / f"{task}/08_Auto_Annotation"

    monologue_task = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA"]
    suffix = "*.pkl"
    if task in monologue_task:
        suffix = "*_long.pkl"

    for turn_path in load_dir.glob(suffix):
        textgrid_path = load_dir / f"{turn_path.stem}.TextGrid"

        yield turn_path, textgrid_path


def load_turn_and_textgrid(turn_path: Path, textgrid_path: Path) -> Tuple[Turn, TextGrid]:
    with open(turn_path, "rb") as f:
        turn = pkl.load(f)

    textgrid = TextGrid(str(textgrid_path))

    return turn, textgrid

The following code block defines a function to get rows from a Turn object.

In [3]:
def extract_rows_from_turn(turn: Turn) -> List[dict]:
    turn.show_disfluency()

    rows = []
    for word in turn.words:
        text_row = {
            "start_time": word.start_time,
            "end_time": word.end_time,
            "type": "01_text",
            "text": str(word)
        }

        rows.append(text_row)

        if word.idx != -1:
            continue

        disfl_type = word.disfluency.name
        if disfl_type == "FILLER":
            filler_row = {
                "start_time": word.start_time,
                "end_time": word.end_time,
                "type": "04_filler",
                "text": "<FILLER>"
            }
            rows.append(filler_row)
            continue

        disfl_row = {
            "start_time": word.start_time,
            "end_time": word.end_time,
            "type": "03_disfl",
            "text": "<DISFLUENCY>"
        }
        rows.append(disfl_row)

    return rows

The following code block defines a function to get pause rows from a TextGrid.

In [4]:
def extract_rows_from_textgrid(textgrid: TextGrid) -> List[dict]:
    pause_tier = textgrid["pause"]
    
    rows = []
    for interval in pause_tier:
        pause_type = interval.text

        if pause_type == "":
            continue
        
        row = {
            "start_time": interval.xmin,
            "end_time": interval.xmax,
            "type": "02_pause",
            "text": f"<{pause_type}>"
        }
        rows.append(row)

    return rows

---

## 3. Preprocess

This section conducts the preprocess.

In [5]:
for task in TASK:
    save_dir = DATA_DIR / f"{task}/10_SCTK_Inputs"

    for turn_path, textgrid_path in turn_textgrid_path_generator(task):
        data = []
        
        turn, textgrid = load_turn_and_textgrid(turn_path, textgrid_path)

        data += extract_rows_from_turn(turn)
        data += extract_rows_from_textgrid(textgrid)

        df_annotation = pd.DataFrame.from_dict(data)
        df_annotation = df_annotation.sort_values("start_time").reset_index(drop=True)

        filename = turn_path.stem.removesuffix("_long")
        save_path = save_dir / f"{filename}_auto.csv"
        df_annotation.to_csv(save_path, index=False)