5/12 (Sun) | UF Measures

# Shorten the Automatic Annotation Results for UF Measure Claculation in Monologic Data

## 1. Introduction

This notebook shorten the automated annotation results to calculate UF measures of monologic speech corpus.
Since in monologic data fluency was judged based on first 1 minute of speech, I shorten the automatic annotation results.
Procedures consist of the following stages.

1. Load a pkl path of turn object
2. Load a short transcript corresponding to the turn object
3. Get end time of the transcript
4. Shorten the turn object
5. Generate a TextGrid from the shorten turn
6. Save Turn and TextGrid

Before starting the procedures, the following code block loads required packages and defines global variables.

In [1]:
from typing import List, Tuple, Generator
import sys, json, traceback
from pathlib import Path

import pandas as pd
import pickle as pkl
from textgrids import TextGrid

sys.path.append(
    "/home/matsuura/Development/app/feature_extraction_api/app/modules"
)

from fluency import shorten_turn, Annotator, Turn
from fluency.pipeline.utils.pause_location import PauseLocation

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

TASK = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA"]

---

## 2. Define Functions

This section defines functions to shorten the automatic annotatio results.
The following code block defines a generator to yield a pkl path of Turn object.

In [2]:
def turn_path_generator(task: str) -> Generator[Path, None, None]:
    load_dir = DATA_DIR / f"{task}/08_Auto_Annotation"

    for turn_path in load_dir.glob(f"*_{task}_long_bert.pkl"):
        yield turn_path

The following code block defines a function to load Turn object.

In [3]:
def load_turn(turn_path: Path) -> Turn:
    with open(turn_path, "rb") as f:
        turn = pkl.load(f)

    return turn

The following code block defines a function to load a shor transcript as pandas' DataFrame object.

In [4]:
def load_short_transcript(turn_path: Path, task: str) -> pd.DataFrame:
    filename = turn_path.stem.removesuffix("_long_bert")
    load_path = DATA_DIR / f"{task}/02_Rev_Transcript/{filename}.wav.csv"

    df_short_transcript = pd.read_csv(load_path, index_col=0)

    return df_short_transcript

The following code block defines a function to get the end time of the loaded short transcript.

In [5]:
def get_end_time(df_short_transcript: pd.DataFrame) -> float:
    mask_text = (df_short_transcript["type"] == "text")
    df_short_transcript_masked = df_short_transcript[mask_text]

    end_time = df_short_transcript_masked.iloc[-1]["end_time"]

    return end_time

The following code block defines a function to shorten corresponding TextGrid file.

In [6]:
def find_pauses(turn: Turn) -> List[dict]:
    pauses = []
    prev_clause_end = turn.start_time
    for clause in turn.clauses:
        if clause.start_time - prev_clause_end >= 0.25:
            p = {
                "location": PauseLocation.CLAUSE_EXTERNAL,
                "start_time": prev_clause_end,
                "end_time": clause.start_time
            }
            pauses.append(p)

        prev_word_end = clause.start_time
        for word in clause.words:
            if word.start_time - prev_word_end >= 0.25:
                p = {
                    "location": PauseLocation.CLAUSE_INTERNAL,
                    "start_time": prev_word_end,
                    "end_time": word.start_time
                }
                pauses.append(p)

            prev_word_end = word.end_time

        prev_clause_end = clause.end_time

    return pauses


def shorten_textgrid(short_turn: Turn, annotator: Annotator, textgrid_path: Path) -> TextGrid:
    pauses = find_pauses(short_turn)
    short_textgrid = annotator.to_textgrid(short_turn, pauses, save_path=textgrid_path)

    return short_textgrid

---

## 3. Shorten Turn

This section shorten turn objects and save them.

In [7]:
annotator = Annotator(process=[])

for task in TASK:
    for turn_path in turn_path_generator(task):
        turn = load_turn(turn_path)
        df_short_transcript = load_short_transcript(turn_path, task)
        end_time = get_end_time(df_short_transcript)

        filename = turn_path.stem.removesuffix("_long_bert")
        short_turn_path = turn_path.parent / f"{filename}_bert.pkl"
        short_textgrid_path = turn_path.parent / f"{filename}_bert.TextGrid"

        short_turn = shorten_turn(turn, end_time)
        short_textgrid = shorten_textgrid(short_turn, annotator, short_textgrid_path)

        with open(short_turn_path, "wb") as f:
            pkl.dump(short_turn, f)