8/24 (Sat) | Experiment

# Preliminary Analyses of Annoation

## 1. Introduction

This notebook conducts preliminary analyses.
The goal of current analyses is to fill the following table.

| Task | WER | N disfluency (Manual / Automatic) | N MCP (Manual / Automatic) | N ECP (Manual / Automatic) |
| - | - | - | - | - |
| Arg_Oly |  |  |  |  |
| Cartoon |  |  |  |  |
| RtSwithoutRAA |  |  |  |  |
| RtSwithRAA |  |  |  |  |
| Monologue |  |  |  |  |
| WoZ_Interview |  |  |  |  |
| ALL |  |  |  |  |

Before starting the analyses, the following code block loads required packages and define global variables.

In [1]:
from typing import List, Tuple, Dict, Generator, Optional
from pathlib import Path

import numpy as np
import pandas as pd

from utils.mfr import logit_2_rating

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

MONOLOGUE_TASK = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA"]
DIALOGUE_TASK = ["WoZ_Interview"]

FILLER = {"uh", "ah", "um", "mm", "hmm", "oh", "mm-hmm", "er", "mhm", "uh-huh", "er", "erm", "huh", "uhu", "mmhmm", "uhhuh"}

---

## 2. Define Functions

This section defines functions for the preliminary analyses.
The following code block defines two functions; one generates csv file paths of manual and automatic annotation results; and another one loads them.

In [2]:
def annotation_result_csv_path_generator(task: str, rating_filter: Optional[List[int]] =None) -> Generator[Tuple[Path, Path], None, None]:
    load_dir = DATA_DIR / f"{task}/10_SCTK_Inputs"

    if rating_filter is None:
        for manu_csv_path in load_dir.glob("*_manu.csv"):
            filename = manu_csv_path.stem.removesuffix("_manu")
            auto_csv_path = load_dir / f"{filename}_auto.csv"

            yield manu_csv_path, auto_csv_path
    else:
        pf_path = DATA_DIR / f"{task}/12_PF_Rating/pf_rating.csv"
        df_pf = pd.read_csv(pf_path)
        uid_list = df_pf["uid"].to_numpy()

        logit_path = pf_path.parent / "logit_all.csv"
        threshold_path = logit_path.parent / "threshold_all.csv"
        if task == "WoZ_Interview":
            logit_path = pf_path.parent / "logit.csv"
            threshold_path = logit_path.parent / "threshold.csv"
        
        df_logit = pd.read_csv(logit_path, index_col=0)
        rating_list = logit_2_rating(df_logit["theta"], threshold_path)

        mask = np.full(rating_list.shape, False, dtype=bool)
        for rating in rating_filter:
            mask = mask | (rating_list == rating)
        
        uid_list = uid_list[mask]

        for uid in uid_list:
            if task == "WoZ_Interview":
                uid = str(int(uid)).zfill(3)

            filename_pattern = f"{uid}*_manu.csv"
            for manu_csv_path in load_dir.glob(filename_pattern):
                filename = manu_csv_path.stem.removesuffix("_manu")
                auto_csv_path = load_dir / f"{filename}_auto.csv"

                yield manu_csv_path, auto_csv_path

def load_dataset(
        rating_filter_monologue: Optional[List[int]] =None,
        rating_filter_dialogue: Optional[List[int]] =None,
) -> Dict[str, Dict[str, List[Dict[str, pd.DataFrame]]]]:
    dataset = {
        "monologue": {},
        "dialogue": {}
    }
    
    for monologue_task in MONOLOGUE_TASK:
        dataset["monologue"][monologue_task] = []
        
        for manu_csv_path, auto_csv_path in annotation_result_csv_path_generator(monologue_task, rating_filter=rating_filter_monologue):
            df_manu = pd.read_csv(manu_csv_path)
            df_auto = pd.DataFrame([], columns=["text"])
            if auto_csv_path.exists():
                df_auto = pd.read_csv(auto_csv_path)

            dataset["monologue"][monologue_task].append({
                "manual": df_manu,
                "automatic": df_auto
            })

    for dialogue_task in DIALOGUE_TASK:
        dataset["dialogue"][dialogue_task] = []

        for manu_csv_path, auto_csv_path in annotation_result_csv_path_generator(dialogue_task, rating_filter=rating_filter_dialogue):
            df_manu = pd.read_csv(manu_csv_path, na_values=["", " "], keep_default_na=False)
            df_auto = pd.DataFrame([], columns=["text"])
            if auto_csv_path.exists():
                df_auto = pd.read_csv(auto_csv_path, na_values=["", " "], keep_default_na=False)

            dataset["dialogue"][dialogue_task].append({
                "manual": df_manu,
                "automatic": df_auto
            })

    return dataset

The following code block defines a function to count the number of tags.

In [3]:
def count_sample_size(annotation_results: List[Dict[str, pd.DataFrame]]) -> List[int]:    
    n_tag_manu = []

    for annotation_result in annotation_results:
        df_manu = annotation_result["manual"]

        mask_tag_manu = ~(df_manu["text"].str.contains("<"))

        n_tag_manu.append(mask_tag_manu.sum())

    return n_tag_manu

---

## 3. Preliminary Analyses

This section conducts the preliminary analyses.
The following code block loads entire dataset.

In [4]:
dataset = load_dataset()

In [5]:
dataset["monologue"]["Arg_Oly"][0]["manual"].head()

Unnamed: 0,start_time,end_time,type,text
0,0.22,0.240063,01_text,i
1,0.300063,0.660125,01_text,agree
2,0.820187,1.04025,01_text,this
3,1.260312,2.080563,01_text,statement
4,2.132178,2.76438,02_pause,<CE>


In [6]:
dataset["monologue"]["Arg_Oly"][0]["automatic"].head()

Unnamed: 0,start_time,end_time,type,text
0,0.22,0.240063,01_text,i
1,0.300063,0.660125,01_text,agree
2,0.820187,1.04025,01_text,this
3,1.260312,2.080563,01_text,statement
4,2.080563,2.74075,02_pause,<CI>


### 3.2. Count Words (Sample Size)

The following code block counts the number of disfluency words in monologue tasks

In [7]:
for monologue_task in MONOLOGUE_TASK:
    annotation_results = dataset["monologue"][monologue_task]
    n_tags_manu = count_sample_size(annotation_results)
    print(f"[Manual] N_sample_size of {monologue_task} = {sum(n_tags_manu)}")

for dialogue_task in DIALOGUE_TASK:
    annotation_results = dataset["dialogue"][dialogue_task]
    n_tags_manu = count_sample_size(annotation_results)
    print(f"[Manual] N_sample_size of {monologue_task} = {sum(n_tags_manu)}")

[Manual] N_sample_size of Arg_Oly = 14623
[Manual] N_sample_size of Cartoon = 18818
[Manual] N_sample_size of RtSwithoutRAA = 19228
[Manual] N_sample_size of RtSwithRAA = 19251


[Manual] N_sample_size of RtSwithRAA = 38757


In [8]:
beginner_dataset = load_dataset(rating_filter_monologue=[0, 1, 2], rating_filter_dialogue=[0, 1])

The following code block counts disfluency tags in beginners' speech.

In [9]:
for monologue_task in MONOLOGUE_TASK:
    annotation_results = beginner_dataset["monologue"][monologue_task]

    n_tags_manu = count_sample_size(annotation_results)
    print(f"[Manual] N_sample_size of {monologue_task} = {sum(n_tags_manu)}")

for dialogue_task in DIALOGUE_TASK:
    annotation_results = beginner_dataset["dialogue"][dialogue_task]

    n_tags_manu = count_sample_size(annotation_results)
    print(f"[Manual] N_sample_size of {monologue_task} = {sum(n_tags_manu)}")

[Manual] N_sample_size of Arg_Oly = 3109
[Manual] N_sample_size of Cartoon = 3891
[Manual] N_sample_size of RtSwithoutRAA = 4210
[Manual] N_sample_size of RtSwithRAA = 3930
[Manual] N_sample_size of RtSwithRAA = 7301


### 5.2. Intemediate

The following code block loads intermediate group's speech.

In [10]:
intemediate_dataset = load_dataset(rating_filter_monologue=[3, 4, 5], rating_filter_dialogue=[2, 3])

The following code block counts disfluency tags in intemediate learners' speech.

In [11]:
for monologue_task in MONOLOGUE_TASK:
    annotation_results = intemediate_dataset["monologue"][monologue_task]

    n_tags_manu = count_sample_size(annotation_results)
    print(f"[Manual] N_sample_size of {monologue_task} = {sum(n_tags_manu)}")

for dialogue_task in DIALOGUE_TASK:
    annotation_results = intemediate_dataset["dialogue"][dialogue_task]

    n_tags_manu = count_sample_size(annotation_results)
    print(f"[Manual] N_sample_size of {monologue_task} = {sum(n_tags_manu)}")

[Manual] N_sample_size of Arg_Oly = 7066
[Manual] N_sample_size of Cartoon = 9492
[Manual] N_sample_size of RtSwithoutRAA = 9223
[Manual] N_sample_size of RtSwithRAA = 9622
[Manual] N_sample_size of RtSwithRAA = 24792


### 5.3. Advanced 

The following code block loads advanced learners' speech.

In [12]:
advanced_dataset = load_dataset(rating_filter_monologue=[6, 7, 8], rating_filter_dialogue=[4, 5])

The following code block counts disfluency tags in advanced learners' speech.

In [13]:
for monologue_task in MONOLOGUE_TASK:
    annotation_results = advanced_dataset["monologue"][monologue_task]

    n_tags_manu = count_sample_size(annotation_results)
    print(f"[Manual] N_sample_size of {monologue_task} = {sum(n_tags_manu)}")

for dialogue_task in DIALOGUE_TASK:
    annotation_results = advanced_dataset["dialogue"][dialogue_task]

    n_tags_manu = count_sample_size(annotation_results)
    print(f"[Manual] N_sample_size of {monologue_task} = {sum(n_tags_manu)}")

[Manual] N_sample_size of Arg_Oly = 4448
[Manual] N_sample_size of Cartoon = 5435
[Manual] N_sample_size of RtSwithoutRAA = 5795
[Manual] N_sample_size of RtSwithRAA = 5699
[Manual] N_sample_size of RtSwithRAA = 6664
