5/21 (Tue) | Experiment

# Preliminary Analyses of Annoation

## 1. Introduction

This notebook conducts preliminary analyses.
The goal of current analyses is to fill the following table.

| Task | WER | N disfluency (Manual / Automatic) | N MCP (Manual / Automatic) | N ECP (Manual / Automatic) |
| - | - | - | - | - |
| WoZ_Interview |  |  |  |  |

Before starting the analyses, the following code block loads required packages and define global variables.

In [7]:
from typing import List, Tuple, Dict, Generator, Optional
from pathlib import Path

import numpy as np
import pandas as pd
from jiwer import wer

from utils.mfr import logit_2_rating

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

MONOLOGUE_TASK = []
DIALOGUE_TASK = ["WoZ_Interview"]

FILLER = {"uh", "ah", "um", "mm", "hmm", "oh", "mm-hmm", "er", "mhm", "uh-huh", "er", "erm", "huh", "uhu", "mmhmm", "uhhuh"}

---

## 2. Define Functions

This section defines functions for the preliminary analyses.
The following code block defines two functions; one generates csv file paths of manual and automatic annotation results; and another one loads them.

In [8]:
def annotation_result_csv_path_generator(task: str, rating_filter: Optional[List[int]] =None) -> Generator[Tuple[Path, Path], None, None]:
    load_dir = DATA_DIR / f"{task}/10_SCTK_Inputs"

    if rating_filter is None:
        for manu_csv_path in load_dir.glob("*_manu.csv"):
            filename = manu_csv_path.stem.removesuffix("_manu")
            auto_csv_path = load_dir / f"{filename}_auto_roberta_L1.csv"

            yield manu_csv_path, auto_csv_path
    else:
        pf_path = DATA_DIR / f"{task}/12_PF_Rating/pf_rating.csv"
        df_pf = pd.read_csv(pf_path)
        uid_list = df_pf["uid"].to_numpy()

        logit_path = pf_path.parent / "logit.csv"
        threshold_path = logit_path.parent / "threshold.csv"
        
        df_logit = pd.read_csv(logit_path, index_col=0)
        rating_list = logit_2_rating(df_logit["theta"], threshold_path)

        mask = np.full(rating_list.shape, False, dtype=bool)
        for rating in rating_filter:
            mask = mask | (rating_list == rating)
        
        uid_list = uid_list[mask]

        for uid in uid_list:
            if task == "WoZ_Interview":
                uid = str(int(uid)).zfill(3)

            filename_pattern = f"{uid}*_manu.csv"
            for manu_csv_path in load_dir.glob(filename_pattern):
                filename = manu_csv_path.stem.removesuffix("_manu")
                auto_csv_path = load_dir / f"{filename}_auto_bert.csv"

                yield manu_csv_path, auto_csv_path

def load_dataset(
        rating_filter_monologue: Optional[List[int]] =None,
        rating_filter_dialogue: Optional[List[int]] =None,
) -> Dict[str, Dict[str, List[Dict[str, pd.DataFrame]]]]:
    dataset = {
        "monologue": {},
        "dialogue": {}
    }
    
    for monologue_task in MONOLOGUE_TASK:
        dataset["monologue"][monologue_task] = []
        
        for manu_csv_path, auto_csv_path in annotation_result_csv_path_generator(monologue_task, rating_filter=rating_filter_monologue):
            df_manu = pd.read_csv(manu_csv_path)
            df_auto = pd.DataFrame([], columns=["text"])
            if auto_csv_path.exists():
                df_auto = pd.read_csv(auto_csv_path)

            dataset["monologue"][monologue_task].append({
                "manual": df_manu,
                "automatic": df_auto
            })

    for dialogue_task in DIALOGUE_TASK:
        dataset["dialogue"][dialogue_task] = []

        for manu_csv_path, auto_csv_path in annotation_result_csv_path_generator(dialogue_task, rating_filter=rating_filter_dialogue):
            df_manu = pd.read_csv(manu_csv_path, na_values=["", " "], keep_default_na=False)
            df_auto = pd.DataFrame([], columns=["text"])
            if auto_csv_path.exists():
                df_auto = pd.read_csv(auto_csv_path, na_values=["", " "], keep_default_na=False)

            dataset["dialogue"][dialogue_task].append({
                "manual": df_manu,
                "automatic": df_auto
            })

    return dataset

The following code block defines a function to calculate WER.

In [9]:
def calculate_wer(annotation_results: List[Dict[str, pd.DataFrame]], remove_filer: bool =False) -> float:
    ref = []
    hyp = []

    for annotation_result in annotation_results:
        df_manu = annotation_result["manual"]
        df_auto = annotation_result["automatic"]

        mask_tag_manu = df_manu["text"].astype(str).str.endswith(">")
        mask_tag_auto = df_auto["text"].astype(str).str.endswith(">")

        df_manu = df_manu[~mask_tag_manu]
        df_auto = df_auto[~mask_tag_auto]

        if remove_filer:
            for filler in FILLER:
                mask_filler_manu = (df_manu["text"] == filler)
                df_manu = df_manu[~mask_filler_manu]

                mask_filler_auto = (df_auto["text"] == filler)
                df_auto = df_auto[~mask_filler_auto]

        text_manu = " ".join(df_manu["text"].astype(str))
        text_auto = " ".join(df_auto["text"].astype(str))

        if len(text_manu) == 0 or len(text_auto) == 0:
            continue

        ref.append(text_manu)
        hyp.append(text_auto)

    return wer(ref, hyp)

The following code block defines a function to count the number of tags.

In [10]:
def count_tags(annotation_results: List[Dict[str, pd.DataFrame]], target_tag: str) -> Tuple[List[int], List[int]]:
    n_tag_manu = []
    n_tag_auto = []

    for annotation_result in annotation_results:
        df_manu = annotation_result["manual"]
        df_auto = annotation_result["automatic"]

        mask_tag_manu = (df_manu["text"] == target_tag)
        mask_tag_auto = (df_auto["text"] == target_tag)

        n_tag_manu.append(mask_tag_manu.sum())
        n_tag_auto.append(mask_tag_auto.sum())

    return n_tag_manu, n_tag_auto

---

## 3. Preliminary Analyses

This section conducts the preliminary analyses.
The following code block loads entire dataset.

In [11]:
dataset = load_dataset()

### 3.1. WER

The following code block calculate WER of a dialogue task.

In [12]:
dialogue_data = []
for dialogue_task in DIALOGUE_TASK:
    annotation_results = dataset["dialogue"][dialogue_task]

    res = calculate_wer(annotation_results, remove_filer=True)

    print(f"WER of {dialogue_task} = {res}")

    dialogue_data += annotation_results

WER of WoZ_Interview = 0.15021711724331138


### 3.2. Count Disfluency

The following code block counts the number of disfluency words in dialogue tasks.

In [13]:
for dialogue_task in DIALOGUE_TASK:
    annotation_results = dataset["dialogue"][dialogue_task]

    n_tags_manu, n_tags_auto = count_tags(annotation_results, "<DISFLUENCY>")

    print(f"[Manual] N_disfluency of {dialogue_task} = {sum(n_tags_manu)}")
    print(f"[Automatic] N_disfluency of {dialogue_task} = {sum(n_tags_auto)}")

[Manual] N_disfluency of WoZ_Interview = 3935
[Automatic] N_disfluency of WoZ_Interview = 3178


### 3.3. Count Mid-Clause Pauses

The following code block counts the number of MCP in dialogue tasks.

In [14]:
for dialogue_task in DIALOGUE_TASK:
    annotation_results = dataset["dialogue"][dialogue_task]

    n_tags_manu, n_tags_auto = count_tags(annotation_results, "<CI>")

    print(f"[Manual] N_MCP of {dialogue_task} = {sum(n_tags_manu)}")
    print(f"[Automatic] N_MCP of {dialogue_task} = {sum(n_tags_auto)}")

[Manual] N_MCP of WoZ_Interview = 7574
[Automatic] N_MCP of WoZ_Interview = 10316


### 3.4. Count End-Clause Pauses

The following code block counts the number of ECP in dialogue tasks.

In [16]:
for dialogue_task in DIALOGUE_TASK:
    annotation_results = dataset["dialogue"][dialogue_task]

    n_tags_manu, n_tags_auto = count_tags(annotation_results, "<CE>")

    print(f"[Manual] N_ECP of {dialogue_task} = {sum(n_tags_manu)}")
    print(f"[Automatic] N_ECP of {dialogue_task} = {sum(n_tags_auto)}")

[Manual] N_ECP of WoZ_Interview = 2414
[Automatic] N_ECP of WoZ_Interview = 2652


---

## 4. Summary

As results, the following table was obtained.

| Task | WER | N disfluency (Manual / Auto_RoBERTa / Auto_RoBERTa_L1) | N MCP (Manual / Auto_RoBERTa / Auto_RoBERTa_L1) | N ECP (Manual / Auto_RoBERTa / Auto_RoBERTa_L1) |
| - | - | - | - | - |
| WoZ_Interview | 15.0% | 3,935 / 3,514 / 3,178 | 7,574 / 10,322 / 10,316 | 2,414 / 2,646 / 2,652 |

## 5. Additional Analyses

This section conducts the same analyses for each PF groups.

### 5.1. Beginners

The following code block loads beginners' speech.

In [17]:
beginner_dataset = load_dataset(rating_filter_monologue=[0, 1, 2], rating_filter_dialogue=[0, 1])

The following code block calculates WER of beginners' speech.

In [18]:
dialogue_data = []
for dialogue_task in DIALOGUE_TASK:
    annotation_results = beginner_dataset["dialogue"][dialogue_task]

    res = calculate_wer(annotation_results, remove_filer=True)

    print(f"WER of {dialogue_task} = {res}")

    dialogue_data += annotation_results

WER of WoZ_Interview = 0.2485251852972319


The following code block counts disfluency tags in beginners' speech.

In [19]:
for dialogue_task in DIALOGUE_TASK:
    annotation_results = beginner_dataset["dialogue"][dialogue_task]

    n_tags_manu, n_tags_auto = count_tags(annotation_results, "<DISFLUENCY>")

    print(f"[Manual] N_disfluency of {dialogue_task} = {sum(n_tags_manu)}")
    print(f"[Automatic] N_disfluency of {dialogue_task} = {sum(n_tags_auto)}")

[Manual] N_disfluency of WoZ_Interview = 955
[Automatic] N_disfluency of WoZ_Interview = 573


The following code block counts MCP tags in beginners' speech.

In [20]:
for dialogue_task in DIALOGUE_TASK:
    annotation_results = beginner_dataset["dialogue"][dialogue_task]

    n_tags_manu, n_tags_auto = count_tags(annotation_results, "<CI>")

    print(f"[Manual] N_MCP of {dialogue_task} = {sum(n_tags_manu)}")
    print(f"[Automatic] N_MCP of {dialogue_task} = {sum(n_tags_auto)}")

[Manual] N_MCP of WoZ_Interview = 1913
[Automatic] N_MCP of WoZ_Interview = 2061


The following code block counts ECP tags in beginners' speech.

In [21]:
for dialogue_task in DIALOGUE_TASK:
    annotation_results = beginner_dataset["dialogue"][dialogue_task]

    n_tags_manu, n_tags_auto = count_tags(annotation_results, "<CE>")

    print(f"[Manual] N_ECP of {dialogue_task} = {sum(n_tags_manu)}")
    print(f"[Automatic] N_ECP of {dialogue_task} = {sum(n_tags_auto)}")

[Manual] N_ECP of WoZ_Interview = 472
[Automatic] N_ECP of WoZ_Interview = 451


### 5.2. Intemediate

The following code block loads intermediate group's speech.

In [22]:
intemediate_dataset = load_dataset(rating_filter_monologue=[3, 4, 5], rating_filter_dialogue=[2, 3])

The following code block calculates WER of intemediate learners' speech.

In [23]:
dialogue_data = []
for dialogue_task in DIALOGUE_TASK:
    annotation_results = intemediate_dataset["dialogue"][dialogue_task]

    res = calculate_wer(annotation_results, remove_filer=True)

    print(f"WER of {dialogue_task} = {res}")

    dialogue_data += annotation_results

WER of WoZ_Interview = 0.14034321645342998


The following code block counts disfluency tags in intemediate learners' speech.

In [24]:
for dialogue_task in DIALOGUE_TASK:
    annotation_results = intemediate_dataset["dialogue"][dialogue_task]

    n_tags_manu, n_tags_auto = count_tags(annotation_results, "<DISFLUENCY>")

    print(f"[Manual] N_disfluency of {dialogue_task} = {sum(n_tags_manu)}")
    print(f"[Automatic] N_disfluency of {dialogue_task} = {sum(n_tags_auto)}")

[Manual] N_disfluency of WoZ_Interview = 2662
[Automatic] N_disfluency of WoZ_Interview = 1790


The following code block counts MCP tags in intemediate learners' speech.

In [25]:
for dialogue_task in DIALOGUE_TASK:
    annotation_results = intemediate_dataset["dialogue"][dialogue_task]

    n_tags_manu, n_tags_auto = count_tags(annotation_results, "<CI>")

    print(f"[Manual] N_MCP of {dialogue_task} = {sum(n_tags_manu)}")
    print(f"[Automatic] N_MCP of {dialogue_task} = {sum(n_tags_auto)}")

[Manual] N_MCP of WoZ_Interview = 4911
[Automatic] N_MCP of WoZ_Interview = 6996


The following code block counts ECP tags in intermediate learners' speech.

In [26]:
for dialogue_task in DIALOGUE_TASK:
    annotation_results = intemediate_dataset["dialogue"][dialogue_task]

    n_tags_manu, n_tags_auto = count_tags(annotation_results, "<CE>")

    print(f"[Manual] N_ECP of {dialogue_task} = {sum(n_tags_manu)}")
    print(f"[Automatic] N_ECP of {dialogue_task} = {sum(n_tags_auto)}")

[Manual] N_ECP of WoZ_Interview = 1682
[Automatic] N_ECP of WoZ_Interview = 1839


### 5.3. Advanced 

The following code block loads advanced learners' speech.

In [27]:
advanced_dataset = load_dataset(rating_filter_monologue=[6, 7, 8], rating_filter_dialogue=[4, 5])

The following code block calculates WER of advanced learners' speech.

In [28]:
dialogue_data = []
for dialogue_task in DIALOGUE_TASK:
    annotation_results = advanced_dataset["dialogue"][dialogue_task]

    res = calculate_wer(annotation_results, remove_filer=True)

    print(f"WER of {dialogue_task} = {res}")

    dialogue_data += annotation_results

WER of WoZ_Interview = 0.081675562024907


The following code block counts disfluency tags in advanced learners' speech.

In [29]:
for dialogue_task in DIALOGUE_TASK:
    annotation_results = advanced_dataset["dialogue"][dialogue_task]

    n_tags_manu, n_tags_auto = count_tags(annotation_results, "<DISFLUENCY>")

    print(f"[Manual] N_disfluency of {dialogue_task} = {sum(n_tags_manu)}")
    print(f"[Automatic] N_disfluency of {dialogue_task} = {sum(n_tags_auto)}")

[Manual] N_disfluency of WoZ_Interview = 318
[Automatic] N_disfluency of WoZ_Interview = 214


The following code block counts MCP tags in advanced learners' speech.

In [30]:
for dialogue_task in DIALOGUE_TASK:
    annotation_results = advanced_dataset["dialogue"][dialogue_task]

    n_tags_manu, n_tags_auto = count_tags(annotation_results, "<CI>")

    print(f"[Manual] N_MCP of {dialogue_task} = {sum(n_tags_manu)}")
    print(f"[Automatic] N_MCP of {dialogue_task} = {sum(n_tags_auto)}")

[Manual] N_MCP of WoZ_Interview = 750
[Automatic] N_MCP of WoZ_Interview = 1231


The following code block counts ECP tags in advanced learners' speech.

In [31]:
for dialogue_task in DIALOGUE_TASK:
    annotation_results = advanced_dataset["dialogue"][dialogue_task]

    n_tags_manu, n_tags_auto = count_tags(annotation_results, "<CE>")

    print(f"[Manual] N_ECP of {dialogue_task} = {sum(n_tags_manu)}")
    print(f"[Automatic] N_ECP of {dialogue_task} = {sum(n_tags_auto)}")

[Manual] N_ECP of WoZ_Interview = 260
[Automatic] N_ECP of WoZ_Interview = 390
