5/14 (Tue) | Experiment

# Evaluation of Reliability of Automatic Annotation

## 1. Introduction

This notebook evalutate the reliability of the proposed automatic temporal feature annotation system.
More specifically, I evaluate the system in terms of the following metrics.

- Cohen's kappa
- Accuracy score
- Precision
- Recall
- F1 Score

Before starting the evaluation, the following code block loads required packages and defines global variables.

In [1]:
from typing import List, Tuple, Generator, Optional
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

from utils.mfr import logit_2_rating
from utils.cohen_kappa import RCohenKappa
r_cohen_kappa = RCohenKappa(debug=False)

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

TASK = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA", "WoZ_Interview"]

---

## 2. Define Functions

This section defines functions for the analyses.
The following code block defines 

In [2]:
def label_generator(
        task: str,
        ignore_tag: str, 
        tags: Tuple[str] =("<disfluency>", "<ci>", "<ce>", "<filler>"), 
        word: str ="<word>",
        rating_filter: Optional[List[int]] =None,
        bert: bool =False
) -> Generator[Tuple[list, list], None, None]:
    load_dir = DATA_DIR / f"{task}/11_SCTK_Outputs"

    pf_path = DATA_DIR / f"{task}/12_PF_Rating/pf_rating.csv"
    df_pf = pd.read_csv(pf_path)
    uid_list = df_pf["uid"].to_numpy()

    if rating_filter is not None:
        logit_path = pf_path.parent / "logit_all.csv"
        threshold_path = logit_path.parent / "threshold_all.csv"
        
        df_logit = pd.read_csv(logit_path, index_col=0)
        rating_list = logit_2_rating(df_logit["theta"], threshold_path)

        mask = np.full(rating_list.shape, False, dtype=bool)
        for rating in rating_filter:
            mask = mask | (rating_list == rating)
        
        uid_list = uid_list[mask]

    for uid in uid_list:
        if task == "WoZ_Interview":
            uid = str(int(uid)).zfill(3)

        filename_pattern = f"{uid}*_ignore_{ignore_tag}.txt"
        if bert:
            filename_pattern = f"{uid}*_ignore_{ignore_tag}_bert.txt"
        for filename in load_dir.glob(filename_pattern):
            with open(filename, "r") as f:
                true = []
                pred = []
                flag = 0

                for line in f.readlines():
                    if line[0] == "<":
                        continue

                    line = line.replace("\n", "")

                    if len(line) == 1 and line.isupper():
                        flag += 1
                        continue

                    line = line.replace("\"", "")
                    if not(line in tags):
                        line = word

                    if flag == 1:
                        true.append(line)
                        flag += 1
                    elif flag == 2:
                        pred.append(line)
                        flag = 0

            yield true, pred

The following code block defines a function to convert labels to ids.

In [3]:
def tag_2_id(
        tag_list: List[str], 
        tags: List[str] =["<disfluency>", "<ci>", "<ce>", "<filler>", "<word>"]
) -> List[str]:
    tag_id_list = []
    for tag in tag_list:
        if tag == "<ce>":
            tag = "<ci>"

        i = tags.index(tag)
        tag_id_list.append(i)

    return tag_id_list

The following code block defines a function to calculate Cronbach's Alpha.

In [4]:
def cronbach_alpha(true: List[int], pred: List[int]):
    mtx = np.array([true, pred])
    var_by_items = np.var(mtx, axis=0)
    sum_var_by_items = np.sum(var_by_items)

    items_sum = np.sum(mtx, axis=1)
    var_items_sum = np.var(items_sum)

    n_items = len(true)

    alpha = n_items / (n_items - 1) * (1 - sum_var_by_items / var_items_sum)

    return alpha

The following code block defines a function to calculate reliability metrics.

In [5]:
def evaluate_reliability(
        task_list: List[str], 
        ignore_tag: str,
        rating_filter: Optional[List[int]] =None
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    all_true = []
    all_pred = []

    if ignore_tag == "<CI>-<CE>-<FILLER>":
        tags = ["<disfluency>", "<word>"]
        print(f"--- Analysis of Disfluency ---")

    elif ignore_tag == "<CI>-<CE>":
        tags = ["<disfluency>", "<filler>", "<word>"]
        print(f"--- Analysis of Disfluency & Filler ---")

    elif ignore_tag == "<DISFLUENCY>-<FILLER>":
        tags = ["<ci>", "<ce>", "<word>"]
        print(f"--- Analysis of Pause Location ---")

    print("- Tasks ... ", end="")

    sample_size = 0
    n_data = 0
    for task in task_list:
        print(f"{task}, ", end="")
        for true, pred in label_generator(task, ignore_tag, rating_filter=rating_filter):
            true = tag_2_id(true, tags=tags)
            pred = tag_2_id(pred, tags=tags)

            all_true += true
            all_pred += pred
            sample_size += len(true)
            n_data += 1

    if rating_filter is not None:
        print(f"\n- Target rating ... {rating_filter}")
    else:
        print()
    print(f"- Sample size = {sample_size}")
    print(f"- Data size   = {n_data}")

    acc = accuracy_score(all_true, all_pred)
    p, r, f, _ = precision_recall_fscore_support(all_true, all_pred)
    # a = cronbach_alpha(all_true, all_pred)
    kappa, _, lower, upper = r_cohen_kappa.cohen_kappa(all_true, all_pred)

    print(f"\n- Metrics")
    # print(f"\tCronbach Alpha: {a}")
    print(f"\tCohen's Kappa:  {kappa:.03f} (|{lower:.03f} - {upper:.03f}|)")
    print(f"\tAccuracy:       {acc:.03f}\n")

    print(f"\tLabels    | {np.array(tags)}")
    print(f"\tPrecision | {p}") 
    print(f"\tRecall    | {r}")
    print(f"\tF1 score  | {f}")

    data_kappa = [[f"{n_data:,}", f"{sample_size:,}", f"{kappa:.03f}",f"[{lower:.03f}, {upper:.03f}]"]]
    df_kappa = pd.DataFrame(data_kappa, columns=["N_data", "Sample size", "Kappa", "95% CI"])
    if "<CI>-<CE>" in ignore_tag:
        data_cfmx = [[f"{p[0]:.03f}", f"{r[0]:.03f}", f"{f[0]:.03f}"]]
        df_cfmx = pd.DataFrame(data_cfmx, columns=["P_disfl", "R_disfl", "F1_disfl"])
    else:
        data_cfmx = [[f"{p[0]:.03f}", f"{r[0]:.03f}", f"{f[0]:.03f}", f"{p[1]:.03f}", f"{r[1]:.03f}", f"{f[1]:.03f}"]]
        df_cfmx = pd.DataFrame(data_cfmx, columns=["P_mcp", "R_mcp", "F1_mcp", "P_ecp", "R_ecp", "F1_ecp"])
    
    if rating_filter is None:
        idx_name = f"{task}_00All"
    elif rating_filter == [0, 1, 2] or rating_filter == [0, 1]:
        idx_name = f"{task}_01Low"
    elif rating_filter == [3, 4, 5] or rating_filter == [2, 3]:
        idx_name = f"{task}_02Mid"
    else:
        idx_name = f"{task}_03High"
    
    df_kappa.index = [idx_name]
    df_cfmx.index = [idx_name]

    return df_kappa, df_cfmx

---

## 3. Reliability Analyses

This section conducts reliability analyses.

### 3.2. Pause Location Classification

In [6]:
df_kappa_pl_list = []
df_cfmx_pl_list = []

#### 3.2.1. All Ratings

In [7]:
_, _ = evaluate_reliability(TASK, ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, 

Cartoon, RtSwithoutRAA, RtSwithRAA, WoZ_Interview, 
- Sample size = 161603
- Data size   = 2255

- Metrics
	Cohen's Kappa:  0.741 (|0.738 - 0.745|)
	Accuracy:       0.900

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.75216436 0.95770632]
	Recall    | [0.87455609 0.90789796]
	F1 score  | [0.80875595 0.93213724]
