5/21 (Tue) | Experiment

# Evaluation of Reliability of Automatic Annotation

## 1. Introduction

This notebook evalutate the reliability of the proposed automatic temporal feature annotation system.
More specifically, I evaluate the system in terms of the following metrics.

- Cohen's kappa
- Accuracy score
- Precision
- Recall
- F1 Score

Before starting the evaluation, the following code block loads required packages and defines global variables.

In [1]:
from typing import List, Tuple, Generator, Optional
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, cohen_kappa_score

from utils.mfr import logit_2_rating

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

TASK = ["WoZ_Interview"]

---

## 2. Define Functions

This section defines functions for the analyses.
The following code block defines 

In [2]:
def label_generator(
        task: str,
        ignore_tag: str, 
        tags: Tuple[str] =("<disfluency>", "<ci>", "<ce>", "<filler>"), 
        word: str ="<word>",
        rating_filter: Optional[List[int]] =None,
        bert: bool =False
) -> Generator[Tuple[list, list], None, None]:
    load_dir = DATA_DIR / f"{task}/11_SCTK_Outputs"

    pf_path = DATA_DIR / f"{task}/12_PF_Rating/pf_rating.csv"
    df_pf = pd.read_csv(pf_path)
    uid_list = df_pf["uid"].to_numpy()

    if rating_filter is not None:
        logit_path = pf_path.parent / "logit.csv"
        threshold_path = logit_path.parent / "threshold.csv"
        
        df_logit = pd.read_csv(logit_path, index_col=0)
        rating_list = logit_2_rating(df_logit["theta"], threshold_path)

        mask = np.full(rating_list.shape, False, dtype=bool)
        for rating in rating_filter:
            mask = mask | (rating_list == rating)
        
        uid_list = uid_list[mask]

    for uid in uid_list:
        if task == "WoZ_Interview":
            uid = str(int(uid)).zfill(3)

        filename_pattern = f"{uid}*_ignore_{ignore_tag}.txt"
        if bert:
            filename_pattern = f"{uid}*_ignore_{ignore_tag}_roberta_L1.txt"
        for filename in load_dir.glob(filename_pattern):
            with open(filename, "r") as f:
                true = []
                pred = []
                flag = 0

                for line in f.readlines():
                    if line[0] == "<":
                        continue

                    line = line.replace("\n", "")

                    if len(line) == 1 and line.isupper():
                        flag += 1
                        continue

                    line = line.replace("\"", "")
                    if not(line in tags):
                        line = word

                    if flag == 1:
                        true.append(line)
                        flag += 1
                    elif flag == 2:
                        pred.append(line)
                        flag = 0

            yield true, pred

The following code block defines a function to convert labels to ids.

In [3]:
def tag_2_id(
        tag_list: List[str], 
        tags: List[str] =["<disfluency>", "<ci>", "<ce>", "<filler>", "<word>"]
) -> List[str]:
    tag_id_list = []
    for tag in tag_list:
        i = tags.index(tag)
        tag_id_list.append(i)

    return tag_id_list

The following code block defines a function to calculate Cronbach's Alpha.

In [4]:
def cronbach_alpha(true: List[int], pred: List[int]):
    mtx = np.array([true, pred])
    var_by_items = np.var(mtx, axis=0)
    sum_var_by_items = np.sum(var_by_items)

    items_sum = np.sum(mtx, axis=1)
    var_items_sum = np.var(items_sum)

    n_items = len(true)

    alpha = n_items / (n_items - 1) * (1 - sum_var_by_items / var_items_sum)

    return alpha

The following code block defines a function to calculate reliability metrics.

In [5]:
def evaluate_reliability(
        task_list: List[str], 
        ignore_tag: str,
        rating_filter: Optional[List[int]] =None
) -> None:
    all_true = []
    all_pred = []

    if ignore_tag == "<CI>-<CE>-<FILLER>":
        tags = ["<disfluency>", "<word>"]
        print(f"--- Analysis of Disfluency ---")

    elif ignore_tag == "<CI>-<CE>":
        tags = ["<disfluency>", "<filler>", "<word>"]
        print(f"--- Analysis of Disfluency & Filler ---")

    elif ignore_tag == "<DISFLUENCY>-<FILLER>":
        tags = ["<ci>", "<ce>", "<word>"]
        print(f"--- Analysis of Pause Location ---")

    print("- Tasks ... ", end="")

    sample_size = 0
    for task in task_list:
        print(f"{task}, ", end="")
        for true, pred in label_generator(task, ignore_tag, rating_filter=rating_filter, bert=True):
            true = tag_2_id(true, tags=tags)
            pred = tag_2_id(pred, tags=tags)

            all_true += true
            all_pred += pred
            sample_size += 1

    if rating_filter is not None:
        print(f"\n- Target rating ... {rating_filter}")
    else:
        print()
    print(f"- Sample size = {sample_size}")

    acc = accuracy_score(all_true, all_pred)
    p, r, f, _ = precision_recall_fscore_support(all_true, all_pred)
    # a = cronbach_alpha(all_true, all_pred)
    kappa = cohen_kappa_score(all_true, all_pred)

    print(f"\n- Metrics")
    # print(f"\tCronbach Alpha: {a}")
    print(f"\tCohen's Kappa:  {kappa}")
    print(f"\tAccuracy:       {acc:.03f}\n")

    print(f"\tLabels    | {np.array(tags)}")
    print(f"\tPrecision | {p}") 
    print(f"\tRecall    | {r}")
    print(f"\tF1 score  | {f}")

---

## 3. Reliability Analyses

This section conducts reliability analyses.

### 3.1. Disfluency Detection

In [6]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>")

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 
- Sample size = 1743

- Metrics
	Cohen's Kappa:  0.5964753840367035
	Accuracy:       0.937

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.70547514 0.95588965]
	Recall    | [0.57019329 0.97507921]
	F1 score  | [0.63066104 0.96538908]


Disfluency Detector のアップデートで，長い系列の言い淀みのアノテーションが可能になった．
ただし，それが不必要な箇所まで言い淀みと判定してしまうケースが増えていそう(e.g., 008_031)．
実際に，recall が 0.631 → 0.804 と APSIPA から値が向上しているのに対して，precision は 0.832 → 0.565 と値が大きく低下している．取りこぼしが少なくなった反面，余計な箇所まで言い淀みと判定していまい，その結果 kappa が低下した可能性 大

**TODO: 旧版の言い淀み検出に取り替えた場合の性能を見たい / もしくは，L1 で学習したモデルに変更した場合はどうなる...？**
</br>→ BERT 版でやったほうが精度が下がった...

せめて，kappa 60% < は欲しいが...

#### 3.1.2. Beginners

In [7]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1])

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 
- Target rating ... [0, 1]
- Sample size = 603

- Metrics
	Cohen's Kappa:  0.5962054564848722
	Accuracy:       0.926

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.76820208 0.94014036]
	Recall    | [0.54306723 0.97767602]
	F1 score  | [0.63630769 0.95854086]


#### 3.1.3. Intermediate

In [8]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[2, 3])

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 
- Target rating ... [2, 3]
- Sample size = 979

- Metrics
	Cohen's Kappa:  0.6029322614015009
	Accuracy:       0.934

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.7040724  0.95496926]
	Recall    | [0.58452292 0.9728732 ]
	F1 score  | [0.63875205 0.96383809]


#### 3.1.4. Advanced

In [9]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[4, 5])

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 
- Target rating ... [4, 5]
- Sample size = 161

- Metrics
	Cohen's Kappa:  0.5301774969006077
	Accuracy:       0.959

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.57288136 0.9770239 ]
	Recall    | [0.53144654 0.98050139]
	F1 score  | [0.55138662 0.97875956]


### 3.2. Pause Location Classification

In [10]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 
- Sample size = 1743

- Metrics
	Cohen's Kappa:  0.626287943022007
	Accuracy:       0.862

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.58229934 0.55693816 0.95992944]
	Recall    | [0.79405155 0.61286307 0.88981355]
	F1 score  | [0.67188636 0.58356381 0.9235426 ]


#### 3.2.2. Beginners

In [11]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1])

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 
- Target rating ... [0, 1]
- Sample size = 603

- Metrics
	Cohen's Kappa:  0.6112853639687635
	Accuracy:       0.848

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.65553943 0.55730337 0.9195605 ]
	Recall    | [0.71165966 0.52653928 0.90278333]
	F1 score  | [0.68244775 0.54148472 0.91109468]


#### 3.2.3. Intermediate

In [12]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[2, 3])

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 
- Target rating ... [2, 3]
- Sample size = 979

- Metrics
	Cohen's Kappa:  0.6350832839837025
	Accuracy:       0.862

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.5758223  0.58333333 0.96540893]
	Recall    | [0.82345754 0.62954139 0.88344301]
	F1 score  | [0.6777275  0.60555715 0.92260906]


#### 3.2.4. Advanced

In [13]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[4, 5])

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 


- Target rating ... [4, 5]
- Sample size = 161

- Metrics
	Cohen's Kappa:  0.5943516673498126
	Accuracy:       0.884

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.4959217  0.43544304 0.98565638]
	Recall    | [0.81066667 0.66153846 0.89992883]
	F1 score  | [0.61538462 0.52519084 0.94084381]
