5/14 (Tue) | Experiment

# Evaluation of Reliability of Automatic Annotation

## 1. Introduction

This notebook evalutate the reliability of the proposed automatic temporal feature annotation system.
More specifically, I evaluate the system in terms of the following metrics.

- Cohen's kappa
- Accuracy score
- Precision
- Recall
- F1 Score

Before starting the evaluation, the following code block loads required packages and defines global variables.

In [1]:
from typing import List, Tuple, Generator, Optional
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

from utils.mfr import logit_2_rating
from utils.cohen_kappa import RCohenKappa
r_cohen_kappa = RCohenKappa(debug=False)

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

TASK = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA", "WoZ_Interview"]

---

## 2. Define Functions

This section defines functions for the analyses.
The following code block defines 

In [2]:
def label_generator(
        task: str,
        ignore_tag: str, 
        tags: Tuple[str] =("<disfluency>", "<ci>", "<ce>", "<filler>"), 
        word: str ="<word>",
        rating_filter: Optional[List[int]] =None,
        bert: bool =False
) -> Generator[Tuple[list, list], None, None]:
    load_dir = DATA_DIR / f"{task}/11_SCTK_Outputs"

    pf_path = DATA_DIR / f"{task}/12_PF_Rating/pf_rating.csv"
    df_pf = pd.read_csv(pf_path)
    uid_list = df_pf["uid"].to_numpy()

    if rating_filter is not None:
        logit_path = pf_path.parent / "logit_all.csv"
        threshold_path = logit_path.parent / "threshold_all.csv"
        
        df_logit = pd.read_csv(logit_path, index_col=0)
        rating_list = logit_2_rating(df_logit["theta"], threshold_path)

        mask = np.full(rating_list.shape, False, dtype=bool)
        for rating in rating_filter:
            mask = mask | (rating_list == rating)
        
        uid_list = uid_list[mask]

    for uid in uid_list:
        if task == "WoZ_Interview":
            uid = str(int(uid)).zfill(3)

        if task == "RtSwithoutRAA":
            if uid == "2055_RtSwithoutRAA":
                continue

        filename_pattern = f"{uid}*_ignore_{ignore_tag}.txt"
        if bert:
            filename_pattern = f"{uid}*_ignore_{ignore_tag}_bert.txt"
        for filename in load_dir.glob(filename_pattern):
            with open(filename, "r") as f:
                true = []
                pred = []
                flag = 0

                for line in f.readlines():
                    if line[0] == "<":
                        continue

                    line = line.replace("\n", "")

                    if len(line) == 1 and line.isupper():
                        flag += 1
                        continue

                    line = line.replace("\"", "")
                    if not(line in tags):
                        line = word

                    if flag == 1:
                        true.append(line)
                        flag += 1
                    elif flag == 2:
                        pred.append(line)
                        flag = 0

            yield true, pred

The following code block defines a function to convert labels to ids.

In [3]:
def tag_2_id(
        tag_list: List[str], 
        tags: List[str] =["<disfluency>", "<ci>", "<ce>", "<filler>", "<word>"]
) -> List[str]:
    tag_id_list = []
    for tag in tag_list:
        i = tags.index(tag)
        tag_id_list.append(i)

    return tag_id_list

The following code block defines a function to calculate Cronbach's Alpha.

In [4]:
def cronbach_alpha(true: List[int], pred: List[int]):
    mtx = np.array([true, pred])
    var_by_items = np.var(mtx, axis=0)
    sum_var_by_items = np.sum(var_by_items)

    items_sum = np.sum(mtx, axis=1)
    var_items_sum = np.var(items_sum)

    n_items = len(true)

    alpha = n_items / (n_items - 1) * (1 - sum_var_by_items / var_items_sum)

    return alpha

The following code block defines a function to calculate reliability metrics.

In [5]:
def evaluate_reliability(
        task_list: List[str], 
        ignore_tag: str,
        rating_filter: Optional[List[int]] =None
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    all_true = []
    all_pred = []

    if ignore_tag == "<CI>-<CE>-<FILLER>":
        tags = ["<disfluency>", "<word>"]
        print(f"--- Analysis of Disfluency ---")

    elif ignore_tag == "<CI>-<CE>":
        tags = ["<disfluency>", "<filler>", "<word>"]
        print(f"--- Analysis of Disfluency & Filler ---")

    elif ignore_tag == "<DISFLUENCY>-<FILLER>":
        tags = ["<ci>", "<ce>", "<word>"]
        print(f"--- Analysis of Pause Location ---")

    print("- Tasks ... ", end="")

    sample_size = 0
    n_data = 0
    for task in task_list:
        print(f"{task}, ", end="")
        for true, pred in label_generator(task, ignore_tag, rating_filter=rating_filter):
            true = tag_2_id(true, tags=tags)
            pred = tag_2_id(pred, tags=tags)

            all_true += true
            all_pred += pred
            sample_size += len(true)
            n_data += 1

    if rating_filter is not None:
        print(f"\n- Target rating ... {rating_filter}")
    else:
        print()
    print(f"- Sample size = {sample_size}")
    print(f"- Data size   = {n_data}")

    acc = accuracy_score(all_true, all_pred)
    p, r, f, _ = precision_recall_fscore_support(all_true, all_pred)
    # a = cronbach_alpha(all_true, all_pred)
    kappa, _, lower, upper = r_cohen_kappa.cohen_kappa(all_true, all_pred)

    print(f"\n- Metrics")
    # print(f"\tCronbach Alpha: {a}")
    print(f"\tCohen's Kappa:  {kappa:.03f} (|{lower:.03f} - {upper:.03f}|)")
    print(f"\tAccuracy:       {acc:.03f}\n")

    print(f"\tLabels    | {np.array(tags)}")
    print(f"\tPrecision | {p}") 
    print(f"\tRecall    | {r}")
    print(f"\tF1 score  | {f}")

    data_kappa = [[f"{n_data:,}", f"{sample_size:,}", f"{kappa:.03f}",f"[{lower:.03f}, {upper:.03f}]"]]
    df_kappa = pd.DataFrame(data_kappa, columns=["N_data", "Sample size", "Kappa", "95% CI"])
    if "<CI>-<CE>" in ignore_tag:
        data_cfmx = [[f"{p[0]:.03f}", f"{r[0]:.03f}", f"{f[0]:.03f}"]]
        df_cfmx = pd.DataFrame(data_cfmx, columns=["P_disfl", "R_disfl", "F1_disfl"])
    else:
        data_cfmx = [[f"{p[0]:.03f}", f"{r[0]:.03f}", f"{f[0]:.03f}", f"{p[1]:.03f}", f"{r[1]:.03f}", f"{f[1]:.03f}"]]
        df_cfmx = pd.DataFrame(data_cfmx, columns=["P_mcp", "R_mcp", "F1_mcp", "P_ecp", "R_ecp", "F1_ecp"])
    
    if rating_filter is None:
        idx_name = f"{task}_00All"
    elif rating_filter == [0, 1, 2] or rating_filter == [0, 1]:
        idx_name = f"{task}_01Low"
    elif rating_filter == [3, 4, 5] or rating_filter == [2, 3]:
        idx_name = f"{task}_02Mid"
    else:
        idx_name = f"{task}_03High"
    
    df_kappa.index = [idx_name]
    df_cfmx.index = [idx_name]

    return df_kappa, df_cfmx

---

## 3. Reliability Analyses

This section conducts reliability analyses.

### 3.1. Disfluency Detection

In [6]:
df_kappa_disfl_list = []
df_cfmx_disfl_list = []

#### 3.1.1. All Ratings

In [7]:
_, _ = evaluate_reliability(TASK, ignore_tag="<CI>-<CE>-<FILLER>")

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, Cartoon, RtSwithoutRAA, RtSwithRAA, WoZ_Interview, 
- Sample size = 129506
- Data size   = 2254

- Metrics
	Cohen's Kappa:  0.658 (|0.652 - 0.665|)
	Accuracy:       0.935

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.73551735 0.95758787]
	Recall    | [0.6575959  0.97032086]
	F1 score  | [0.69437742 0.96391232]


In [8]:
df_kappa, df_cfmx = evaluate_reliability(["Arg_Oly"], ignore_tag="<CI>-<CE>-<FILLER>")
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, 
- Sample size = 17557
- Data size   = 128

- Metrics
	Cohen's Kappa:  0.689 (|0.671 - 0.707|)
	Accuracy:       0.942

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.73786949 0.96492007]
	Recall    | [0.70522388 0.97002742]
	F1 score  | [0.72117743 0.96746701]


In [9]:
df_kappa, df_cfmx = evaluate_reliability(["Cartoon"], ignore_tag="<CI>-<CE>-<FILLER>")
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... Cartoon, 
- Sample size = 22850
- Data size   = 128

- Metrics
	Cohen's Kappa:  0.714 (|0.699 - 0.728|)
	Accuracy:       0.941

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.79102106 0.95957311]
	Recall    | [0.70778528 0.97374857]
	F1 score  | [0.74709193 0.96660887]


In [10]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<CI>-<CE>-<FILLER>")
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... RtSwithoutRAA, 
- Sample size = 23608
- Data size   = 127

- Metrics
	Cohen's Kappa:  0.680 (|0.665 - 0.695|)
	Accuracy:       0.935

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.77409163 0.9537863 ]
	Recall    | [0.66802999 0.9723324 ]
	F1 score  | [0.71716063 0.96297006]


In [11]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithRAA"], ignore_tag="<CI>-<CE>-<FILLER>")
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... RtSwithRAA, 
- Sample size = 23761
- Data size   = 128

- Metrics
	Cohen's Kappa:  0.646 (|0.630 - 0.661|)
	Accuracy:       0.928

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.73043818 0.95189814]
	Recall    | [0.64669207 0.96699243]
	F1 score  | [0.68601874 0.95938592]


In [12]:
df_kappa, df_cfmx = evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>")
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)
# TODO: df_annotation_manu の disfluency の位置の訂正

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 
- Sample size = 41730
- Data size   = 1743

- Metrics
	Cohen's Kappa:  0.597 (|0.583 - 0.611|)
	Accuracy:       0.934

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.67046101 0.95876073]
	Recall    | [0.59918616 0.96936346]
	F1 score  | [0.63282299 0.96403294]


Disfluency Detector のアップデートで，長い系列の言い淀みのアノテーションが可能になった．
ただし，それが不必要な箇所まで言い淀みと判定してしまうケースが増えていそう(e.g., 008_031)．
実際に，recall が 0.631 → 0.804 と APSIPA から値が向上しているのに対して，precision は 0.832 → 0.565 と値が大きく低下している．取りこぼしが少なくなった反面，余計な箇所まで言い淀みと判定していまい，その結果 kappa が低下した可能性 大

**TODO: 旧版の言い淀み検出に取り替えた場合の性能を見たい / もしくは，L1 で学習したモデルに変更した場合はどうなる...？**

#### 3.1.2. Beginners

In [13]:
df_kappa, df_cfmx = evaluate_reliability(["Arg_Oly"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1, 2])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, 
- Target rating ... [0, 1, 2]
- Sample size = 3923
- Data size   = 33

- Metrics
	Cohen's Kappa:  0.675 (|0.640 - 0.710|)
	Accuracy:       0.925

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.75502008 0.94948905]
	Recall    | [0.6848816  0.96384114]
	F1 score  | [0.7182426  0.95661127]


In [14]:
df_kappa, df_cfmx = evaluate_reliability(["Cartoon"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1, 2])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... Cartoon, 
- Target rating ... [0, 1, 2]
- Sample size = 5000
- Data size   = 33

- Metrics
	Cohen's Kappa:  0.710 (|0.682 - 0.738|)
	Accuracy:       0.927

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.80403458 0.94658616]
	Recall    | [0.70812183 0.9677113 ]
	F1 score  | [0.75303644 0.95703217]


In [15]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1, 2])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [0, 1, 2]
- Sample size = 5466
- Data size   = 32

- Metrics
	Cohen's Kappa:  0.684 (|0.656 - 0.712|)
	Accuracy:       0.923

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.77322404 0.94613435]
	Recall    | [0.68940317 0.96426265]
	F1 score  | [0.72891178 0.95511249]


In [16]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1, 2])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... RtSwithRAA, 
- Target rating ... [0, 1, 2]
- Sample size = 5092
- Data size   = 33

- Metrics
	Cohen's Kappa:  0.642 (|0.611 - 0.674|)
	Accuracy:       0.917

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.70386905 0.94909502]
	Recall    | [0.67765043 0.95471097]
	F1 score  | [0.69051095 0.95189471]


In [17]:
df_kappa, df_cfmx = evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 
- Target rating ... [0, 1]
- Sample size = 7967
- Data size   = 603

- Metrics
	Cohen's Kappa:  0.607 (|0.579 - 0.636|)
	Accuracy:       0.926

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.74657534 0.94376123]
	Recall    | [0.57247899 0.97362794]
	F1 score  | [0.64803805 0.95846197]


#### 3.1.3. Intermediate

In [18]:
df_kappa, df_cfmx = evaluate_reliability(["Arg_Oly"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[3, 4, 5])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, 
- Target rating ... [3, 4, 5]
- Sample size = 8571
- Data size   = 66

- Metrics
	Cohen's Kappa:  0.697 (|0.672 - 0.722|)
	Accuracy:       0.940

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.74038462 0.96489849]
	Recall    | [0.72112383 0.96806833]
	F1 score  | [0.73062731 0.96648081]


In [19]:
df_kappa, df_cfmx = evaluate_reliability(["Cartoon"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[3, 4, 5])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... Cartoon, 
- Target rating ... [3, 4, 5]
- Sample size = 11562
- Data size   = 66

- Metrics
	Cohen's Kappa:  0.734 (|0.715 - 0.754|)
	Accuracy:       0.944

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.79369842 0.96363281]
	Recall    | [0.73986014 0.97285827]
	F1 score  | [0.76583424 0.96822356]


In [20]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[3, 4, 5])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [3, 4, 5]
- Sample size = 11330
- Data size   = 66

- Metrics
	Cohen's Kappa:  0.677 (|0.655 - 0.698|)
	Accuracy:       0.934

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.7700579  0.95366071]
	Recall    | [0.665      0.97200403]
	F1 score  | [0.7136834 0.962745 ]


In [21]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[3, 4, 5])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... RtSwithRAA, 
- Target rating ... [3, 4, 5]
- Sample size = 11858
- Data size   = 66

- Metrics
	Cohen's Kappa:  0.635 (|0.612 - 0.657|)
	Accuracy:       0.927

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.71860095 0.95150943]
	Recall    | [0.63751763 0.96609195]
	F1 score  | [0.67563528 0.95874525]


In [22]:
df_kappa, df_cfmx = evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[2, 3])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 
- Target rating ... [2, 3]
- Sample size = 26947
- Data size   = 979

- Metrics
	Cohen's Kappa:  0.597 (|0.580 - 0.613|)
	Accuracy:       0.931

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.66379664 0.95744247]
	Recall    | [0.60818933 0.9662343 ]
	F1 score  | [0.63477749 0.96181829]


#### 3.1.4. Advanced

In [23]:
df_kappa, df_cfmx = evaluate_reliability(["Arg_Oly"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[6, 7, 8])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, 
- Target rating ... [6, 7, 8]
- Sample size = 5063
- Data size   = 29

- Metrics
	Cohen's Kappa:  0.678 (|0.637 - 0.718|)
	Accuracy:       0.957

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.70752089 0.97619048]
	Recall    | [0.69398907 0.97764531]
	F1 score  | [0.70068966 0.97691735]


In [24]:
df_kappa, df_cfmx = evaluate_reliability(["Cartoon"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[6, 7, 8])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... Cartoon, 
- Target rating ... [6, 7, 8]
- Sample size = 6288
- Data size   = 29

- Metrics
	Cohen's Kappa:  0.662 (|0.629 - 0.696|)
	Accuracy:       0.947

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.76530612 0.96205588]
	Recall    | [0.6302521  0.97979975]
	F1 score  | [0.69124424 0.97084675]


In [25]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[6, 7, 8])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [6, 7, 8]
- Sample size = 6812
- Data size   = 29

- Metrics
	Cohen's Kappa:  0.680 (|0.650 - 0.710|)
	Accuracy:       0.945

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.78341794 0.95981353]
	Recall    | [0.64936886 0.97901295]
	F1 score  | [0.7101227  0.96931818]


In [26]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[6, 7, 8])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... RtSwithRAA, 
- Target rating ... [6, 7, 8]
- Sample size = 6811
- Data size   = 29

- Metrics
	Cohen's Kappa:  0.668 (|0.638 - 0.697|)
	Accuracy:       0.939

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.7827476 0.9545675]
	Recall    | [0.63553826 0.97748344]
	F1 score  | [0.70150322 0.96588957]


In [27]:
df_kappa, df_cfmx = evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[4, 5])
df_kappa_disfl_list.append(df_kappa)
df_cfmx_disfl_list.append(df_cfmx)

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 
- Target rating ... [4, 5]
- Sample size = 6816
- Data size   = 161

- Metrics
	Cohen's Kappa:  0.558 (|0.511 - 0.604|)
	Accuracy:       0.959

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.55652174 0.98052851]
	Recall    | [0.60377358 0.97645429]
	F1 score  | [0.57918552 0.97848716]


MEMO: CEFR fluency のスコアが上がるほど precision と kappa が低下している．

In [28]:
df_kappa_disfl = pd.concat(df_kappa_disfl_list)
df_kappa_disfl.sort_index()
df_kappa_disfl.to_csv("/home/matsuura/Development/app/feature_extraction_api/experiment/data/kappa_disfl_detection.csv")

In [29]:
df_cfmx_disfl = pd.concat(df_cfmx_disfl_list)
df_cfmx_disfl.sort_index()
df_cfmx_disfl.to_csv("/home/matsuura/Development/app/feature_extraction_api/experiment/data/cfmx_disfl_detection.csv")

### 3.2. Pause Location Classification

In [30]:
df_kappa_pl_list = []
df_cfmx_pl_list = []

#### 3.2.1. All Ratings

In [31]:
_, _ = evaluate_reliability(TASK, ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, Cartoon, RtSwithoutRAA, RtSwithRAA, WoZ_Interview, 
- Sample size = 161539
- Data size   = 2254

- Metrics
	Cohen's Kappa:  0.707 (|0.704 - 0.711|)
	Accuracy:       0.880

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.67758925 0.69779267 0.95775995]
	Recall    | [0.83463751 0.65552057 0.90790205]
	F1 score  | [0.74795847 0.67599642 0.9321648 ]


In [32]:
df_kappa, df_cfmx = evaluate_reliability(["Arg_Oly"], ignore_tag="<DISFLUENCY>-<FILLER>")
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, 
- Sample size = 21808
- Data size   = 128

- Metrics
	Cohen's Kappa:  0.742 (|0.732 - 0.752|)
	Accuracy:       0.895

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.73166768 0.76323676 0.95410476]
	Recall    | [0.83728498 0.66900175 0.92611831]
	F1 score  | [0.78092141 0.71301913 0.93990325]


In [33]:
df_kappa, df_cfmx = evaluate_reliability(["Cartoon"], ignore_tag="<DISFLUENCY>-<FILLER>")
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... Cartoon, 
- Sample size = 28744
- Data size   = 128

- Metrics
	Cohen's Kappa:  0.749 (|0.741 - 0.757|)
	Accuracy:       0.892

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.70514913 0.78563536 0.96335727]
	Recall    | [0.85483001 0.7163728  0.91768868]
	F1 score  | [0.77280859 0.74940711 0.9399686 ]


In [34]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<DISFLUENCY>-<FILLER>")
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... RtSwithoutRAA, 
- Sample size = 29903
- Data size   = 127

- Metrics
	Cohen's Kappa:  0.732 (|0.724 - 0.740|)
	Accuracy:       0.884

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.71016465 0.74799754 0.95561607]
	Recall    | [0.84917019 0.66266376 0.9122919 ]
	F1 score  | [0.77347166 0.70274964 0.93345156]


In [35]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithRAA"], ignore_tag="<DISFLUENCY>-<FILLER>")
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... RtSwithRAA, 
- Sample size = 30134
- Data size   = 128

- Metrics
	Cohen's Kappa:  0.732 (|0.724 - 0.740|)
	Accuracy:       0.883

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.71634231 0.73998729 0.95336911]
	Recall    | [0.84870499 0.63192182 0.91425319]
	F1 score  | [0.77692641 0.68169839 0.93340152]


In [36]:
df_kappa, df_cfmx = evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>")
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 
- Sample size = 50950
- Data size   = 1743

- Metrics
	Cohen's Kappa:  0.626 (|0.619 - 0.634|)
	Accuracy:       0.863

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.58244526 0.55706727 0.95992839]
	Recall    | [0.79471249 0.61161826 0.88981086]
	F1 score  | [0.67222005 0.58306962 0.92354066]


#### 3.2.2. Beginners

In [37]:
df_kappa, df_cfmx = evaluate_reliability(["Arg_Oly"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1, 2])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, 
- Target rating ... [0, 1, 2]
- Sample size = 4997
- Data size   = 33

- Metrics
	Cohen's Kappa:  0.731 (|0.711 - 0.751|)
	Accuracy:       0.880

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.78079009 0.71830986 0.92699685]
	Recall    | [0.8057554  0.6023622  0.92726231]
	F1 score  | [0.79307632 0.65524625 0.92712956]


In [38]:
df_kappa, df_cfmx = evaluate_reliability(["Cartoon"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1, 2])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... Cartoon, 
- Target rating ... [0, 1, 2]
- Sample size = 6444
- Data size   = 33

- Metrics
	Cohen's Kappa:  0.752 (|0.736 - 0.768|)
	Accuracy:       0.883

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.77183908 0.76190476 0.93874249]
	Recall    | [0.83780412 0.65753425 0.92232569]
	F1 score  | [0.80346994 0.70588235 0.93046168]


In [39]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1, 2])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [0, 1, 2]
- Sample size = 7062
- Data size   = 32

- Metrics
	Cohen's Kappa:  0.733 (|0.717 - 0.749|)
	Accuracy:       0.875

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.75       0.68668407 0.94306352]
	Recall    | [0.83891095 0.62470309 0.9100041 ]
	F1 score  | [0.79196787 0.65422886 0.92623891]


In [40]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1, 2])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... RtSwithRAA, 
- Target rating ... [0, 1, 2]
- Sample size = 6652
- Data size   = 33

- Metrics
	Cohen's Kappa:  0.746 (|0.730 - 0.762|)
	Accuracy:       0.881

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.76968399 0.7251462  0.93990547]
	Recall    | [0.85180794 0.59615385 0.91800396]
	F1 score  | [0.80866629 0.65435356 0.92882562]


In [41]:
df_kappa, df_cfmx = evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 
- Target rating ... [0, 1]
- Sample size = 9886
- Data size   = 603

- Metrics
	Cohen's Kappa:  0.611 (|0.593 - 0.628|)
	Accuracy:       0.848

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.65490575 0.55530474 0.9194467 ]
	Recall    | [0.71165966 0.52229299 0.90267608]
	F1 score  | [0.6821042  0.53829322 0.91098421]


#### 3.2.3. Intermediate

In [42]:
df_kappa, df_cfmx = evaluate_reliability(["Arg_Oly"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[3, 4, 5])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, 
- Target rating ... [3, 4, 5]
- Sample size = 10743
- Data size   = 66

- Metrics
	Cohen's Kappa:  0.742 (|0.729 - 0.756|)
	Accuracy:       0.892

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.73107256 0.76771654 0.95376023]
	Recall    | [0.85006878 0.66213922 0.92098332]
	F1 score  | [0.78609286 0.71103008 0.93708525]


In [43]:
df_kappa, df_cfmx = evaluate_reliability(["Cartoon"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[3, 4, 5])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... Cartoon, 
- Target rating ... [3, 4, 5]
- Sample size = 14805
- Data size   = 66

- Metrics
	Cohen's Kappa:  0.742 (|0.731 - 0.754|)
	Accuracy:       0.886

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.69440655 0.78886555 0.9633883 ]
	Recall    | [0.85661393 0.72142171 0.90938571]
	F1 score  | [0.76702833 0.75363773 0.93560841]


In [44]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[3, 4, 5])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [3, 4, 5]
- Sample size = 14543
- Data size   = 66

- Metrics
	Cohen's Kappa:  0.735 (|0.723 - 0.746|)
	Accuracy:       0.883

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.70860215 0.76765083 0.95679012]
	Recall    | [0.86682012 0.65070729 0.9080601 ]
	F1 score  | [0.77976631 0.70435807 0.93178843]


In [45]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[3, 4, 5])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... RtSwithRAA, 
- Target rating ... [3, 4, 5]
- Sample size = 15218
- Data size   = 66

- Metrics
	Cohen's Kappa:  0.726 (|0.715 - 0.737|)
	Accuracy:       0.879

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.71097372 0.73391089 0.95231763]
	Recall    | [0.84896955 0.62951168 0.90938776]
	F1 score  | [0.77386794 0.67771429 0.93035772]


In [46]:
df_kappa, df_cfmx = evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[2, 3])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 
- Target rating ... [2, 3]
- Sample size = 33028
- Data size   = 979

- Metrics
	Cohen's Kappa:  0.635 (|0.626 - 0.644|)
	Accuracy:       0.862

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.57604671 0.58246001 0.96540322]
	Recall    | [0.82366117 0.6289458  0.88342537]
	F1 score  | [0.6779519  0.604811   0.92259683]


#### 3.2.4. Advanced

In [47]:
df_kappa, df_cfmx = evaluate_reliability(["Arg_Oly"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[6, 7, 8])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, 
- Target rating ... [6, 7, 8]
- Sample size = 6068
- Data size   = 29

- Metrics
	Cohen's Kappa:  0.747 (|0.728 - 0.767|)
	Accuracy:       0.912

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.67518248 0.78928571 0.97485081]
	Recall    | [0.85057471 0.73913043 0.93365993]
	F1 score  | [0.75279756 0.76338515 0.95381086]


In [48]:
df_kappa, df_cfmx = evaluate_reliability(["Cartoon"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[6, 7, 8])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... Cartoon, 
- Target rating ... [6, 7, 8]
- Sample size = 7495
- Data size   = 29

- Metrics
	Cohen's Kappa:  0.756 (|0.739 - 0.772|)
	Accuracy:       0.911

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.64767616 0.79791667 0.98204541]
	Recall    | [0.87715736 0.756917   0.92921386]
	F1 score  | [0.74514877 0.77687627 0.95489944]


In [49]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[6, 7, 8])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [6, 7, 8]
- Sample size = 8298
- Data size   = 29

- Metrics
	Cohen's Kappa:  0.722 (|0.705 - 0.738|)
	Accuracy:       0.893

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.66626433 0.76572668 0.96326861]
	Recall    | [0.82265276 0.71747967 0.92094678]
	F1 score  | [0.73624542 0.74081847 0.94163239]


In [50]:
df_kappa, df_cfmx = evaluate_reliability(["RtSwithRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[6, 7, 8])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... RtSwithRAA, 
- Target rating ... [6, 7, 8]
- Sample size = 8264
- Data size   = 29

- Metrics
	Cohen's Kappa:  0.726 (|0.709 - 0.742|)
	Accuracy:       0.892

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.67142042 0.76359338 0.96501314]
	Recall    | [0.84433286 0.66735537 0.91998121]
	F1 score  | [0.74801398 0.71223815 0.94195928]


In [51]:
df_kappa, df_cfmx = evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[4, 5])
df_kappa_pl_list.append(df_kappa)
df_cfmx_pl_list.append(df_cfmx)

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 
- Target rating ... [4, 5]
- Sample size = 8036
- Data size   = 161

- Metrics
	Cohen's Kappa:  0.596 (|0.575 - 0.618|)
	Accuracy:       0.885

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.49715678 0.44102564 0.9858145 ]
	Recall    | [0.816      0.66153846 0.9000854 ]
	F1 score  | [0.61786976 0.52923077 0.94100141]


In [52]:
df_kappa_pl = pd.concat(df_kappa_pl_list)
df_kappa_pl.sort_index()
df_kappa_pl.to_csv("/home/matsuura/Development/app/feature_extraction_api/experiment/data/kappa_pause_loc_detection.csv")

In [53]:
df_cfmx_pl = pd.concat(df_cfmx_pl_list)
df_cfmx_pl.sort_index()
df_cfmx_pl.to_csv("/home/matsuura/Development/app/feature_extraction_api/experiment/data/cfmx_pause_loc_detection.csv")