5/14 (Tue) | Experiment

# Evaluation of Reliability of Automatic Annotation

## 1. Introduction

This notebook evalutate the reliability of the proposed automatic temporal feature annotation system.
More specifically, I evaluate the system in terms of the following metrics.

- Cohen's kappa
- Accuracy score
- Precision
- Recall
- F1 Score

Before starting the evaluation, the following code block loads required packages and defines global variables.

In [1]:
from typing import List, Tuple, Generator, Optional
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, cohen_kappa_score

from utils.mfr import logit_2_rating

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

TASK = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA", "WoZ_Interview"]

---

## 2. Define Functions

This section defines functions for the analyses.
The following code block defines 

In [2]:
def label_generator(
        task: str,
        ignore_tag: str, 
        tags: Tuple[str] =("<disfluency>", "<ci>", "<ce>", "<filler>"), 
        word: str ="<word>",
        rating_filter: Optional[List[int]] =None,
        bert: bool =False
) -> Generator[Tuple[list, list], None, None]:
    load_dir = DATA_DIR / f"{task}/11_SCTK_Outputs"

    pf_path = DATA_DIR / f"{task}/12_PF_Rating/pf_rating.csv"
    df_pf = pd.read_csv(pf_path)
    uid_list = df_pf["uid"].to_numpy()

    if rating_filter is not None:
        logit_path = pf_path.parent / "logit.csv"
        threshold_path = logit_path.parent / "threshold.csv"
        
        df_logit = pd.read_csv(logit_path, index_col=0)
        rating_list = logit_2_rating(df_logit["theta"], threshold_path)

        mask = np.full(rating_list.shape, False, dtype=bool)
        for rating in rating_filter:
            mask = mask | (rating_list == rating)
        
        uid_list = uid_list[mask]

    for uid in uid_list:
        if task == "WoZ_Interview":
            uid = str(int(uid)).zfill(3)

        filename_pattern = f"{uid}*_ignore_{ignore_tag}.txt"
        if bert:
            filename_pattern = f"{uid}*_ignore_{ignore_tag}_bert.txt"
        for filename in load_dir.glob(filename_pattern):
            with open(filename, "r") as f:
                true = []
                pred = []
                flag = 0

                for line in f.readlines():
                    if line[0] == "<":
                        continue

                    line = line.replace("\n", "")

                    if len(line) == 1 and line.isupper():
                        flag += 1
                        continue

                    line = line.replace("\"", "")
                    if not(line in tags):
                        line = word

                    if flag == 1:
                        true.append(line)
                        flag += 1
                    elif flag == 2:
                        pred.append(line)
                        flag = 0

            yield true, pred

The following code block defines a function to convert labels to ids.

In [3]:
def tag_2_id(
        tag_list: List[str], 
        tags: List[str] =["<disfluency>", "<ci>", "<ce>", "<filler>", "<word>"]
) -> List[str]:
    tag_id_list = []
    for tag in tag_list:
        i = tags.index(tag)
        tag_id_list.append(i)

    return tag_id_list

The following code block defines a function to calculate Cronbach's Alpha.

In [4]:
def cronbach_alpha(true: List[int], pred: List[int]):
    mtx = np.array([true, pred])
    var_by_items = np.var(mtx, axis=0)
    sum_var_by_items = np.sum(var_by_items)

    items_sum = np.sum(mtx, axis=1)
    var_items_sum = np.var(items_sum)

    n_items = len(true)

    alpha = n_items / (n_items - 1) * (1 - sum_var_by_items / var_items_sum)

    return alpha

The following code block defines a function to calculate reliability metrics.

In [5]:
def evaluate_reliability(
        task_list: List[str], 
        ignore_tag: str,
        rating_filter: Optional[List[int]] =None
) -> None:
    all_true = []
    all_pred = []

    if ignore_tag == "<CI>-<CE>-<FILLER>":
        tags = ["<disfluency>", "<word>"]
        print(f"--- Analysis of Disfluency ---")

    elif ignore_tag == "<CI>-<CE>":
        tags = ["<disfluency>", "<filler>", "<word>"]
        print(f"--- Analysis of Disfluency & Filler ---")

    elif ignore_tag == "<DISFLUENCY>-<FILLER>":
        tags = ["<ci>", "<ce>", "<word>"]
        print(f"--- Analysis of Pause Location ---")

    print("- Tasks ... ", end="")

    sample_size = 0
    for task in task_list:
        print(f"{task}, ", end="")
        for true, pred in label_generator(task, ignore_tag, rating_filter=rating_filter):
            true = tag_2_id(true, tags=tags)
            pred = tag_2_id(pred, tags=tags)

            all_true += true
            all_pred += pred
            sample_size += 1

    if rating_filter is not None:
        print(f"\n- Target rating ... {rating_filter}")
    else:
        print()
    print(f"- Sample size = {sample_size}")

    acc = accuracy_score(all_true, all_pred)
    p, r, f, _ = precision_recall_fscore_support(all_true, all_pred)
    # a = cronbach_alpha(all_true, all_pred)
    kappa = cohen_kappa_score(all_true, all_pred)

    print(f"\n- Metrics")
    # print(f"\tCronbach Alpha: {a}")
    print(f"\tCohen's Kappa:  {kappa}")
    print(f"\tAccuracy:       {acc:.03f}\n")

    print(f"\tLabels    | {np.array(tags)}")
    print(f"\tPrecision | {p}") 
    print(f"\tRecall    | {r}")
    print(f"\tF1 score  | {f}")

---

## 3. Reliability Analyses

This section conducts reliability analyses.

### 3.1. Disfluency Detection

#### 3.1.1. All Ratings

In [6]:
evaluate_reliability(TASK, ignore_tag="<CI>-<CE>-<FILLER>")

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, Cartoon, RtSwithoutRAA, RtSwithRAA, WoZ_Interview, 
- Sample size = 2255

- Metrics
	Cohen's Kappa:  0.6584217734421212
	Accuracy:       0.935

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.73564464 0.95755134]
	Recall    | [0.65753614 0.97032146]
	F1 score  | [0.69440082 0.9638941 ]


In [7]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<CI>-<CE>-<FILLER>")

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, 
- Sample size = 128

- Metrics
	Cohen's Kappa:  0.6886630287763778
	Accuracy:       0.942

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.73786949 0.96492007]
	Recall    | [0.70522388 0.97002742]
	F1 score  | [0.72117743 0.96746701]


In [8]:
evaluate_reliability(["Cartoon"], ignore_tag="<CI>-<CE>-<FILLER>")

--- Analysis of Disfluency ---
- Tasks ... Cartoon, 
- Sample size = 128

- Metrics
	Cohen's Kappa:  0.7138173312465022
	Accuracy:       0.941

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.79102106 0.95957311]
	Recall    | [0.70778528 0.97374857]
	F1 score  | [0.74709193 0.96660887]


In [9]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<CI>-<CE>-<FILLER>")

--- Analysis of Disfluency ---
- Tasks ... RtSwithoutRAA, 
- Sample size = 128

- Metrics
	Cohen's Kappa:  0.6802750285020682
	Accuracy:       0.934

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.77458694 0.95359189]
	Recall    | [0.66768396 0.97233221]
	F1 score  | [0.71717356 0.96287087]


In [10]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<CI>-<CE>-<FILLER>")

--- Analysis of Disfluency ---
- Tasks ... RtSwithRAA, 


- Sample size = 128

- Metrics
	Cohen's Kappa:  0.6455741987370316
	Accuracy:       0.928

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.73043818 0.95189814]
	Recall    | [0.64669207 0.96699243]
	F1 score  | [0.68601874 0.95938592]


In [11]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>")
# TODO: df_annotation_manu の disfluency の位置の訂正

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 
- Sample size = 1743

- Metrics
	Cohen's Kappa:  0.596980345482176
	Accuracy:       0.934

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.67046101 0.95876073]
	Recall    | [0.59918616 0.96936346]
	F1 score  | [0.63282299 0.96403294]


Disfluency Detector のアップデートで，長い系列の言い淀みのアノテーションが可能になった．
ただし，それが不必要な箇所まで言い淀みと判定してしまうケースが増えていそう(e.g., 008_031)．
実際に，recall が 0.631 → 0.804 と APSIPA から値が向上しているのに対して，precision は 0.832 → 0.565 と値が大きく低下している．取りこぼしが少なくなった反面，余計な箇所まで言い淀みと判定していまい，その結果 kappa が低下した可能性 大

**TODO: 旧版の言い淀み検出に取り替えた場合の性能を見たい / もしくは，L1 で学習したモデルに変更した場合はどうなる...？**

#### 3.1.2. Beginners

In [12]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, 
- Target rating ... [0, 1, 2]
- Sample size = 35

- Metrics
	Cohen's Kappa:  0.6679425246554411
	Accuracy:       0.919

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.75181159 0.94524414]
	Recall    | [0.68144499 0.96069994]
	F1 score  | [0.71490095 0.95290938]


In [13]:
evaluate_reliability(["Cartoon"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Disfluency ---
- Tasks ... Cartoon, 
- Target rating ... [0, 1, 2]
- Sample size = 18

- Metrics
	Cohen's Kappa:  0.6778112798183122
	Accuracy:       0.920

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.76068376 0.94548786]
	Recall    | [0.69170984 0.96089385]
	F1 score  | [0.72455902 0.95312861]


In [14]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Disfluency ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [0, 1, 2]
- Sample size = 31

- Metrics
	Cohen's Kappa:  0.6980290721568795
	Accuracy:       0.926

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.7980226  0.94551351]
	Recall    | [0.69155447 0.96833481]
	F1 score  | [0.74098361 0.9567881 ]


In [15]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Disfluency ---
- Tasks ... RtSwithRAA, 
- Target rating ... [0, 1, 2]
- Sample size = 18

- Metrics
	Cohen's Kappa:  0.6598173488722505
	Accuracy:       0.917

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.71273713 0.95081967]
	Recall    | [0.70320856 0.95293073]
	F1 score  | [0.70794078 0.95187403]


In [16]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1])

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 


- Target rating ... [0, 1]
- Sample size = 603

- Metrics
	Cohen's Kappa:  0.6073073609970681
	Accuracy:       0.926

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.74657534 0.94376123]
	Recall    | [0.57247899 0.97362794]
	F1 score  | [0.64803805 0.95846197]


#### 3.1.3. Intermediate

In [17]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, 
- Target rating ... [3, 4, 5]
- Sample size = 66

- Metrics
	Cohen's Kappa:  0.7081380226444488
	Accuracy:       0.944

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.74733475 0.967456  ]
	Recall    | [0.73173278 0.9699124 ]
	F1 score  | [0.73945148 0.96868264]


In [18]:
evaluate_reliability(["Cartoon"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Disfluency ---
- Tasks ... Cartoon, 
- Target rating ... [3, 4, 5]
- Sample size = 78

- Metrics
	Cohen's Kappa:  0.7362660769058755
	Accuracy:       0.943

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.80125392 0.96175697]
	Recall    | [0.73915558 0.9728107 ]
	F1 score  | [0.76895307 0.96725226]


In [19]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Disfluency ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [3, 4, 5]
- Sample size = 68

- Metrics
	Cohen's Kappa:  0.6609717820257859
	Accuracy:       0.931

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.76350245 0.95097477]
	Recall    | [0.64522822 0.97177734]
	F1 score  | [0.6994003  0.96126352]


In [20]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Disfluency ---
- Tasks ... RtSwithRAA, 
- Target rating ... [3, 4, 5]
- Sample size = 81

- Metrics
	Cohen's Kappa:  0.6432725135177209
	Accuracy:       0.927

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.72501552 0.95150912]
	Recall    | [0.6485286  0.96556281]
	F1 score  | [0.68464244 0.95848445]


In [21]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[2, 3])

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 
- Target rating ... [2, 3]
- Sample size = 979

- Metrics
	Cohen's Kappa:  0.5966763725366623
	Accuracy:       0.931

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.66379664 0.95744247]
	Recall    | [0.60818933 0.9662343 ]
	F1 score  | [0.63477749 0.96181829]


#### 3.1.4. Advanced

In [22]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, 
- Target rating ... [6, 7, 8]
- Sample size = 27

- Metrics
	Cohen's Kappa:  0.6535615832283974
	Accuracy:       0.957

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.68316832 0.97641073]
	Recall    | [0.66990291 0.97776748]
	F1 score  | [0.67647059 0.97708864]


In [23]:
evaluate_reliability(["Cartoon"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Disfluency ---
- Tasks ... Cartoon, 
- Target rating ... [6, 7, 8]
- Sample size = 32

- Metrics
	Cohen's Kappa:  0.6732940430665222
	Accuracy:       0.946

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.78108581 0.96035866]
	Recall    | [0.63896848 0.97993579]
	F1 score  | [0.70291568 0.97004846]


In [24]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Disfluency ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [6, 7, 8]
- Sample size = 29

- Metrics
	Cohen's Kappa:  0.6961276535353389
	Accuracy:       0.946

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.76960784 0.96433311]
	Recall    | [0.68658892 0.97631844]
	F1 score  | [0.7257319  0.97028877]


In [25]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Disfluency ---
- Tasks ... RtSwithRAA, 


- Target rating ... [6, 7, 8]
- Sample size = 29

- Metrics
	Cohen's Kappa:  0.6417581153861716
	Accuracy:       0.936

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.75694444 0.95317272]
	Recall    | [0.61235955 0.975686  ]
	F1 score  | [0.67701863 0.96429797]


In [26]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[4, 5])

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 
- Target rating ... [4, 5]
- Sample size = 161

- Metrics
	Cohen's Kappa:  0.5577101806033509
	Accuracy:       0.959

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.55652174 0.98052851]
	Recall    | [0.60377358 0.97645429]
	F1 score  | [0.57918552 0.97848716]


MEMO: CEFR fluency のスコアが上がるほど precision と kappa が低下している．

### 3.2. Pause Location Classification

#### 3.2.1. All Ratings

In [27]:
evaluate_reliability(TASK, ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, Cartoon, RtSwithoutRAA, RtSwithRAA, WoZ_Interview, 
- Sample size = 2255

- Metrics
	Cohen's Kappa:  0.7073288940256932
	Accuracy:       0.880

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.67766532 0.69758577 0.95770632]
	Recall    | [0.83449728 0.65541567 0.90789796]
	F1 score  | [0.74794849 0.67584355 0.93213724]


In [28]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, 
- Sample size = 128

- Metrics
	Cohen's Kappa:  0.7420368964297315
	Accuracy:       0.895

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.73166768 0.76323676 0.95410476]
	Recall    | [0.83728498 0.66900175 0.92611831]
	F1 score  | [0.78092141 0.71301913 0.93990325]


In [29]:
evaluate_reliability(["Cartoon"], ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... Cartoon, 


- Sample size = 128

- Metrics
	Cohen's Kappa:  0.7492145901987092
	Accuracy:       0.892

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.70514913 0.78563536 0.96335727]
	Recall    | [0.85483001 0.7163728  0.91768868]
	F1 score  | [0.77280859 0.74940711 0.9399686 ]


In [30]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... RtSwithoutRAA, 
- Sample size = 128

- Metrics
	Cohen's Kappa:  0.7317320791341345
	Accuracy:       0.884

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.71046559 0.7467732  0.95532351]
	Recall    | [0.84843573 0.66212534 0.91226153]
	F1 score  | [0.77334515 0.70190641 0.93329607]


In [31]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... RtSwithRAA, 
- Sample size = 128

- Metrics
	Cohen's Kappa:  0.7316777341914804
	Accuracy:       0.883

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.71634231 0.73998729 0.95336911]
	Recall    | [0.84870499 0.63192182 0.91425319]
	F1 score  | [0.77692641 0.68169839 0.93340152]


In [32]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 
- Sample size = 1743

- Metrics
	Cohen's Kappa:  0.6263808866684425
	Accuracy:       0.863

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.58244526 0.55706727 0.95992839]
	Recall    | [0.79471249 0.61161826 0.88981086]
	F1 score  | [0.67222005 0.58306962 0.92354066]


#### 3.2.2. Beginners

In [33]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, 
- Target rating ... [0, 1, 2]
- Sample size = 35

- Metrics
	Cohen's Kappa:  0.7295661497803689
	Accuracy:       0.881

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.78332034 0.7281106  0.92470131]
	Recall    | [0.80015924 0.58955224 0.93090909]
	F1 score  | [0.79165026 0.65154639 0.92779481]


In [34]:
evaluate_reliability(["Cartoon"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Pause Location ---
- Tasks ... Cartoon, 
- Target rating ... [0, 1, 2]
- Sample size = 18

- Metrics
	Cohen's Kappa:  0.7214208120105148
	Accuracy:       0.870

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.76913876 0.75647668 0.91763653]
	Recall    | [0.79187192 0.62931034 0.92383957]
	F1 score  | [0.78033981 0.68705882 0.9207276 ]


In [35]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Pause Location ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [0, 1, 2]
- Sample size = 31

- Metrics
	Cohen's Kappa:  0.7331648133631077
	Accuracy:       0.876

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.74611399 0.68518519 0.94611693]
	Recall    | [0.84705882 0.64427861 0.90542797]
	F1 score  | [0.79338843 0.66410256 0.92532537]


In [36]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Pause Location ---
- Tasks ... RtSwithRAA, 
- Target rating ... [0, 1, 2]
- Sample size = 18

- Metrics
	Cohen's Kappa:  0.7473395340342042
	Accuracy:       0.880

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.77601586 0.72619048 0.93736312]
	Recall    | [0.85667396 0.57819905 0.91648822]
	F1 score  | [0.81435257 0.64379947 0.92680814]


In [37]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1])

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 
- Target rating ... [0, 1]
- Sample size = 603

- Metrics
	Cohen's Kappa:  0.6105006084868569
	Accuracy:       0.848

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.65490575 0.55530474 0.9194467 ]
	Recall    | [0.71165966 0.52229299 0.90267608]
	F1 score  | [0.6821042  0.53829322 0.91098421]


#### 3.2.3. Intermediate

In [38]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, 
- Target rating ... [3, 4, 5]
- Sample size = 66

- Metrics
	Cohen's Kappa:  0.7470058316463517
	Accuracy:       0.894

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.73298033 0.7752809  0.95550263]
	Recall    | [0.85449735 0.68092105 0.92065217]
	F1 score  | [0.78908795 0.72504378 0.93775372]


In [39]:
evaluate_reliability(["Cartoon"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Pause Location ---
- Tasks ... Cartoon, 
- Target rating ... [3, 4, 5]
- Sample size = 78

- Metrics
	Cohen's Kappa:  0.7522469718587835
	Accuracy:       0.888

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.71051444 0.79248826 0.96419743]
	Recall    | [0.86940917 0.71163575 0.91070853]
	F1 score  | [0.78197169 0.74988894 0.93668999]


In [40]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Pause Location ---
- Tasks ... RtSwithoutRAA, 


- Target rating ... [3, 4, 5]
- Sample size = 68

- Metrics
	Cohen's Kappa:  0.7297208501433563
	Accuracy:       0.882

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.70403226 0.75772559 0.95544841]
	Recall    | [0.85560274 0.64188482 0.91041073]
	F1 score  | [0.77245244 0.69501134 0.93238602]


In [41]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Pause Location ---
- Tasks ... RtSwithRAA, 
- Target rating ... [3, 4, 5]
- Sample size = 81

- Metrics
	Cohen's Kappa:  0.7355249255130409
	Accuracy:       0.883

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.71997472 0.74527363 0.95403986]
	Recall    | [0.85403649 0.64904679 0.9120514 ]
	F1 score  | [0.78129644 0.69383974 0.93257324]


In [42]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[2, 3])

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 


- Target rating ... [2, 3]
- Sample size = 979

- Metrics
	Cohen's Kappa:  0.6350759068690512
	Accuracy:       0.862

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.57604671 0.58246001 0.96540322]
	Recall    | [0.82366117 0.6289458  0.88342537]
	F1 score  | [0.6779519  0.604811   0.92259683]


#### 3.2.4. Advanced

In [43]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, 
- Target rating ... [6, 7, 8]
- Sample size = 27

- Metrics
	Cohen's Kappa:  0.7384761982939918
	Accuracy:       0.910

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.66164659 0.768      0.97608544]
	Recall    | [0.8470437  0.72180451 0.93235751]
	F1 score  | [0.74295378 0.74418605 0.95372051]


In [44]:
evaluate_reliability(["Cartoon"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Pause Location ---
- Tasks ... Cartoon, 
- Target rating ... [6, 7, 8]
- Sample size = 32

- Metrics
	Cohen's Kappa:  0.7494946005242763
	Accuracy:       0.907

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.65261727 0.7826087  0.97817014]
	Recall    | [0.85333333 0.76190476 0.92882183]
	F1 score  | [0.73959938 0.77211796 0.95285748]


In [45]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Pause Location ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [6, 7, 8]
- Sample size = 29

- Metrics
	Cohen's Kappa:  0.7310813020374732
	Accuracy:       0.894

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.68433314 0.77954545 0.96211995]
	Recall    | [0.83451705 0.71757322 0.92066773]
	F1 score  | [0.752      0.74727669 0.94093753]


In [46]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Pause Location ---
- Tasks ... RtSwithRAA, 
- Target rating ... [6, 7, 8]
- Sample size = 29

- Metrics
	Cohen's Kappa:  0.7109775808808132
	Accuracy:       0.884

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.67200916 0.7325     0.95814751]
	Recall    | [0.82851094 0.61425577 0.91829689]
	F1 score  | [0.74209861 0.668187   0.93779904]


In [47]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[4, 5])

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 


- Target rating ... [4, 5]
- Sample size = 161

- Metrics
	Cohen's Kappa:  0.5964792972793034
	Accuracy:       0.885

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.49715678 0.44102564 0.9858145 ]
	Recall    | [0.816      0.66153846 0.9000854 ]
	F1 score  | [0.61786976 0.52923077 0.94100141]
