5/14 (Tue) | Experiment

# Evaluation of Reliability of Automatic Annotation

## 1. Introduction

This notebook evalutate the reliability of the proposed automatic temporal feature annotation system.
More specifically, I evaluate the system in terms of the following metrics.

- Cohen's kappa
- Accuracy score
- Precision
- Recall
- F1 Score

Before starting the evaluation, the following code block loads required packages and defines global variables.

In [1]:
from typing import List, Tuple, Generator, Optional
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, cohen_kappa_score

from utils.mfr import logit_2_rating

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

TASK = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA", "WoZ_Interview"]

---

## 2. Define Functions

This section defines functions for the analyses.
The following code block defines 

In [2]:
def label_generator(
        task: str,
        ignore_tag: str, 
        tags: Tuple[str] =("<disfluency>", "<ci>", "<ce>", "<filler>"), 
        word: str ="<word>",
        rating_filter: Optional[List[int]] =None,
        bert: bool =False
) -> Generator[Tuple[list, list], None, None]:
    load_dir = DATA_DIR / f"{task}/11_SCTK_Outputs"

    pf_path = DATA_DIR / f"{task}/12_PF_Rating/pf_rating.csv"
    df_pf = pd.read_csv(pf_path)
    uid_list = df_pf["uid"].to_numpy()

    if rating_filter is not None:
        logit_path = pf_path.parent / "logit.csv"
        threshold_path = logit_path.parent / "threshold.csv"
        
        df_logit = pd.read_csv(logit_path, index_col=0)
        rating_list = logit_2_rating(df_logit["theta"], threshold_path)

        mask = np.full(rating_list.shape, False, dtype=bool)
        for rating in rating_filter:
            mask = mask | (rating_list == rating)
        
        uid_list = uid_list[mask]

    for uid in uid_list:
        if task == "WoZ_Interview":
            uid = str(int(uid)).zfill(3)

        filename_pattern = f"{uid}*_ignore_{ignore_tag}.txt"
        if bert:
            filename_pattern = f"{uid}*_ignore_{ignore_tag}_bert.txt"
        for filename in load_dir.glob(filename_pattern):
            with open(filename, "r") as f:
                true = []
                pred = []
                flag = 0

                for line in f.readlines():
                    if line[0] == "<":
                        continue

                    line = line.replace("\n", "")

                    if len(line) == 1 and line.isupper():
                        flag += 1
                        continue

                    line = line.replace("\"", "")
                    if not(line in tags):
                        line = word

                    if flag == 1:
                        true.append(line)
                        flag += 1
                    elif flag == 2:
                        pred.append(line)
                        flag = 0

            yield true, pred

The following code block defines a function to convert labels to ids.

In [3]:
def tag_2_id(
        tag_list: List[str], 
        tags: List[str] =["<disfluency>", "<ci>", "<ce>", "<filler>", "<word>"]
) -> List[str]:
    tag_id_list = []
    for tag in tag_list:
        i = tags.index(tag)
        tag_id_list.append(i)

    return tag_id_list

The following code block defines a function to calculate Cronbach's Alpha.

In [4]:
def cronbach_alpha(true: List[int], pred: List[int]):
    mtx = np.array([true, pred])
    var_by_items = np.var(mtx, axis=0)
    sum_var_by_items = np.sum(var_by_items)

    items_sum = np.sum(mtx, axis=1)
    var_items_sum = np.var(items_sum)

    n_items = len(true)

    alpha = n_items / (n_items - 1) * (1 - sum_var_by_items / var_items_sum)

    return alpha

The following code block defines a function to calculate reliability metrics.

In [5]:
def evaluate_reliability(
        task_list: List[str], 
        ignore_tag: str,
        rating_filter: Optional[List[int]] =None
) -> None:
    all_true = []
    all_pred = []

    if ignore_tag == "<CI>-<CE>-<FILLER>":
        tags = ["<disfluency>", "<word>"]
        print(f"--- Analysis of Disfluency ---")

    elif ignore_tag == "<CI>-<CE>":
        tags = ["<disfluency>", "<filler>", "<word>"]
        print(f"--- Analysis of Disfluency & Filler ---")

    elif ignore_tag == "<DISFLUENCY>-<FILLER>":
        tags = ["<ci>", "<ce>", "<word>"]
        print(f"--- Analysis of Pause Location ---")

    print("- Tasks ... ", end="")

    sample_size = 0
    for task in task_list:
        print(f"{task}, ", end="")
        for true, pred in label_generator(task, ignore_tag, rating_filter=rating_filter, bert=True):
            true = tag_2_id(true, tags=tags)
            pred = tag_2_id(pred, tags=tags)

            all_true += true
            all_pred += pred
            sample_size += 1

    if rating_filter is not None:
        print(f"\n- Target rating ... {rating_filter}")
    else:
        print()
    print(f"- Sample size = {sample_size}")

    acc = accuracy_score(all_true, all_pred)
    p, r, f, _ = precision_recall_fscore_support(all_true, all_pred)
    # a = cronbach_alpha(all_true, all_pred)
    kappa = cohen_kappa_score(all_true, all_pred)

    print(f"\n- Metrics")
    # print(f"\tCronbach Alpha: {a}")
    print(f"\tCohen's Kappa:  {kappa}")
    print(f"\tAccuracy:       {acc:.03f}\n")

    print(f"\tLabels    | {np.array(tags)}")
    print(f"\tPrecision | {p}") 
    print(f"\tRecall    | {r}")
    print(f"\tF1 score  | {f}")

---

## 3. Reliability Analyses

This section conducts reliability analyses.

### 3.1. Disfluency Detection

#### 3.1.1. All Ratings

In [6]:
evaluate_reliability(TASK, ignore_tag="<CI>-<CE>-<FILLER>")

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, Cartoon, RtSwithoutRAA, RtSwithRAA, WoZ_Interview, 
- Sample size = 2255

- Metrics
	Cohen's Kappa:  0.5928225003650681
	Accuracy:       0.931

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.80300913 0.94104042]
	Recall    | [0.51684305 0.98382442]
	F1 score  | [0.62890329 0.96195694]


In [7]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<CI>-<CE>-<FILLER>")

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, 
- Sample size = 128

- Metrics
	Cohen's Kappa:  0.618745919241853
	Accuracy:       0.937

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.80948617 0.94690927]
	Recall    | [0.54584222 0.98438816]
	F1 score  | [0.65202165 0.96528506]


In [8]:
evaluate_reliability(["Cartoon"], ignore_tag="<CI>-<CE>-<FILLER>")

--- Analysis of Disfluency ---
- Tasks ... Cartoon, 
- Sample size = 128

- Metrics
	Cohen's Kappa:  0.6458304576321436
	Accuracy:       0.933

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.82942974 0.94273   ]
	Recall    | [0.57909705 0.98310214]
	F1 score  | [0.682018  0.9624929]


In [9]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<CI>-<CE>-<FILLER>")

--- Analysis of Disfluency ---
- Tasks ... RtSwithoutRAA, 
- Sample size = 128

- Metrics
	Cohen's Kappa:  0.5856748522192039
	Accuracy:       0.924

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.82174393 0.93222542]
	Recall    | [0.50491692 0.98417055]
	F1 score  | [0.62549884 0.95749398]


In [10]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<CI>-<CE>-<FILLER>")

--- Analysis of Disfluency ---
- Tasks ... RtSwithRAA, 
- Sample size = 128

- Metrics
	Cohen's Kappa:  0.5563248083041319
	Accuracy:       0.921

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.80912863 0.92962501]
	Recall    | [0.47280914 0.98423655]
	F1 score  | [0.59685177 0.95615161]


In [11]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>")
# TODO: df_annotation_manu の disfluency の位置の訂正

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 
- Sample size = 1743

- Metrics
	Cohen's Kappa:  0.5713529006729258
	Accuracy:       0.937

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.76251455 0.94901503]
	Recall    | [0.49974568 0.98355944]
	F1 score  | [0.60377938 0.9659785 ]


Disfluency Detector のアップデートで，長い系列の言い淀みのアノテーションが可能になった．
ただし，それが不必要な箇所まで言い淀みと判定してしまうケースが増えていそう(e.g., 008_031)．
実際に，recall が 0.631 → 0.804 と APSIPA から値が向上しているのに対して，precision は 0.832 → 0.565 と値が大きく低下している．取りこぼしが少なくなった反面，余計な箇所まで言い淀みと判定していまい，その結果 kappa が低下した可能性 大

**TODO: 旧版の言い淀み検出に取り替えた場合の性能を見たい / もしくは，L1 で学習したモデルに変更した場合はどうなる...？**
</br>→ BERT 版でやったほうが精度が下がった...

せめて，kappa 60% < は欲しいが...

#### 3.1.2. Beginners

In [12]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, 
- Target rating ... [0, 1, 2]
- Sample size = 35

- Metrics
	Cohen's Kappa:  0.6090367613622097
	Accuracy:       0.914

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.82382134 0.92356512]
	Recall    | [0.54515599 0.97922762]
	F1 score  | [0.65612648 0.95058222]


In [13]:
evaluate_reliability(["Cartoon"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Disfluency ---
- Tasks ... Cartoon, 
- Target rating ... [0, 1, 2]
- Sample size = 18

- Metrics
	Cohen's Kappa:  0.6150261367666208
	Accuracy:       0.910

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.77586207 0.92770543]
	Recall    | [0.58290155 0.96949789]
	F1 score  | [0.66568047 0.94814135]


In [14]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Disfluency ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [0, 1, 2]
- Sample size = 31

- Metrics
	Cohen's Kappa:  0.6077705504590072
	Accuracy:       0.911

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.82018349 0.9217759 ]
	Recall    | [0.54712362 0.97801705]
	F1 score  | [0.65638767 0.949064  ]


In [15]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Disfluency ---
- Tasks ... RtSwithRAA, 


- Target rating ... [0, 1, 2]
- Sample size = 18

- Metrics
	Cohen's Kappa:  0.6126385101629456
	Accuracy:       0.916

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.82329317 0.92645779]
	Recall    | [0.54812834 0.9797515 ]
	F1 score  | [0.65810594 0.95235965]


In [16]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[0, 1])

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 
- Target rating ... [0, 1]
- Sample size = 603

- Metrics
	Cohen's Kappa:  0.5737943612703172
	Accuracy:       0.925

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.81500873 0.93367975]
	Recall    | [0.49054622 0.98471301]
	F1 score  | [0.61245902 0.95851758]


#### 3.1.3. Intermediate

In [17]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, 
- Target rating ... [3, 4, 5]
- Sample size = 66

- Metrics
	Cohen's Kappa:  0.6335498842932917
	Accuracy:       0.938

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.81137725 0.94827157]
	Recall    | [0.565762   0.98374613]
	F1 score  | [0.66666667 0.96568317]


In [18]:
evaluate_reliability(["Cartoon"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Disfluency ---
- Tasks ... Cartoon, 
- Target rating ... [3, 4, 5]
- Sample size = 78

- Metrics
	Cohen's Kappa:  0.6689322328419027
	Accuracy:       0.934

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.84671533 0.94294519]
	Recall    | [0.60381724 0.9835795 ]
	F1 score  | [0.7049291  0.96283382]


In [19]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Disfluency ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [3, 4, 5]
- Sample size = 68

- Metrics
	Cohen's Kappa:  0.58070961121569
	Accuracy:       0.924

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.84468339 0.93070792]
	Recall    | [0.48893499 0.98707239]
	F1 score  | [0.61936049 0.95806187]


In [20]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Disfluency ---
- Tasks ... RtSwithRAA, 
- Target rating ... [3, 4, 5]
- Sample size = 81

- Metrics
	Cohen's Kappa:  0.5501154901701164
	Accuracy:       0.919

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.8036225  0.92816437]
	Recall    | [0.46807329 0.98363001]
	F1 score  | [0.59157895 0.95509259]


In [21]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[2, 3])

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 


- Target rating ... [2, 3]
- Sample size = 979

- Metrics
	Cohen's Kappa:  0.5743955205879954
	Accuracy:       0.934

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.75698324 0.94720472]
	Recall    | [0.50901578 0.98178697]
	F1 score  | [0.60871518 0.96418586]


#### 3.1.4. Advanced

In [22]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Disfluency ---
- Tasks ... Arg_Oly, 
- Target rating ... [6, 7, 8]
- Sample size = 27

- Metrics
	Cohen's Kappa:  0.574244810334049
	Accuracy:       0.956

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.77319588 0.9637152 ]
	Recall    | [0.48543689 0.98968831]
	F1 score  | [0.59642147 0.97652908]


In [23]:
evaluate_reliability(["Cartoon"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Disfluency ---
- Tasks ... Cartoon, 
- Target rating ... [6, 7, 8]
- Sample size = 32

- Metrics
	Cohen's Kappa:  0.6007790155894494
	Accuracy:       0.939

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.81632653 0.94752368]
	Recall    | [0.51575931 0.98690168]
	F1 score  | [0.63213345 0.96681188]


In [24]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Disfluency ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [6, 7, 8]
- Sample size = 29

- Metrics
	Cohen's Kappa:  0.5654290121499763
	Accuracy:       0.932

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.77906977 0.94289898]
	Recall    | [0.48833819 0.98387371]
	F1 score  | [0.60035842 0.96295066]


In [25]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Disfluency ---
- Tasks ... RtSwithRAA, 
- Target rating ... [6, 7, 8]
- Sample size = 29

- Metrics
	Cohen's Kappa:  0.5395400891690971
	Accuracy:       0.927

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.81491003 0.93408977]
	Recall    | [0.44522472 0.98730159]
	F1 score  | [0.57584015 0.95995884]


In [26]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<CI>-<CE>-<FILLER>", rating_filter=[4, 5])

--- Analysis of Disfluency ---
- Tasks ... WoZ_Interview, 
- Target rating ... [4, 5]
- Sample size = 161

- Metrics
	Cohen's Kappa:  0.5193073213706664
	Accuracy:       0.963

	Labels    | ['<disfluency>' '<word>']
	Precision | [0.6682243  0.97312241]
	Recall    | [0.44968553 0.98891837]
	F1 score  | [0.53759398 0.9809568 ]


### 3.2. Pause Location Classification

#### 3.2.1. All Ratings

In [27]:
evaluate_reliability(TASK, ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, Cartoon, RtSwithoutRAA, RtSwithRAA, WoZ_Interview, 
- Sample size = 2255

- Metrics
	Cohen's Kappa:  0.7036632694925453
	Accuracy:       0.878

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.67683477 0.6777591  0.95726591]
	Recall    | [0.82744102 0.65650098 0.90747183]
	F1 score  | [0.74459864 0.66696069 0.93170405]


In [28]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, 
- Sample size = 128

- Metrics
	Cohen's Kappa:  0.7397392020769109
	Accuracy:       0.894

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.73170732 0.74162679 0.95403891]
	Recall    | [0.82984658 0.67863398 0.92605268]
	F1 score  | [0.77769306 0.70873342 0.9398375 ]


In [29]:
evaluate_reliability(["Cartoon"], ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... Cartoon, 
- Sample size = 128

- Metrics
	Cohen's Kappa:  0.7433434831509741
	Accuracy:       0.889

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.70397004 0.75346852 0.9627649 ]
	Recall    | [0.84529592 0.71133501 0.91712655]
	F1 score  | [0.76818702 0.7317958  0.93939174]


In [30]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... RtSwithoutRAA, 
- Sample size = 128

- Metrics
	Cohen's Kappa:  0.7267736383514194
	Accuracy:       0.882

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.70953164 0.71690307 0.95475566]
	Recall    | [0.83984438 0.66103542 0.9117232 ]
	F1 score  | [0.76920793 0.68783669 0.93274336]


In [31]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... RtSwithRAA, 
- Sample size = 128

- Metrics
	Cohen's Kappa:  0.7269164692421093
	Accuracy:       0.881

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.71509206 0.71542228 0.95256635]
	Recall    | [0.84033481 0.63463626 0.91348693]
	F1 score  | [0.77267117 0.6726122  0.93261744]


In [32]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>")

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 
- Sample size = 1743

- Metrics
	Cohen's Kappa:  0.6251470365992093
	Accuracy:       0.862

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.58232893 0.55223881 0.959684  ]
	Recall    | [0.79204125 0.61410788 0.88955064]
	F1 score  | [0.6711853  0.58153242 0.92328739]


#### 3.2.2. Beginners

In [33]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, 
- Target rating ... [0, 1, 2]
- Sample size = 35

- Metrics
	Cohen's Kappa:  0.728115031673858
	Accuracy:       0.880

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.78329393 0.7012987  0.92553487]
	Recall    | [0.79140127 0.60447761 0.93174825]
	F1 score  | [0.78732673 0.6492986  0.92863117]


In [34]:
evaluate_reliability(["Cartoon"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Pause Location ---
- Tasks ... Cartoon, 
- Target rating ... [0, 1, 2]
- Sample size = 18

- Metrics
	Cohen's Kappa:  0.715874349905717
	Accuracy:       0.867

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.76820388 0.72682927 0.91670399]
	Recall    | [0.77955665 0.64224138 0.92290352]
	F1 score  | [0.77383863 0.6819222  0.9197933 ]


In [35]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Pause Location ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [0, 1, 2]
- Sample size = 31

- Metrics
	Cohen's Kappa:  0.7282912924578773
	Accuracy:       0.873

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.74440396 0.65374677 0.94609341]
	Recall    | [0.84117647 0.62935323 0.90538847]
	F1 score  | [0.78983706 0.64131812 0.92529349]


In [36]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1, 2])

--- Analysis of Pause Location ---
- Tasks ... RtSwithRAA, 
- Target rating ... [0, 1, 2]
- Sample size = 18

- Metrics
	Cohen's Kappa:  0.737504798100663
	Accuracy:       0.875

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.7758794  0.67582418 0.93435449]
	Recall    | [0.84463895 0.58293839 0.9135644 ]
	F1 score  | [0.80880042 0.6259542  0.92384249]


In [37]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[0, 1])

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 
- Target rating ... [0, 1]
- Sample size = 603

- Metrics
	Cohen's Kappa:  0.6094342055232786
	Accuracy:       0.847

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.65502183 0.54545455 0.91948405]
	Recall    | [0.7094062  0.52229299 0.9025723 ]
	F1 score  | [0.68113017 0.53362256 0.91094969]


#### 3.2.3. Intermediate

In [38]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, 
- Target rating ... [3, 4, 5]
- Sample size = 66

- Metrics
	Cohen's Kappa:  0.7448269936539953
	Accuracy:       0.893

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.73351125 0.75495495 0.95525755]
	Recall    | [0.84832451 0.68914474 0.92042024]
	F1 score  | [0.78675118 0.7205503  0.93751538]


In [39]:
evaluate_reliability(["Cartoon"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Pause Location ---
- Tasks ... Cartoon, 


- Target rating ... [3, 4, 5]
- Sample size = 78

- Metrics
	Cohen's Kappa:  0.7460677632244399
	Accuracy:       0.885

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.71037003 0.7515639  0.96352169]
	Recall    | [0.85864163 0.70910624 0.91007889]
	F1 score  | [0.7775     0.729718   0.93603809]


In [40]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Pause Location ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [3, 4, 5]
- Sample size = 68

- Metrics
	Cohen's Kappa:  0.7237150315601042
	Accuracy:       0.879

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.70226096 0.72377622 0.95467695]
	Recall    | [0.84220843 0.65026178 0.90967564]
	F1 score  | [0.76589424 0.6850524  0.93163318]


In [41]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[3, 4, 5])

--- Analysis of Pause Location ---
- Tasks ... RtSwithRAA, 


- Target rating ... [3, 4, 5]
- Sample size = 81

- Metrics
	Cohen's Kappa:  0.7315934604213616
	Accuracy:       0.882

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.7191417  0.72222222 0.95349915]
	Recall    | [0.84603849 0.65337955 0.91153449]
	F1 score  | [0.77744603 0.68607825 0.9320447 ]


In [42]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[2, 3])

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 
- Target rating ... [2, 3]
- Sample size = 979

- Metrics
	Cohen's Kappa:  0.6338831708702791
	Accuracy:       0.861

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.57632933 0.57748777 0.96503699]
	Recall    | [0.82101405 0.63251936 0.88310264]
	F1 score  | [0.67724868 0.60375213 0.9222536 ]


#### 3.2.4. Advanced

In [43]:
evaluate_reliability(["Arg_Oly"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Pause Location ---
- Tasks ... Arg_Oly, 
- Target rating ... [6, 7, 8]
- Sample size = 27

- Metrics
	Cohen's Kappa:  0.7348963787850904
	Accuracy:       0.909

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.66058764 0.74903475 0.97560976]
	Recall    | [0.83804627 0.72932331 0.93188374]
	F1 score  | [0.7388102  0.73904762 0.95324557]


In [44]:
evaluate_reliability(["Cartoon"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Pause Location ---
- Tasks ... Cartoon, 
- Target rating ... [6, 7, 8]
- Sample size = 32

- Metrics
	Cohen's Kappa:  0.7442576390283934
	Accuracy:       0.905

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.64901561 0.76727273 0.97784912]
	Recall    | [0.84977778 0.74426808 0.92851699]
	F1 score  | [0.73595073 0.75559534 0.95254476]


In [45]:
evaluate_reliability(["RtSwithoutRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Pause Location ---
- Tasks ... RtSwithoutRAA, 
- Target rating ... [6, 7, 8]
- Sample size = 29

- Metrics
	Cohen's Kappa:  0.7280804867143461
	Accuracy:       0.893

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.68596491 0.75838926 0.96148099]
	Recall    | [0.83309659 0.70920502 0.92008262]
	F1 score  | [0.75240539 0.73297297 0.94032638]


In [46]:
evaluate_reliability(["RtSwithRAA"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[6, 7, 8])

--- Analysis of Pause Location ---
- Tasks ... RtSwithRAA, 
- Target rating ... [6, 7, 8]
- Sample size = 29

- Metrics
	Cohen's Kappa:  0.7066849445665323
	Accuracy:       0.882

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.6693502  0.71568627 0.95763293]
	Recall    | [0.82145378 0.61215933 0.91780372]
	F1 score  | [0.73764259 0.65988701 0.93729539]


In [47]:
evaluate_reliability(["WoZ_Interview"], ignore_tag="<DISFLUENCY>-<FILLER>", rating_filter=[4, 5])

--- Analysis of Pause Location ---
- Tasks ... WoZ_Interview, 
- Target rating ... [4, 5]
- Sample size = 161

- Metrics
	Cohen's Kappa:  0.5947254523995277
	Accuracy:       0.884

	Labels    | ['<ci>' '<ce>' '<word>']
	Precision | [0.49471974 0.44102564 0.98565414]
	Recall    | [0.812      0.66153846 0.89991458]
	F1 score  | [0.61484099 0.52923077 0.94083501]
