# Introduction
From December 2022 to July 2023 we conducted several annotation studies on the data from the DWUG DE/EN/SV data sets. These can be distinguished into basically three studies:
- Study 1: Small number of randomly selected edges
- Study 2: resample 25+25 uses and annotate them rather densely
- Study 3: extra round with uncompared clusters

## Study 1 ('uses')
Study 1 has been done with DWUG DE/EN/SV. For Swedish it has been repeated because of low agreement. We uploaded all uses for all words from the most recent DWUG data sets to the DURel system and instructed each annotator to annotate each word for 30 minutes. One exception: in the second study with Swedish data annotators had 60 minutes for each word. The order of annotation instances (pairs) was randomized by the DURel tool. Some important points:

- Annotators JoshuaC99 and Frida didn't finish the study and should be excluded if low agreement with other annotators.
- In the Swedish studies some words were excluded from the annotation during the first annotation round because there seems to be many "Cannot decide" judgments. Hence, we decided to instruct annotators to skip those words which we had already excluded for SemEval. These are:

anda
bäck
dun
fack
fången
framlida
gloria
ingående
jordisk
mode
sittning
stöt
uppslag

It is possible that some of these words have few annotations because this instruction was sent out to annotators during the first round of annotation. See below for Swedish data quality. See below for Swedish data quality, agreement and multiple rounds for Swedish.


## Study 2 ('resampled')
For this study, we resampled (random without replacement) 25+25 uses for 15 words from the source corpora used for SemEval. Important: In resampling we did not apply the constraint we had in SemEval on sentence length. Excluded words from above were also excluded from the annotation. See below for Swedish data quality, agreement and multiple rounds for Swedish.

## Study 3 ('unc')
In this study, we took the aggregated graphs from the latest DWUG and DiscoWUG data sets and sampled (at most) 3 edges for clusters which had not been compared with each other (see cluster_find_uncompared_edges.ipynb). The aim is to make the annotation algorithm described in SemEval converge (all cluster combinations are annotated). All annotators annotated the same data (all sampled edges). See below for Swedish data quality and agreement.

### General notes
- Swedish data shows consistently low agreement. The data is also very dirty (many OCR errors). This is one explanation.
- Because of low agreement, study 1 and 2 have been repeated on Swedish with other annotators.
- During the annotation we observed several bugs in the annotation system. These were subsequently solved but may have had minor influences on the annotation:
 * In the first round of study 2 some of the indices of target words were erroneous (mostly off by small number of characters). This led to a bug in the annotation system showing annotators a message that everything was annotated. After the bug was solved annotators were instructed to continue the annotation.
 * In June we discovered some bugs concerning randomization:
     * randomization within pairs: the randomization of order within annotation pairs did not work for projects uploaded with the "upload pairs" function (concerns only study 2) and the tutorials. This was solved in early July, so it (probably) did not affect dwug_sv_resampled2. This was (probably) not present in the first round of studies.
     * randomization of sequence: the underlying annotation sequence was not fixed, so some pairs were annotated multiple times and sometimes not all pairs uploaded with the "upload pairs" function were annotated. This was solved in early July, so it did not affect dwug_sv_resampled2, but maybe dwug_sv_uses2. For the first round of studies this was solved with a hot fix.
     * skipped instances: the annotation index was increased twice when clicking once. This (probably) had no effect on the annotated data. Couldn't be reproduced, so no fix was applied.
- todo Dominik: validate the statistics calculated in this notebook with previous calculations on earlier data versions

#### Prerequisites
- install packages as described in the readme of this repository
- if you observe any problems, please open an issue in this repository

In [None]:
# import scripts from parent folder
import sys
sys.path.append('../')

In [None]:
# Download re-annotated data
import requests
data = 'https://drive.google.com/uc?export=download&id=13FS46AVAFnEwx1NwJAYgbd0DbozmrS4q'
r = requests.get(data, allow_redirects=True)
f = 'dwug_de_en_sv_reannotation_studies-from-annotation-workspace.zip'
open(f, 'wb').write(r.content)

import zipfile
with zipfile.ZipFile(f) as z:
    z.extractall()      

In [None]:
from pathlib import Path
import csv
# Preprocess annotated data
datasets = ['discowug_unc', 'dwug_de_resampled', 'dwug_de_unc', 'dwug_de_uses', 'dwug_en_resampled', 'dwug_en_uses', 'dwug_sv_resampled', 'dwug_sv_resampled2', 'dwug_sv_unc', 'dwug_sv_uses', 'dwug_sv_uses2']

# Make output directory
input_path = 'dwug_de_en_sv_reannotation_studies-from-annotation-workspace'
output_path = 'data_output'
Path(output_path).mkdir(parents=True, exist_ok=True)

utility = ['random', 'xlmr+mlp+binary', 'tuozhang', 'garrafao']
low_agreement = ['Frida', 'JoshuaC99'] + ['Tegehall', 'Maria Rumar']
undesired_annotators = utility + low_agreement  # They need to be filtered out

# map identifiers of uses (data still has DURel-internal identifiers)
for dataset in datasets:
    for p in Path(input_path+'/'+dataset+'/data').glob('*/'):
        print(p)
        lemma = str(p).split('/')[-1]
        with open(str(p)+'/'+'uses.csv', encoding='utf-8') as csvfile: 
            reader = csv.DictReader(csvfile, delimiter='\t',quoting=csv.QUOTE_NONE,strict=True)
            uses = [row for row in reader]
        with open(str(p)+'/'+'judgments.csv', encoding='utf-8') as csvfile: 
            reader = csv.DictReader(csvfile, delimiter='\t',quoting=csv.QUOTE_NONE,strict=True)
            judgments = [row for row in reader]

        # Get mapping of system identifiers to original identifiers
        identifiersystem2identifier = {row['identifier_system']:row['identifier'] for row in uses}
        
        uses_out = []
        for row in uses:
            row_out = {key:val for key, val in row.items() if not key in ['identifier_system','project','lang','user']}
            uses_out.append(row_out)
        
        judgments_out = []
        for row in judgments:
            if row['annotator'] in undesired_annotators: # filter out undesired annotators
                continue
            row_out = {key:(val if not key in ['identifier1','identifier2'] else identifiersystem2identifier[val]) for key, val in row.items()}            
            judgments_out.append(row_out)

        # Continue if word was not annotated    
        if judgments_out == []:
            continue        
        
        output_path_lemma = output_path+'/'+dataset+'/data/'+lemma+'/'
        Path(output_path_lemma).mkdir(parents=True, exist_ok=True)

        with open(output_path_lemma+'/'+'judgments.csv', 'w') as f:  
                w = csv.DictWriter(f, judgments_out[0].keys(), delimiter='\t', quoting = csv.QUOTE_NONE, quotechar='')
                w.writeheader()
                w.writerows(judgments_out)
        
        with open(output_path_lemma+'/'+'uses.csv', 'w') as f:  
                w = csv.DictWriter(f, uses_out[0].keys(), delimiter='\t', quoting = csv.QUOTE_NONE, quotechar='')
                w.writeheader()
                w.writerows(uses_out)
                
                
# todo: clean the comment column
# todo: rename lemma folders in English data

In [None]:
# Download original data for comparison
path_original = 'datasets_original'
datasets_original = ['dwug_de', 'dwug_en', 'dwug_sv', 'discowug'] # versions: 'dwug_de_230', 'dwug_en_201', 'dwug_sv_201', 'discowug_111'
datasets_original_links = ['https://zenodo.org/record/7441645/files/dwug_de.zip?download=1', 'https://zenodo.org/record/7387261/files/dwug_en.zip?download=1', 'https://zenodo.org/record/7389506/files/dwug_sv.zip?download=1', 'https://zenodo.org/record/7396225/files/discowug.zip?download=1']

Path(path_original).mkdir(parents=True, exist_ok=True)
for i in range(len(datasets_original)):
    name = datasets_original[i] 
    link = datasets_original_links[i] 
    r = requests.get(link, allow_redirects=True)
    f = path_original+'/'+name+'.zip'
    open(f, 'wb').write(r.content)

    with zipfile.ZipFile(f) as z:
        z.extractall(path_original)

In [None]:
import pandas as pd
# Load new datasets into data frame
df_judgments = pd.DataFrame()
for dataset in datasets:
    for p in Path(output_path+'/'+dataset+'/data').glob('*/judgments.csv'):
        #print(p)
        df = pd.read_csv(p, delimiter='\t', quoting=3, na_filter=False)
        df['dataset'] = dataset
        df_judgments = pd.concat([df_judgments, df])
display(df_judgments)

In [None]:
# Load old/original datasets into data frame
df_judgments_original = pd.DataFrame()
for dataset in datasets_original:
    for p in Path(path_original+'/'+dataset+'/data').glob('*/judgments.csv'):
        #print(p)
        df = pd.read_csv(p, delimiter='\t', quoting=3, na_filter=False)
        df['dataset'] = dataset
        df_judgments_original = pd.concat([df_judgments_original, df])
display(df_judgments_original)

In [None]:
# Aggregate data frames and validate
import numpy as np
# Get all annotators
annotators = df_judgments.annotator.unique()
#display(annotators)

# Get mapping from studies to annotators
dataset2annotators = df_judgments.groupby(['dataset']).annotator.unique()
#display(dataset2annotators)
#display(dataset2annotators['dwug_de_resampled'])

# Get aggregated data as instance versus annotator
df_judgments[['identifier1','identifier2']] = np.sort(df_judgments[['identifier1','identifier2']], axis=1) # sort within pairs to be able to aggregate
df_judgments_pair_vs_ann = pd.DataFrame()
for annotator in annotators:
    judgments_annotator = df_judgments[df_judgments['annotator'] == annotator][['identifier1', 'identifier2', 'lemma', 'dataset', 'judgment']].rename(columns={'judgment': annotator}, inplace=False)
    #display(judgments_annotator)
    df_judgments_pair_vs_ann = pd.concat([df_judgments_pair_vs_ann, judgments_annotator])

#display(df_judgments_pair_vs_ann)
df_judgments_pair_vs_ann_aggregated = df_judgments_pair_vs_ann.groupby(['identifier1','identifier2']).first().reset_index()  
display(df_judgments_pair_vs_ann_aggregated)
# Sanity check number of rows
df_judgments_aggregated = df_judgments.groupby(['identifier1','identifier2']).agg({'judgment':lambda x: list(x)})
#display(df_judgments_aggregated)
assert len(df_judgments_aggregated.index) == len(df_judgments_pair_vs_ann_aggregated.index)
# Check for duplicate rows in non-aggregated (but sorted) data frame
assert len(df_judgments.index) == len(df_judgments.drop_duplicates().index)

In [None]:
# Calculate agreement for each study (new data)
import krippendorff_ as krippendorff
from itertools import combinations

gb = df_judgments_pair_vs_ann_aggregated.groupby('dataset')
groups = gb.groups
for dataset in groups.keys():
    
    df_group = gb.get_group(dataset)
    annotators_dataset = dataset2annotators[dataset]
    df_group = df_group[annotators_dataset]
    df_group = df_group.replace('0.0', np.nan) # 0.0 judgments mean "cannot decide"

    data = np.transpose(df_group.values)
    kri = krippendorff.alpha(reliability_data=data, level_of_measurement='ordinal')
    #print(data)
    print(dataset, kri)

    # Pairwise
    for a, b in combinations(annotators_dataset, 2):
        data = [df_group[a].values, df_group[b].values]
        kri = krippendorff.alpha(reliability_data=data, level_of_measurement='ordinal')
        print('  ', a, b, kri)
        
# Per lemma

# Evaluate annotators for exclusion

In [None]:
# make annotator names unique across all datasets
df_judgments_original['annotator'] = df_judgments_original.annotator + '-' + df_judgments_original.dataset
df_judgments_all = pd.concat([df_judgments, df_judgments_original], axis=0)
annotators = df_judgments_all.annotator.unique()
info = df_judgments_all[['annotator', 'dataset']].drop_duplicates()
info['lang'] = info.dataset.apply(lambda x: 'en' if '_en' in x else 'sv' if '_sv' in x else 'de')
dataset_annotators = info.groupby('dataset')['annotator'].apply(lambda x: list(set(x)))
dataset_language = info[['dataset', 'lang']].drop_duplicates().set_index('dataset')['lang']
language_annotators = info[['annotator', 'lang']].drop_duplicates().groupby('lang').annotator.apply(list)
annotator_language = info[['annotator', 'lang']].drop_duplicates().set_index('annotator')

In [None]:
# for duplicate judgments (same item, same annotator), take the median (86 total, mostly pairs)
# unless the median is out of domain, in which case ignore
value_domain = [float(v) for v in range(1, 4+1)]
index_cols = ['identifier1' ,'identifier2', 'annotator', 'lemma', 'dataset']
display(df_judgments_all[df_judgments_all.duplicated(subset=index_cols, keep=False)])
df_judgments_all = df_judgments_all.groupby(index_cols).median('judgment').reset_index()
df_judgments_all = df_judgments_all[df_judgments_all.judgment.apply(lambda x: x in value_domain)]

In [None]:
# Get aggregated data as instance versus annotator
df_judgments_all[['identifier1','identifier2']] = np.sort(df_judgments_all[['identifier1','identifier2']], axis=1) # sort within pairs to be able to aggregate
df_judgments_all_pair_vs_ann = pd.DataFrame()
annotator2dataset = df_judgments_all[['annotator', 'dataset']].drop_duplicates().set_index('annotator')
for annotator in annotators:
    judgments_annotator = df_judgments_all[df_judgments_all['annotator'] == annotator][['identifier1', 'identifier2', 'lemma', 'dataset', 'judgment']].rename(columns={'judgment': annotator}, inplace=False)
    #display(judgments_annotator)
    df_judgments_all_pair_vs_ann = pd.concat([df_judgments_all_pair_vs_ann, judgments_annotator])

#display(df_judgments_pair_vs_ann)
df_judgments_all_pair_vs_ann_aggregated = df_judgments_all_pair_vs_ann.groupby(['identifier1','identifier2']).first().reset_index()  
# display(df_judgments_all_pair_vs_ann_aggregated)
# Sanity check number of rows
df_judgments_all_aggregated = df_judgments_all.groupby(['identifier1','identifier2']).agg({'judgment':lambda x: list(x)})
#display(df_judgments_aggregated)
assert len(df_judgments_all_aggregated.index) == len(df_judgments_all_pair_vs_ann_aggregated.index)
# Check for duplicate rows in non-aggregated (but sorted) data frame (we cheated so there definitely aren't...)
assert len(df_judgments_all.index) == len(df_judgments_all.drop_duplicates().index)


In [None]:
from modules import get_agreements
value_domain = [float(v) for v in range(int(1), int(4)+1)]
level_of_measurement = 'ordinal'
df = df_judgments_all_pair_vs_ann_aggregated.copy()
df = df.replace(0, np.nan)
df = df[annotators] 
stats = get_agreements(df, value_domain=value_domain, level_of_measurement=level_of_measurement, metrics=['spr', 'kri'])

In [None]:
stats_lang = {}
for lang in language_annotators.keys():
    df_lang = df[language_annotators.loc[lang]].dropna(axis=0, how='all')
    stats_lang[lang] = get_agreements(df_lang, value_domain=value_domain, level_of_measurement=level_of_measurement, metrics=['spr', 'kri'])

In [None]:
for metric, agg in [('kri', 'full'), ('spr', 'mean_weighted')]:
    print(metric, agg)
    for lang in language_annotators.keys():
        print(lang, stats_lang[lang][metric][agg])


In [None]:
# Pairwise permutations (per-annotator block analysis)
annotator_agg = {a: {'krippendorff': [], 'spearman': [], 'support': []} for a in annotators}
lang_agg = {l: {'krippendorff': [], 'spearman': [], 'support': []} for l in ['de', 'en', 'sv']}
for a, b in permutations(annotators, 2):
    data = [df[a].values, df[b].values]
    n_common = (~np.isnan(data[0]) * ~np.isnan(data[1])).sum()
    if n_common == 0:
        continue
    lang = annotator_language.loc[a].item()
    assert(lang == annotator_language.loc[b].item())
    kri = krippendorff.alpha(reliability_data=data, level_of_measurement='ordinal')
    if n_common >= 3:
        spr, spr_p = spearmanr(data[0], data[1], nan_policy='omit')
    else:
        spr, spr_p = np.nan, np.nan
    # only record for 'a' since we duplicate with roles switched (permutations)
    annotator_agg[a]['krippendorff'].append(kri)
    annotator_agg[a]['spearman'].append(spr)
    annotator_agg[a]['support'].append(n_common)
    lang_agg[lang]['krippendorff'].append(kri)
    lang_agg[lang]['spearman'].append(spr)
    lang_agg[lang]['support'].append(n_common)
    print(f'  {a:<20} {b:<20} {n_common: 5d} {kri: 0.3f} {spr: 0.3f}(p={spr_p:0.3f})') 

In [None]:
for lang in ['de', 'sv', 'en']:
    for metric in ['krippendorff', 'spearman', 'support']:
        print(lang, metric, np.nanmean(np.array(lang_agg[lang][metric])))

In [None]:
# unweighted summaries of pairwise statistics
stats = pd.DataFrame.from_dict(annotator_agg, orient="index").stack().to_frame()
stats = pd.DataFrame(stats[0].values.tolist(), index=stats.index).T
summary = stats.describe()
annotators_ordered = summary.xs('krippendorff', level=1, axis=1).sort_values('mean', axis=1).columns
for a in annotators_ordered:
    print(a)
    display(summary[a])

In [None]:
# mean of pairwise scores, weighted by number of items annotated by both in the pair
weights = stats.xs('support', level=1, axis=1) / stats.xs('support', level=1, axis=1).sum(axis=0)
stats = stats.drop('support', axis=1, level=1).mul(weights, level=0).sum().unstack().sort_values('krippendorff')
display(stats)

In [None]:
# annotator weighted pairwise agreements per dataset / per language 
import plotly.express as px
df = stats.unstack().reset_index().rename({'level_0': 'metric', 'level_1': 'annotator', 0: 'value'}, axis=1).set_index('annotator')
df['dataset'] = False
for dataset in dataset_annotators.index:  
    dataset_df = df.loc[dataset_annotators.loc[dataset]].copy()
    dataset_df['slice']='dataset'
    lang_df = df.loc[info[info.lang == dataset_language.loc[dataset]].annotator].copy()
    lang_df['slice']='language'
    tmp = pd.concat([dataset_df.reset_index(), lang_df.drop_duplicates().reset_index()], axis=0)
    fig = px.box(tmp, y='value', x='slice', color='metric', points='all', height=600,  hover_name=tmp.annotator, title=dataset)
    fig.show()

In [None]:
# inspect some particular pairs of annotators
a1, a2 = 'JoshuaC99', 'annotator8-dwug_en'
wecare = df_judgments_all_pair_vs_ann_aggregated.apply(lambda x: ~np.isnan(x[a1]) and ~np.isnan(x[a2]), axis=1)
df_judgments_all_pair_vs_ann_aggregated[wecare][['identifier1', 'identifier2', 'lemma', 'dataset', a1, a2]]

# Run WUG pipeline
An example how to run WUG pipeline. Check readme of this repository to see what the pipeline can do. Many parameters can currently only be modified by defining your own parameterfile and providing it as input parameter to the pipeline (see below).
Form:

`bash -e scripts/run_system2.sh $dir $algorithm $position $parameterfile`

Output is written to `$dir`.

In [None]:
!bash -e ../run_system2.sh data_output/dwug_en_resampled correlation spring ../parameters_system2.sh

In [None]:
# Calculate agreement when merging studies
# Check whether 'unc' data leads to more connected clusters
# Compare clusterings on 'uses' data to clustering on previous data
# Compare change scores on 'resampled' studies to previous data
# Compare agreement across time periods (use grouping information from uses files)

In [None]:
# Postprocessing
## filter out bad annotators (low agreement)
## export filtered data again
## prepare data for publishing