#### Statistical analysis of lexicographer's analysis of different cluster groups (round 2)
- based on numbers from "281_lemmas_overview. 23 december 2024.xlsx"

| Item            | 1 cluster | >1 cluster     | Row total |
|-----------------|-----------|----------------|-----------|
| interesting     | 1         | 17+7+4+1+1=30  | 31        |
| not interesting | 27        | 32+2+2+0+0=36  | 63        |
| Column total    | 28        | 66             | 94        |

to do:
- verify numbers (Emma? Second pair of eyes.)
- includes some non-randomly chosen words for one cluster (done)
- some words have no decision (done)

- Fisher's exact test: https://en.wikipedia.org/wiki/Fisher%27s_exact_test
- https://docs.scipy.org/doc/scipy/tutorial/stats/hypothesis_fisher_exact.html#hypothesis-fisher-exact
- Null hypothesis: "The odds of finding a semantically interesting word are the same within the 1-cluster group as they are within the >1-cluster group."

In [1]:
from scipy.stats import fisher_exact
from decimal import Decimal

table = [[27, 1], [36, 30]]
res = fisher_exact(table, alternative='greater')
print('%.8f' % Decimal(res.pvalue))

0.00002412


- Using a significance level of 1%, we would reject the null hypothesis in favor of the alternative hypothesis: "The odds of finding a semantically interesting word are greater within the >1-cluster group as they are within the 1-cluster group."
- Additionally, the 1-cluster group is much more frequent than the >1-cluster group (215 versus 66).
- Hence, by inspecting only the >1-cluster group we increase the chance to find semantically interesting words as compared to a random selection.

#### Statistical analysis of lexicographer's analysis of different cluster groups (round 3)
- based on numbers from "1. Lemmas_with_cluster_round3_lex.judgements_4 juni.ods"

| Item            | 1 cluster | >1 cluster     | Row total |
|-----------------|-----------|----------------|-----------|
| interesting     | 0         | 58+16+11+1+1=87| 87        |
| not interesting | 21        | 43+17+4+2+0=66 | 87        |
| Column total    | 21        | 153            | 174       |

to do:
- verify numbers (Emma? Second pair of eyes.). Done.

In [2]:
table = [[21, 0], [66, 87]]
res = fisher_exact(table, alternative='greater')
print('%.8f' % Decimal(res.pvalue))

0.00000012


- same results as above
- effect size seems stronger (smaller significance value)

#### Additional experiments parallel to experiment 1

In [3]:
from sklearn.metrics.cluster import adjusted_rand_score
import pandas as pd
from pathlib import Path
from sklearn.metrics.cluster import adjusted_rand_score, rand_score, adjusted_mutual_info_score
import numpy as np
import unicodedata

experiment_folder='../../data/lexicographer_project/Euralex_sv_experiment_1_v2.02/xllexeme'
#experiment_folder='../../data/lexicographer_project/Euralex_sv_experiment_1_v2.02/xldurel'

is_load_data = False

if is_load_data:
    df_experiments = pd.read_pickle(experiment_folder + "/clusterings.pkl")
else:
    df_experiments = pd.DataFrame()
    paths = Path(experiment_folder+'/').glob('*/clusters/*.csv')
    for i, p in enumerate(paths):

        #if not 'wsbm' in str(p):
        #    continue
        #if 'old' in str(p):
        #    continue
        #print(p)
        #if i==1000:
        #    break

        model = str(p).replace('\\', '/').split('/')[-3]
        lemma = str(p).replace('\\', '/').split('/')[-1].replace('.csv','')
        lemma = unicodedata.normalize('NFC', lemma)
        df = pd.read_csv(p, delimiter='\t', quoting=3, na_filter=False)
        df['model'] = model
        df['lemma'] = lemma
        df_experiments = pd.concat([df_experiments, df])
    df_experiments.to_pickle(experiment_folder + "/clusterings.pkl")
display(df_experiments)

Unnamed: 0,identifier,cluster,model,lemma
0,svt-2019-9a093e72-21,0,correlation_0.45,tysta
1,svt-2012-e28bee60-15,0,correlation_0.45,tysta
2,svt-2011-91a33298-8,0,correlation_0.45,tysta
3,svt-2014-152158f0-12,0,correlation_0.45,tysta
4,svt-2011-e2825434-8,0,correlation_0.45,tysta
...,...,...,...,...
45,svt-2016-f54f3f49-1,0,correlation_0.675,fotavtryck
46,svt-2016-d697bc6e-7,1,correlation_0.675,fotavtryck
47,svt-2019-9a09d01d-16,1,correlation_0.675,fotavtryck
48,svt-2017-1188d57d-34,0,correlation_0.675,fotavtryck


In [4]:
# load ground truth
truth_folder='../../data/lexicographer_project/lexicographer_judgments'

df_truth = pd.DataFrame()
paths = Path(truth_folder+'/').glob('*.csv')
for i, p in enumerate(paths):
    #display(p)
    lemma = str(p).replace('\\', '/').split('/')[-1].replace('.csv','').split('_')[1]
    lemma = unicodedata.normalize('NFC', lemma)
    df = pd.read_csv(p, delimiter=';')
    df['lemma'] = lemma
    df = df.rename(columns={'lex.judgement': 'lex_judgement', 'lex. judgement': 'lex_judgement', 'lex. Jugement': 'lex_judgement'})
    df_truth = pd.concat([df_truth, df])
#display(df_truth)
#print(df_truth.columns.tolist())

# Filter
df_truth = df_truth[(~df_truth['lex_judgement'].eq('unclear'))]
df_truth = df_truth[~df_truth['lex_judgement'].isnull()]
df_experiments = df_experiments[(df_experiments['lemma'].isin(df_truth['lemma'].unique()))]
#display(df_truth['lex_judgement'].unique())
gb_truth_lemma = df_truth.groupby('lemma')    
#gb_truth = gb_truth_lemma.groups
#display(sorted(df_experiments['lemma'].unique()))
#display(sorted(df_truth['lemma'].unique()))
assert set(df_truth['lemma'].unique()) == set(df_experiments['lemma'].unique())

In [5]:
gb_model = df_experiments.groupby('model')    
groups_model = gb_model.groups
results = []
for model in groups_model.keys():
    df_model = gb_model.get_group(model)
    gb_model_lemma = df_model.groupby('lemma')    
    groups_model_lemma = gb_model_lemma.groups
    aris = []
    ris = []
    for lemma in groups_model_lemma.keys():
        df_truth_lemma = gb_truth_lemma.get_group(lemma)
        df_truth_lemma_dict = pd.Series(df_truth_lemma.lex_judgement.values,index=df_truth_lemma.identifier).to_dict()
        #print(df_truth_lemma_dict)
        df_model_lemma = gb_model_lemma.get_group(lemma)
        df_model_lemma_dict = pd.Series(df_model_lemma.cluster.values,index=df_model_lemma.identifier).to_dict()
        #print(df_model_lemma_dict)
        data1 = [label for identifier, label in df_truth_lemma_dict.items()]
        data2 = [df_model_lemma_dict[identifier] for identifier, _ in df_truth_lemma_dict.items()]
        #print(data2)
        ari = adjusted_rand_score(data1, data2)
        ri = rand_score(data1, data2)
        #print(' ', lemma, ari)
        results.append({'model':model, 'lemma':lemma, 'ARI':ari, 'RI':ri})
        aris.append(ari)
        ris.append(ri)

    mean_ari = np.mean(aris) 
    mean_ri = np.mean(ris) 
    print('\n', model, 'mean', mean_ari, mean_ri)

df_results = pd.DataFrame(results)


 correlation_0.3 mean 0.19671879294415545 0.6855276965643725

 correlation_0.325 mean 0.24544926169415549 0.6866722543875017

 correlation_0.35 mean 0.25551646343936996 0.6972407422912258

 correlation_0.375 mean 0.313371108410046 0.7205662444451154

 correlation_0.4 mean 0.3191706017699908 0.7197159043090611

 correlation_0.425 mean 0.31630965110183323 0.7167877104620664

 correlation_0.45 mean 0.411899076121062 0.7438446755994794

 correlation_0.475 mean 0.4213088920993869 0.7667968646358376

 correlation_0.5 mean 0.3898254254624668 0.7356454930788893

 correlation_0.525 mean 0.4213205971685345 0.7456192284789165

 correlation_0.55 mean 0.415823071276606 0.7404938784314045

 correlation_0.575 mean 0.4487771469993344 0.7508595246899078

 correlation_0.6 mean 0.4887712468774944 0.7690422463424741

 correlation_0.625 mean 0.48226494073660914 0.7632520451556164

 correlation_0.65 mean 0.4853282424592054 0.7644052825837977

 correlation_0.675 mean 0.46915612118502664 0.7489216437773322

