Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DisagrementMeasure agreement calculation wrong for 100 agreement? #35

Open
reckart opened this issue Apr 13, 2021 · 8 comments
Open

DisagrementMeasure agreement calculation wrong for 100 agreement? #35

reckart opened this issue Apr 13, 2021 · 8 comments
Labels
🐛Bug Something isn't working

Comments

@reckart
Copy link
Member

reckart commented Apr 13, 2021

The KrippendorffAlphaAgreement is a disagreement measure. If there is a full agreement, then the expected and observed disagreement is calculated as 0.0 in

    @Override
    public double calculateAgreement()
    {
        double D_O = calculateObservedDisagreement();
        double D_E = calculateExpectedDisagreement();
        if (D_O == D_E) {
            return 0.0;
        }
        else {
            return 1.0 - (D_O / D_E);
        }
    }

However, a disagreement of 0 in this case does not yield an agreement of 1.0 but instead an agreement of 0.0.... seems wrong?

Suggested fix:

                double D_O = calculateObservedDisagreement();
                double D_E = calculateExpectedDisagreement();
                if (D_O == 0.0 && D_E == 0.0) {
                    return 1.0;
                }
                return 1.0 - (D_O / D_E);
@reckart reckart added the 🐛Bug Something isn't working label Apr 13, 2021
@reckart
Copy link
Member Author

reckart commented Apr 13, 2021

@chmeyer are you still watching this repo? Is this a bug or a feature?

@chmeyer
Copy link
Member

chmeyer commented Apr 17, 2021

@reckart sort of, where time permits :)

I would not recommend this fix. Although Krippendorff's measure internally uses disagreement modeling (i.e., observed disagreement D_O and expected disagreement D_E, it is still defined as an agreement measure. This is achieved by the "1 - " term in the result calculation "1 - (D_O / D_E)". That's why the method is called "calculateAgreement" rather than "calculateDisagreement".

Example: Imagine, we see an observed disagreement of 0.5 (~ half of the annotations are "wrong") and due to the annotations, we would expect a disagreement of 0.57 (this happens for example in a 2-raters, 4-items AA, AB, BA, BB study). This means that the raters did only slightly better than chance, so we see alpha = 1 - (0.5 / 0.57) = 1 - 0.88 = 0.12. If the raters produce only an observed disagreement of 0.25, then they do clearly better than chance and we would obtain alpha = 1 - (0.25 / 0.57) = 1 - 0.43 = 0.57. To reach acceptable agreement levels, the raters would need to produce even less observed disagreement (or the expected disagreement raises).

Coming back to your question, if there is no observed disagreement and no expected disagreement, this would mean that we have an empty study and thus nothing to judge. Returning an alpha = 1 agreement would be misleading IMHO. Returning alpha = 0 as it is now, is also debateable, as there is no clear definition for the situation. Thus, NaN would be an option, but for practicality reasons (e.g., writing numbers in a database, computing averages, etc.), we did choose 0 in the first place and probably that's fine to keep.

What do you think? Best wishes!

@reckart
Copy link
Member Author

reckart commented Apr 17, 2021

In my case, I found that if I have two annotators who both annotate the same unit with the same label, then the expected and observed disagreement are both 0 and in the current code this causes the agreement to be reported as 0 - but it is full agreement and thus should be reported as 1.

@reckart
Copy link
Member Author

reckart commented Apr 17, 2021

Coming back to your question, if there is no observed disagreement and no expected disagreement, this would mean that we have an empty study and thus nothing to judge.

So the study is not necessarily empty if expected/observed disagreement are both 0.

@reckart
Copy link
Member Author

reckart commented Apr 17, 2021

Maybe?

if (D_O == D_E) {
  return study.isEmpty() ? 0.0 : 1.0;
}

@chmeyer
Copy link
Member

chmeyer commented Apr 17, 2021

Well, D_O can be 0.0 if the raters agree on all items. But D_E does not fall 0 in a proper study. It can be 0 if there is only a single label, but then there is nothing to agree on, i.e. no question. In my opinion, this would not a real annotation study. But if we want support this use case, then, yes, the study.isEmpty solution should be a way.

@reckart
Copy link
Member Author

reckart commented Apr 17, 2021

I assume you mean by "real" study that there is a significant number of annotations :) In INCEpTION/WebAnno, we use calculate pairwise agreement between annotators. It is not uncommon to have cases where

  • the annotators did not find any item to be labelled in their documents
  • the annotators did find very few items to be labelled and agree on all of them

We can and probably should handle the first case (no items) directly in our code telling the users that there was no data to compare.
However, the second case I think be better handled here.

Thanks for the feedback!

@chmeyer
Copy link
Member

chmeyer commented Apr 17, 2021

Few labels is not the actual problem: the simplest case AA BB (2 raters agree on 2 items) returns alpha = 1. Also cases with 4 items work well and give a good agreement notion that captures the uncertainty, e.g., in AA AB AA BB (alpha = 0.53). But if there is nothing to decide, i.e. there is only a single label, then we could have 1000 items that are annotated with A by both raters without being able to tell the agreement as there is no expectation model. I am OK to set such cases to 1, but still they should be taken with a grain of salt.

reckart added a commit to inception-project/inception that referenced this issue Apr 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants