Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi span handling for sciarg #103

Merged
merged 31 commits into from
Mar 18, 2024
Merged

multi span handling for sciarg #103

merged 31 commits into from
Mar 18, 2024

Conversation

ArneBinder
Copy link
Owner

@ArneBinder ArneBinder commented Jan 31, 2024

This adds a new dataset variant resolve_parts_of_same where all spans connected via parts_of_same relations are merged by using the smallest start index as new start and the biggest end index as new end index, i.e. the max coverage will be used span. into LabeledMultiSpans. It also adds the respective document converters for TextDocumentWithLabeledMultiSpansAndBinaryRelations and TextDocumentWithLabeledMultiSpansBinaryRelationsAndLabeledPartitions.

IMPORTANT:

  • We set TEST_FULL_DATASET = True.
  • This also adds code which deduplicates the relations for the default dataset variant, so the annotation counts slightly change from:
len max sum
contradicts 40 37 697
parts_of_same 40 100 1298
semantically_same 18 9 44
supports 40 300 5791

to:

len max sum
contradicts 40 37 696
parts_of_same 40 100 1298
semantically_same 18 9 44
supports 40 300 5789

We warn about the annotation removal:

WARNING  datasets_modules.datasets.sciarg.e539c4247581f3861d74f668d91edfad8b298991b2026aa2d8fc629bc3b89350.sciarg:sciarg.py:108 doc_id=A18: Removing duplicate relation: BinaryRelation(head=LabeledSpan(start=13659, end=13660, label='data', score=1.0), tail=LabeledSpan(start=13720, end=13763, label='background_claim', score=1.0), label='supports', score=1.0)
WARNING  datasets_modules.datasets.sciarg.e539c4247581f3861d74f668d91edfad8b298991b2026aa2d8fc629bc3b89350.sciarg:sciarg.py:108 doc_id=A21: Removing duplicate relation: BinaryRelation(head=LabeledSpan(start=2339, end=2349, label='data', score=1.0), tail=LabeledSpan(start=2175, end=2238, label='background_claim', score=1.0), label='supports', score=1.0)
WARNING  datasets_modules.datasets.sciarg.e539c4247581f3861d74f668d91edfad8b298991b2026aa2d8fc629bc3b89350.sciarg:sciarg.py:108 doc_id=A30: Removing duplicate relation: BinaryRelation(head=LabeledSpan(start=2109, end=2217, label='background_claim', score=1.0), tail=LabeledSpan(start=1965, end=2098, label='background_claim', score=1.0), label='contradicts', score=1.0)

This PR requires:

TODO:

  • use 0.10.6 0.10.8 release of pie-modules when available
  • update the docs
  • investigate why test_tokenize_documents_all fails
  • increase test coverage, if necessary
  • update documentation (dataset card)

@ArneBinder ArneBinder added the enhancement New feature or request label Jan 31, 2024
Copy link

codecov bot commented Jan 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.55%. Comparing base (a12858f) to head (54018b3).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #103      +/-   ##
==========================================
+ Coverage   95.23%   95.55%   +0.31%     
==========================================
  Files          22       22              
  Lines        1407     1439      +32     
==========================================
+ Hits         1340     1375      +35     
+ Misses         67       64       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ArneBinder ArneBinder changed the title multi span handling for sciarg [WIP] multi span handling for sciarg Mar 4, 2024
@idalr
Copy link
Collaborator

idalr commented Mar 11, 2024

Label counts statistics

EDIT: adjusted for counts calculated with latest commit e2698bc

dataset variant default

document type BratDocumentWithMergedSpans

i.e. no conversion

len max sum
background_claim 40 182 3291
data 40 214 4297
own_claim 40 285 6004
command line
python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_entity_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=default  \
metric.document_type=pie_datasets.builders.brat.BratDocumentWithMergedSpans \
metric.field=spans
len max sum
contradicts 40 37 696
parts_of_same 40 100 1298
semantically_same 18 9 44
supports 40 300 5789
command line
python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_relation_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=default \
metric.document_type=pie_datasets.builders.brat.BratDocumentWithMergedSpans \
metric.field=relations

document type TextDocumentWithLabeledSpansAndBinaryRelations

len max sum
background_claim 40 182 3291
data 40 214 4297
own_claim 40 285 6004
command line
python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_entity_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=default  \
metric.document_type=pie_modules.documents.TextDocumentWithLabeledSpansAndBinaryRelations \
metric.field=labeled_spans
len max sum
contradicts 40 37 696
parts_of_same 40 100 1298
semantically_same 18 9 44
supports 40 300 5789
command line
python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_relation_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=default \
metric.document_type=pie_modules.documents.TextDocumentWithLabeledSpansAndBinaryRelations \
metric.field=binary_relations

dataset variant resolve_parts_of_same

document type BratDocument

i.e. no conversion

len max sum
background_claim 40 153 2752
data 40 171 4093
own_claim 40 262 5450
command line
python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_entity_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=resolve_parts_of_same  \
metric.document_type=pie_datasets.builders.brat.BratDocument \
metric.field=spans
len max sum
contradicts 40 37 696
semantically_same 18 9 44
supports 40 300 5788
command line
python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_relation_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=resolve_parts_of_same  \
metric.document_type=pie_datasets.builders.brat.BratDocument \
metric.field=relations

document type TextDocumentWithLabeledMultiSpansAndBinaryRelations

i.e. with conversion

entities len max sum
background_claim 40 153 2752
data 40 171 4093
own_claim 40 262 5450
command line
python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_entity_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=resolve_parts_of_same  \
metric.document_type=pie_modules.documents.TextDocumentWithLabeledMultiSpansAndBinaryRelations \
metric.field=labeled_multi_spans
relations len max sum
contradicts 40 37 696
semantically_same 18 9 44
supports 40 300 5788
command line
python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_relation_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=resolve_parts_of_same \ metric.document_type=pie_modules.documents.TextDocumentWithLabeledMultiSpansAndBinaryRelations \
metric.field=binary_relations

@ArneBinder ArneBinder merged commit c0891d5 into main Mar 18, 2024
4 checks passed
@ArneBinder ArneBinder deleted the sciarg_multi_span_handling branch March 18, 2024 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants