multi span handling for `sciarg` #103

ArneBinder · 2024-01-31T17:47:55Z

This adds a new dataset variant resolve_parts_of_same where all spans connected via parts_of_same relations are merged ~~by using the smallest start index as new start and the biggest end index as new end index, i.e. the max coverage will be used span.~~ into LabeledMultiSpans. It also adds the respective document converters for TextDocumentWithLabeledMultiSpansAndBinaryRelations and TextDocumentWithLabeledMultiSpansBinaryRelationsAndLabeledPartitions.

IMPORTANT:

We set TEST_FULL_DATASET = True.
This also adds code which deduplicates the relations for the default dataset variant, so the annotation counts slightly change from:

	len	max	sum
contradicts	40	37	697
parts_of_same	40	100	1298
semantically_same	18	9	44
supports	40	300	5791

to:

	len	max	sum
contradicts	40	37	696
parts_of_same	40	100	1298
semantically_same	18	9	44
supports	40	300	5789

We warn about the annotation removal:

WARNING  datasets_modules.datasets.sciarg.e539c4247581f3861d74f668d91edfad8b298991b2026aa2d8fc629bc3b89350.sciarg:sciarg.py:108 doc_id=A18: Removing duplicate relation: BinaryRelation(head=LabeledSpan(start=13659, end=13660, label='data', score=1.0), tail=LabeledSpan(start=13720, end=13763, label='background_claim', score=1.0), label='supports', score=1.0)
WARNING  datasets_modules.datasets.sciarg.e539c4247581f3861d74f668d91edfad8b298991b2026aa2d8fc629bc3b89350.sciarg:sciarg.py:108 doc_id=A21: Removing duplicate relation: BinaryRelation(head=LabeledSpan(start=2339, end=2349, label='data', score=1.0), tail=LabeledSpan(start=2175, end=2238, label='background_claim', score=1.0), label='supports', score=1.0)
WARNING  datasets_modules.datasets.sciarg.e539c4247581f3861d74f668d91edfad8b298991b2026aa2d8fc629bc3b89350.sciarg:sciarg.py:108 doc_id=A30: Removing duplicate relation: BinaryRelation(head=LabeledSpan(start=2109, end=2217, label='background_claim', score=1.0), tail=LabeledSpan(start=1965, end=2098, label='background_claim', score=1.0), label='contradicts', score=1.0)

This PR requires:

TODO:

use ~~0.10.6~~ 0.10.8 release of pie-modules when available
update the docs
investigate why test_tokenize_documents_all fails
increase test coverage, if necessary
update documentation (dataset card)

codecov · 2024-01-31T17:54:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.55%. Comparing base (a12858f) to head (54018b3).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #103      +/-   ##
==========================================
+ Coverage   95.23%   95.55%   +0.31%     
==========================================
  Files          22       22              
  Lines        1407     1439      +32     
==========================================
+ Hits         1340     1375      +35     
+ Misses         67       64       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ing)

… to tests.dataset_builders.common

… tests

…s and TextDocumentWithLabeledMultiSpansBinaryRelationsAndLabeledPartitions to sciarg

…on_converter_pipeline_steps_with_resolve_parts_of_same() (requires pie-modules 0.10.8)

…lt relation count

idalr · 2024-03-11T15:08:12Z

Label counts statistics

EDIT: adjusted for counts calculated with latest commit e2698bc

dataset variant `default`

document type `BratDocumentWithMergedSpans`

i.e. no conversion

	len	max	sum
background_claim	40	182	3291
data	40	214	4297
own_claim	40	285	6004

command line

python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_entity_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=default  \
metric.document_type=pie_datasets.builders.brat.BratDocumentWithMergedSpans \
metric.field=spans

	len	max	sum
contradicts	40	37	696
parts_of_same	40	100	1298
semantically_same	18	9	44
supports	40	300	5789

command line

python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_relation_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=default \
metric.document_type=pie_datasets.builders.brat.BratDocumentWithMergedSpans \
metric.field=relations

document type `TextDocumentWithLabeledSpansAndBinaryRelations`

	len	max	sum
background_claim	40	182	3291
data	40	214	4297
own_claim	40	285	6004

command line

python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_entity_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=default  \
metric.document_type=pie_modules.documents.TextDocumentWithLabeledSpansAndBinaryRelations \
metric.field=labeled_spans

	len	max	sum
contradicts	40	37	696
parts_of_same	40	100	1298
semantically_same	18	9	44
supports	40	300	5789

command line

python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_relation_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=default \
metric.document_type=pie_modules.documents.TextDocumentWithLabeledSpansAndBinaryRelations \
metric.field=binary_relations

dataset variant `resolve_parts_of_same`

document type `BratDocument`

i.e. no conversion

	len	max	sum
background_claim	40	153	2752
data	40	171	4093
own_claim	40	262	5450

command line

python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_entity_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=resolve_parts_of_same  \
metric.document_type=pie_datasets.builders.brat.BratDocument \
metric.field=spans

	len	max	sum
contradicts	40	37	696
semantically_same	18	9	44
supports	40	300	5788

command line

python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_relation_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=resolve_parts_of_same  \
metric.document_type=pie_datasets.builders.brat.BratDocument \
metric.field=relations

document type `TextDocumentWithLabeledMultiSpansAndBinaryRelations`

i.e. with conversion

entities	len	max	sum
background_claim	40	153	2752
data	40	171	4093
own_claim	40	262	5450

command line

python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_entity_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=resolve_parts_of_same  \
metric.document_type=pie_modules.documents.TextDocumentWithLabeledMultiSpansAndBinaryRelations \
metric.field=labeled_multi_spans

relations	len	max	sum
contradicts	40	37	696
semantically_same	18	9	44
supports	40	300	5788

command line

python src/evaluate_documents.py \
dataset=sciarg_base \
metric=count_relation_labels \
+metric.show_as_markdown=true \
dataset.input.revision=caab546e03606630984140917a5e86979caba6e9 \
+dataset.input.name=resolve_parts_of_same \ metric.document_type=pie_modules.documents.TextDocumentWithLabeledMultiSpansAndBinaryRelations \
metric.field=binary_relations

ArneBinder added the enhancement New feature or request label Jan 31, 2024

ArneBinder and others added 19 commits February 20, 2024 11:51

implement some multi span handling (mode=merge, but config still miss…

e051ccf

…ing)

rename parameter

894cb28

add resolve_annotation(), sort_annotations(), and resolve_annotations…

60808c9

… to tests.dataset_builders.common

add dataset variant "resolve_parts_of_same" for sciarg and streamline…

17c6838

… tests

improve tests: check that parts_of_same gets merged

296f112

upgrade pie-modules to 0.10.5

7349bb5

add SpansWithRelationsMerger

b7aff6c

add LabeledMultiSpan to resolve_annotation() and sort_annotations()

a791684

sort slices in _merge_spans_via_relation()

a355388

add converters for TextDocumentWithLabeledMultiSpansAndBinaryRelation…

ca97b21

…s and TextDocumentWithLabeledMultiSpansBinaryRelationsAndLabeledPartitions to sciarg

use SpansViaRelationMerger from pie_modules

e185081

streamline tests and test tokenization

6cb23cf

fix TestTokenDocumentWithLabeledMultiSpansAndBinaryRelations

804988f

import annotations from pie_modules

db68636

adjust requirements.txt for sciarg

8c7a42b

add some documentation to teh dataset card

3cb8e12

adjusted DOCUMENT_TYPES = {'resolve_parts_of_same': BratDocument}

c83abcb

debugging

2e1da3a

edited pie/sciarg/readme.md

e1f8bc7

ArneBinder force-pushed the sciarg_multi_span_handling branch from a599403 to e1f8bc7 Compare February 20, 2024 10:54

ArneBinder added 8 commits February 20, 2024 12:58

minor change

d6e3d29

set TEST_FULL_DATASET=True

e1109c7

add trim_adus and sort_symmetric_relation_arguments steps to get_comm…

cc47858

…on_converter_pipeline_steps_with_resolve_parts_of_same() (requires pie-modules 0.10.8)

improve documentation

fce8f22

fix test_tokenize_documents_all()

35441ad

add test_generate_document()

78e4634

update README.md

62adfbc

add statistics (label counts) for resolve_parts_of_same and fix defau…

9a48d7c

…lt relation count

ArneBinder changed the title ~~multi span handling for sciarg [WIP]~~ multi span handling for sciarg Mar 4, 2024

ArneBinder added 4 commits March 14, 2024 15:16

begin writing tests for label counts

b99d934

add tests for label counts

71238c9

remove duplicate relations for default dataset variant

e2698bc

add test_remove_duplicate_relations()

54018b3

ArneBinder merged commit c0891d5 into main Mar 18, 2024
4 checks passed

ArneBinder deleted the sciarg_multi_span_handling branch March 18, 2024 11:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi span handling for `sciarg` #103

multi span handling for `sciarg` #103

ArneBinder commented Jan 31, 2024 •

edited

codecov bot commented Jan 31, 2024 •

edited

idalr commented Mar 11, 2024 •

edited by ArneBinder

multi span handling for sciarg #103

multi span handling for sciarg #103

Conversation

ArneBinder commented Jan 31, 2024 • edited

codecov bot commented Jan 31, 2024 • edited

Codecov Report

idalr commented Mar 11, 2024 • edited by ArneBinder

Label counts statistics

dataset variant default

document type BratDocumentWithMergedSpans

document type TextDocumentWithLabeledSpansAndBinaryRelations

dataset variant resolve_parts_of_same

document type BratDocument

document type TextDocumentWithLabeledMultiSpansAndBinaryRelations

multi span handling for `sciarg` #103

multi span handling for `sciarg` #103

ArneBinder commented Jan 31, 2024 •

edited

codecov bot commented Jan 31, 2024 •

edited

idalr commented Mar 11, 2024 •

edited by ArneBinder

dataset variant `default`

document type `BratDocumentWithMergedSpans`

document type `TextDocumentWithLabeledSpansAndBinaryRelations`

dataset variant `resolve_parts_of_same`

document type `BratDocument`

document type `TextDocumentWithLabeledMultiSpansAndBinaryRelations`