Support doc_id column as the third item in a `DOCSTART-` line discarding any surrounding single hyphens #13316

jfernandrezj · 2023-01-09T16:09:42Z

Description

Supports doc_id information through:

DOCSTART- -X- -{doc_id}- O

i.e.

-DOCSTART- -X- -1- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O

-DOCSTART- -X- 2 O

Rare NNP B-NP O
Hendrix NNP I-NP B-PER

-DOCSTART- -X- -3-1- O

China NNP B-NP B-LOC
says VBZ B-VP O

-DOCSTART-

China NNP B-NP B-LOC
says VBZ B-VP O

would produce:

+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            sentence|               token|                 pos|               label|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|     1|EU rejects German...|[{document, 0, 28...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|
|     2|Rare Hendrix song...|[{document, 0, 97...|[{document, 0, 50...|[{token, 0, 3, Ra...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|   3-1|China says Taiwan...|[{document, 0, 13...|[{document, 0, 46...|[{token, 0, 4, Ch...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
|     X|China says Taiwan...|[{document, 0, 13...|[{document, 0, 46...|[{token, 0, 4, Ch...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+

Motivation and Context

Flexibility already offered in the implementation of the CoNLL 2003 standard.

How Has This Been Tested?

All existing tests passing

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

…ing any surrounding single hyphens

maziyarpanahi · 2023-01-09T16:43:27Z

Thanks for the contribution @jfernandrezj

Can we add a param like enableDocId inside CoNLL() class and make it by default false? If it was set to true explicitly by the user then we can do this. Most CoNLL files (2003) don't have this id so we end up with more computation that results in none usable X

+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            sentence|               token|                 pos|               label|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|     X|EU rejects German...|[{document, 0, 28...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+

PS: We should have this param in both Scala and Python and both have to be set to false/False to be backward compatible

PS2: The unit test fails because of the .ArrayIndexOutOfBoundsException, we need a safer assumption there https://github.com/JohnSnowLabs/spark-nlp/actions/runs/3875840482/jobs/6608932290#step:7:1267

jfernandrezj · 2023-01-09T17:41:38Z

Hello @maziyarpanahi, yes, will add that param, setting it to false by default.

maziyarpanahi · 2023-01-09T17:43:24Z

Hello @maziyarpanahi, yes, will add that param, setting it to false by default.

Thanks @jfernandrezj that'd be great. We also need to see how safely we can enable it and not to have exception if docId = items(2) doesn't exist

jfernandrezj · 2023-01-09T18:04:02Z

Yes @maziyarpanahi Ill also add that

…s in the DOCSTART row

jfernandrezj · 2023-01-09T18:48:27Z

Hello @maziyarpanahi the pushed changes should handle the requirements

… into conll-reader-with-docid-parsing

jfernandrezj · 2023-01-09T18:56:30Z

Currently adding the same support while reading the dataset with .../*

maziyarpanahi · 2023-01-09T19:40:30Z

Thanks @jfernandrezj for these changes

I added @albertoandreottiATgmail as a reviewer since he recently made some improvements to these files

jfernandrezj · 2023-01-09T19:46:10Z

Thank you @maziyarpanahi ! Thank you @albertoandreottiATgmail !

maziyarpanahi · 2023-01-10T09:43:01Z

@jfernandrezj Some unit tests failed, it seems the order of columns in CoNLL file has changed so instead of NER it reds the POS column:

jfernandrezj · 2023-01-10T11:32:58Z

@jfernandrezj Some unit tests failed, it seems the order of columns in CoNLL file has changed so instead of NER it reds the POS column:

@maziyarpanahi can you please point me to the Test file in the repo?

maziyarpanahi · 2023-01-10T11:36:23Z

@jfernandrezj Some unit tests failed, it seems the order of columns in CoNLL file has changed so instead of NER it reds the POS column:

@maziyarpanahi can you please point me to the Test file in the repo?

Sure, it's here:

spark-nlp/src/test/scala/com/johnsnowlabs/nlp/annotators/ner/dl/NerDLSpec.scala

Line 195 in 87bbf31

    
           "NerDLApproach" should "validate against part of the training dataset" taggedAs FastTest in {

"NerDLApproach" should "validate against part of the training dataset" taggedAs FastTest in {

PS: Since this format is not a usual CoNLL 2003 schema and our readDataset was originally created for conll03 format, what we can do is to just create another method like readDatasetWithDocId or any other name to just target this kind of CoNLL files.
This way we don't have to be worried about what current usage of readDataset would be after this change and we have more flexibility in developing more readers in CoNLL() class like 2017 etc. What do you think @jfernandrezj? (we do lack of readers so this way we can just keep adding new ones)