Skip to content

Conversation

@jfernandrezj
Copy link
Contributor

@jfernandrezj jfernandrezj commented Jan 9, 2023

Description

Supports doc_id information through:

DOCSTART- -X- -{doc_id}- O

i.e.

-DOCSTART- -X- -1- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O

-DOCSTART- -X- 2 O

Rare NNP B-NP O
Hendrix NNP I-NP B-PER

-DOCSTART- -X- -3-1- O

China NNP B-NP B-LOC
says VBZ B-VP O

-DOCSTART-

China NNP B-NP B-LOC
says VBZ B-VP O

would produce:

+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            sentence|               token|                 pos|               label|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|     1|EU rejects German...|[{document, 0, 28...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|
|     2|Rare Hendrix song...|[{document, 0, 97...|[{document, 0, 50...|[{token, 0, 3, Ra...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|   3-1|China says Taiwan...|[{document, 0, 13...|[{document, 0, 46...|[{token, 0, 4, Ch...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
|     X|China says Taiwan...|[{document, 0, 13...|[{document, 0, 46...|[{token, 0, 4, Ch...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+

Motivation and Context

Flexibility already offered in the implementation of the CoNLL 2003 standard.

How Has This Been Tested?

All existing tests passing

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@jfernandrezj jfernandrezj self-assigned this Jan 9, 2023
@maziyarpanahi
Copy link
Contributor

maziyarpanahi commented Jan 9, 2023

Thanks for the contribution @jfernandrezj

Can we add a param like enableDocId inside CoNLL() class and make it by default false? If it was set to true explicitly by the user then we can do this. Most CoNLL files (2003) don't have this id so we end up with more computation that results in none usable X

+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            sentence|               token|                 pos|               label|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|     X|EU rejects German...|[{document, 0, 28...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+

PS: We should have this param in both Scala and Python and both have to be set to false/False to be backward compatible

PS2: The unit test fails because of the .ArrayIndexOutOfBoundsException, we need a safer assumption there https://github.com/JohnSnowLabs/spark-nlp/actions/runs/3875840482/jobs/6608932290#step:7:1267

@maziyarpanahi maziyarpanahi added enhancement DON'T MERGE Do not merge this PR labels Jan 9, 2023
@maziyarpanahi maziyarpanahi changed the base branch from master to release/427-release-candidate January 9, 2023 16:43
@jfernandrezj
Copy link
Contributor Author

Hello @maziyarpanahi, yes, will add that param, setting it to false by default.

@maziyarpanahi
Copy link
Contributor

Hello @maziyarpanahi, yes, will add that param, setting it to false by default.

Thanks @jfernandrezj that'd be great. We also need to see how safely we can enable it and not to have exception if docId = items(2) doesn't exist

@jfernandrezj
Copy link
Contributor Author

Yes @maziyarpanahi Ill also add that

@jfernandrezj
Copy link
Contributor Author

Hello @maziyarpanahi the pushed changes should handle the requirements

@jfernandrezj
Copy link
Contributor Author

Currently adding the same support while reading the dataset with .../*

@maziyarpanahi
Copy link
Contributor

Thanks @jfernandrezj for these changes

I added @albertoandreottiATgmail as a reviewer since he recently made some improvements to these files

@jfernandrezj
Copy link
Contributor Author

Thank you @maziyarpanahi ! Thank you @albertoandreottiATgmail !

@maziyarpanahi
Copy link
Contributor

maziyarpanahi commented Jan 10, 2023

@jfernandrezj Some unit tests failed, it seems the order of columns in CoNLL file has changed so instead of NER it reds the POS column:

image

@maziyarpanahi maziyarpanahi added the on-hold cannot be merged right away label Jan 10, 2023
@jfernandrezj
Copy link
Contributor Author

@jfernandrezj Some unit tests failed, it seems the order of columns in CoNLL file has changed so instead of NER it reds the POS column:

image

@maziyarpanahi can you please point me to the Test file in the repo?

@maziyarpanahi
Copy link
Contributor

@jfernandrezj Some unit tests failed, it seems the order of columns in CoNLL file has changed so instead of NER it reds the POS column:
image

@maziyarpanahi can you please point me to the Test file in the repo?

Sure, it's here:

"NerDLApproach" should "validate against part of the training dataset" taggedAs FastTest in {

"NerDLApproach" should "validate against part of the training dataset" taggedAs FastTest in {

PS: Since this format is not a usual CoNLL 2003 schema and our readDataset was originally created for conll03 format, what we can do is to just create another method like readDatasetWithDocId or any other name to just target this kind of CoNLL files.
This way we don't have to be worried about what current usage of readDataset would be after this change and we have more flexibility in developing more readers in CoNLL() class like 2017 etc. What do you think @jfernandrezj? (we do lack of readers so this way we can just keep adding new ones)

@maziyarpanahi maziyarpanahi removed the on-hold cannot be merged right away label Jan 11, 2023
def closeDocument = {

val result = (doc.toString, sentences.toList)
val result = (doc.toString, sentences.toList, docId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you pack everything into a tuple, just to use the first element result._1, and then assemble the tuple back from its pieces again? Why don't you just,

doc.clear()
sentences.clear()

if (doc.toString.nonEmpty)
  Some(doc.toString, sentences.toList, if(incldeDocId) docId else None)
else None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. The pattern was already there though, didnt give it any thought while changing it.

addSentence()
closeDocument
val closedDoc = closeDocument
docId = items.lift(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't modify external state within your map/flatMap.
Make sure the condition in the if is met only once per pass.
I would do this in 2 steps, first consume all lines until I get to the "-DOCSTART-", once I got that docId problem solved, I would continue with the rest of the document, and merge the 2 things at the end.
Disclaimer: consider I don't remember exactly the schema of CONLL right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ill try to check this

}

def removeSurroundingHyphens(text: String) =
"-(.+)-".r.findFirstMatchIn(text).map(_.group(1)).getOrElse(text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have done text.replace("-", ""), if you're sure you're just after a number. Or,
test.tail.dropRight(1).
Probably yours is more robust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case the ids actually have hyphens

val label = getAnnotationType(labelCol, AnnotatorType.NAMED_ENTITY)

StructType(Seq(text, doc, sentence, token, pos, label))
if (includeDocId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the schema always the same, consider consumers of this data. Changing output schema according to input data probably is not a good idea. If docId is not available, then make them all docId==0. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the original idea in the very first commit. But then the optionality was required.

import spark.implicits._
val rows = docs.map(coreTransformation).toDF.rdd
spark.createDataFrame(rows, schema)
val preDf = if (includeDocId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep a single version of everything, always return the column for the docId, if there's no doc ID in the data, return 0. Simplifies the code, keeps a single version of the coreTransformation() function... easier to maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the original idea in the very first commit. But then the optionality was required.

@maziyarpanahi maziyarpanahi changed the base branch from release/427-release-candidate to release/430-release-candidate January 12, 2023 12:34
@maziyarpanahi maziyarpanahi changed the base branch from release/430-release-candidate to release/427-release-candidate January 12, 2023 12:38
@maziyarpanahi
Copy link
Contributor

closing this for a fresh branch, but keep the PR for the history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DON'T MERGE Do not merge this PR enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants