-
Notifications
You must be signed in to change notification settings - Fork 739
Support doc_id column as the third item in a DOCSTART- line discarding any surrounding single hyphens
#13316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ing any surrounding single hyphens
|
Thanks for the contribution @jfernandrezj Can we add a param like PS: We should have this param in both Scala and Python and both have to be set to false/False to be backward compatible PS2: The unit test fails because of the .ArrayIndexOutOfBoundsException, we need a safer assumption there https://github.com/JohnSnowLabs/spark-nlp/actions/runs/3875840482/jobs/6608932290#step:7:1267 |
|
Hello @maziyarpanahi, yes, will add that param, setting it to false by default. |
Thanks @jfernandrezj that'd be great. We also need to see how safely we can enable it and not to have exception if |
|
Yes @maziyarpanahi Ill also add that |
…s in the DOCSTART row
|
Hello @maziyarpanahi the pushed changes should handle the requirements |
… into conll-reader-with-docid-parsing
|
Currently adding the same support while reading the dataset with .../* |
|
Thanks @jfernandrezj for these changes I added @albertoandreottiATgmail as a reviewer since he recently made some improvements to these files |
|
Thank you @maziyarpanahi ! Thank you @albertoandreottiATgmail ! |
|
@jfernandrezj Some unit tests failed, it seems the order of columns in CoNLL file has changed so instead of NER it reds the POS column: |
@maziyarpanahi can you please point me to the Test file in the repo? |
Sure, it's here:
PS: Since this format is not a usual CoNLL 2003 schema and our readDataset was originally created for conll03 format, what we can do is to just create another method like readDatasetWithDocId or any other name to just target this kind of CoNLL files. |
| def closeDocument = { | ||
|
|
||
| val result = (doc.toString, sentences.toList) | ||
| val result = (doc.toString, sentences.toList, docId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you pack everything into a tuple, just to use the first element result._1, and then assemble the tuple back from its pieces again? Why don't you just,
doc.clear()
sentences.clear()
if (doc.toString.nonEmpty)
Some(doc.toString, sentences.toList, if(incldeDocId) docId else None)
else None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. The pattern was already there though, didnt give it any thought while changing it.
| addSentence() | ||
| closeDocument | ||
| val closedDoc = closeDocument | ||
| docId = items.lift(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't modify external state within your map/flatMap.
Make sure the condition in the if is met only once per pass.
I would do this in 2 steps, first consume all lines until I get to the "-DOCSTART-", once I got that docId problem solved, I would continue with the rest of the document, and merge the 2 things at the end.
Disclaimer: consider I don't remember exactly the schema of CONLL right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ill try to check this
| } | ||
|
|
||
| def removeSurroundingHyphens(text: String) = | ||
| "-(.+)-".r.findFirstMatchIn(text).map(_.group(1)).getOrElse(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have done text.replace("-", ""), if you're sure you're just after a number. Or,
test.tail.dropRight(1).
Probably yours is more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case the ids actually have hyphens
| val label = getAnnotationType(labelCol, AnnotatorType.NAMED_ENTITY) | ||
|
|
||
| StructType(Seq(text, doc, sentence, token, pos, label)) | ||
| if (includeDocId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the schema always the same, consider consumers of this data. Changing output schema according to input data probably is not a good idea. If docId is not available, then make them all docId==0. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the original idea in the very first commit. But then the optionality was required.
| import spark.implicits._ | ||
| val rows = docs.map(coreTransformation).toDF.rdd | ||
| spark.createDataFrame(rows, schema) | ||
| val preDf = if (includeDocId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep a single version of everything, always return the column for the docId, if there's no doc ID in the data, return 0. Simplifies the code, keeps a single version of the coreTransformation() function... easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the original idea in the very first commit. But then the optionality was required.
|
closing this for a fresh branch, but keep the PR for the history. |

Description
Supports
doc_idinformation through:i.e.
would produce:
Motivation and Context
Flexibility already offered in the implementation of the CoNLL 2003 standard.
How Has This Been Tested?
All existing tests passing
Screenshots (if appropriate):
Types of changes
Checklist: