Adding Control for Annotators with One Column #12997

danilojsl · 2022-10-27T22:49:57Z

Description

Adds inputAnnotatorTypes in Python to identify annotators that expect only one column. This allows handling more than one-column inputs in annotators that only process one column by taking only the first one and ignoring the other columns.

Motivation and Context

Check issue #12981

How Has This Been Tested?

Screenshots (if appropriate):

Local Unit Tests
Google Colab

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

…umn in Python side

dataanalyst4lyfe · 2022-10-31T15:22:02Z

Understanding the Current Solution
Wait, if I am understanding you correctly:

by taking only the first one and ignoring the other columns.

That means if I did something like this:

annotator.setInputCols(["col_1", "col_2"])

When that is executed, only results for col_1 will be returned.
^ If that is the case, I think that is a very bad idea.

My Proposed Solution
Throw an exception if the user calls setInputCols() <- This is for cols. The exception message should tell the user to use the single column version (setInputCol()).

Why?
I think this implementation would reduce the risk of "silent failures", and makes it explicit what should be done. The user will know that they can only pass in 1 column at-a-time, and they won't risk trying to have to debug and wonder why only 1 column is being processed.

@danilojsl - Let me know what you think.

maziyarpanahi · 2022-11-01T11:38:51Z

Understanding the Current Solution Wait, if I am understanding you correctly:

by taking only the first one and ignoring the other columns.

That means if I did something like this:
annotator.setInputCols(["col_1", "col_2"])
When that is executed, only results for col_1 will be returned. ^ If that is the case, I think that is a very bad idea.

My Proposed Solution Throw an exception if the user calls setInputCols() <- This is for cols. The exception message should tell the user to use the single column version (setInputCol()).

Why? I think this implementation would reduce the risk of "silent failures", and makes it explicit what should be done. The user will know that they can only pass in 1 column at-a-time, and they won't risk trying to have to debug and wonder why only 1 column is being processed.

@danilojsl - Let me know what you think.

Hi @dataanalyst4lyfe
Yes, your assumption was correct, it will silently skip the other inputCols. We actually discussed this to be in sync with Scala/Java behavior which is a hard exception with an error that says this annotator only accepts 1 inputCol. We were going to implement the exception for the reason you explained perfectly, thanks for reminding me.

…es in annotators for Python

danilojsl · 2022-11-08T13:15:08Z

@maziyarpanahi, I made the changes to raise an exception when the number of columns is different than expected. So, now the behavior will be synced with Scala

maziyarpanahi · 2022-11-08T13:16:51Z

@maziyarpanahi, I made the changes to raise an exception when the number of columns is different than expected. So, now the behavior will be synced with Scala

Thanks @danilojsl the new raise TypeError will make it exactly like Scala so we won't have use cases where we are not sure what goes in and out of the pipeline.

coveralls · 2022-11-08T13:51:43Z

Pull Request Test Coverage Report for Build 3421214687

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 69.036%

Totals
Change from base Build 3412079753:	0.0%
Covered Lines:	8800
Relevant Lines:	12747

💛 - Coveralls

SPARKNLP-632 Adding control for annotators with only one expected col…

7f78d0b

…umn in Python side

danilojsl requested a review from maziyarpanahi October 27, 2022 22:50

danilojsl added bug-fix DON'T MERGE Do not merge this PR labels Oct 27, 2022

maziyarpanahi self-assigned this Oct 28, 2022

maziyarpanahi linked an issue Oct 28, 2022 that may be closed by this pull request

Multiple Documents Do Not Get Chunked Properly #12981

Closed

SPARKNLP-632 Adding inputAnnotatorTypes and optionalInputAnnotatorTyp…

5bc3918

…es in annotators for Python

danilojsl changed the base branch from master to release/423-release-candidate November 8, 2022 13:11

maziyarpanahi approved these changes Nov 8, 2022

View reviewed changes

SPARKNLP-632 Removing useless prints to terminal

74cd9aa

danilojsl added 2 commits November 8, 2022 10:43

SPARKNLP-632 Adding AnnotatorType initialization in Python

1b4c6c2

SPARKNLP-632 Removing wrong imports

c72d9ef

maziyarpanahi merged commit 902b07d into release/423-release-candidate Nov 10, 2022

maziyarpanahi mentioned this pull request Nov 10, 2022

Release/423 release candidate #13036

Merged

KshitizGIT deleted the feature/SPARKNLP-632-multiple-documents-do-not-get-chunked-properly branch March 2, 2023 10:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Control for Annotators with One Column #12997

Adding Control for Annotators with One Column #12997

danilojsl commented Oct 27, 2022 •

edited

Loading

dataanalyst4lyfe commented Oct 31, 2022 •

edited

Loading

maziyarpanahi commented Nov 1, 2022

danilojsl commented Nov 8, 2022

maziyarpanahi commented Nov 8, 2022

coveralls commented Nov 8, 2022 •

edited

Loading

Adding Control for Annotators with One Column #12997

Adding Control for Annotators with One Column #12997

Conversation

danilojsl commented Oct 27, 2022 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

dataanalyst4lyfe commented Oct 31, 2022 • edited Loading

maziyarpanahi commented Nov 1, 2022

danilojsl commented Nov 8, 2022

maziyarpanahi commented Nov 8, 2022

coveralls commented Nov 8, 2022 • edited Loading

Pull Request Test Coverage Report for Build 3421214687

💛 - Coveralls

danilojsl commented Oct 27, 2022 •

edited

Loading

dataanalyst4lyfe commented Oct 31, 2022 •

edited

Loading

coveralls commented Nov 8, 2022 •

edited

Loading