Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Control for Annotators with One Column #12997

Conversation

danilojsl
Copy link
Contributor

@danilojsl danilojsl commented Oct 27, 2022

Description

Adds inputAnnotatorTypes in Python to identify annotators that expect only one column. This allows handling more than one-column inputs in annotators that only process one column by taking only the first one and ignoring the other columns.

Motivation and Context

Check issue #12981

How Has This Been Tested?

Screenshots (if appropriate):

  • Local Unit Tests
  • Google Colab

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@danilojsl danilojsl added bug-fix DON'T MERGE Do not merge this PR labels Oct 27, 2022
@maziyarpanahi maziyarpanahi self-assigned this Oct 28, 2022
@maziyarpanahi maziyarpanahi linked an issue Oct 28, 2022 that may be closed by this pull request
@dataanalyst4lyfe
Copy link

dataanalyst4lyfe commented Oct 31, 2022

Understanding the Current Solution
Wait, if I am understanding you correctly:

by taking only the first one and ignoring the other columns.

That means if I did something like this:

annotator.setInputCols(["col_1", "col_2"])

When that is executed, only results for col_1 will be returned.
^ If that is the case, I think that is a very bad idea.

My Proposed Solution
Throw an exception if the user calls setInputCols() <- This is for cols. The exception message should tell the user to use the single column version (setInputCol()).

Why?
I think this implementation would reduce the risk of "silent failures", and makes it explicit what should be done. The user will know that they can only pass in 1 column at-a-time, and they won't risk trying to have to debug and wonder why only 1 column is being processed.

@danilojsl - Let me know what you think.

@maziyarpanahi
Copy link
Member

Understanding the Current Solution Wait, if I am understanding you correctly:

by taking only the first one and ignoring the other columns.

That means if I did something like this:

annotator.setInputCols(["col_1", "col_2"])

When that is executed, only results for col_1 will be returned. ^ If that is the case, I think that is a very bad idea.

My Proposed Solution Throw an exception if the user calls setInputCols() <- This is for cols. The exception message should tell the user to use the single column version (setInputCol()).

Why? I think this implementation would reduce the risk of "silent failures", and makes it explicit what should be done. The user will know that they can only pass in 1 column at-a-time, and they won't risk trying to have to debug and wonder why only 1 column is being processed.

@danilojsl - Let me know what you think.

Hi @dataanalyst4lyfe
Yes, your assumption was correct, it will silently skip the other inputCols. We actually discussed this to be in sync with Scala/Java behavior which is a hard exception with an error that says this annotator only accepts 1 inputCol. We were going to implement the exception for the reason you explained perfectly, thanks for reminding me.

@danilojsl danilojsl changed the base branch from master to release/423-release-candidate November 8, 2022 13:11
@danilojsl
Copy link
Contributor Author

@maziyarpanahi, I made the changes to raise an exception when the number of columns is different than expected. So, now the behavior will be synced with Scala

@maziyarpanahi
Copy link
Member

@maziyarpanahi, I made the changes to raise an exception when the number of columns is different than expected. So, now the behavior will be synced with Scala

Thanks @danilojsl the new raise TypeError will make it exactly like Scala so we won't have use cases where we are not sure what goes in and out of the pipeline.

@coveralls
Copy link

coveralls commented Nov 8, 2022

Pull Request Test Coverage Report for Build 3421214687

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 69.036%

Totals Coverage Status
Change from base Build 3412079753: 0.0%
Covered Lines: 8800
Relevant Lines: 12747

💛 - Coveralls

@maziyarpanahi maziyarpanahi merged commit 902b07d into release/423-release-candidate Nov 10, 2022
@KshitizGIT KshitizGIT deleted the feature/SPARKNLP-632-multiple-documents-do-not-get-chunked-properly branch March 2, 2023 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-fix DON'T MERGE Do not merge this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple Documents Do Not Get Chunked Properly
4 participants