SPARKNLP-828: Raise error when exceeding max input length #13774

DevinTDHa · 2023-04-27T15:32:13Z

Description

This PR introduces an exception in transformer based annotators. If an invalid value is used to set the max input length, an exception is thrown (limit is 512, or 4096 for the longformer). This exception exists on the scala side but was missing in python. These exceptions are now in sync.

This is done by introducing a new property HasMaxSentenceLengthLimit:

spark-nlp/python/sparknlp/common/properties.py

Lines 438 to 476 in cce5dc8

    
           class HasMaxSentenceLengthLimit: 
        
               # Default Value, can be overridden 
        
               max_length_limit = 512 
        
               maxSentenceLength = Param(Params._dummy(), 
        
                                         "maxSentenceLength", 
        
                                         "Max sentence length to process", 
        
                                         typeConverter=TypeConverters.toInt) 
        
               def setMaxSentenceLength(self, value): 
        
                   """Sets max sentence length to process. 
        
                   Note that a maximum limit exists depending on the model. If you are working with long single 
        
                   sequences, consider splitting up the input first with another annotator e.g. SentenceDetector. 
        
                   Parameters 
        
                   ---------- 
        
                   value : int 
        
                       Max sentence length to process 
        
                   """ 
        
                   if value > self.max_length_limit: 
        
                       raise ValueError( 
        
                           f"{self.__class__.__name__} models do not support token sequences longer than {self.max_length_limit}.\n" 
        
                           f"Consider splitting up the input first with another annotator e.g. SentenceDetector.") 
        
                   return self._set(maxSentenceLength=value) 
        
               def getMaxSentenceLength(self): 
        
                   """Gets max sentence of the model. 
        
                   Returns 
        
                   ------- 
        
                   int 
        
                       Max sentence length to process 
        
                   """ 
        
                   return self.getOrDefault("maxSentenceLength") 
        
           class HasLongMaxSentenceLengthLimit(HasMaxSentenceLengthLimit): 
        
               max_length_limit = 4096

A note regarding this has also been added to the documentation.

Motivation and Context

Users have been experiencing issues when trying to translate long texts. This should make it clear, that very long inputs are not intended for these models. Users should use a Sentence Detector to split the text to smaller chunks first and this new change will avoid exception by not allowing users to set anything larger than 512 in maxInputLength parameter. (a bug only on Python side)

How Has This Been Tested?

Existing tests have been amended to cover this behaviour. Missing tests were added for some annotators.

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

- Python side now also throws an exception if max length exceeds 512

…ators - Added HasMaxSentenceLengthLimit mix-in to check for valid value for maxSentenceLength - Appended tests with new test case for this - Added missing tests for some annotators

SPARKNLP-828: Raise error when exceeding max input length

ffe01da

- Python side now also throws an exception if max length exceeds 512

DevinTDHa added bug-fix DON'T MERGE Do not merge this PR labels Apr 27, 2023

DevinTDHa requested a review from maziyarpanahi April 27, 2023 15:32

DevinTDHa assigned maziyarpanahi and DevinTDHa Apr 27, 2023

SPARKNLP-828: Add Input limit to all relevant transformer-based Annot…

cce5dc8

…ators - Added HasMaxSentenceLengthLimit mix-in to check for valid value for maxSentenceLength - Appended tests with new test case for this - Added missing tests for some annotators

DevinTDHa changed the title ~~SPARKNLP-828: MarianTransformer - Raise error when exceeding max input length~~ SPARKNLP-828: Raise error when exceeding max input length May 1, 2023

DevinTDHa marked this pull request as draft May 2, 2023 08:36

DevinTDHa marked this pull request as ready for review May 2, 2023 15:07

maziyarpanahi approved these changes May 10, 2023

View reviewed changes

maziyarpanahi changed the base branch from master to release/442-release-candidate May 10, 2023 09:45

maziyarpanahi merged commit 7612d98 into JohnSnowLabs:release/442-release-candidate May 10, 2023
7 of 8 checks passed

maziyarpanahi mentioned this pull request May 10, 2023

Release/442 release candidate #13789

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARKNLP-828: Raise error when exceeding max input length #13774

SPARKNLP-828: Raise error when exceeding max input length #13774

DevinTDHa commented Apr 27, 2023 •

edited by maziyarpanahi

	class HasMaxSentenceLengthLimit:
	# Default Value, can be overridden
	max_length_limit = 512

	maxSentenceLength = Param(Params._dummy(),
	"maxSentenceLength",
	"Max sentence length to process",
	typeConverter=TypeConverters.toInt)

	def setMaxSentenceLength(self, value):
	"""Sets max sentence length to process.

	Note that a maximum limit exists depending on the model. If you are working with long single
	sequences, consider splitting up the input first with another annotator e.g. SentenceDetector.

	Parameters
	----------
	value : int
	Max sentence length to process
	"""
	if value > self.max_length_limit:
	raise ValueError(
	f"{self.__class__.__name__} models do not support token sequences longer than {self.max_length_limit}.\n"
	f"Consider splitting up the input first with another annotator e.g. SentenceDetector.")
	return self._set(maxSentenceLength=value)

	def getMaxSentenceLength(self):
	"""Gets max sentence of the model.

	Returns
	-------
	int
	Max sentence length to process
	"""
	return self.getOrDefault("maxSentenceLength")


	class HasLongMaxSentenceLengthLimit(HasMaxSentenceLengthLimit):
	max_length_limit = 4096

SPARKNLP-828: Raise error when exceeding max input length #13774

SPARKNLP-828: Raise error when exceeding max input length #13774

Conversation

DevinTDHa commented Apr 27, 2023 • edited by maziyarpanahi

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

DevinTDHa commented Apr 27, 2023 •

edited by maziyarpanahi