Enable RegexMatchSpan with concatenates words by sep="(separator)" option #492
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of the problems or issues
Is your pull request related to a problem? Please describe.
A clear and concise description of what the problem is.
A sentence "123 456 789" is parsed and gets three words "123", "456", and "789".
I'd like to match a number like
RegexMatchSpan(rgx=r"\d{9}", sep=" ")
but sep=" " has no effect
Does your pull request fix any issue.
Fix #270
Description of the proposed changes
Enable RegexMatchSpan with sep="(separator)" option.
It concatenates mention spans to one word and does RgexMatch without consideration of the separator.
Test plan
Add Test Code to 'fonduer/tests/candidates/test_matchers.py'.
A sentence "This is apple" is parsed and gets 2 2-grams "This is" and "is apple".
We can get "is apple" with following rgx and sep="(space)" option:
RegexMatchSpan(rgx=r"isapple", sep=" ")
Checklist