Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegexMatchSpan with sep="" concatenates words with sep="(space)" #270

Closed
HiromuHota opened this issue May 21, 2019 · 1 comment · Fixed by #492
Closed

RegexMatchSpan with sep="" concatenates words with sep="(space)" #270

HiromuHota opened this issue May 21, 2019 · 1 comment · Fixed by #492
Labels
bug Something isn't working

Comments

@HiromuHota
Copy link
Contributor

Describe the bug
A clear and concise description of what the bug is.

A sentence "123 456 789" is parsed and gets three words "123", "456", and "789".
I'd like to match a number like

RegexMatchSpan(rgx=r"\d{9}", sep="")

but sep="" has no effect.

To Reproduce
Steps to reproduce the behavior:

  1. Have a sentence "123 456 789"
  2. Parse it
  3. Try to match it with RegexMatchSpan(rgx=r"\d{9}", sep="")

Expected behavior
A clear and concise description of what you expected to happen.

RegexMatchSpan(rgx=r"\d{9}", sep="") matches a sentence of "123 456 789".

Environment (please complete the following information):

  • Fonduer Version: 0.6.2

Additional context
Add any other context about the problem here.

I think the root cause of this issue is the following implementation.

def get_attrib_span(self, a, sep=" "):
"""Get the span of sentence attribute *a*.
Intuitively, like calling::
sep.join(span.a)
:param a: The attribute to get a span for.
:type a: str
:param sep: The separator to use for the join.
:type sep: str
:return: The joined tokens, or text if a="words".
:rtype: str
"""
# NOTE: Special behavior for words currently (due to correspondence
# with char_offsets)
if a == "words":
return self.sentence.text[self.char_start : self.char_end + 1]
else:
return sep.join(self.get_attrib_tokens(a))

where a is words by default.

@HiromuHota
Copy link
Contributor Author

This "Special behavior" dates back to snorkel-team/snorkel@7b57927.

@senwu senwu added the bug Something isn't working label Sep 30, 2019
@lukehsiao lukehsiao removed their assignment Nov 13, 2019
@senwu senwu closed this as completed in #492 Aug 5, 2020
senwu pushed a commit that referenced this issue Aug 5, 2020
…tion

Fix #270
Enable RegexMatchSpan with sep="(separator)" option.
It concatenates mention spans to one word and does RegexMatch without consideration of the separator.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants