-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Make String.split()
default to whitespace & fix behavior to be pythonic
#2711
Conversation
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
atol_safe() -> Optional[Int]
and make String.split()
default to whitespace and not raise & fix behavior to be pythonic
Maybe this should be split into two separate PRs so that they can be reviewed independently? I feel like you address multiple things at once here. |
ok I'll get the |
atol_safe() -> Optional[Int]
and make String.split()
default to whitespace and not raise & fix behavior to be pythonicString.split()
default to whitespace and not raise & fix behavior to be pythonic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! I have a couple asks before I do a more thorough review, but I like the direction!
@laszlokindrat added you suggestions though I just put the isspace_python inside the String.isspace method since it's basically just checking for a String inside a List. Changed the tests to target that method and added another one for cases that shouldn't be a space |
stdlib/src/builtin/string.mojo
Outdated
String(List[UInt8](0x20, 0x5C, 0x75, 0x32, 0x30, 0x32, 0x39)), | ||
) | ||
|
||
fn split(self, sep: String = "", maxsplit: Int = -1) -> List[String]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior of returning an empty string for sep=""
is kind of strange and actually breaks an invariant: sep.join(some_str.split(sep)) == some_str
should be true for any sep
and any some_str
. Python simply rejects empty separators (with ValueError: empty separator
), and we could certainly do this. The (IMO better) alternative is to return a list of the characters. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check this by running " ".split(" ")
and " ".split()
in python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to return a list of the characters. WDYT?
I think it makes a lot of sense, though it adds yet another branch to the code that could be parametrized... but yeah it's inevitable. I hadn't actually tested the .split("")
with python so thanks for the save.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.split("")
shouldn't be possible IMO, since there's an infinite number of empty strings between every symbol. If you want to get a list of the characters of a string I would opt for list(some_string)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand but how would we deal with it then? IMO we shouldn't raise in a func that is just splitting, the other option is to return an empty list, or a list with the whole string as the first and only item. And I do like the idea of returning chars since it would also provide an API for splitting a string into its chars, instead of iterating and building them or using List[String](String.as_bytes())
(I'm not sure this implicit casting even works or if the user would have to build a String for each item in the byte List)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on board with returning a list of the characters from .split("")
. I'm not typically a fan of deviating from Python behaviour but it's not really a useful safeguard to begin with and it's not worth "colouring" this method with raises
to avoid an altered behaviour from exactly one possible input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting on an empty string is ambiguous and conceptually wrong. If the goal is to be a superset of Python then I don't see why we should change Python's behaviour (unless Python changes it too).
As a fix to avoid raises
when splitting we could add an overload like:
fn split[sep: StringRef](maxsplit: Int = -1):
constrained[sep != "", "empty separator"]()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this some thought, and I think I mostly agree with @siitron and @bgreni: splitting on an empty string is ambiguous, and we shouldn't risk unexpected behavior by allowing it (even when it might feel natural to some of us). Let's keep it simple, and make this function raising. If this turns out to be a perf problem, there are ways we can deal with it, but our default should be following Python and providing safe, unsurprising APIs.
stdlib/test/builtin/test_string.mojo
Outdated
var d = String("hello world").split("") | ||
assert_true(d[0] == "hello", d[1] == "world") | ||
d = String("hello \t\n\n\v\fworld").split("\n") | ||
assert_true(d[0] == "hello \t" and d[1] == "" and d[2] == "\v\fworld") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While explicitly tests are good, a great way to improve coverage for this would be to check for the invariant sep.join(some_str.split(sep)) == some_str
.
reverted in the |
@laszlokindrat any idea why this keeps timing out with no error? |
Note sure. I can look at this next week, but in the interest of keeping things moving, could you split out |
1514525
to
b569ac1
Compare
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
e78c9fe
to
b78fb4b
Compare
Signed-off-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com>
@martinvuyk Can you confirm that the timeout issue is resolved? |
it's solved. Using self.endswith(sep) removed the issue. I think it might be a problem with fn __getitem__[IndexerType: Indexer](self, i: IndexerType) -> String:
var idx = index(i)
if idx < 0:
return self.__getitem__(len(self) + idx)
debug_assert(0 <= idx < len(self), "index must be in range")
var buf = Self._buffer_type(capacity=1)
buf.append(self._buffer[idx])
buf.append(0)
return String(buf^) |
Probably |
#2793 has landed internally and will be included in the next nightly. Could you please rebase after that and use |
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
@laszlokindrat I just used _isspace temporarily since it's iterating byte per byte. I think we first need to have a _StringIter that returns a StringRef according to the full utf8 multi byte length. I have a few ideas for a PR on that but can we land this meanwhile? |
stdlib/test/builtin/test_string.mojo
Outdated
d = String("abababaaba").split("aba") | ||
assert_true(d[0] == "" and d[1] == "b" and d[2] == "" and d[3] == "") | ||
|
||
# separator = "" returns all char split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this currently fail? If it's a known limitation, please add a TODO, otherwise you can uncomment or remove it.
assert_true(len(String(" ").split(" ")) == 4) | ||
|
||
d = String("abababaaba").split("aba") | ||
assert_true(d[0] == "" and d[1] == "b" and d[2] == "" and d[3] == "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should also check the size of d
as well, right? Otherwise, what if d
has a fifth element?
_ = String("").split() # [] | ||
# Splitting a string with leading, trailing, and middle whitespaces | ||
_ = String(" hello world ").split() # ["hello", "world"] | ||
# Splitting adjacent universal newlines: # TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this TODO out of the docstring and into the body.
Makes sense, thanks for pushing this forward! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thanks! Just a couple more small things, but looks great otherwise! Could you please also update the PR description and add this to the changelog?
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
String.split()
default to whitespace and not raise & fix behavior to be pythonicString.split()
default to whitespace & fix behavior to be pythonic
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
see #2868
Just did, the markdownlint keeps failing. |
Co-authored-by: Laszlo Kindrat <laszlokindrat@gmail.com> Signed-off-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com>
Looks really good, can you address the remaining two comments, or should I do it when I bring it in? |
!sync |
✅🟣 This contribution has been merged 🟣✅ Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours. We use Copybara to merge external contributions, click here to learn more. |
Landed in 10b1ee3! Thank you for your contribution 🎉 |
… behavior to be pythonic (#40714) [External] [stdlib] Make `String.split()` default to whitespace & fix behavior to be pythonic Closes #2686 `String.split()` now defaults to whitespace and has pythonic behavior in that it removes all adjacent whitespaces by default. ORIGINAL_AUTHOR=martinvuyk <110240700+martinvuyk@users.noreply.github.com> PUBLIC_PR_LINK=#2711 --------- Co-authored-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com> Closes #2711 MODULAR_ORIG_COMMIT_REV_ID: b6a05e97f8de09a2c77b272b6cd8b96da3c5c782
Just a heads up: I also found an edge case that doesn't work correctly: The output is wrong when |
Closes #2686
String.split()
now defaults to whitespace and has pythonic behavior in that it removes all adjacent whitespaces by default.