Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Make String.split() default to whitespace & fix behavior to be pythonic #2711

Closed
wants to merge 16 commits into from

Conversation

martinvuyk
Copy link

@martinvuyk martinvuyk commented May 17, 2024

Closes #2686
String.split() now defaults to whitespace and has pythonic behavior in that it removes all adjacent whitespaces by default.

Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
@martinvuyk martinvuyk requested a review from a team as a code owner May 17, 2024 14:49
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
@martinvuyk martinvuyk changed the title [stdlib] Add atol_safe and make split default to whitespace and not raise [stdlib] Add atol_safe() -> Optional[Int] and make String.split() default to whitespace and not raise & fix behavior to be pythonic May 17, 2024
@siitron
Copy link

siitron commented May 19, 2024

Maybe this should be split into two separate PRs so that they can be reviewed independently? I feel like you address multiple things at once here.

@martinvuyk
Copy link
Author

martinvuyk commented May 19, 2024

Maybe this should be split into two separate PRs so that they can be reviewed independently? I feel like you address multiple things at once here.

ok I'll get the atol_safe() -> Optional[Int] into another PR. Though the isspace_python() and the isspace(self) and split(self) funcs I think belong in the same problematic category that they don't adress all of python's unicode separators and only do utf8. isspace(self: String) could even be made to use isspace_python() instead of self in String.WHITESPACE, but that may make performance worse since comparing to a StringRef is probably faster (does memcmp) than iterating over a List[String]

@martinvuyk martinvuyk changed the title [stdlib] Add atol_safe() -> Optional[Int] and make String.split() default to whitespace and not raise & fix behavior to be pythonic [stdlib] Make String.split() default to whitespace and not raise & fix behavior to be pythonic May 19, 2024
@laszlokindrat laszlokindrat self-assigned this May 20, 2024
Copy link
Contributor

@laszlokindrat laszlokindrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I have a couple asks before I do a more thorough review, but I like the direction!

stdlib/src/builtin/string.mojo Outdated Show resolved Hide resolved
stdlib/src/builtin/string.mojo Outdated Show resolved Hide resolved
stdlib/src/builtin/string.mojo Outdated Show resolved Hide resolved
stdlib/src/builtin/string.mojo Outdated Show resolved Hide resolved
stdlib/src/builtin/string.mojo Outdated Show resolved Hide resolved
@martinvuyk
Copy link
Author

@laszlokindrat added you suggestions though I just put the isspace_python inside the String.isspace method since it's basically just checking for a String inside a List. Changed the tests to target that method and added another one for cases that shouldn't be a space

String(List[UInt8](0x20, 0x5C, 0x75, 0x32, 0x30, 0x32, 0x39)),
)

fn split(self, sep: String = "", maxsplit: Int = -1) -> List[String]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of returning an empty string for sep="" is kind of strange and actually breaks an invariant: sep.join(some_str.split(sep)) == some_str should be true for any sep and any some_str. Python simply rejects empty separators (with ValueError: empty separator), and we could certainly do this. The (IMO better) alternative is to return a list of the characters. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check this by running " ".split(" ") and " ".split() in python.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to return a list of the characters. WDYT?

I think it makes a lot of sense, though it adds yet another branch to the code that could be parametrized... but yeah it's inevitable. I hadn't actually tested the .split("") with python so thanks for the save.

Copy link

@siitron siitron May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.split("") shouldn't be possible IMO, since there's an infinite number of empty strings between every symbol. If you want to get a list of the characters of a string I would opt for list(some_string)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand but how would we deal with it then? IMO we shouldn't raise in a func that is just splitting, the other option is to return an empty list, or a list with the whole string as the first and only item. And I do like the idea of returning chars since it would also provide an API for splitting a string into its chars, instead of iterating and building them or using List[String](String.as_bytes()) (I'm not sure this implicit casting even works or if the user would have to build a String for each item in the byte List)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on board with returning a list of the characters from .split(""). I'm not typically a fan of deviating from Python behaviour but it's not really a useful safeguard to begin with and it's not worth "colouring" this method with raises to avoid an altered behaviour from exactly one possible input.

Copy link

@siitron siitron May 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting on an empty string is ambiguous and conceptually wrong. If the goal is to be a superset of Python then I don't see why we should change Python's behaviour (unless Python changes it too).

As a fix to avoid raises when splitting we could add an overload like:

fn split[sep: StringRef](maxsplit: Int = -1):
    constrained[sep != "", "empty separator"]()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this some thought, and I think I mostly agree with @siitron and @bgreni: splitting on an empty string is ambiguous, and we shouldn't risk unexpected behavior by allowing it (even when it might feel natural to some of us). Let's keep it simple, and make this function raising. If this turns out to be a perf problem, there are ways we can deal with it, but our default should be following Python and providing safe, unsurprising APIs.

Comment on lines 623 to 532
var d = String("hello world").split("")
assert_true(d[0] == "hello", d[1] == "world")
d = String("hello \t\n\n\v\fworld").split("\n")
assert_true(d[0] == "hello \t" and d[1] == "" and d[2] == "\v\fworld")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While explicitly tests are good, a great way to improve coverage for this would be to check for the invariant sep.join(some_str.split(sep)) == some_str.

@martinvuyk
Copy link
Author

martinvuyk commented May 20, 2024

@laszlokindrat had to add an exception to the test. If the separator to test for sep.join(some_str.split(sep)) == some_str is " " it would eat up all adjacent whitespace-like chars

for i in range(len(text)):
    var sep = text[i]
    if sep == " ":
        continue
    var joined = String("")
    for it in text.split(sep):
        joined += sep + it[]
    assert_true(text == joined[len(sep) :])

reverted in the fix whitespace default commit

@martinvuyk
Copy link
Author

@laszlokindrat any idea why this keeps timing out with no error?

@laszlokindrat
Copy link
Contributor

@laszlokindrat any idea why this keeps timing out with no error?

Note sure. I can look at this next week, but in the interest of keeping things moving, could you split out isspace into a standalone PR? Small patches are easier to marshal through on our end as well.

@martinvuyk martinvuyk marked this pull request as draft May 23, 2024 04:16
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com>
@martinvuyk martinvuyk marked this pull request as ready for review May 24, 2024 21:46
@laszlokindrat
Copy link
Contributor

@martinvuyk Can you confirm that the timeout issue is resolved?

@martinvuyk
Copy link
Author

@martinvuyk Can you confirm that the timeout issue is resolved?

it's solved. Using self.endswith(sep) removed the issue. I think it might be a problem with fn __getitem__, btw. why is this recursive? should I open a PR fixing this?

fn __getitem__[IndexerType: Indexer](self, i: IndexerType) -> String:
    var idx = index(i)
    if idx < 0:
        return self.__getitem__(len(self) + idx)

    debug_assert(0 <= idx < len(self), "index must be in range")
    var buf = Self._buffer_type(capacity=1)
    buf.append(self._buffer[idx])
    buf.append(0)
    return String(buf^)

@laszlokindrat
Copy link
Contributor

btw. why is this recursive? should I open a PR fixing this?

fn __getitem__[IndexerType: Indexer](self, i: IndexerType) -> String:
    var idx = index(i)
    if idx < 0:
        return self.__getitem__(len(self) + idx)

    debug_assert(0 <= idx < len(self), "index must be in range")
    var buf = Self._buffer_type(capacity=1)
    buf.append(self._buffer[idx])
    buf.append(0)
    return String(buf^)

Probably debug_assert, since it uses print under the hood. This is a known issue, for now we need to work around it. Also, please use Int directly now instead of IndexerType (we now have implicit conversions).

@laszlokindrat
Copy link
Contributor

#2793 has landed internally and will be included in the next nightly. Could you please rebase after that and use isspace where appropriate?

@martinvuyk
Copy link
Author

@laszlokindrat I just used _isspace temporarily since it's iterating byte per byte. I think we first need to have a _StringIter that returns a StringRef according to the full utf8 multi byte length. I have a few ideas for a PR on that but can we land this meanwhile?

d = String("abababaaba").split("aba")
assert_true(d[0] == "" and d[1] == "b" and d[2] == "" and d[3] == "")

# separator = "" returns all char split
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this currently fail? If it's a known limitation, please add a TODO, otherwise you can uncomment or remove it.

assert_true(len(String(" ").split(" ")) == 4)

d = String("abababaaba").split("aba")
assert_true(d[0] == "" and d[1] == "b" and d[2] == "" and d[3] == "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should also check the size of d as well, right? Otherwise, what if d has a fifth element?

_ = String("").split() # []
# Splitting a string with leading, trailing, and middle whitespaces
_ = String(" hello world ").split() # ["hello", "world"]
# Splitting adjacent universal newlines: # TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this TODO out of the docstring and into the body.

@laszlokindrat
Copy link
Contributor

@laszlokindrat I just used _isspace temporarily since it's iterating byte per byte. I think we first need to have a _StringIter that returns a StringRef according to the full utf8 multi byte length. I have a few ideas for a PR on that but can we land this meanwhile?

Makes sense, thanks for pushing this forward!

Copy link
Contributor

@laszlokindrat laszlokindrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks! Just a couple more small things, but looks great otherwise! Could you please also update the PR description and add this to the changelog?

Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
@martinvuyk martinvuyk changed the title [stdlib] Make String.split() default to whitespace and not raise & fix behavior to be pythonic [stdlib] Make String.split() default to whitespace & fix behavior to be pythonic May 28, 2024
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
@martinvuyk martinvuyk requested a review from a team as a code owner May 28, 2024 13:41
@martinvuyk
Copy link
Author

martinvuyk commented May 28, 2024

we first need to have a _StringIter

see #2868

Very nice, thanks! Just a couple more small things, but looks great otherwise! Could you please also update the PR description and add this to the changelog?

Just did, the markdownlint keeps failing.

docs/changelog.md Outdated Show resolved Hide resolved
Co-authored-by: Laszlo Kindrat <laszlokindrat@gmail.com>
Signed-off-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com>
@laszlokindrat
Copy link
Contributor

Looks really good, can you address the remaining two comments, or should I do it when I bring it in?

@laszlokindrat
Copy link
Contributor

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label May 28, 2024
@modularbot
Copy link
Collaborator

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

@modularbot modularbot added the merged-internally Indicates that this pull request has been merged internally label May 28, 2024
@modularbot
Copy link
Collaborator

Landed in 10b1ee3! Thank you for your contribution 🎉

@modularbot modularbot added the merged-externally Merged externally in public mojo repo label May 29, 2024
modularbot added a commit that referenced this pull request May 29, 2024
… behavior to be pythonic (#40714)

[External] [stdlib] Make `String.split()` default to whitespace & fix
behavior to be pythonic

Closes #2686
`String.split()` now defaults to whitespace and has pythonic behavior in
that it removes all adjacent whitespaces by default.

ORIGINAL_AUTHOR=martinvuyk
<110240700+martinvuyk@users.noreply.github.com>
PUBLIC_PR_LINK=#2711

---------

Co-authored-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com>
Closes #2711
MODULAR_ORIG_COMMIT_REV_ID: b6a05e97f8de09a2c77b272b6cd8b96da3c5c782
@modularbot modularbot closed this May 29, 2024
@martinvuyk martinvuyk deleted the string-less-raises branch May 29, 2024 13:29
@siitron
Copy link

siitron commented May 29, 2024

Just a heads up: .split() still doesn't match Python's behaviour fully as it currently doesn't accept None as the sep. E.g. we can't do some_string.split(None) or some_string.split(None, maxsplit=1) or some_string.split(None, 2) in Mojo. (#2880)

I also found an edge case that doesn't work correctly: The output is wrong when maxsplit is less than the total number of separators found in the string and sep is the suffix of the string getting split (you always get an empty string added at the end of the list).
Example: String("abbaaaabbba").split("a", maxsplit=5) incorrectly results in ['', 'bb', '', '', '', 'bbba', '']
(#2879)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally. merged-externally Merged externally in public mojo repo merged-internally Indicates that this pull request has been merged internally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants