-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split /\A/ works like /^/m, matches embedded newlines #14086
Comments
From @maukeperldoc perlrebackslash: \A "\A" only matches at the beginning of the string. perldoc -f split: Empty leading fields are produced when there are positive-width matches at the beginning of the string; a zero-width match at the beginning of the string does not produce an empty field. Therefore split /\A/ should return the input string as is. \A can only match once (at offset 0), which (logically speaking) should turn "foo" into ("", "foo"), but because of the special case in split of not producing empty leading fields for zero-width matches at the beginning, we just get "foo" again. What actually happens: $ perl -wE 'say "[$_]" for split /\A/, "foo\nbar\nbaz"' Apparently split thinks /\A/ is the same as /^/m, matching after every embedded newline in the input string. I think this is a bug in split. The test above was with: ... but an IRC bot running 5.20.0 produces the same results so I assume it's still present in 5.20. |
From @demerphqOn 11 September 2014 14:28, l.mai@web.de <perlbug-followup@perl.org> wrote:
Yes this is still in blead. I was party to breaking this in 7bd1e61 in This code does NOT use the regex engine for anything other parsing the Part of the problem is that way way way back in the history of Perl, To explain more /^/m produces a MBOL op, "multi-beginning-of-line", and /^/ And split will and has always treated both the same, as an MBOL, when the Later on in history /\A/ was added as a synonym for /^/, and produces an When I upgraded the logic in 7bd1e61 to not look at the pattern So now we have a problem. There is LOADS of code out there that assumes that split /^/, $string; is the correct way to split a string into lines. However it was only true because of the optimisation in split // did not So we are now in a jam. I can do some kind of workaround that makes /\A/ not trigger this The naive obvious fix would be to document that split // operates with the For instance split /^x/ does not act as though there is an implicit /m flag perl -le'my $str="foo\nxbar\nxbaz\n"; print ">>$_<<" for split /^x/, $str'
Compare with just plain /^/: perl -le'my $str="foo\nxbar\nxbaz\n"; print ">>$_<<" for split /^/m, $str'
IOW, split /^/ and split /^/m do the same thing, which they definitely Given all this I really cant decide what to do. I *could* change the code A simple work around btw would be to write: /\A|\A/. But that would suck. I really dont know what to do here. Basically the root of this bug was Another alternative would be to introduce a multiline version of \A, say \L Yves -- |
The RT System itself - Status changed from 'new' to 'open' |
From @AbigailOn Thu, Sep 11, 2014 at 04:29:02PM +0200, demerphq wrote:
I first thought having "split /^/" mean the same as "split /^/m" was A PATTERN of "/^/" is treated as if it were "/^/m", since it The third edition of "Programming Perl" documents this behaviour as well -- But looking at some old commits, this may actually not be the case. Commit 2cdd06f (Aug 4, And then commit 1ec9456 I haven't checked the mail archives to see whether that was any discussion.
As I said, /^/ implying a /m has already been documented for 14 years.
My suggestion: leave it as is, and document it. How useful is it to be split /\A/ => $foo; when you could have written $foo; instead? Fixing it to do the "right thing" seems like a whole lot of work for little Abigail |
From @maukeAm Do 11. Sep 2014, 07:29:35, demerphq schrieb:
My first idea would be to revert the opcode checking and go back to the pattern source; i.e. do the equivalent of $src eq "^". That would keep backwards compatibility with existing code and the letter of the documentation ("If PATTERN is /^/, ..."). It would also make \A "work" (i.e. do nothing) again. Then I'd add a deprecation note to the documentation; something like: If PATTERN is /^/, then it is treated as if it used the multiline That leaves the door open to actual deprecation warnings if we decide to remove this feature in a future release. |
From @demerphqOn 11 September 2014 18:27, l.mai@web.de via RT <perlbug-followup@perl.org>
FWIW, I am really against using the raw pattern. For instance I expect: split /(?:)^/ to be the same as split /^(?:)/ to be the same as split /^/ to be the same as split /#this splits lines out without capturing the line break I fixed a bunch of issues like this when I redid this code. I am really Yves |
From @ikegamisplit already says: If PATTERN is /^/ , then it is treated as if it used the multiline modifier How about If PATTERN is /^/ or /\A/, then it is treated as if it used the multiline On Thu, Sep 11, 2014 at 10:29 AM, demerphq <demerphq@gmail.com> wrote:
|
From @AbigailOn Thu, Sep 11, 2014 at 09:27:36AM -0700, l.mai@web.de via RT wrote:
As can been seen in my other post, we did this back in 1999. Then quickly Considering the uselessness of splitting on just the beginning of the string Abigail |
From @demerphqOn 11 September 2014 18:59, Eric Brine <ikegami@adaelis.com> wrote:
I really really really hate the idea that prepending (?:) to the pattern foo((),$thing); being different from foo($thing); And like I said we had a bunch of bug reports along those lines. The real issue here is that the /^/ implies /m in split thing was not This is yet another example of how "ooh neat" features, especially in the Yves |
From @rjbs* demerphq <demerphq@gmail.com> [2014-09-11T10:29:02]
First off: thanks for this post, which was interesting and useful. It seems to me that the above is a subset of the below:
That is: you need to distinguish ^ from \A, whether or not you add \L, for such -- |
From @demerphqOn 11 September 2014 19:54, Ricardo Signes <perl.p5p@rjbs.manxome.org>
No problem. Especially as I was indirectly responsible for part of the mess.
Er, sort of. What you describe is option 2 below. Thinking about this more I think there are two reasonable options: 1. document that all patterns to split are compiled under /m by default. At 2. use the flag field of the regop to store whether the SBOL comes from \A Personally the more I think about this more i think that 1 is better, even Consider what option 1 would result in: split /^/, $string would behave the same as far as the ^ operator goes. And it would mean that split /^/, $string would behave similarly (that is match all the beginning or end of lines in And it would fix the problem with /\A/ behaving like /^/m (which is And when I think about what it would break I struggle to think of Also the other nice thing about option one is it doesnt need an \L split /(?-m:^)/, $string would disable the defaut /m flag. The reason I proposed the \L In fact the process of writing this email I have become sufficiently Yves -- |
From @ap* demerphq <demerphq@gmail.com> [2014-09-11 20:25]:
Would writing it `split qr/^/, $string` also work? (I would hope yes.) Regards, |
From @demerphqOn 11 September 2014 22:45, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
No, when Karl changed qr/^/ to reduce down to (?^:^) he changed the IOW, (?^:^) means "match /^/ under whatever rules the pattern is compiled In the older perls it would turn into (?-msix:^) and then yes I think it Win-some, lose-some. cheers, -- |
From @ap* demerphq <demerphq@gmail.com> [2014-09-11 22:55]:
It does. I am assuming that this special compilation context applies at the time If so – is that doable with reasonable effort? It would go some ways toward regularising split’s behaviour further. Regards, |
From @AbigailOn Thu, Sep 11, 2014 at 08:20:28PM +0200, demerphq wrote:
To do that, you would first have to change the behavior of split, as $ perl -E 'say "[$_]" for split /^a/m => "foo\nabar\nabaz"' $ perl -E 'say "[$_]" for split /^a/ => "foo\nabar\nabaz"' Abigail |
From @demerphqOn 12 September 2014 00:58, Abigail <abigail@abigail.be> wrote:
Yes, I have said exactly the same thing multiple times in this thread. And to me its actually exactly the reason we *should* do this. I consider As I said elsewhere in this thread, why should split /$/ not have the same $ perl -le'my $str="foo\nxbar\nxbaz\n"; print ">>$_<<" for split /^/, $str'
<< |
From @AbigailOn Fri, Sep 12, 2014 at 01:09:27AM +0200, demerphq wrote:
Because noone uses /$/m to split a multiline string into individual lines, I'm still figuring out what problem needs solving. Is it really a problem split /\A/, "multiline string"; splits as /^/m? Is splitting on the beginning of the string, resulting in a Can't we just document this exception? Abigail |
From @demerphqOn 12 September 2014 01:47, Abigail <abigail@abigail.be> wrote:
I dont like the exceptions here, and I find the inconsistency to be very My intent is to make split default to /m enabled which I believe is the Yves -- |
From @AbigailOn Fri, Sep 12, 2014 at 05:19:49AM +0200, demerphq wrote:
But split is already full of exceptions: * Any pattern matching the empty string is special cased.
Really? For what purpose? You'd potentially break code, and it won't Abigail |
From @AbigailOn Fri, Sep 12, 2014 at 10:17:03AM +0200, Abigail wrote:
Having said that, the only code effected by such a change is a split Abigail |
From @demerphqOn 12 September 2014 13:07, Abigail <abigail@abigail.be> wrote:
Indeed. Exactly the same conclusion I came to as well. Yves -- |
From @demerphqOn 12 September 2014 10:17, Abigail <abigail@abigail.be> wrote:
I don't know if I agree here. Part of this behavior is the default perl -le'my $str="abcdef"; while($str=~//g) { print substr(
Perhaps ENOTENOUGHCOFFEE, but can you expand on that, I don't recall what
I consider the inability to simulate " " using a qr// or // a bug, and it
Consistency in behaviour of things like split /^/ and split /^x/ at the
Lets find out if that is FUD or Fact. As far as I can tell the only code
Yes it does. /^/ => SBOL The equivalence of /^/ and /^/m is afforded by the following code: else if (PL_regkind[fop] == BOL && nop == END) if we change the default of split to /m and that code is changed to: else if (fop == MBOL && nop == END) then split /^/, => MBOL which fixes the bug in this thread, and make splits behaviour consistent Yves |
From @AbigailOn Fri, Sep 12, 2014 at 02:21:56PM +0200, demerphq wrote:
From the split doc entry: As a special case for "split", the empty pattern given in match So it's special cased to get to not mean the last succesful match,
How do you propose to "fix" that? Both C<< split " " >> and C<< split / / >> Abigail |
From @ap* Abigail <abigail@abigail.be> [2014-09-11 18:15]:
There are quite a few APIs that use regexps as a sort of DSL. It’s not * Abigail <abigail@abigail.be> [2014-09-12 10:20]:
Well yes, the entire point of this thread is the idea that split /\A/ * demerphq <demerphq@gmail.com> [2014-09-12 14:25]:
split //, "foobar" # yields qw( f o o b a r ) Normally an empty match reuses the last pattern but here it really means Regards, |
From @AbigailOn Fri, Sep 12, 2014 at 02:57:01PM +0200, Aristotle Pagaltzis wrote:
Sure, a niche case, and one for which /\A/ isn't the only option.
Double special cased as in "not acting like the normal //, but acting as Abigail |
From @demerphqOn 12 September 2014 14:50, Abigail <abigail@abigail.be> wrote:
Oh that. Right. That isn't a special case in split, its a special case in $ perl -le'"foo"=~/(.*)/ and print $1; print qr//' Also I thought we decided that that feature wasn't very useful and were
split qr/(*SPLIT_WHITE)/, $string is my working plan. I fixed one issue related to this, I think in 5.18, ./perl -Ilib -le'my $str=" "; my $foo="foo\n\n\nbar\n\n"; print ">$_<" for
But I think one should be able to do this with a qr// object as well. Basically (*SPLIT_WHITE) would be semantically equivalent to \s+ in a Yves -- |
From @demerphqOn 12 September 2014 15:52, Abigail <abigail@abigail.be> wrote:
FWIW, I am only moderately interested in fixing the split /\A/ behaviour in
Sorry to repeat a previous mail, but IMO it is not that split // is special if ($str= which I admit I am not sure how it would be used, but I am pretty sure Yves Yves |
From @AbigailOn Fri, Sep 12, 2014 at 04:16:13PM +0200, demerphq wrote:
Are you saying you want to change the meaning of split " ", "string"; and people should write split qr /(*SPLIT_WHITE)/, "string"; instead? That would not be very programmer friendly. Abigail |
From @demerphqOn 12 September 2014 16:27, Abigail <abigail@abigail.be> wrote:
No no. I mean that I think code like this: my $pat= qr/$user_pat/; my @things= split /$pat/, $input; should be capable of producing split white semantics. I have no intention of removing the split " ", $string semantics, and on $ ./perl -Ilib -le'print $]; my $pat=" "; my $foo="foo\n\n\nbar\n\n"; print
$ perl -le'print $]; my $pat=" "; my $foo="foo\n\n\nbar\n\n"; print ">$_<"
bar < Previously the *only* way to get split white behavior was to write So before that patch if you wanted to parametrically control the split, you my @things= $pat eq " " ? split " ", $input : split $pat, $input; you couldnt write this even: my @things= split $pat eq " " ? " " : $pat, $input To summarize, I would like to make it so you can use a qr// object to cheers, -- |
From @AbigailOn Fri, Sep 12, 2014 at 04:37:21PM +0200, demerphq wrote:
Excellent. I think adding a qr /(*SPLIT_WHITE)/, while keeping the existing behaviour I presume $str =~ s/(*SPLIT_WHITE)/.../; and $str =~ /(*SPLIT_WHITE)/; will be meaningless, just as $pat = qr /(*SPLIT_WHITE)/; Abigail |
From @demerphqOn 12 September 2014 16:45, Abigail <abigail@abigail.be> wrote:
Well, no, I think making them illegal in normal patterns would be nearly Yves Yves |
From @cpansproutOn Fri Sep 12 05:22:28 2014, demerphq wrote:
Omitting initial empty fields is more a feature of split than of the regexp engine. Making a special pattern that does that makes as much sense to me as qr//c. If we were to consider the // to be part of the split operator (and I generally do), then we could introduce a m// modifier that only applies in split (and is an error otherwise). -- Father Chrysostomos |
From @cpansproutOn Fri Sep 12 07:16:50 2014, demerphq wrote:
I use it.
But if we are going to generalise it, it would be useful to skip initial null fields with other separators, such as /,/, too. -- Father Chrysostomos |
From @cpansproutOn Fri Sep 12 07:37:43 2014, demerphq wrote:
To my mind, that just doesn’t add up. How is that much different from having a way to specify the second half of s/// with qr//? -- Father Chrysostomos |
From @cpansproutOn Fri Sep 12 07:57:02 2014, demerphq wrote:
Oh, and what would split /(,)(?(1)(*SPLIT_WHITE))/ do? I just can’t wrap my mind around this \s+-and-a-split-flag construct. Maybe what we want is qr//k, where the /k flag is ignored by m// and s///, but is taken by split to mean sKip initial null fields. But then what would split /foo${that_qr}bar/ do? -- Father Chrysostomos |
From @demerphqOn 12 September 2014 17:20, Father Chrysostomos via RT <
qr//c is obvious useless. On the other hand two *very* experienced regex
I consider that wrong. Split is a function which uses a pattern as an
I dont think a modifier is required, or even a particularly elegant Yves -- |
From @demerphqOn 12 September 2014 17:22, Father Chrysostomos via RT <
Yes, I think I have used it once or twice in my career. However the fact
Then we can create a pattern that does it. (*EAT_EMPTY) maybe. Yves |
From @demerphqOn 12 September 2014 17:24, Father Chrysostomos via RT <
Completely different. As different as jet-planes and penguins. Yves |
From @demerphqOn 12 September 2014 17:28, Father Chrysostomos via RT <
Not sure yet. Maybe nothing.
I couldn't possibly comment on your inability to wrap your mind around this.
/k is unavailable to us due to Regexp::Common. Although i retract an earlier comment, *maybe* a modifier is appropriate
Probably just revert to its "normal" regex behaviour. cheers, |
From @bulk88I dont think this ticket is productive anymore. 20 posts in half a day between just 2, or maybe 3 people. -- |
From @cpansproutOn Fri Sep 12 08:33:15 2014, demerphq wrote:
It’s not that I do not see its utility. It just seems like too much of a special case, and I thought we were trying to get away from those. If it’s something that goes in a pattern, but affects the behaviour of one specific operator that acts on the pattern, then what is its scope? etc., etc. Now, if we want to add a thingy that goes in a pattern and flags the pattern to tell split not skip initial fields, then let’s make it general. E.g., your /(*SPLIT_WHITE)/ could be written /(*EAT_EMPTY)\s+/ or /(?q)\s+/ or /\s+/q (with q only because q is available). -- Father Chrysostomos |
From @hvdsI've completely lost track of the bifurcating paths of the discussion, Somewhere in there were references to making split patterns act as if Not sure that I've used /^/ or /\A/ much in split patterns, but I've I was never aware of an implied //m, so I've never knowingly used that. Hugo |
From @rjbs* demerphq <demerphq@gmail.com> [2014-09-11T14:20:28]
I am sitting here making my "I am so nervous face," but I also can't really https://metacpan.org/source/ANDYA/TAP-Parser-0.54/t/040-parse.t#L630 Anyway, on one hand and in one way this is a big scary change that makes me This is not to say that I'm saying "do it!" But it sounds like you want to A lot of other stuff came up in this thread about /other/ changes to split and -- |
From @demerphqOn 11 September 2014 16:29, demerphq <demerphq@gmail.com> wrote:
I have fixed this with: 1645b83 Perl RT #122761 - split /\A/ Note that this does NOT make split // default to /m enabled. It simply Related to this I did some cleanup, freeing up bits, reducing object size, /me puts away the chainsaw. I still plan to try the "default to /m in split" and see what happens, so Yves -- |
From @cpansproutOn Tue Sep 16 20:13:15 2014, demerphq wrote:
Did the porting tests fail before you ran make regen to regenerate the table in perldebguts.pod? **Duck** -- Father Chrysostomos |
From @demerphqOn 17 September 2014 06:36, Father Chrysostomos via RT <
No. d3d47aa includes changes to regen/regcomp.pl and regcomp.sym which I did however leak some warning/diagnostics into the porting tests, which Smarty pants. :-) Yves -- |
From @maukeOn Tue Sep 16 20:13:15 2014, demerphq wrote:
Shouldn't this be done in a new ticket then? (Also, is this still happening?) |
From @jkeenanOn Fri Feb 26 10:53:38 2016, mauke- wrote:
I recommend closing this ticket and having anyone pursuing this open a new ticket. -- |
@mauke - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#122761 (status was 'resolved')
Searchable as RT122761$
The text was updated successfully, but these errors were encountered: