Stop parsing on first syntax error. #20168

demerphq · 2022-08-26T17:19:25Z

We try to keep parsing after many types of errors, up to a (current)
maximum of 10 errors. Continuing after a semantic error (like
undeclared variables) can be helpful, for instance showing a set of
common errors, but continuing after a syntax error isn't helpful
most of the time as the internal state of the parser can get confused
and is not reliably restored in between attempts. This can produce
sometimes completely bizarre errors which just obscure the true error,
and has resulted in security tickets being filed in the past.

This patch makes the parser stop after the first syntax error, while
preserving the current behavior for other errors. An error is considered
a syntax error if the error message from our internals is the literal
text "syntax error". This may not be a complete list of true syntax
errors, we can iterate on that in the future.

This fixes the segfaults reported in Issue #17397, and #16944 and
likely fixes other "segfault due to compiler continuation after syntax
error" bugs that we have on record, which has been a recurring issue
over the years.

The PR also includes a fix to another segfault/assert (Issue #16057)
related prototypes on BEGIN blocks, which is in this PR because it
originally looked related to the stop on first error problem, and given it
/is/ related to "stopping segfaults during compilation" it seems reasonable
to save some work and keep it in this PR.

Sorry for the weird wrapping of this ticket.

pp_ctl.c

t/comp/retainedlines.t

khwilliamson · 2022-08-28T18:04:59Z

I'm somewhat leery of this.

I wrote a Lisp interpreter in Snobol for a school assignment. Snobol stops at the first error. This was using punch cards and the turnaround time was 8-12 hours during the day, dropping to .5 hr at 3am. It was awful. I have hated compilers that don't try to recover ever since.

I can see giving up the current section of code if there are several errors in a few adjacent lines. But why not then skip ahead some looking for a semi colon immediately followed by a new line, and continue trying from there?

demerphq · 2022-08-29T12:22:41Z

Hey @khwilliamson can I ask you to try it before you pass judgement? My experience is that the storm of error messages from perl getting confused is really unhelpful, and results in old-times teaching new-comers to "go through it and look for the lowest line number" and such things. I have been using it for some hacking and i have been quite pleased with it. At first it was a little jarring, I'm used to having to sift through a heap of ridiculous and meaningless errors to find the one that is relevant and with the patch it "throws your eyeballs" a bit not having all that crap there, but I got quickly used to it and when i go back to the old rules it throws me the other way now. :-)

If you have thoughts how to safely restart parsing at a semicolon then i think you could do a follow up patch, but given the vagaries of parsing perl IMO that might not be quite as useful as you think: consider code like: if if (/;/) { ... }. It seems like something that is pretty darn hard to do well in perl, so we shouldnt try and stop giving users bogus errors.

Also, I am happy to make this a configurable option with whatever default we want. If people like the storm of hallucinatory errors that perl produces from common syntax errors then they are welcome to build with them, so long as I am welcome to build without them. :-)

demerphq · 2022-08-29T12:24:10Z

Also note that this patch includes a revert of a patch from #16300 which caused breakage with Module::Install. We need to decide what to do about that.

demerphq · 2022-08-29T14:02:04Z

I have added a workaround patch for the issue in Module::Install::DSL. We convert INIT blocks from that namespace to be BEGIN blocks. I thought about the added restriction of "INIT blocks in an eval", but it didnt seem necessary. With that i think in theory this PR should be "ok" to go and not cause havok in the CPAN river.

This converts INIT {} blocks from the Module::Install::DSL namespace into BEGIN blocks. This works around the bug reported in GH Issue #16300. (Hopefully, not fully tested yet.) Which in turn should allow us to close the bug in #2754. See also PR: #20168 and Issue: #20161 both of which are blocked by this.

hvds · 2022-08-29T14:56:45Z

I'm somewhat leery of this.
[...]
I can see giving up the current section of code if there are several errors in a few adjacent lines. But why not then skip ahead some looking for a semi colon immediately followed by a new line, and continue trying from there?

@khw note that "stop on first error" has for quite a while been advocated by @iabyn, if I remember correctly - we're not good at ensuring everything is restored to a valid state after an error, and the attempt to continue after errors has been the source of numerous security issues in the past. (That said, I think they were all rejected as security issues, because they needed code from an untrusted source to exploit - but they also cost us a lot of effort to analyse.)

For me it is second nature to use the strategy @demerphq mentions - to scan a screed of garbage on the screen for the lowest-mentioned line number - but I'm always aware when doing so that a) I'm making up for perl's failings in doing so, and b) that someone new to perl probably won't know about that strategy.

demerphq · 2022-08-29T15:00:11Z

@khw note that "stop on first error" has for quite a while been advocated by @iabyn, if I remember correctly

Thanks @hvds - he actually pointed me at the code that started me down this rabbit hole. This task has lead to a bunch of bugs and old issues being identified and fixed.

demerphq · 2022-08-29T15:10:23Z

FWIW #20181 and #20182 need to be applied in that order, and then the code from this can be applied on top. Ill rebase as things get merged.

This converts INIT {} blocks from the Module::Install::DSL namespace into BEGIN blocks. This works around the bug reported in GH Issue #16300. (Hopefully, not fully tested yet.) Which in turn should allow us to close the bug in #2754. See also PR: #20168 and Issue: #20161 both of which are blocked by this.

demerphq · 2022-09-05T07:25:35Z

@leonerd you expressed some interest in this, it is now out of draft and ready for merge.

toke.c

demerphq · 2022-09-06T14:55:48Z

squashed them down to a single patch now.

We try to keep parsing after many types of errors, up to a (current) maximum of 10 errors. Continuing after a semantic error (like undeclared variables) can be helpful, for instance showing a set of common errors, but continuing after a syntax error isn't helpful most of the time as the internal state of the parser can get confused and is not reliably restored in between attempts. This can produce sometimes completely bizarre errors which just obscure the true error, and has resulted in security tickets being filed in the past. This patch makes the parser stop after the first syntax error, while preserving the current behavior for other errors. An error is considered a syntax error if the error message from our internals is the literal text "syntax error". This may not be a complete list of true syntax errors, we can iterate on that in the future. This fixes the segfaults reported in Issue #17397, and #16944 and likely fixes other "segfault due to compiler continuation after syntax error" bugs that we have on record, which has been a recurring issue over the years.

This fixes Issue #16057, prototypes on BEGIN blocks cause segfaults. This patch warns about the use of either.

leonerd

Overall seems a reasonable direction. I'm not hugely a fan of special-casing the exception message "syntax error"; but I gather this is just a first-step in the direction of having a better mechanism - such as a dedicated abort-the-parse function.

demerphq · 2022-09-09T16:48:39Z

but I gather this is just a first-step in the direction of having a better mechanism - such as a dedicated abort-the-parse function.

Yes, indeed, that will come once I start getting some feedback. Please let me know if you encounter something that should stop the parse but doesnt.

I will merge this!

demerphq · 2022-10-11T08:35:36Z

On Sun, 28 Aug 2022 at 20:05, Karl Williamson ***@***.***> wrote: I'm somewhat leery of this. I wrote a Lisp interpreter in Snobol for a school assignment. Snobol stops at the first error. This was using punch cards and the turnaround time was 8-12 hours during the day, dropping to .5 hr at 3am. It was awful. I have hated compilers that don't try to recover ever since.

I can see giving up the current section of code if there are several errors

in a few adjacent lines. But why not then skip ahead some looking for a semi colon immediately followed by a new line, and continue trying from there?

I think you misunderstand the patch. It doesnt stop on first error. it stop on first *syntax* error, which is where the compiler gets totally confused. For other errors the old behavior is preserved. Personally I *strongly* disagree with continuing after a syntax error. The storm of bogus errors from restarting the parse obscures the true error in a storm of halucinatory rubbish. I dont think trying to find the first semicolon is reliable. Perl syntax is too messed up. At my old job I had to do a lot of hand holding of people new to perl and one of the things they complained about was the rubbish errors. Its a bit embarrasing when you have to train people to ignore most of the mesages and root through them to find the gem hidden in a pile of dung. Why dont you try the branch and see what you think? So far I have found it quite nice. When I mess up and make a syntax error there is one error message, and it's always correct. I dont have to stare at 9 other errors which are figments of perl imagination. In fact I found it a little confusing at first as I am so used to ignoring most of the error messages Perl produces that when it gave me a single error that was correct it threw me my a bit (in a good way): "Where did all the garbarge go?". I think until you have tried the branch you should reserve judgement, i dont think extrapolation from snobol is a reasonable thing to do. Perl is a very different language. Anyway, if you have ideas on how to sanely restart the parser somewhere else then go for it. Yves

…

-- perl -Mre=debug -e "/just|another|perl|hacker/"

This converts INIT {} blocks from the Module::Install::DSL namespace into BEGIN blocks. This works around the bug reported in GH Issue Perl#16300. (Hopefully, not fully tested yet.) Which in turn should allow us to close the bug in Perl#2754. See also PR: Perl#20168 and Issue: Perl#20161 both of which are blocked by this.

demerphq marked this pull request as draft August 26, 2022 17:20