Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with syntax #35

Open
orolyn opened this issue Jul 24, 2018 · 23 comments
Open

Issues with syntax #35

orolyn opened this issue Jul 24, 2018 · 23 comments

Comments

@orolyn
Copy link

orolyn commented Jul 24, 2018

@morrisonlevi You mentioned a few issues with ambiguity regarding the syntax within statements.

The only one I am currently aware of is:

class A
{
}

define('B', 1);
define('C', 1);
define('D', 1);

function foo($x, $y)
{
    var_dump(func_get_args());
}

foo(new A<B,C>(D));

/*
PHP Notice:  Object of class A could not be converted to int in test.php on line 17
array(2) {
  [0]=>
  bool(false)
  [1]=>
  bool(false)
}
*/

Did you have any ideas on how to solve it?

@morrisonlevi
Copy link
Collaborator

The fact that the parentheses on constructors are optional is the culprit here: foo( new A < B, C > (D) ) . However, even if we fixed that we still have a conflict in the function-call variant of this: foo( f < B, C > (D) ).

I don't remember how C# solved this; need to find that paper again.

@orolyn
Copy link
Author

orolyn commented Jul 30, 2018

I read about it a while ago, C# handles it case by case, since A is a known class at compile time the parser can make the decision to call it a new instance of a generic class. Its also pure OO so it doesn't have to deal with constants.

Adding something innocuous to the syntax might be a compromise. What do you think of this?

new A<use B, C>(D);
foo(new A<use B, C>(D));

@jorissteyn
Copy link

I believe other languages don't have this problem because in java, c#, etc there is no ambiguity in using '<>' in expressions like this. And hack, which already implements generics, infers the type from the arguments so they don't have this issue to begin with.

Another approach would be to simply allow for the ambiguity, introducing a BC break for all valid expressions which are also valid new- or instanceof-statements using the '<>' type alias list.

This can not be done with the LR(1) parser currently used, but bison supports GLR parsers. It's a minor change in code really. The shift/reduce conflict generated by the '<>' construct will cause the parser to branch off, following both paths until one path fails, or allowing to specify which branch has precedence when both paths succeed.

For example, below statements would still work (no BC break):

class Test {
  public function __toString() {
    return 'A';
  }
}

var_dump(new Test<stdClass);
var_dump(new Test<stdClass, stdClass> 'A');
var_dump(new Test<stdClass && stdClass > 'A');
echo new Test<stdClass;
echo new Test<stdClass, stdClass> 'A';
echo new Test<stdClass && stdClass > 'A';
bool(true)
bool(true)
bool(true)
bool(true)
1111

And these statements would be BC breakages:

class Test<X, Y> {
  public function __toString() {
    return 'A';
  }
}

define('stdClass', 'B');

var_dump(new Test<stdClass, stdClass>(1));
echo new Test<stdClass, stdClass>(1);
object(Test)#1 (0) {
}
A

I believe the chance of BC breakage in the wild is very small, because the existing code would have to use an argument- or echo-list with a new/instanceof statement without parenthesis in one argument and be comparing it with the LT operator against a global constant, and a second argument comparing a global constant with GT against an expression enclosed in parenthesis. What are the odds?

Porting old code to support the new rules would be easy:

var_dump(new Test()<stdClass, stdClass>(1));

Or:

var_dump((new Test<stdClass), stdClass>(1));

Does this make any sense?

@mindplay-dk

This comment has been minimized.

@nikic
Copy link
Collaborator

nikic commented Nov 7, 2019

PHP does not have a comma operator.

@someniatko
Copy link

I think he meant comma as separator for function args in construction like this:

call (new Date < FOO, BAR > ('now') )

#42 (comment)

@mindplay-dk
Copy link
Collaborator

PHP does not have a comma operator.

Ha! 😆 Well, that explains why nobody's using it.

@mindplay-dk
Copy link
Collaborator

We may need to make this bit of syntax whitespace-sensitive then?

call( new X<Y, Z> ('now') );    // generics
call( new X < Y, Z > ('now') ); // comparison operators

Obviously that kind of goes against the nature of the language, but considering...

X<Y>             // looks like generics
X < Y ...        // looks like comparison

X<Y, Z> ...      // looks like generics
X < Y, Z > ...   // looks like comparison

X< Y, Z > ...    // probably looks more like generics?

That is X< definitely looks like the start of a generic type-expression, and X < looks like the start of an expression - so, if there's no space between the name and < operator, we can make the parser prioritize generics over expressions maybe?

I mean, there might be existing code that looks like this:

call(new Date<FOO, BAR>('now'));

But who would put extra parens around a string and why?

The worst-case realistic version of that probably is this:

call(new Date<FOO, BAR>'now');

If we prioritize a valid generic expression, there's no ambiguity here, right? I mean, there's a pretty serious visual ambiguity - but once we add generics, the parse tree where the start of this expression is interpreted as generics doesn't result in a valid parse tree for the whole expression, so it should be possible for this particular case, right?

If there's a concern about this breaking something in real-world code, I'd have to see a better, more realistic example - the extra parens around a string is so contrived, I can't imagine that will really affect anyone. (and if it were to affect one or two people around the world, it is really easy to fix.)

@nikic
Copy link
Collaborator

nikic commented Jan 7, 2020

We discussed our options here with @bwoebi yesterday. Some notes.

There are, I think, basically three issues with A<B> style generics:

  • The A<B<C>> lexer ambiguity on >>. @bwoebi suggested that we resolve this by replacing << with two <, parser it as such and then verify that for the normal << operator, both < are adjacent.
  • The A<B, C>::Foo parser ambiguity. This is again not a true ambiguity, but unlimited lookahead is required to resolve it.
  • The foo(new A<B, C>(1)) true ambiguity. We would want this to be resolved in favor of generics, as the alternative is rather unlikely.

Assuming we want to stick with A<B> style generics, we generally have two options to resolve ambiguities: The first is lexer lookahead, the other is switching to a GLR parser.

Lexer lookahead has the same issues as already discussed with arrow functions. Doing lookahead for the basic A<B, C> syntax would not be terrible, but we need to keep in mind that we already have A<B, C|D> as well, and in the future might have fun stuff like A<(function(): C) & D>. All of these would have to be specially pre-parsed inside the lexer.

The other alternative is to switch to a GLR parser. I have created a prototype for this. The GLR parser works by pursuing both possible parse trees and then discarding the one that generates an error. For the true ambiguity in the new case we additionally perform a merge conflict resolution in favor of having less function arguments, which conversely means having more generic arguments, thus effectively favoring generics.

As we don't have any potential for nesting, GLR should not exhibit any worst case (exponential) performance here (though might make parsing overall slower). However, the main disadvantage of GLR is that all PHP tooling will have to also switch to GLR or equivalent. This is going to be a pain for PHP-Parser at least, as we don't have a GLR parser implemented in PHP currently.

@nikic
Copy link
Collaborator

nikic commented Jan 7, 2020

Next to A<B>, I believe the main other contender for generics syntax is A{B}, as used by Julia for example. Assuming that we drop the alternative array access syntax (which we can without issues, thanks for the deprecation in 7.4), the A{B} syntax is completely unambiguous when it comes to uses in expressions. There are neither any real syntactical ambiguities, nor any parser lookahead ambiguities, nor any lexer ambiguities.

However, the {} syntax does have some (much less severe) lookahead issues when it comes to declarations, as of course {} is also used for code blocks.

//        v-- Need to distinguish between code block and placeholder here
class Foo{T}
{
    // ...
}

Doing this is possible, but comes at the cost of avoiding any zero-length productions between the class name and the start of the class statements. This effectively means that we need to spell out all the possible combinations of generic params, extends and interfaces (making for 6 rules per root class kind). A prototype for this is available as well.

@bwoebi's official position on using {} generics is "what about just no" :P

@natebrunette
Copy link
Collaborator

What about a look ahead for now and look and switching the parser in the future?

@bwoebi
Copy link

bwoebi commented Jan 7, 2020

It is not that trivial, e.g. you'll have to ignore comments in between, any whitespace, etc.
Where the GLR parser is just working.

As to userland matters (PHP-Parser ...), they may opt to do the lexer lookahead.
But I prefer the least hacky way for core.

@morrisonlevi
Copy link
Collaborator

morrisonlevi commented Jan 9, 2020

I personally prefer {}, but I've used a language that uses {} before so I'm probably more accepting of such things. Additionally, I'd like to stay away from a GLR parser, though I do admit that the grammar work to avoid it is quite tedious. Neither situation is particularly great, which is probably why this has stagnated.

@bwoebi
Copy link

bwoebi commented Jan 9, 2020

While my purist part agrees with you, I think it's a bit better to move forward with <> despite GLR, I don't want contention about the grammar in RFC phase.
The RFC itself will already be controversial enough I guess.
We should just stick with <> for now (we can change later...) and move on.

@jkufner
Copy link

jkufner commented Jan 9, 2020

Do you remember the fuss when PHP introduced the backslash as namespace separator instead of :: ?

Just stick with < > because it is already established by C++, Java, C#, Rust, Pascal, Swift, Typescript, …

@mindplay-dk
Copy link
Collaborator

As we don't have any potential for nesting, GLR should not exhibit any worst case (exponential) performance here (though might make parsing overall slower). However, the main disadvantage of GLR is that all PHP tooling will have to also switch to GLR or equivalent. This is going to be a pain for PHP-Parser at least, as we don't have a GLR parser implemented in PHP currently.

Could we alleviate that issue by exposing an API that provides access to the AST?

I know this doesn't address all use-cases (such as wanting to parse older PHP code, but PHP-Parser could continue to do that) but maybe it alleviates the pressure to maintain an entire PHP parser in PHP itself?

@jkufner
Copy link

jkufner commented Jan 9, 2020

Is it really necessary to use GLR? What is we simply define that < following after new is generics and not "lesser than" operator? The "lesser than" operator does not make any sense with objects anyway. Thus, the valid syntax would be te following:

$list_of_a = new List<A>;
$list_of_a = new List<A>();
$list_of_a = (new List<A>);
$a_lt_n = (new A) < $n;    // mandatory parenthesis around new, otherwise syntax error
$a_lt_n = new A() < $n;    // parenthesis separate the operator
$b = call((new A) < $n, $m > 0);
$b = call(new A() < $n, $m > 0);

@mindplay-dk
Copy link
Collaborator

@jkufner that is actually quite simple and elegant. 🙂

@nikic
Copy link
Collaborator

nikic commented Jan 9, 2020

GLR is needed mainly to solve the finite-lookahead issue. The actual new ambiguity is a secondary concern from an implementation point of view.

@mindplay-dk
Copy link
Collaborator

Looks like someone is working on exposing the native AST now!

https://twitter.com/lisachenko/status/1217887229470822400?s=19

@brzuchal
Copy link

Would there be no issues exposing the generic/template type name as the usual type using the ::class magic constant?
If I have \Framework\Colleciton<\App\Foo>::class will it return the class name like "Framework\Colleciton<App\Foo>"?
The same would then have to be available via $collection::class.
Both cases require parsing to extract the template type and parameter types.

@mindplay-dk
Copy link
Collaborator

I think the ::class constant should continue to work as it does, returning the class name only - no the generic class syntax, as this is clearly more than just the name.

As for $collection::class, I would expect this would continue to return \Framework\Collection as always.

On a related note, PHP being dynamic, would something like new Collection<$item_type>([]) be expected to work?

@brzuchal
Copy link

@mindplay-dk Thanks, that makes sense.
Additionally, how a generic type can know if a parameter type is a scalar type or object/enum type since T::class can fail if given a string and not narrowed down by additional constraints on declaration?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants