support new PHP features (flexible heredoc/nowdoc syntaxes, an underscore inside numbers) #19

ivan-u7n · 2021-07-17T13:32:53Z

As written at https://sourceforge.net/p/scintilla/feature-requests/1378/ "This is a reasonable addition but it will require someone to provide an implementation." The someone in question turned out to be me.

The patch for the issue is below. Its more elaborate explanation, a downloadable file, and a test php file are available in the comment to the similar Notepad++ issue.

--- LexHTML~orig.cxx	2021-07-17 18:23:40.714103918 +0600
+++ LexHTML.cxx	2021-07-17 18:23:47.766164407 +0600
@@ -531,7 +531,7 @@
 Sci_Position FindPhpStringDelimiter(std::string &phpStringDelimiter, Sci_Position i, const Sci_Position lengthDoc, Accessor &styler, bool &isSimpleString) {
 	Sci_Position j;
 	const Sci_Position beginning = i - 1;
-	bool isValidSimpleString = false;
+	bool isQuoted = false;
 
 	while (i < lengthDoc && (styler[i] == ' ' || styler[i] == '\t'))
 		i++;
@@ -539,10 +539,11 @@
 	const char chNext = styler.SafeGetCharAt(i + 1);
 	phpStringDelimiter.clear();
 	if (!IsPhpWordStart(ch)) {
-		if (ch == '\'' && IsPhpWordStart(chNext)) {
+		if ((ch == '\'' || ch == '\"') && IsPhpWordStart(chNext)) {
+			isSimpleString = ch == '\'';
+			isQuoted = true;
 			i++;
 			ch = chNext;
-			isSimpleString = true;
 		} else {
 			return beginning;
 		}
@@ -550,9 +551,9 @@
 	phpStringDelimiter.push_back(ch);
 	i++;
 	for (j = i; j < lengthDoc && !isLineEnd(styler[j]); j++) {
-		if (!IsPhpWordChar(styler[j])) {
-			if (isSimpleString && (styler[j] == '\'') && isLineEnd(styler.SafeGetCharAt(j + 1))) {
-				isValidSimpleString = true;
+		if (!IsPhpWordChar(styler[j]) && isQuoted) {
+			if (((isSimpleString && styler[j] == '\'') || (!isSimpleString && styler[j] == '\"')) && isLineEnd(styler.SafeGetCharAt(j + 1))) {
+				isQuoted = false;
 				j++;
 				break;
 			} else {
@@ -562,7 +563,7 @@
 		}
 		phpStringDelimiter.push_back(styler[j]);
 	}
-	if (isSimpleString && !isValidSimpleString) {
+	if (isQuoted) {
 		phpStringDelimiter.clear();
 		return beginning;
 	}
@@ -2310,7 +2311,7 @@
 		case SCE_HPHP_NUMBER:
 			// recognize bases 8,10 or 16 integers OR floating-point numbers
 			if (!IsADigit(ch)
-				&& strchr(".xXabcdefABCDEF", ch) == NULL
+				&& strchr(".xXabcdefABCDEF_", ch) == NULL
 				&& ((ch != '-' && ch != '+') || (chPrev != 'e' && chPrev != 'E'))) {
 				styler.ColourTo(i - 1, SCE_HPHP_NUMBER);
 				if (IsOperator(ch))
@@ -2352,13 +2353,10 @@
 				if (phpStringDelimiter == "\"") {
 					styler.ColourTo(i, StateToPrint);
 					state = SCE_HPHP_DEFAULT;
-				} else if (isLineEnd(chPrev)) {
+				} else if (lineStartVisibleChars == 1) {
 					const int psdLength = static_cast<int>(phpStringDelimiter.length());
-					const char chAfterPsd = styler.SafeGetCharAt(i + psdLength);
-					const char chAfterPsd2 = styler.SafeGetCharAt(i + psdLength + 1);
-					if (isLineEnd(chAfterPsd) ||
-						(chAfterPsd == ';' && isLineEnd(chAfterPsd2))) {
-							i += (((i + psdLength) < lengthDoc) ? psdLength : lengthDoc) - 1;
+					if (!IsPhpWordChar(styler.SafeGetCharAt(i + psdLength))) {
+						i += (((i + psdLength) < lengthDoc) ? psdLength : lengthDoc) - 1;
 						styler.ColourTo(i, StateToPrint);
 						state = SCE_HPHP_DEFAULT;
 						if (foldHeredoc) levelCurrent--;
@@ -2375,12 +2373,9 @@
 					styler.ColourTo(i, StateToPrint);
 					state = SCE_HPHP_DEFAULT;
 				}
-			} else if (isLineEnd(chPrev) && styler.Match(i, phpStringDelimiter.c_str())) {
+			} else if (lineStartVisibleChars == 1 && styler.Match(i, phpStringDelimiter.c_str())) {
 				const int psdLength = static_cast<int>(phpStringDelimiter.length());
-				const char chAfterPsd = styler.SafeGetCharAt(i + psdLength);
-				const char chAfterPsd2 = styler.SafeGetCharAt(i + psdLength + 1);
-				if (isLineEnd(chAfterPsd) ||
-				(chAfterPsd == ';' && isLineEnd(chAfterPsd2))) {
+				if (!IsPhpWordChar(styler.SafeGetCharAt(i + psdLength))) {
 					i += (((i + psdLength) < lengthDoc) ? psdLength : lengthDoc) - 1;
 					styler.ColourTo(i, StateToPrint);
 					state = SCE_HPHP_DEFAULT;

nyamatongwe · 2021-07-18T02:47:16Z

Broke this up into two commits: one for digit separator and one for heredoc/nowdoc.

Added example file test/examples/hypertext/Issue19.php. You should check that the results in test/examples/hypertext/Issue19.php.styled (with style numbers in brackets) is exactly what is intended.

The patches were credited to "ivan-u7n". If you would prefer a different name, then please replay with your preferred name.

ivan-u7n · 2021-07-18T07:03:46Z

Thanks for a swift response! Everything in test/examples/hypertext/Issue19.php.styled looks as it should.

My last name is in Cyrillic and has no true form in Latin script, so either Ivan Ustûžanin or Ivan U7n will do. The latter is in case the diacritics aren't possible.

nyamatongwe · 2021-07-18T12:05:46Z

The credits are reasonably Unicode clean - LexillaHistory.html and SciTEHistory.html are UTF-8 and browsers are well behaved with UTF-8. For the SciTE about box, its also UTF-8 but any character with the 8th bit set is represented in the source code with an octal escape so it doesn't get mangled by the compiler:

"Ivan Ust\303\273\305\276anin"

Which then shows correctly. 'Иван Устюжанин' (or similar) should also survive.

Where it breaks down is trying to represent names in git or other source code control systems and all the source code control clients. Sometimes it works and sometimes the bits fall on the floor, broken. Its understandable that after seeing this people just go with something ASCII that can be recognized.

ivan-u7n · 2021-07-18T12:45:04Z

Yes, my name in Cyrillic is “Иван Устюжанин”. But I'm fine with the current credits, for not many people know how to read Cyrillic. The issue can now be closed.

BTW, I have an additional trivial fix for annotations in PHP 8 not being highlighted as comments: lines 2287 and 2411 should be changed from } else if (ch == '#') { to } else if (ch == '#' && chNext != '[') {. The test is something like

<?php
#[
	MultiLineAnnotation('string', 1, null)
]
#[SingleLineAnnotation('string', 1, null)]

which should not contain any line comments.

nyamatongwe · 2021-07-18T23:54:56Z

Committed annotation change.

Issues are normally closed when the fix goes in a release.

ivan-u7n · 2021-07-21T14:07:21Z

Are you interested in extending the PHP part of the lexer with more styles for stuff like annotations, errors in syntax and in increasing the number of keywords lists for tokens with special meaning? I mean the implementation is on me, I'm just asking wether it will be accepted and it is something worth doing or just a basic lexer like the current one will do.

Although the support for nesting lexers will be event greater, I don't think I can tackle that one yet.

Edit: My first thought was to remake the categorization of digits to be in accordance with the language grammar, and in the case of invalid tokens it'll be much better to highlight them somehow. I had an idea to just style them with the default style, but if we can detect them, why not to style them accordingly.

nyamatongwe · 2021-07-23T04:09:32Z

There are only 9 available styles in the 0-127 range and there is a potential for incompatibility with styles > 127. Other aspects of HTML (like JavaScript with template literals) may need these more than PHP. Adding more styles should only be done where there is strong justification.

The current "hypertext" lexer is difficult to work with so should be replaced with a more modular design. Client-side and server-side languages should be extracted into their own modules that are then combined with the markup language code.

categorization of digits ... invalid ... highlight them

This doesn't seem to me to be worth another style. Using the default style would be visible enough.

ivan-u7n · 2021-07-25T23:34:23Z

I agree that the "HTML" lexer is rather complicated and needs overhaul. However, do you have in mind compile-time or run-time combining?

Nevertheless, I've prepared the update to PHP's numeric literals: php-numbers.patch.txt. It's greedier than PHP's own lexer — it doesn't stop on the first invalid character, but goes on. It was made this way on purpose: to visually show, by applying the default style, the invalid “numeric words” which will result in parser errors.

The test for this patch (“+” denotes a valid syntax that should be styled as a number, “-” — an invalid one that should have the default style):

123456; // +
123_456; // +
1234z6; // -
123456_; // -
123__456; // -

0x89Ab; // +
0x89_aB; // +
0x89zB; // -
0x89AB_; // -
0x_89AB; // -
0_x89AB; // -
0x89__AB; // -

1234.; // +
1234.e-0; // +
1234e+0; // +
1234e0; // +
1234.e-; // -
1234e+; // -
1234.-e; // -
1234+e; // -
1234e; // -

.1234; // +
.12e0; // +
.12.0e0; // -
.12e0.0; // -
.12e0e0; // -

1.234e-10; // +
1.2_34e-1_0; // +
1.234e-_10; // -
1.234e_-10; // -
1.234_e-10; // -
1._234e-10; // -
1_.234e-10; // -


01234567; // +
0_1234567; // +
012345678; // -

0...0; // +

nyamatongwe · 2021-07-26T04:18:25Z

Moved numeric literals into a new issue #20 so this issue can be closed.

nyamatongwe added the php Caused by the hypertext lexer for PHP label Jul 17, 2021

nyamatongwe pushed a commit that referenced this issue Jul 18, 2021

#19 Enable '_' digit separator in numbers.

44c3794

nyamatongwe pushed a commit that referenced this issue Jul 18, 2021

#19 Support flexible heredoc and nowdoc syntax.

cda350d

nyamatongwe pushed a commit that referenced this issue Jul 18, 2021

#19 Stop styling attributes as comments for PHP.

0751489

nyamatongwe mentioned this issue Jul 26, 2021

PHP Numeric literals #20

Closed

nyamatongwe closed this as completed Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support new PHP features (flexible heredoc/nowdoc syntaxes, an underscore inside numbers) #19

support new PHP features (flexible heredoc/nowdoc syntaxes, an underscore inside numbers) #19

ivan-u7n commented Jul 17, 2021

nyamatongwe commented Jul 18, 2021

ivan-u7n commented Jul 18, 2021

nyamatongwe commented Jul 18, 2021

ivan-u7n commented Jul 18, 2021 •

edited

Loading

nyamatongwe commented Jul 18, 2021

ivan-u7n commented Jul 21, 2021 •

edited

Loading

nyamatongwe commented Jul 23, 2021

ivan-u7n commented Jul 25, 2021

nyamatongwe commented Jul 26, 2021

support new PHP features (flexible heredoc/nowdoc syntaxes, an underscore inside numbers) #19

support new PHP features (flexible heredoc/nowdoc syntaxes, an underscore inside numbers) #19

Comments

ivan-u7n commented Jul 17, 2021

nyamatongwe commented Jul 18, 2021

ivan-u7n commented Jul 18, 2021

nyamatongwe commented Jul 18, 2021

ivan-u7n commented Jul 18, 2021 • edited Loading

nyamatongwe commented Jul 18, 2021

ivan-u7n commented Jul 21, 2021 • edited Loading

nyamatongwe commented Jul 23, 2021

ivan-u7n commented Jul 25, 2021

nyamatongwe commented Jul 26, 2021

ivan-u7n commented Jul 18, 2021 •

edited

Loading

ivan-u7n commented Jul 21, 2021 •

edited

Loading