Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support new PHP features (flexible heredoc/nowdoc syntaxes, an underscore inside numbers) #19

Closed
ivan-u7n opened this issue Jul 17, 2021 · 9 comments
Labels
php Caused by the hypertext lexer for PHP

Comments

@ivan-u7n
Copy link
Contributor

As written at https://sourceforge.net/p/scintilla/feature-requests/1378/ "This is a reasonable addition but it will require someone to provide an implementation." The someone in question turned out to be me.

The patch for the issue is below. Its more elaborate explanation, a downloadable file, and a test php file are available in the comment to the similar Notepad++ issue.

--- LexHTML~orig.cxx	2021-07-17 18:23:40.714103918 +0600
+++ LexHTML.cxx	2021-07-17 18:23:47.766164407 +0600
@@ -531,7 +531,7 @@
 Sci_Position FindPhpStringDelimiter(std::string &phpStringDelimiter, Sci_Position i, const Sci_Position lengthDoc, Accessor &styler, bool &isSimpleString) {
 	Sci_Position j;
 	const Sci_Position beginning = i - 1;
-	bool isValidSimpleString = false;
+	bool isQuoted = false;
 
 	while (i < lengthDoc && (styler[i] == ' ' || styler[i] == '\t'))
 		i++;
@@ -539,10 +539,11 @@
 	const char chNext = styler.SafeGetCharAt(i + 1);
 	phpStringDelimiter.clear();
 	if (!IsPhpWordStart(ch)) {
-		if (ch == '\'' && IsPhpWordStart(chNext)) {
+		if ((ch == '\'' || ch == '\"') && IsPhpWordStart(chNext)) {
+			isSimpleString = ch == '\'';
+			isQuoted = true;
 			i++;
 			ch = chNext;
-			isSimpleString = true;
 		} else {
 			return beginning;
 		}
@@ -550,9 +551,9 @@
 	phpStringDelimiter.push_back(ch);
 	i++;
 	for (j = i; j < lengthDoc && !isLineEnd(styler[j]); j++) {
-		if (!IsPhpWordChar(styler[j])) {
-			if (isSimpleString && (styler[j] == '\'') && isLineEnd(styler.SafeGetCharAt(j + 1))) {
-				isValidSimpleString = true;
+		if (!IsPhpWordChar(styler[j]) && isQuoted) {
+			if (((isSimpleString && styler[j] == '\'') || (!isSimpleString && styler[j] == '\"')) && isLineEnd(styler.SafeGetCharAt(j + 1))) {
+				isQuoted = false;
 				j++;
 				break;
 			} else {
@@ -562,7 +563,7 @@
 		}
 		phpStringDelimiter.push_back(styler[j]);
 	}
-	if (isSimpleString && !isValidSimpleString) {
+	if (isQuoted) {
 		phpStringDelimiter.clear();
 		return beginning;
 	}
@@ -2310,7 +2311,7 @@
 		case SCE_HPHP_NUMBER:
 			// recognize bases 8,10 or 16 integers OR floating-point numbers
 			if (!IsADigit(ch)
-				&& strchr(".xXabcdefABCDEF", ch) == NULL
+				&& strchr(".xXabcdefABCDEF_", ch) == NULL
 				&& ((ch != '-' && ch != '+') || (chPrev != 'e' && chPrev != 'E'))) {
 				styler.ColourTo(i - 1, SCE_HPHP_NUMBER);
 				if (IsOperator(ch))
@@ -2352,13 +2353,10 @@
 				if (phpStringDelimiter == "\"") {
 					styler.ColourTo(i, StateToPrint);
 					state = SCE_HPHP_DEFAULT;
-				} else if (isLineEnd(chPrev)) {
+				} else if (lineStartVisibleChars == 1) {
 					const int psdLength = static_cast<int>(phpStringDelimiter.length());
-					const char chAfterPsd = styler.SafeGetCharAt(i + psdLength);
-					const char chAfterPsd2 = styler.SafeGetCharAt(i + psdLength + 1);
-					if (isLineEnd(chAfterPsd) ||
-						(chAfterPsd == ';' && isLineEnd(chAfterPsd2))) {
-							i += (((i + psdLength) < lengthDoc) ? psdLength : lengthDoc) - 1;
+					if (!IsPhpWordChar(styler.SafeGetCharAt(i + psdLength))) {
+						i += (((i + psdLength) < lengthDoc) ? psdLength : lengthDoc) - 1;
 						styler.ColourTo(i, StateToPrint);
 						state = SCE_HPHP_DEFAULT;
 						if (foldHeredoc) levelCurrent--;
@@ -2375,12 +2373,9 @@
 					styler.ColourTo(i, StateToPrint);
 					state = SCE_HPHP_DEFAULT;
 				}
-			} else if (isLineEnd(chPrev) && styler.Match(i, phpStringDelimiter.c_str())) {
+			} else if (lineStartVisibleChars == 1 && styler.Match(i, phpStringDelimiter.c_str())) {
 				const int psdLength = static_cast<int>(phpStringDelimiter.length());
-				const char chAfterPsd = styler.SafeGetCharAt(i + psdLength);
-				const char chAfterPsd2 = styler.SafeGetCharAt(i + psdLength + 1);
-				if (isLineEnd(chAfterPsd) ||
-				(chAfterPsd == ';' && isLineEnd(chAfterPsd2))) {
+				if (!IsPhpWordChar(styler.SafeGetCharAt(i + psdLength))) {
 					i += (((i + psdLength) < lengthDoc) ? psdLength : lengthDoc) - 1;
 					styler.ColourTo(i, StateToPrint);
 					state = SCE_HPHP_DEFAULT;
@nyamatongwe nyamatongwe added the php Caused by the hypertext lexer for PHP label Jul 17, 2021
@nyamatongwe
Copy link
Member

Broke this up into two commits: one for digit separator and one for heredoc/nowdoc.

Added example file test/examples/hypertext/Issue19.php. You should check that the results in test/examples/hypertext/Issue19.php.styled (with style numbers in brackets) is exactly what is intended.

The patches were credited to "ivan-u7n". If you would prefer a different name, then please replay with your preferred name.

@ivan-u7n
Copy link
Contributor Author

Thanks for a swift response! Everything in test/examples/hypertext/Issue19.php.styled looks as it should.

My last name is in Cyrillic and has no true form in Latin script, so either Ivan Ustûžanin or Ivan U7n will do. The latter is in case the diacritics aren't possible.

@nyamatongwe
Copy link
Member

The credits are reasonably Unicode clean - LexillaHistory.html and SciTEHistory.html are UTF-8 and browsers are well behaved with UTF-8. For the SciTE about box, its also UTF-8 but any character with the 8th bit set is represented in the source code with an octal escape so it doesn't get mangled by the compiler:

"Ivan Ust\303\273\305\276anin"

Which then shows correctly. 'Иван Устюжанин' (or similar) should also survive.

Where it breaks down is trying to represent names in git or other source code control systems and all the source code control clients. Sometimes it works and sometimes the bits fall on the floor, broken. Its understandable that after seeing this people just go with something ASCII that can be recognized.

@ivan-u7n
Copy link
Contributor Author

ivan-u7n commented Jul 18, 2021

Yes, my name in Cyrillic is “Иван Устюжанин”. But I'm fine with the current credits, for not many people know how to read Cyrillic. The issue can now be closed.

BTW, I have an additional trivial fix for annotations in PHP 8 not being highlighted as comments: lines 2287 and 2411 should be changed from } else if (ch == '#') { to } else if (ch == '#' && chNext != '[') {. The test is something like

<?php
#[
	MultiLineAnnotation('string', 1, null)
]
#[SingleLineAnnotation('string', 1, null)]

which should not contain any line comments.

@nyamatongwe
Copy link
Member

Committed annotation change.

Issues are normally closed when the fix goes in a release.

@ivan-u7n
Copy link
Contributor Author

ivan-u7n commented Jul 21, 2021

Are you interested in extending the PHP part of the lexer with more styles for stuff like annotations, errors in syntax and in increasing the number of keywords lists for tokens with special meaning? I mean the implementation is on me, I'm just asking wether it will be accepted and it is something worth doing or just a basic lexer like the current one will do.

Although the support for nesting lexers will be event greater, I don't think I can tackle that one yet.

Edit: My first thought was to remake the categorization of digits to be in accordance with the language grammar, and in the case of invalid tokens it'll be much better to highlight them somehow. I had an idea to just style them with the default style, but if we can detect them, why not to style them accordingly.

@nyamatongwe
Copy link
Member

There are only 9 available styles in the 0-127 range and there is a potential for incompatibility with styles > 127. Other aspects of HTML (like JavaScript with template literals) may need these more than PHP. Adding more styles should only be done where there is strong justification.

The current "hypertext" lexer is difficult to work with so should be replaced with a more modular design. Client-side and server-side languages should be extracted into their own modules that are then combined with the markup language code.

categorization of digits ... invalid ... highlight them

This doesn't seem to me to be worth another style. Using the default style would be visible enough.

@ivan-u7n
Copy link
Contributor Author

I agree that the "HTML" lexer is rather complicated and needs overhaul. However, do you have in mind compile-time or run-time combining?

Nevertheless, I've prepared the update to PHP's numeric literals: php-numbers.patch.txt. It's greedier than PHP's own lexer — it doesn't stop on the first invalid character, but goes on. It was made this way on purpose: to visually show, by applying the default style, the invalid “numeric words” which will result in parser errors.

The test for this patch (“+” denotes a valid syntax that should be styled as a number, “-” — an invalid one that should have the default style):

123456; // +
123_456; // +
1234z6; // -
123456_; // -
123__456; // -

0x89Ab; // +
0x89_aB; // +
0x89zB; // -
0x89AB_; // -
0x_89AB; // -
0_x89AB; // -
0x89__AB; // -

1234.; // +
1234.e-0; // +
1234e+0; // +
1234e0; // +
1234.e-; // -
1234e+; // -
1234.-e; // -
1234+e; // -
1234e; // -

.1234; // +
.12e0; // +
.12.0e0; // -
.12e0.0; // -
.12e0e0; // -

1.234e-10; // +
1.2_34e-1_0; // +
1.234e-_10; // -
1.234e_-10; // -
1.234_e-10; // -
1._234e-10; // -
1_.234e-10; // -


01234567; // +
0_1234567; // +
012345678; // -

0...0; // +

@nyamatongwe
Copy link
Member

Moved numeric literals into a new issue #20 so this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
php Caused by the hypertext lexer for PHP
Projects
None yet
Development

No branches or pull requests

2 participants