Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid (X/HT)ML entity at line end causes "different styles between \r and \n" #192

Closed
rdipardo opened this issue Jul 22, 2023 · 16 comments
Labels
committed Issue fixed in repository but not in release html Caused by the hypertext lexer

Comments

@rdipardo
Copy link
Contributor

SCE_H_TAGUNKNOWN bleeds into the first non-printing character adjacent to a malformed entity:

lexilla/lexers/LexHTML.cxx

Lines 1876 to 1883 in 5ffca56

if (ch != '#' && !(IsASCII(ch) && isalnum(ch)) // Should check that '#' follows '&', but it is unlikely anyway...
&& ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
if (!IsASCII(ch)) // Possibly start of a multibyte character so don't allow this byte to be in entity style
styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
else
styler.ColourTo(i, SCE_H_TAGUNKNOWN);
state = SCE_H_DEFAULT;
}

In CRLF mode, this will split the EOL into different styles:

Text (SC_EOL_CRLF)
&
Computed Styles
{2}&
{0}
Visual Styles

SCE_H_TAGUNKNOWN-eol-split

Test Output
~\lexilla\test\examples\hypertext\SCE_H_TAGUNKNOWN.html:1: different styles between \r and \n at 2: 2, 0
~\lexilla\test\examples\hypertext\SCE_H_TAGUNKNOWN.html:1: has different styles with \n versus \r\n line ends

This patch introduces back-tracking to only style the visible characters of the malformed entity:

@nyamatongwe nyamatongwe added the html Caused by the hypertext lexer label Jul 22, 2023
@zufuliu
Copy link
Contributor

zufuliu commented Jul 22, 2023

what about following fix (added else before if, the condition could be extracted as a function):

else if (chNext != ';' && chNext != '#' && !(IsASCII(chNext) && isalnum(chNext))	// Should check that '#' follows '&', but it is unlikely anyway...
	&& chNext != '.' && chNext != '-' && chNext != '_' && chNext != ':') { // valid in XML
	styler.ColourTo(i, SCE_H_TAGUNKNOWN);
	state = SCE_H_DEFAULT;
}

SCE_H_TAGUNKNOWN can be changed to SCE_H_DEFAULT as this is not tag and using & with invalid entity name is valid.

@zufuliu
Copy link
Contributor

zufuliu commented Jul 22, 2023

what about following fix

needs extra check for chNext:

lexilla/lexers/LexHTML.cxx

Lines 1713 to 1716 in 5ffca56

} else if (ch == '&') {
styler.ColourTo(i - 1, SCE_H_DEFAULT);
state = SCE_H_ENTITY;
}

@nyamatongwe
Copy link
Member

To fix just the initial problem, a minimum change appears to be.

@@ -1875,7 +1875,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
 			}
 			if (ch != '#' && !(IsASCII(ch) && isalnum(ch))	// Should check that '#' follows '&', but it is unlikely anyway...
 				&& ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
-				if (!IsASCII(ch))	// Possibly start of a multibyte character so don't allow this byte to be in entity style
+				if (!IsASCII(ch) || isLineEnd(ch))	// Possibly start of a multibyte character so don't allow this byte to be in entity style
 					styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
 				else
 					styler.ColourTo(i, SCE_H_TAGUNKNOWN);

There may be a case for additional checks.

@nyamatongwe
Copy link
Member

nyamatongwe commented Jul 23, 2023

Looked up the definition of entities at https://www.w3.org/TR/2008/REC-xml-20081126/#sec-references.

Simplifying a bit it is:

  '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';'
| '&' [A-Za-z:_] [A-Za-z0-9:_.-]* ';'

It also appears possible in XML to define non-ASCII entity names but that isn't done for HTML.

@zufuliu
Copy link
Contributor

zufuliu commented Jul 23, 2023

@nyamatongwe
Copy link
Member

@@ -1162,6 +1162,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
 	const CharacterSet setAttributeContinue(CharacterSet::setAlphaNum, ".-_:!#/", true);
 	// TODO: also handle + and - (except if they're part of ++ or --) and return keywords
 	const CharacterSet setOKBeforeJSRE(CharacterSet::setNone, "([{=,:;!%^&*|?~");
+	const CharacterSet setEntity(CharacterSet::setAlphaNum, ".#-_:");
 
 	int levelPrev = styler.LevelAt(lineCurrent) & SC_FOLDLEVELNUMBERMASK;
 	int levelCurrent = levelPrev;
@@ -1872,13 +1873,9 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
 			if (ch == ';') {
 				styler.ColourTo(i, StateToPrint);
 				state = SCE_H_DEFAULT;
-			}
-			if (ch != '#' && !(IsASCII(ch) && isalnum(ch))	// Should check that '#' follows '&', but it is unlikely anyway...
-				&& ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
-				if (!IsASCII(ch))	// Possibly start of a multibyte character so don't allow this byte to be in entity style
-					styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
-				else
-					styler.ColourTo(i, SCE_H_TAGUNKNOWN);
+			} else if (!setEntity.Contains(ch)) {
+				// Only allow [A-Za-z0-9.#-_:] in entities
+				styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
 				state = SCE_H_DEFAULT;
 			}
 			break;

@nyamatongwe
Copy link
Member

Scintilla bug 810 affected this code but the &— case seems OK with this change.
https://sourceforge.net/p/scintilla/bugs/810/

@zufuliu
Copy link
Contributor

zufuliu commented Jul 23, 2023

how about adding continue after SCE_H_TAGUNKNOWN:

state = SCE_H_DEFAULT;
--i;
continue;

test case:

&amp<p>
&&amp;

@rdipardo
Copy link
Contributor Author

how about adding continue after SCE_H_TAGUNKNOWN:

state = SCE_H_DEFAULT;
--i;
continue;

That works:

{2}&amp<p>{0}
{2}&{10}&amp;{0}

@nyamatongwe
Copy link
Member

After experimenting a bit, I prefer the current styling that shows the first invalid character in 'red' so it can be seen as the problem. If it is the start of a valid tag like <p> then showing that all in blue makes it harder to see where the problem is.

Tests:

&
&amp<p>
&&amp;
&lt;
&lt;<b>
&b.epsi;
&b.epsi!
&カタナ;
&—;

Treating line ends and non-ASCII as early ends (due to character/CRLF slicing):

@@ -1162,6 +1162,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
 	const CharacterSet setAttributeContinue(CharacterSet::setAlphaNum, ".-_:!#/", true);
 	// TODO: also handle + and - (except if they're part of ++ or --) and return keywords
 	const CharacterSet setOKBeforeJSRE(CharacterSet::setNone, "([{=,:;!%^&*|?~");
+	const CharacterSet setEntity(CharacterSet::setAlphaNum, ".#-_:");
 
 	int levelPrev = styler.LevelAt(lineCurrent) & SC_FOLDLEVELNUMBERMASK;
 	int levelCurrent = levelPrev;
@@ -1872,13 +1873,12 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
 			if (ch == ';') {
 				styler.ColourTo(i, StateToPrint);
 				state = SCE_H_DEFAULT;
-			}
-			if (ch != '#' && !(IsASCII(ch) && isalnum(ch))	// Should check that '#' follows '&', but it is unlikely anyway...
-				&& ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
-				if (!IsASCII(ch))	// Possibly start of a multibyte character so don't allow this byte to be in entity style
-					styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
-				else
-					styler.ColourTo(i, SCE_H_TAGUNKNOWN);
+			} else if (isLineEnd(ch) || !IsASCII(ch)) {
+				styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
+				state = SCE_H_DEFAULT;
+			} else if (!setEntity.Contains(ch)) {
+				// Only allow [A-Za-z0-9.#-_:] in entities
+				styler.ColourTo(i, SCE_H_TAGUNKNOWN);
 				state = SCE_H_DEFAULT;
 			}
 			break;
{2}&{0}
{2}&amp<{0}p>
{2}&&{0}amp;
{10}&lt;{0}
{10}&lt;{1}<b>{0}
{10}&b.epsi;{0}
{2}&b.epsi!{0}
{2}&{0}カタナ;
{2}&{0}—;

@zufuliu
Copy link
Contributor

zufuliu commented Jul 23, 2023

with

} else if (!setEntity.Contains(ch)) {
	styler.ColourTo(i - 1, SCE_H_TAGUNKNOWN);
	state = SCE_H_DEFAULT;
	--i;
	continue;
}
<html>
&
&1
&A
&中
&amp<p>
&1<p>
&A<p>
&中<p>
&&amp;
&#1;
&#1;中
&A;<p>
&#1;<p>
&#1;中<p>
&amp
</html>

image

PS. I think it's might better to treat <p> as void tag, https://html.spec.whatwg.org/multipage/grouping-content.html#the-p-element. also from https://www.w3.org/TR/xml-entity-names/#htmlmathml

the HTML (but not XML) parser allows the trailing semicolon to be omitted. So &amp may be used as well as &amp; when using HTML.

@nyamatongwe
Copy link
Member

From https://html.spec.whatwg.org/multipage/syntax.html#character-references:

Named character references
The name must be one that is terminated by a U+003B SEMICOLON character (;)

I think it is better to be strict here.

@rdipardo
Copy link
Contributor Author

rdipardo commented Jul 28, 2023

"Strict" really means "easier to implement" in this case. It does appear that some entities in the HTML 5 specification have no semicolon in their canonical form.

$ curl -sL 'https://www.w3.org/TR/html5/entities.json' | jq -r '. | keys[] | select(index(";") | not)'
&AElig
&AMP
&Aacute
&Acirc
&Agrave
&Aring
&Atilde
&Auml
&COPY
&Ccedil
&ETH
&Eacute
&Ecirc
&Egrave
&Euml
&GT
&Iacute
&Icirc
&Igrave
&Iuml
&LT
&Ntilde
&Oacute
&Ocirc
&Ograve
&Oslash
&Otilde
&Ouml
&QUOT
&REG
&THORN
&Uacute
&Ucirc
&Ugrave
&Uuml
&Yacute
&aacute
&acirc
&acute
&aelig
&agrave
&amp
&aring
&atilde
&auml
&brvbar
&ccedil
&cedil
&cent
&copy
&curren
&deg
&divide
&eacute
&ecirc
&egrave
&eth
&euml
&frac12
&frac14
&frac34
&gt
&iacute
&icirc
&iexcl
&igrave
&iquest
&iuml
&laquo
&lt
&macr
&micro
&middot
&nbsp
&not
&ntilde
&oacute
&ocirc
&ograve
&ordf
&ordm
&oslash
&otilde
&ouml
&para
&plusmn
&pound
&quot
&raquo
&reg
&sect
&shy
&sup1
&sup2
&sup3
&szlig
&thorn
&times
&uacute
&ucirc
&ugrave
&uml
&uuml
&yacute
&yen
&yuml

But then, each of these also has a fully-delimited variant, so the most the implementation would lack is an optionally briefer syntax.

@zufuliu
Copy link
Contributor

zufuliu commented Jul 28, 2023

strict is fine. I prefer highlight only &amp (instead of &amp<) as SCE_H_TAGUNKNOWN: less code than checking eol and multi-byte, better error recovery (with --i).

@nyamatongwe
Copy link
Member

@rdipardo It does appear that some entities in the HTML 5 specification have no semicolon in their canonical form.

The HTML 5 specification is the same as WHATWG:
https://dev.w3.org/html5/spec-LC/syntax.html#character-references

While an entity may be defined without a semicolon, the syntax doesn't appear to allow that to be used within HTML. If it was allowed then the syntax should be defined somewhere so lexing code knows where to stop and thus what to discard and where to restart.

nyamatongwe added a commit that referenced this issue Aug 4, 2023
@nyamatongwe nyamatongwe added the committed Issue fixed in repository but not in release label Aug 4, 2023
@nyamatongwe
Copy link
Member

Folding is more stable by retreating for an invalid entity character with --i; continue; when adding an entity before a tag. Without this, the tag is temporarily invalid which means its fold level changes which may then lead to automatic unfolding. Because of this, the retreat code is included in 55c0061.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
committed Issue fixed in repository but not in release html Caused by the hypertext lexer
Projects
None yet
Development

No branches or pull requests

3 participants