Invalid (X/HT)ML entity at line end causes "different styles between \r and \n" #192

rdipardo · 2023-07-22T19:33:13Z

SCE_H_TAGUNKNOWN bleeds into the first non-printing character adjacent to a malformed entity:

Lines 1876 to 1883 in 5ffca56

    
           if (ch != '#' && !(IsASCII(ch) && isalnum(ch))	// Should check that '#' follows '&', but it is unlikely anyway... 
        
           	&& ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML 
        
           	if (!IsASCII(ch))	// Possibly start of a multibyte character so don't allow this byte to be in entity style 
        
           		styler.ColourTo(i-1, SCE_H_TAGUNKNOWN); 
        
           	else 
        
           		styler.ColourTo(i, SCE_H_TAGUNKNOWN); 
        
           	state = SCE_H_DEFAULT; 
        
           }

In CRLF mode, this will split the EOL into different styles:

Text (`SC_EOL_CRLF`)

Computed Styles

{2}&
{0}

Visual Styles

Test Output

~\lexilla\test\examples\hypertext\SCE_H_TAGUNKNOWN.html:1: different styles between \r and \n at 2: 2, 0
~\lexilla\test\examples\hypertext\SCE_H_TAGUNKNOWN.html:1: has different styles with \n versus \r\n line ends

This patch introduces back-tracking to only style the visible characters of the malformed entity:

The text was updated successfully, but these errors were encountered:

zufuliu · 2023-07-22T22:07:03Z

what about following fix (added else before if, the condition could be extracted as a function):

else if (chNext != ';' && chNext != '#' && !(IsASCII(chNext) && isalnum(chNext))	// Should check that '#' follows '&', but it is unlikely anyway...
	&& chNext != '.' && chNext != '-' && chNext != '_' && chNext != ':') { // valid in XML
	styler.ColourTo(i, SCE_H_TAGUNKNOWN);
	state = SCE_H_DEFAULT;
}

SCE_H_TAGUNKNOWN can be changed to SCE_H_DEFAULT as this is not tag and using & with invalid entity name is valid.

zufuliu · 2023-07-22T22:32:14Z

what about following fix

needs extra check for chNext:

lexilla/lexers/LexHTML.cxx

Lines 1713 to 1716 in 5ffca56

    
           } else if (ch == '&') { 
        
           	styler.ColourTo(i - 1, SCE_H_DEFAULT); 
        
           	state = SCE_H_ENTITY; 
        
           }

nyamatongwe · 2023-07-22T23:29:52Z

To fix just the initial problem, a minimum change appears to be.

@@ -1875,7 +1875,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
 			}
 			if (ch != '#' && !(IsASCII(ch) && isalnum(ch))	// Should check that '#' follows '&', but it is unlikely anyway...
 				&& ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
-				if (!IsASCII(ch))	// Possibly start of a multibyte character so don't allow this byte to be in entity style
+				if (!IsASCII(ch) || isLineEnd(ch))	// Possibly start of a multibyte character so don't allow this byte to be in entity style
 					styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
 				else
 					styler.ColourTo(i, SCE_H_TAGUNKNOWN);

There may be a case for additional checks.

nyamatongwe · 2023-07-23T05:20:29Z

Looked up the definition of entities at https://www.w3.org/TR/2008/REC-xml-20081126/#sec-references.

Simplifying a bit it is:

  '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';'
| '&' [A-Za-z:_] [A-Za-z0-9:_.-]* ';'

It also appears possible in XML to define non-ASCII entity names but that isn't done for HTML.

zufuliu · 2023-07-23T06:11:44Z

Third rule (named character reference) could be simplified to only allow alpha numeric + dot for all predefined entities
https://html.spec.whatwg.org/multipage/syntax.html#character-references
https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references
https://www.w3.org/TR/xml-entity-names/
https://www.w3.org/2003/entities/2007/w3centities-f.ent

nyamatongwe · 2023-07-23T12:15:07Z

@@ -1162,6 +1162,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
 	const CharacterSet setAttributeContinue(CharacterSet::setAlphaNum, ".-_:!#/", true);
 	// TODO: also handle + and - (except if they're part of ++ or --) and return keywords
 	const CharacterSet setOKBeforeJSRE(CharacterSet::setNone, "([{=,:;!%^&*|?~");
+	const CharacterSet setEntity(CharacterSet::setAlphaNum, ".#-_:");
 
 	int levelPrev = styler.LevelAt(lineCurrent) & SC_FOLDLEVELNUMBERMASK;
 	int levelCurrent = levelPrev;
@@ -1872,13 +1873,9 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
 			if (ch == ';') {
 				styler.ColourTo(i, StateToPrint);
 				state = SCE_H_DEFAULT;
-			}
-			if (ch != '#' && !(IsASCII(ch) && isalnum(ch))	// Should check that '#' follows '&', but it is unlikely anyway...
-				&& ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
-				if (!IsASCII(ch))	// Possibly start of a multibyte character so don't allow this byte to be in entity style
-					styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
-				else
-					styler.ColourTo(i, SCE_H_TAGUNKNOWN);
+			} else if (!setEntity.Contains(ch)) {
+				// Only allow [A-Za-z0-9.#-_:] in entities
+				styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
 				state = SCE_H_DEFAULT;
 			}
 			break;

nyamatongwe · 2023-07-23T12:26:59Z

Scintilla bug 810 affected this code but the &— case seems OK with this change.
https://sourceforge.net/p/scintilla/bugs/810/

zufuliu · 2023-07-23T12:49:45Z

how about adding continue after SCE_H_TAGUNKNOWN:

state = SCE_H_DEFAULT;
--i;
continue;

test case:

&amp<p>
&&amp;

rdipardo · 2023-07-23T17:52:35Z

how about adding continue after SCE_H_TAGUNKNOWN:
state = SCE_H_DEFAULT;
--i;
continue;

That works:

{2}&amp<p>{0}
{2}&{10}&amp;{0}

nyamatongwe · 2023-07-23T22:27:05Z

After experimenting a bit, I prefer the current styling that shows the first invalid character in 'red' so it can be seen as the problem. If it is the start of a valid tag like <p> then showing that all in blue makes it harder to see where the problem is.

Tests:

&
&amp<p>
&&amp;
&lt;
&lt;<b>
&b.epsi;
&b.epsi!
&カタナ;
&—;

Treating line ends and non-ASCII as early ends (due to character/CRLF slicing):

@@ -1162,6 +1162,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
 	const CharacterSet setAttributeContinue(CharacterSet::setAlphaNum, ".-_:!#/", true);
 	// TODO: also handle + and - (except if they're part of ++ or --) and return keywords
 	const CharacterSet setOKBeforeJSRE(CharacterSet::setNone, "([{=,:;!%^&*|?~");
+	const CharacterSet setEntity(CharacterSet::setAlphaNum, ".#-_:");
 
 	int levelPrev = styler.LevelAt(lineCurrent) & SC_FOLDLEVELNUMBERMASK;
 	int levelCurrent = levelPrev;
@@ -1872,13 +1873,12 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
 			if (ch == ';') {
 				styler.ColourTo(i, StateToPrint);
 				state = SCE_H_DEFAULT;
-			}
-			if (ch != '#' && !(IsASCII(ch) && isalnum(ch))	// Should check that '#' follows '&', but it is unlikely anyway...
-				&& ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
-				if (!IsASCII(ch))	// Possibly start of a multibyte character so don't allow this byte to be in entity style
-					styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
-				else
-					styler.ColourTo(i, SCE_H_TAGUNKNOWN);
+			} else if (isLineEnd(ch) || !IsASCII(ch)) {
+				styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
+				state = SCE_H_DEFAULT;
+			} else if (!setEntity.Contains(ch)) {
+				// Only allow [A-Za-z0-9.#-_:] in entities
+				styler.ColourTo(i, SCE_H_TAGUNKNOWN);
 				state = SCE_H_DEFAULT;
 			}
 			break;

{2}&{0}
{2}&amp<{0}p>
{2}&&{0}amp;
{10}&lt;{0}
{10}&lt;{1}<b>{0}
{10}&b.epsi;{0}
{2}&b.epsi!{0}
{2}&{0}カタナ;
{2}&{0}—;

zufuliu · 2023-07-23T23:36:37Z

with

} else if (!setEntity.Contains(ch)) {
	styler.ColourTo(i - 1, SCE_H_TAGUNKNOWN);
	state = SCE_H_DEFAULT;
	--i;
	continue;
}

<html>
&
&1
&A
&中
&amp<p>
&1<p>
&A<p>
&中<p>
&&amp;
&#1;
&#1;中
&A;<p>
&#1;<p>
&#1;中<p>
&amp
</html>

PS. I think it's might better to treat <p> as void tag, https://html.spec.whatwg.org/multipage/grouping-content.html#the-p-element. also from https://www.w3.org/TR/xml-entity-names/#htmlmathml

the HTML (but not XML) parser allows the trailing semicolon to be omitted. So &amp may be used as well as & when using HTML.

nyamatongwe · 2023-07-27T22:47:45Z

From https://html.spec.whatwg.org/multipage/syntax.html#character-references:

Named character references
The name must be one that is terminated by a U+003B SEMICOLON character (;)

I think it is better to be strict here.

rdipardo · 2023-07-28T03:52:32Z

"Strict" really means "easier to implement" in this case. It does appear that some entities in the HTML 5 specification have no semicolon in their canonical form.

$ curl -sL 'https://www.w3.org/TR/html5/entities.json' | jq -r '. | keys[] | select(index(";") | not)'
&AElig
&AMP
&Aacute
&Acirc
&Agrave
&Aring
&Atilde
&Auml
&COPY
&Ccedil
&ETH
&Eacute
&Ecirc
&Egrave
&Euml
&GT
&Iacute
&Icirc
&Igrave
&Iuml
&LT
&Ntilde
&Oacute
&Ocirc
&Ograve
&Oslash
&Otilde
&Ouml
&QUOT
&REG
&THORN
&Uacute
&Ucirc
&Ugrave
&Uuml
&Yacute
&aacute
&acirc
&acute
&aelig
&agrave
&amp
&aring
&atilde
&auml
&brvbar
&ccedil
&cedil
&cent
&copy
&curren
&deg
&divide
&eacute
&ecirc
&egrave
&eth
&euml
&frac12
&frac14
&frac34
&gt
&iacute
&icirc
&iexcl
&igrave
&iquest
&iuml
&laquo
&lt
&macr
&micro
&middot
&nbsp
&not
&ntilde
&oacute
&ocirc
&ograve
&ordf
&ordm
&oslash
&otilde
&ouml
&para
&plusmn
&pound
&quot
&raquo
&reg
&sect
&shy
&sup1
&sup2
&sup3
&szlig
&thorn
&times
&uacute
&ucirc
&ugrave
&uml
&uuml
&yacute
&yen
&yuml

But then, each of these also has a fully-delimited variant, so the most the implementation would lack is an optionally briefer syntax.

zufuliu · 2023-07-28T11:08:33Z

strict is fine. I prefer highlight only &amp (instead of &amp<) as SCE_H_TAGUNKNOWN: less code than checking eol and multi-byte, better error recovery (with --i).

nyamatongwe · 2023-07-28T22:19:49Z

@rdipardo It does appear that some entities in the HTML 5 specification have no semicolon in their canonical form.

The HTML 5 specification is the same as WHATWG:
https://dev.w3.org/html5/spec-LC/syntax.html#character-references

While an entity may be defined without a semicolon, the syntax doesn't appear to allow that to be used within HTML. If it was allowed then the syntax should be defined somewhere so lexing code knows where to stop and thus what to discard and where to restart.

…re invalid character.

nyamatongwe · 2023-08-04T22:55:31Z

Folding is more stable by retreating for an invalid entity character with --i; continue; when adding an entity before a tag. Without this, the tag is temporarily invalid which means its fold level changes which may then lead to automatic unfolding. Because of this, the retreat code is included in 55c0061.

nyamatongwe added the html Caused by the hypertext lexer label Jul 22, 2023

nyamatongwe added a commit that referenced this issue Aug 4, 2023

#192 Fix invalid entity at line end and terminate invalid entity befo…

55c0061

…re invalid character.

nyamatongwe added the committed Issue fixed in repository but not in release label Aug 4, 2023

nyamatongwe closed this as completed Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid (X/HT)ML entity at line end causes "different styles between \r and \n" #192

Invalid (X/HT)ML entity at line end causes "different styles between \r and \n" #192

rdipardo commented Jul 22, 2023

zufuliu commented Jul 22, 2023 •

edited

zufuliu commented Jul 22, 2023

nyamatongwe commented Jul 22, 2023

nyamatongwe commented Jul 23, 2023 •

edited

zufuliu commented Jul 23, 2023

nyamatongwe commented Jul 23, 2023

nyamatongwe commented Jul 23, 2023

zufuliu commented Jul 23, 2023

rdipardo commented Jul 23, 2023

nyamatongwe commented Jul 23, 2023

zufuliu commented Jul 23, 2023

nyamatongwe commented Jul 27, 2023

rdipardo commented Jul 28, 2023 •

edited

zufuliu commented Jul 28, 2023

nyamatongwe commented Jul 28, 2023

nyamatongwe commented Aug 4, 2023

Invalid (X/HT)ML entity at line end causes "different styles between \r and \n" #192

Invalid (X/HT)ML entity at line end causes "different styles between \r and \n" #192

Comments

rdipardo commented Jul 22, 2023

Text (SC_EOL_CRLF)

Computed Styles

Visual Styles

Test Output

zufuliu commented Jul 22, 2023 • edited

zufuliu commented Jul 22, 2023

nyamatongwe commented Jul 22, 2023

nyamatongwe commented Jul 23, 2023 • edited

zufuliu commented Jul 23, 2023

nyamatongwe commented Jul 23, 2023

nyamatongwe commented Jul 23, 2023

zufuliu commented Jul 23, 2023

rdipardo commented Jul 23, 2023

nyamatongwe commented Jul 23, 2023

zufuliu commented Jul 23, 2023

nyamatongwe commented Jul 27, 2023

rdipardo commented Jul 28, 2023 • edited

zufuliu commented Jul 28, 2023

nyamatongwe commented Jul 28, 2023

nyamatongwe commented Aug 4, 2023

Text (`SC_EOL_CRLF`)

zufuliu commented Jul 22, 2023 •

edited

nyamatongwe commented Jul 23, 2023 •

edited

rdipardo commented Jul 28, 2023 •

edited