-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid (X/HT)ML entity at line end causes "different styles between \r and \n" #192
Comments
what about following fix (added else if (chNext != ';' && chNext != '#' && !(IsASCII(chNext) && isalnum(chNext)) // Should check that '#' follows '&', but it is unlikely anyway...
&& chNext != '.' && chNext != '-' && chNext != '_' && chNext != ':') { // valid in XML
styler.ColourTo(i, SCE_H_TAGUNKNOWN);
state = SCE_H_DEFAULT;
}
|
needs extra check for Lines 1713 to 1716 in 5ffca56
|
To fix just the initial problem, a minimum change appears to be. @@ -1875,7 +1875,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
}
if (ch != '#' && !(IsASCII(ch) && isalnum(ch)) // Should check that '#' follows '&', but it is unlikely anyway...
&& ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
- if (!IsASCII(ch)) // Possibly start of a multibyte character so don't allow this byte to be in entity style
+ if (!IsASCII(ch) || isLineEnd(ch)) // Possibly start of a multibyte character so don't allow this byte to be in entity style
styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
else
styler.ColourTo(i, SCE_H_TAGUNKNOWN); There may be a case for additional checks. |
Looked up the definition of entities at https://www.w3.org/TR/2008/REC-xml-20081126/#sec-references. Simplifying a bit it is:
It also appears possible in XML to define non-ASCII entity names but that isn't done for HTML. |
Third rule (named character reference) could be simplified to only allow alpha numeric + dot for all predefined entities |
@@ -1162,6 +1162,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
const CharacterSet setAttributeContinue(CharacterSet::setAlphaNum, ".-_:!#/", true);
// TODO: also handle + and - (except if they're part of ++ or --) and return keywords
const CharacterSet setOKBeforeJSRE(CharacterSet::setNone, "([{=,:;!%^&*|?~");
+ const CharacterSet setEntity(CharacterSet::setAlphaNum, ".#-_:");
int levelPrev = styler.LevelAt(lineCurrent) & SC_FOLDLEVELNUMBERMASK;
int levelCurrent = levelPrev;
@@ -1872,13 +1873,9 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
if (ch == ';') {
styler.ColourTo(i, StateToPrint);
state = SCE_H_DEFAULT;
- }
- if (ch != '#' && !(IsASCII(ch) && isalnum(ch)) // Should check that '#' follows '&', but it is unlikely anyway...
- && ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
- if (!IsASCII(ch)) // Possibly start of a multibyte character so don't allow this byte to be in entity style
- styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
- else
- styler.ColourTo(i, SCE_H_TAGUNKNOWN);
+ } else if (!setEntity.Contains(ch)) {
+ // Only allow [A-Za-z0-9.#-_:] in entities
+ styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
state = SCE_H_DEFAULT;
}
break; |
Scintilla bug 810 affected this code but the |
how about adding state = SCE_H_DEFAULT;
--i;
continue; test case: &<p>
&& |
That works:
|
After experimenting a bit, I prefer the current styling that shows the first invalid character in 'red' so it can be seen as the problem. If it is the start of a valid tag like Tests: &
&<p>
&&
<
<<b>
&b.epsi;
&b.epsi!
&カタナ;
&—; Treating line ends and non-ASCII as early ends (due to character/CRLF slicing): @@ -1162,6 +1162,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
const CharacterSet setAttributeContinue(CharacterSet::setAlphaNum, ".-_:!#/", true);
// TODO: also handle + and - (except if they're part of ++ or --) and return keywords
const CharacterSet setOKBeforeJSRE(CharacterSet::setNone, "([{=,:;!%^&*|?~");
+ const CharacterSet setEntity(CharacterSet::setAlphaNum, ".#-_:");
int levelPrev = styler.LevelAt(lineCurrent) & SC_FOLDLEVELNUMBERMASK;
int levelCurrent = levelPrev;
@@ -1872,13 +1873,12 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
if (ch == ';') {
styler.ColourTo(i, StateToPrint);
state = SCE_H_DEFAULT;
- }
- if (ch != '#' && !(IsASCII(ch) && isalnum(ch)) // Should check that '#' follows '&', but it is unlikely anyway...
- && ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
- if (!IsASCII(ch)) // Possibly start of a multibyte character so don't allow this byte to be in entity style
- styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
- else
- styler.ColourTo(i, SCE_H_TAGUNKNOWN);
+ } else if (isLineEnd(ch) || !IsASCII(ch)) {
+ styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
+ state = SCE_H_DEFAULT;
+ } else if (!setEntity.Contains(ch)) {
+ // Only allow [A-Za-z0-9.#-_:] in entities
+ styler.ColourTo(i, SCE_H_TAGUNKNOWN);
state = SCE_H_DEFAULT;
}
break;
|
with } else if (!setEntity.Contains(ch)) {
styler.ColourTo(i - 1, SCE_H_TAGUNKNOWN);
state = SCE_H_DEFAULT;
--i;
continue;
} <html>
&
&1
&A
&中
&<p>
&1<p>
&A<p>
&中<p>
&&

中
&A;<p>
<p>
中<p>
&
</html> PS. I think it's might better to treat
|
From https://html.spec.whatwg.org/multipage/syntax.html#character-references:
I think it is better to be strict here. |
"Strict" really means "easier to implement" in this case. It does appear that some entities in the HTML 5 specification have no semicolon in their canonical form. $ curl -sL 'https://www.w3.org/TR/html5/entities.json' | jq -r '. | keys[] | select(index(";") | not)'
Æ
&
Á
Â
À
Å
Ã
Ä
©
Ç
Ð
É
Ê
È
Ë
>
Í
Î
Ì
Ï
<
Ñ
Ó
Ô
Ò
Ø
Õ
Ö
"
®
Þ
Ú
Û
Ù
Ü
Ý
á
â
´
æ
à
&
å
ã
ä
¦
ç
¸
¢
©
¤
°
÷
é
ê
è
ð
ë
½
¼
¾
>
í
î
¡
ì
¿
ï
«
<
¯
µ
·
 
¬
ñ
ó
ô
ò
ª
º
ø
õ
ö
¶
±
£
"
»
®
§
­
¹
²
³
ß
þ
×
ú
û
ù
¨
ü
ý
¥
ÿ But then, each of these also has a fully-delimited variant, so the most the implementation would lack is an optionally briefer syntax. |
strict is fine. I prefer highlight only |
The HTML 5 specification is the same as WHATWG: While an entity may be defined without a semicolon, the syntax doesn't appear to allow that to be used within HTML. If it was allowed then the syntax should be defined somewhere so lexing code knows where to stop and thus what to discard and where to restart. |
Folding is more stable by retreating for an invalid entity character with |
SCE_H_TAGUNKNOWN
bleeds into the first non-printing character adjacent to a malformed entity:lexilla/lexers/LexHTML.cxx
Lines 1876 to 1883 in 5ffca56
In CRLF mode, this will split the EOL into different styles:
Text (
SC_EOL_CRLF
)Computed Styles
Visual Styles
Test Output
This patch introduces back-tracking to only style the visible characters of the malformed entity:
0001-192-Prevent-styling-EOLs-as-SCE_H_TAGUNKNOWN.patch.txt
0001-Test-Prevent-styling-EOLs-as-SCE_H_TAGUNKNOWN.diff.txt
The text was updated successfully, but these errors were encountered: