Links followed by escaped quote breaks JSON parsing. #72

JC3 · 2022-04-03T05:09:04Z

Originally reported at notepad-plus-plus/notepad-plus-plus#11467 but was told to post it here instead:

It seems that the JSON lexer chokes on the following JSON document:

{
    "a": "1",
    "problem": "http://link\"}",
    "b": "2",
    "c": "3"
}

It misparses the } after the \" as a JSON token (e.g. in Notepad++):

At a glance it seems like it thinks that backslashes are part of links, and so it treats the \ as part of the link rather than as an escape character for the following ", and so it ends the string early. (Real reason is in comments.)

I ran into this in the wild in a HAR file that contained a value that ended with a similar sequence (value is escaped JSON, and that JSON contains a value that is a URL):

"value": "{\"group\":\"hosted-libraries-pushers\",\"max_age\":2592000,\"endpoints\":[{\"url\":\"https://csp.withgoogle.com/csp/report-to/hosted-libraries-pushers\"}]}"

Version

It's the version that ships with Notepad++ 8.3.3, 64-bit (Windows 10).

SciLexer.dll (I'm assuming that's the Scintilla lib) says 4.4.6.0, product version field says "4.4.6 for Notepad++".

The text was updated successfully, but these errors were encountered:

nyamatongwe · 2022-04-03T06:24:25Z

The \ breaks out of the link as its not in setURL but doesn't do the context.Forward(); from line 347 that throws away the meaning of the following ". It also doesn't try to highlight the \" escape sequence as SCE_JSON_ESCAPESEQUENCE which it does if http: is removed.. Handling this case is complicated by sharing code with SCE_JSON_LDKEYWORD.

JC3 · 2022-04-03T17:58:56Z

Fwiw (if it influences a decision to look into it -- note I'm not familiar with lexilla at all and don't have a concept of how complex a fix would be), while it's an edge case, it's also not completely uncommon.

Since it's not dependent on the }, it'll happen anywhere a link is followed by an escaped quote (i.e. not limited to when the offending value is the last one in an object), which means, for example, any JSON document that contains a value that is a quoted URL escaped for JSON storage. Common examples include (but are not limited to):

Any JSON document that contains a value that is itself escaped JSON, where the escaped JSON contains a value that is a URL. This is not uncommon these days with the popularity and increased complexity of web apps, node.js, etc.
Any JSON document that contains a string value that is, say, an HTML element with a URL attribute (like an <a> tag or something).
Perhaps a JSON document contains large amounts of user-specified text content, and the user typed a link in quotes as part of the content.

Also, there isn't really a workaround. E.g. the HAR file I mentioned in my OP, which I was viewing in Notepad++, had 8 or 9 occurrences of it, and I actually ended up having to just not use Notepad++ for that case, as formatting / navigating the large document was too difficult. And since the issue happens on the lexer level, there isn't a way for a client to influence it (e.g. Notepad++'s "clickable link" option does not -- and can not -- affect the issue), i.e. there's no clean way for Notepad++ to implement a workaround even if it wanted to.

So, if it matters, because it isn't super rare, and it can't be worked around, I think it's ... moderately important. 🤷‍♂️

There's also the philosophical reason that, edge case or not, it is currently misparsing valid JSON (and does so when parsing URLs, which aren't recognized as special in the JSON spec). So if that's important, it's another reason.

mpheath · 2022-04-04T05:22:48Z

The \ breaks out of the link as its not in setURL but doesn't do the context.Forward(); from line 347 that throws away the meaning of the following ". It also doesn't try to highlight the \" escape sequence as SCE_JSON_ESCAPESEQUENCE which it does if http: is removed.. Handling this case is complicated by sharing code with SCE_JSON_LDKEYWORD.

Complicated, like " are not used in urls, so the escape would not be there, instead would use %22 which would not display the issue.

http:// is matched, sets state SCE_JSON_URI, then later finds the escaped double quote which is malformed for a url. Passes past the escape \, sees the double quote and considers it as now a closed string. Perhaps it should display red to indicate a malformed url... oh, it does.

"value": "{\"group\":\"hosted-libraries-pushers\",\"max_age\":2592000,\"endpoints\":[{\"url\":\"https://csp.withgoogle.com/csp/report-to/hosted-libraries-pushers\"}]}"

Does not display the issue as the url is valid. Am I missing something here?

I see the escaped double quote issue generated, though the ideas behind it's creation based on a malformed url seems flawed.

JC3 · 2022-04-04T12:18:19Z

"value": "{\"group\":\"hosted-libraries-pushers\",\"max_age\":2592000,\"endpoints\":[{\"url\":\"https://csp.withgoogle.com/csp/report-to/hosted-libraries-pushers\"}]}"
Does not display the issue as the url is valid. Am I missing something here?

I don't know if you are missing something; I do know that it shows the issue for me, in Notepad++ anyways:

The first two show the issue. In the third, I removed the 'p' from 'https' in the URL (so it is not parsed as a link), and the issue disappears.

Note the syntax coloring:

And the effect when collapsed (the highlighted lines should not be visible) (not shown: top level collapse also fails):

I see the escaped double quote issue generated, though the ideas behind it's creation based on a malformed url seems flawed.

It could be a red herring but... the examples above show that when a link is not parsed, the issue does not appear.

JC3 · 2022-04-04T12:21:48Z

~~@mpheath PS: It's Scintilla 5.2.2, lexilla 5.1.6 (so I've been told -- I should probably verify that...), just to make sure we're all on the same page. Sorry for omitting that in the OP.~~

rdipardo · 2022-04-04T20:42:18Z

It's Scintilla 5.2.2, lexilla 5.1.6 (so I've been told . . . )

To clarify, that was a capture of SciTE. The OP didn't specify a Notepad++ version, but all current releases consume Scintilla 4.4.6.

While that should not invalidate the comparison (LexJSON hasn't changed much in several years), I do notice a difference in how the trailing object tokens are styled.

Per SciTE, the state is consistently SCE_JSON_ERROR (the reason I chose that editor for illustration).
But in Notepad++ it's a dull gray that matches the SCE_JSON_STRINGEOL state in the default style palette:

There are 3 code paths resulting in SCE_JSON_STRINGEOL ¹,²,³; all of them depend on StyleContext::atLineEnd being true. That value is known to be inaccurately determined in the presence of DOS EOLs (i.e., \r\n, \x0d\x0a, or CRLF, depending on the representation). I briefly thought it was DOS EOLs causing SCE_JSON_STRINGEOL to be applied, with Unix ones (LF) precluding a match at those 3 branches and falling through to SCE_JSON_ERROR; but that's evidently not the case, at least for the current version of Notepad++:

To be sure, the application acknowledges the SCE_JSON_ERROR lexical class; it's just not applied in this particular case, for reasons that are unique to Notepad++.

JC3 · 2022-04-04T21:22:57Z

To clarify, that was a capture of SciTE. The OP didn't specify a Notepad++ version, but all current releases consume Scintilla 4.4.6.

I'm sorry. I swear I'm usually better at bug reports... 😂

Notepad++ 8.3.3, 64-bit, Windows 10

And SciLexer.dll (I'm assuming that's the Scintilla lib) says 4.4.6.0, product version field says "4.4.6 for Notepad++".

nyamatongwe · 2022-04-04T22:04:55Z

There is a JSON specification available. It describes strings but not URLs. URL recognition within strings adds value but should not continue past the boundary of the string since strings may include invalid URL fragments.

Reasonable approaches for the invalid URL are to treat it as:

error: the invalid URL and the following text within the string display as SCE_JSON_ERROR
text: revert the invalid URL to SCE_JSON_STRING

mpheath · 2022-04-05T05:15:05Z

@JC3 , @rdipardo

Thanks for the images and details. I see differences between the grey and the black text in Notepad++.

The appearance in SciTE looks deceptive in comparison:

as it displays as string style at the end. To my eyes, cannot see an issue. Edit: Now seeing red EOLFilled with testing a new compiled build (facepalm).

Testlexers shows a problem with styling in Lexilla 5.1.6.

{8}{{0}
    {4}"k"{8}:{0} {2}"v"{8},{0}
    {4}"value"{8}:{0} {2}"{\"group\":\"hosted-libraries-pushers\",\"max_age\":2592000,\"endpoints\":[{\"url\":\"{9}https://csp.withgoogle.com/csp/report-to/hosted-libraries-pushers{2}\"{8}}]}{3}"
{8}}{0}

Note {8}}]}{3} as operator style set in the string, then unclosed string style.

ID	Description
{0}	default
{2}	string
{3}	unclosed string
{4}	property name
{8}	operator

@nyamatongwe

The URL is decoration only styling, so it is a feature. What matters most is the escaping required for JSON. To be more precise as far as the lexer in involved, is the escaped double quote \". The option between 1 and 2 is 2 in my view. It is an error which the URL feature currently does unfortunately. If the URL code was commented out, then probably would be no issue and no error condition. This leaves option 1 as an error that in comparison with the JSON specification considers as good syntax and as such, should not be an error set by the lexer. This means invalid URLs IMO should be styled as a string. Well, that is my current opinion.

As the Testlexers example shows, this is not just about invalid URLs. This will need some code scutiny to fix.

JC3 · 2022-04-05T12:42:42Z

@nyamatongwe

There is a JSON specification available. It describes strings but not URLs. URL recognition within strings adds value but should not continue past the boundary of the string since strings may include invalid URL fragments.

Reasonable approaches for the invalid URL are to treat it as:

error: the invalid URL and the following text within the string display as SCE_JSON_ERROR

text: revert the invalid URL to SCE_JSON_STRING

Well, the issue actually only appears with valid URLs (well, URLs recognized as links), and if I've understood correctly, it's more about the escape sequence handling than the URL itself. When the URL is not parseable as a link, it seems OK.

As an aside though, your option 2 would be the appropriate choice in the cases you describe, since malformed URLs are not a JSON error.

@mpheath

This means invalid URLs IMO should be styled as a string. Well, that is my current opinion.

Well, any other opinions besides that one are demonstrably inferior so... good choice of opinion. 😂

JC3 · 2022-04-05T13:06:45Z

Also, fwiw, philosophically, link parsing shouldn't be done on the lexer level at all if links aren't part of the language's vocabulary. That's more a post-lexer text-processing step, really. When the two are entangled, it'll always be work to discover / test for / avoid a conflict, whereas in a postprocessing, metadata-adding step on stringish tokens, there can't be any conflicts.

But, I acknowledge that it is convenient and also already implemented.

mpheath · 2022-04-05T15:20:03Z

This is my attempt at a fix:

LexJSON.zip

URLs do not get processed with property lexer.json.escape.sequence=1 set. Didn't before, does not now. Looked into it, could not change it to process URLs. Probably never implemented as escapes can be used almost anywhere in the JSON document. This would probably make the code very complicated.

Note:

lexilla/lexers/LexJSON.cxx

Line 314 in 0c84f25

context.ForwardSetState(SCE_C_DEFAULT);

Looks like an strange constant for the JSON lexer. Perhaps should be SCE_JSON_DEFAULT. Not sure if was added as intentional or was a typo.

rdipardo · 2022-04-05T19:44:39Z

Perhaps should be SCE_JSON_DEFAULT. Not sure if was added as intentional or was a typo.

The latter, and a harmless one at that: SCE_C_DEFAULT == 0 == SCE_JSON_DEFAULT

rdipardo · 2022-04-05T22:04:58Z

Here's a more minimalistic attempt:

diff --git a/lexers/LexJSON.cxx b/lexers/LexJSON.cxx
index 92616715..967e9b93 100644
--- a/lexers/LexJSON.cxx
+++ b/lexers/LexJSON.cxx
@@ -310,8 +310,7 @@ void SCI_METHOD LexerJSON::Lex(Sci_PositionU startPos,
 					break;
 				}
 				if (context.ch == '"') {
-					context.SetState(stringStyleBefore);
-					context.ForwardSetState(SCE_C_DEFAULT);
+					context.ForwardSetState(SCE_JSON_STRING);
 				} else if (context.ch == '\\') {
 					if (!escapeSeq.newSequence(context.chNext)) {
 						context.SetState(SCE_JSON_ERROR);
@@ -370,7 +369,11 @@ void SCI_METHOD LexerJSON::Lex(Sci_PositionU startPos,
 				if ((!setKeywordJSONLD.Contains(context.ch) &&
 					 (context.state == SCE_JSON_LDKEYWORD)) ||
 					(!setURL.Contains(context.ch))) {
-					context.SetState(stringStyleBefore);
+					if (context.ch == '\\') {
+						context.SetState(options.escapeSequence ? SCE_JSON_ESCAPESEQUENCE : SCE_JSON_STRING);
+						break;
+					} else
+						context.SetState(stringStyleBefore);
 				}
 				if (context.ch == '"') {
 					context.ForwardSetState(SCE_JSON_DEFAULT);

This does however overload the significance of lexer.json.escape.sequence to include: "Allow improper URLs when 1, disallow them when 0":

It also doesn't get around the mutual exclusivity illustrated by comparison of @mpheath's embedded.json (distinct URL) and embedded_1.json (everything a string) — although invalid URLs are more obviously so:

mpheath · 2022-04-06T03:11:16Z

I tracked down the difference between views in SciTE and Notepad++

SciTE with json.properties enables the lexer.json... options by default.
Notepad++ does not use json.properties so it is probably uses the options as disabled.

The options have an affect with the cases.

			case SCE_JSON_LDKEYWORD:
			case SCE_JSON_URI:

These cases are not selected with lexer.json.escape.sequence=1.

@rdipardo

The latter, and a harmless one at that: SCE_C_DEFAULT == 0 == SCE_JSON_DEFAULT

I would option to change to SCE_JSON_DEFAULT. Using some other lexers constants is a little naughty :) The maintainer might want to do this separately.

This does however overload the significance of lexer.json.escape.sequence to include: "Allow improper URLs when 1, disallow them when 0":

Nice for the url, though the trailing double quote perhaps makes the url invalid. What I sort of try to do is handle no double quotes, paired double quotes and regard the trailing quote on it's own as an invalid url. The leading double quotes need to be recognized to determine if the trailing double quotes are valid, which determines if the url is valid to style as string or as url.

It also doesn't get around the mutual exclusivity illustrated by comparison of @mpheath's embedded.json (distinct URL) and embedded_1.json (everything a string) — although invalid URLs are more obviously so:

I appreciate the pros/cons mentioned. Any improvements are welcome.

Changing the ForwardSetState() parameter from SCE_C_DEFAULT to SCE_JSON_STRING is a bold move. Lots of code happens in case SCE_JSON_DEFAULT that may cause side effects if not selected like the original author of the lexer intended. I chose not to go there as these lexers can be like a house of cards and can collapse quite easy.

IIRC with my debug output, the escape with \ sets SCE_JSON_ESCAPESEQUENCE and the " is consumed, like Neil mentioned at the start. So the if (context.ch == '"') { condition is apparently for closing the string. This is why the following ForwardSetState() sets SCE_C_DEFAULT. The test file processed_esc.json shows the ill effect causing an unclosed string with \"" if SCE_JSON_STRING is set instead.

rdipardo · 2022-04-06T05:11:44Z

@mpheath,

What I sort of try to do is handle no double quotes, paired double quotes and regard the trailing quote on it's own as an invalid url.

That's a job for a struct that can reason about it's state, the way CompactIRI and EscapeSequence already do.

It also doesn't get around the mutual exclusivity illustrated by comparison of @mpheath's embedded.json (distinct URL) and embedded_1.json (everything a string) — although invalid URLs are more obviously so:

I appreciate the pros/cons mentioned. Any improvements are welcome.

The observation was in reference to the constraints of the problem, not your solution. At some point a choice has to made between pedantic correctness and the kind of JSON that real programs like NPM consume. I don't have any idea how both can be satisfied at once, and the lexer already suffers from trying to do too much.

…ine end characters so both CR and LF are in DEFAULT.

nyamatongwe · 2022-04-07T00:07:08Z

Added an AllStyles.json file to examples to show all styles and act as a basic test. Fixed code so that there is no split CRLF styling on comment and error lines.

Switched SCE_C_DEFAULT to ~~SCE_JSON_STRING~~ (edited) SCE_JSON_DEFAULT.

@rdipardo, the 'minimalistic attempt' changes interpretation of (enabled) escape sequence in AllStyles.json to not return to string state for ending ". The start and end " styles should match as sometimes code tries to search for delimited elements.

nyamatongwe · 2022-04-07T00:15:25Z

Some of the unusual elements identified by this lexer (SCE_JSON_LDKEYWORD, SCE_JSON_COMPACTIRI) are defined by JSON-LD.

JC3 · 2022-04-08T18:06:59Z

Some of the unusual elements identified by this lexer (SCE_JSON_LDKEYWORD, SCE_JSON_COMPACTIRI) are defined by JSON-LD.

JSON-LD is neither an update nor an extension to JSON. It is a separate specification of a JSON-based schema. Its relation to JSON is the same as, say, the relation of SVG to XML.

Therefore the JSON lexer shouldn't deal with JSON-LD at all, just as it doesn't deal with LIME, JSON-RPC, etc.

Those are all useful schemas, but really shouldn't be handled at the lexing level anyways (ideally, they're post-lex/parse steps) -- although if they were to be handled during lexing, they would certainly demand their own lexers rather than being part of the general JSON lexer.

JC3 mentioned this issue Apr 3, 2022

Hyperlinks followed by escaped quote breaks parsing of valid JSON notepad-plus-plus/notepad-plus-plus#11467

Open

nyamatongwe added the json Caused by the json lexer label Apr 3, 2022

nyamatongwe added a commit that referenced this issue Apr 6, 2022

#72 Add test for json lexer. Terminate LINECOMMENT and ERROR before l…

63e0ea8

…ine end characters so both CR and LF are in DEFAULT.

nyamatongwe added a commit that referenced this issue Apr 6, 2022

#72 Replace cpp lexer constant with correct json lexer constant.

9a9c395

nyamatongwe added a commit that referenced this issue Apr 6, 2022

#72 Treat json files as text so see native line ends and match tests.

1024b08

nyamatongwe added a commit that referenced this issue Apr 6, 2022

#72 Fix line ends for example file.

97eb056

rdipardo mentioned this issue Apr 27, 2022

Add Prolog filetype support geany/geany#3171

Open

rdipardo mentioned this issue May 30, 2022

JSON syntax highlighting error when escaping a quote right after a link notepad-plus-plus/notepad-plus-plus#11746

Closed

rdipardo mentioned this issue Apr 14, 2023

HTML link in JSON notepad-plus-plus/notepad-plus-plus#13522

Closed

rdipardo mentioned this issue Oct 25, 2023

Convert JSON to builtin filetype and use Scintilla JSON lexer for it geany/geany#3647

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Links followed by escaped quote breaks JSON parsing. #72

Links followed by escaped quote breaks JSON parsing. #72

JC3 commented Apr 3, 2022 •

edited

nyamatongwe commented Apr 3, 2022

JC3 commented Apr 3, 2022 •

edited

mpheath commented Apr 4, 2022

JC3 commented Apr 4, 2022 •

edited

JC3 commented Apr 4, 2022 •

edited

rdipardo commented Apr 4, 2022

JC3 commented Apr 4, 2022 •

edited

nyamatongwe commented Apr 4, 2022

mpheath commented Apr 5, 2022 •

edited

JC3 commented Apr 5, 2022 •

edited

JC3 commented Apr 5, 2022 •

edited

mpheath commented Apr 5, 2022

rdipardo commented Apr 5, 2022

rdipardo commented Apr 5, 2022

mpheath commented Apr 6, 2022

rdipardo commented Apr 6, 2022

nyamatongwe commented Apr 7, 2022 •

edited

nyamatongwe commented Apr 7, 2022

JC3 commented Apr 8, 2022 •

edited

Links followed by escaped quote breaks JSON parsing. #72

Links followed by escaped quote breaks JSON parsing. #72

Comments

JC3 commented Apr 3, 2022 • edited

Version

nyamatongwe commented Apr 3, 2022

JC3 commented Apr 3, 2022 • edited

mpheath commented Apr 4, 2022

JC3 commented Apr 4, 2022 • edited

JC3 commented Apr 4, 2022 • edited

rdipardo commented Apr 4, 2022

Footnotes

JC3 commented Apr 4, 2022 • edited

nyamatongwe commented Apr 4, 2022

mpheath commented Apr 5, 2022 • edited

JC3 commented Apr 5, 2022 • edited

JC3 commented Apr 5, 2022 • edited

mpheath commented Apr 5, 2022

rdipardo commented Apr 5, 2022

rdipardo commented Apr 5, 2022

mpheath commented Apr 6, 2022

rdipardo commented Apr 6, 2022

nyamatongwe commented Apr 7, 2022 • edited

nyamatongwe commented Apr 7, 2022

JC3 commented Apr 8, 2022 • edited

JC3 commented Apr 3, 2022 •

edited

JC3 commented Apr 3, 2022 •

edited

JC3 commented Apr 4, 2022 •

edited

JC3 commented Apr 4, 2022 •

edited

JC3 commented Apr 4, 2022 •

edited

mpheath commented Apr 5, 2022 •

edited

JC3 commented Apr 5, 2022 •

edited

JC3 commented Apr 5, 2022 •

edited

nyamatongwe commented Apr 7, 2022 •

edited

JC3 commented Apr 8, 2022 •

edited