Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Links followed by escaped quote breaks JSON parsing. #72

Open
JC3 opened this issue Apr 3, 2022 · 19 comments
Open

Links followed by escaped quote breaks JSON parsing. #72

JC3 opened this issue Apr 3, 2022 · 19 comments
Labels
json Caused by the json lexer

Comments

@JC3
Copy link

JC3 commented Apr 3, 2022

Originally reported at notepad-plus-plus/notepad-plus-plus#11467 but was told to post it here instead:


It seems that the JSON lexer chokes on the following JSON document:

{
    "a": "1",
    "problem": "http://link\"}",
    "b": "2",
    "c": "3"
}

It misparses the } after the \" as a JSON token (e.g. in Notepad++):

image

At a glance it seems like it thinks that backslashes are part of links, and so it treats the \ as part of the link rather than as an escape character for the following ", and so it ends the string early. (Real reason is in comments.)


I ran into this in the wild in a HAR file that contained a value that ended with a similar sequence (value is escaped JSON, and that JSON contains a value that is a URL):

"value": "{\"group\":\"hosted-libraries-pushers\",\"max_age\":2592000,\"endpoints\":[{\"url\":\"https://csp.withgoogle.com/csp/report-to/hosted-libraries-pushers\"}]}"

Version

It's the version that ships with Notepad++ 8.3.3, 64-bit (Windows 10).

SciLexer.dll (I'm assuming that's the Scintilla lib) says 4.4.6.0, product version field says "4.4.6 for Notepad++".

@nyamatongwe
Copy link
Member

The \ breaks out of the link as its not in setURL but doesn't do the context.Forward(); from line 347 that throws away the meaning of the following ". It also doesn't try to highlight the \" escape sequence as SCE_JSON_ESCAPESEQUENCE which it does if http: is removed.. Handling this case is complicated by sharing code with SCE_JSON_LDKEYWORD.

@nyamatongwe nyamatongwe added the json Caused by the json lexer label Apr 3, 2022
@JC3
Copy link
Author

JC3 commented Apr 3, 2022

Fwiw (if it influences a decision to look into it -- note I'm not familiar with lexilla at all and don't have a concept of how complex a fix would be), while it's an edge case, it's also not completely uncommon.

Since it's not dependent on the }, it'll happen anywhere a link is followed by an escaped quote (i.e. not limited to when the offending value is the last one in an object), which means, for example, any JSON document that contains a value that is a quoted URL escaped for JSON storage. Common examples include (but are not limited to):

  • Any JSON document that contains a value that is itself escaped JSON, where the escaped JSON contains a value that is a URL. This is not uncommon these days with the popularity and increased complexity of web apps, node.js, etc.
  • Any JSON document that contains a string value that is, say, an HTML element with a URL attribute (like an <a> tag or something).
  • Perhaps a JSON document contains large amounts of user-specified text content, and the user typed a link in quotes as part of the content.

Also, there isn't really a workaround. E.g. the HAR file I mentioned in my OP, which I was viewing in Notepad++, had 8 or 9 occurrences of it, and I actually ended up having to just not use Notepad++ for that case, as formatting / navigating the large document was too difficult. And since the issue happens on the lexer level, there isn't a way for a client to influence it (e.g. Notepad++'s "clickable link" option does not -- and can not -- affect the issue), i.e. there's no clean way for Notepad++ to implement a workaround even if it wanted to.

So, if it matters, because it isn't super rare, and it can't be worked around, I think it's ... moderately important. 🤷‍♂️

There's also the philosophical reason that, edge case or not, it is currently misparsing valid JSON (and does so when parsing URLs, which aren't recognized as special in the JSON spec). So if that's important, it's another reason.

@mpheath
Copy link
Contributor

mpheath commented Apr 4, 2022

The \ breaks out of the link as its not in setURL but doesn't do the context.Forward(); from line 347 that throws away the meaning of the following ". It also doesn't try to highlight the \" escape sequence as SCE_JSON_ESCAPESEQUENCE which it does if http: is removed.. Handling this case is complicated by sharing code with SCE_JSON_LDKEYWORD.

Complicated, like " are not used in urls, so the escape would not be there, instead would use %22 which would not display the issue.

http:// is matched, sets state SCE_JSON_URI, then later finds the escaped double quote which is malformed for a url. Passes past the escape \, sees the double quote and considers it as now a closed string. Perhaps it should display red to indicate a malformed url... oh, it does.

"value": "{\"group\":\"hosted-libraries-pushers\",\"max_age\":2592000,\"endpoints\":[{\"url\":\"https://csp.withgoogle.com/csp/report-to/hosted-libraries-pushers\"}]}"

Does not display the issue as the url is valid. Am I missing something here?

I see the escaped double quote issue generated, though the ideas behind it's creation based on a malformed url seems flawed.

@JC3
Copy link
Author

JC3 commented Apr 4, 2022

"value": "{\"group\":\"hosted-libraries-pushers\",\"max_age\":2592000,\"endpoints\":[{\"url\":\"https://csp.withgoogle.com/csp/report-to/hosted-libraries-pushers\"}]}"

Does not display the issue as the url is valid. Am I missing something here?

I don't know if you are missing something; I do know that it shows the issue for me, in Notepad++ anyways:

image

The first two show the issue. In the third, I removed the 'p' from 'https' in the URL (so it is not parsed as a link), and the issue disappears.

Note the syntax coloring:

image

And the effect when collapsed (the highlighted lines should not be visible) (not shown: top level collapse also fails):

image

I see the escaped double quote issue generated, though the ideas behind it's creation based on a malformed url seems flawed.

It could be a red herring but... the examples above show that when a link is not parsed, the issue does not appear.

@JC3
Copy link
Author

JC3 commented Apr 4, 2022

@mpheath PS: It's Scintilla 5.2.2, lexilla 5.1.6 (so I've been told -- I should probably verify that...), just to make sure we're all on the same page. Sorry for omitting that in the OP.

@rdipardo
Copy link
Contributor

rdipardo commented Apr 4, 2022

It's Scintilla 5.2.2, lexilla 5.1.6 (so I've been told . . . )

To clarify, that was a capture of SciTE. The OP didn't specify a Notepad++ version, but all current releases consume Scintilla 4.4.6.

While that should not invalidate the comparison (LexJSON hasn't changed much in several years), I do notice a difference in how the trailing object tokens are styled.

Per SciTE, the state is consistently SCE_JSON_ERROR (the reason I chose that editor for illustration).
But in Notepad++ it's a dull gray that matches the SCE_JSON_STRINGEOL state in the default style palette:

npp_sce_json_str_eol

There are 3 code paths resulting in SCE_JSON_STRINGEOL 1,2,3; all of them depend on StyleContext::atLineEnd being true. That value is known to be inaccurately determined in the presence of DOS EOLs (i.e., \r\n, \x0d\x0a, or CRLF, depending on the representation). I briefly thought it was DOS EOLs causing SCE_JSON_STRINGEOL to be applied, with Unix ones (LF) precluding a match at those 3 branches and falling through to SCE_JSON_ERROR; but that's evidently not the case, at least for the current version of Notepad++:

npp_sce_json_string_eol_crlf

npp_sce_json_string_eol_lf

To be sure, the application acknowledges the SCE_JSON_ERROR lexical class; it's just not applied in this particular case, for reasons that are unique to Notepad++.

Footnotes

  1. https://github.com/ScintillaOrg/lexilla/blob/0c84f2542965911df68888b23c5b3b1d6f019852/lexers/LexJSON.cxx#L322-L324

  2. https://github.com/ScintillaOrg/lexilla/blob/0c84f2542965911df68888b23c5b3b1d6f019852/lexers/LexJSON.cxx#L337-L339

  3. https://github.com/ScintillaOrg/lexilla/blob/0c84f2542965911df68888b23c5b3b1d6f019852/lexers/LexJSON.cxx#L375-L379

@JC3
Copy link
Author

JC3 commented Apr 4, 2022

To clarify, that was a capture of SciTE. The OP didn't specify a Notepad++ version, but all current releases consume Scintilla 4.4.6.

I'm sorry. I swear I'm usually better at bug reports... 😂

Notepad++ 8.3.3, 64-bit, Windows 10

And SciLexer.dll (I'm assuming that's the Scintilla lib) says 4.4.6.0, product version field says "4.4.6 for Notepad++".

@nyamatongwe
Copy link
Member

There is a JSON specification available. It describes strings but not URLs. URL recognition within strings adds value but should not continue past the boundary of the string since strings may include invalid URL fragments.

Reasonable approaches for the invalid URL are to treat it as:

  1. error: the invalid URL and the following text within the string display as SCE_JSON_ERROR
  2. text: revert the invalid URL to SCE_JSON_STRING

@mpheath
Copy link
Contributor

mpheath commented Apr 5, 2022

@JC3 , @rdipardo

Thanks for the images and details. I see differences between the grey and the black text in Notepad++.

The appearance in SciTE looks deceptive in comparison:

json3

as it displays as string style at the end. To my eyes, cannot see an issue. Edit: Now seeing red EOLFilled with testing a new compiled build (facepalm).

Testlexers shows a problem with styling in Lexilla 5.1.6.

{8}{{0}
    {4}"k"{8}:{0} {2}"v"{8},{0}
    {4}"value"{8}:{0} {2}"{\"group\":\"hosted-libraries-pushers\",\"max_age\":2592000,\"endpoints\":[{\"url\":\"{9}https://csp.withgoogle.com/csp/report-to/hosted-libraries-pushers{2}\"{8}}]}{3}"
{8}}{0}

Note {8}}]}{3} as operator style set in the string, then unclosed string style.

ID Description
{0} default
{2} string
{3} unclosed string
{4} property name
{8} operator

@nyamatongwe

The URL is decoration only styling, so it is a feature. What matters most is the escaping required for JSON. To be more precise as far as the lexer in involved, is the escaped double quote \". The option between 1 and 2 is 2 in my view. It is an error which the URL feature currently does unfortunately. If the URL code was commented out, then probably would be no issue and no error condition. This leaves option 1 as an error that in comparison with the JSON specification considers as good syntax and as such, should not be an error set by the lexer. This means invalid URLs IMO should be styled as a string. Well, that is my current opinion.

As the Testlexers example shows, this is not just about invalid URLs. This will need some code scutiny to fix.

@JC3
Copy link
Author

JC3 commented Apr 5, 2022

@nyamatongwe

There is a JSON specification available. It describes strings but not URLs. URL recognition within strings adds value but should not continue past the boundary of the string since strings may include invalid URL fragments.

Reasonable approaches for the invalid URL are to treat it as:

  1. error: the invalid URL and the following text within the string display as SCE_JSON_ERROR
  2. text: revert the invalid URL to SCE_JSON_STRING

Well, the issue actually only appears with valid URLs (well, URLs recognized as links), and if I've understood correctly, it's more about the escape sequence handling than the URL itself. When the URL is not parseable as a link, it seems OK.

As an aside though, your option 2 would be the appropriate choice in the cases you describe, since malformed URLs are not a JSON error.

@mpheath

This means invalid URLs IMO should be styled as a string. Well, that is my current opinion.

Well, any other opinions besides that one are demonstrably inferior so... good choice of opinion. 😂

@JC3
Copy link
Author

JC3 commented Apr 5, 2022

Also, fwiw, philosophically, link parsing shouldn't be done on the lexer level at all if links aren't part of the language's vocabulary. That's more a post-lexer text-processing step, really. When the two are entangled, it'll always be work to discover / test for / avoid a conflict, whereas in a postprocessing, metadata-adding step on stringish tokens, there can't be any conflicts.

But, I acknowledge that it is convenient and also already implemented.

@mpheath
Copy link
Contributor

mpheath commented Apr 5, 2022

This is my attempt at a fix:

json

LexJSON.zip

URLs do not get processed with property lexer.json.escape.sequence=1 set. Didn't before, does not now. Looked into it, could not change it to process URLs. Probably never implemented as escapes can be used almost anywhere in the JSON document. This would probably make the code very complicated.

Note:

context.ForwardSetState(SCE_C_DEFAULT);

Looks like an strange constant for the JSON lexer. Perhaps should be SCE_JSON_DEFAULT. Not sure if was added as intentional or was a typo.

@rdipardo
Copy link
Contributor

rdipardo commented Apr 5, 2022

Perhaps should be SCE_JSON_DEFAULT. Not sure if was added as intentional or was a typo.

The latter, and a harmless one at that: SCE_C_DEFAULT == 0 == SCE_JSON_DEFAULT

@rdipardo
Copy link
Contributor

rdipardo commented Apr 5, 2022

Here's a more minimalistic attempt:

diff --git a/lexers/LexJSON.cxx b/lexers/LexJSON.cxx
index 92616715..967e9b93 100644
--- a/lexers/LexJSON.cxx
+++ b/lexers/LexJSON.cxx
@@ -310,8 +310,7 @@ void SCI_METHOD LexerJSON::Lex(Sci_PositionU startPos,
 					break;
 				}
 				if (context.ch == '"') {
-					context.SetState(stringStyleBefore);
-					context.ForwardSetState(SCE_C_DEFAULT);
+					context.ForwardSetState(SCE_JSON_STRING);
 				} else if (context.ch == '\\') {
 					if (!escapeSeq.newSequence(context.chNext)) {
 						context.SetState(SCE_JSON_ERROR);
@@ -370,7 +369,11 @@ void SCI_METHOD LexerJSON::Lex(Sci_PositionU startPos,
 				if ((!setKeywordJSONLD.Contains(context.ch) &&
 					 (context.state == SCE_JSON_LDKEYWORD)) ||
 					(!setURL.Contains(context.ch))) {
-					context.SetState(stringStyleBefore);
+					if (context.ch == '\\') {
+						context.SetState(options.escapeSequence ? SCE_JSON_ESCAPESEQUENCE : SCE_JSON_STRING);
+						break;
+					} else
+						context.SetState(stringStyleBefore);
 				}
 				if (context.ch == '"') {
 					context.ForwardSetState(SCE_JSON_DEFAULT);

This does however overload the significance of lexer.json.escape.sequence to include: "Allow improper URLs when 1, disallow them when 0":

scite_5222_issue_72_patch_demo

It also doesn't get around the mutual exclusivity illustrated by comparison of @mpheath's embedded.json (distinct URL) and embedded_1.json (everything a string) — although invalid URLs are more obviously so:

scite_522_issue_72_embedded_and_no_compared

@mpheath
Copy link
Contributor

mpheath commented Apr 6, 2022

I tracked down the difference between views in SciTE and Notepad++

  • SciTE with json.properties enables the lexer.json... options by default.
  • Notepad++ does not use json.properties so it is probably uses the options as disabled.

The options have an affect with the cases.

			case SCE_JSON_LDKEYWORD:
			case SCE_JSON_URI:

These cases are not selected with lexer.json.escape.sequence=1.


@rdipardo

The latter, and a harmless one at that: SCE_C_DEFAULT == 0 == SCE_JSON_DEFAULT

I would option to change to SCE_JSON_DEFAULT. Using some other lexers constants is a little naughty :) The maintainer might want to do this separately.


This does however overload the significance of lexer.json.escape.sequence to include: "Allow improper URLs when 1, disallow them when 0":

Nice for the url, though the trailing double quote perhaps makes the url invalid. What I sort of try to do is handle no double quotes, paired double quotes and regard the trailing quote on it's own as an invalid url. The leading double quotes need to be recognized to determine if the trailing double quotes are valid, which determines if the url is valid to style as string or as url.

It also doesn't get around the mutual exclusivity illustrated by comparison of @mpheath's embedded.json (distinct URL) and embedded_1.json (everything a string) — although invalid URLs are more obviously so:

I appreciate the pros/cons mentioned. Any improvements are welcome.

Changing the ForwardSetState() parameter from SCE_C_DEFAULT to SCE_JSON_STRING is a bold move. Lots of code happens in case SCE_JSON_DEFAULT that may cause side effects if not selected like the original author of the lexer intended. I chose not to go there as these lexers can be like a house of cards and can collapse quite easy.

IIRC with my debug output, the escape with \ sets SCE_JSON_ESCAPESEQUENCE and the " is consumed, like Neil mentioned at the start. So the if (context.ch == '"') { condition is apparently for closing the string. This is why the following ForwardSetState() sets SCE_C_DEFAULT. The test file processed_esc.json shows the ill effect causing an unclosed string with \"" if SCE_JSON_STRING is set instead.

@rdipardo
Copy link
Contributor

rdipardo commented Apr 6, 2022

@mpheath,

What I sort of try to do is handle no double quotes, paired double quotes and regard the trailing quote on it's own as an invalid url.

That's a job for a struct that can reason about it's state, the way CompactIRI and EscapeSequence already do.

It also doesn't get around the mutual exclusivity illustrated by comparison of @mpheath's embedded.json (distinct URL) and embedded_1.json (everything a string) — although invalid URLs are more obviously so:

I appreciate the pros/cons mentioned. Any improvements are welcome.

The observation was in reference to the constraints of the problem, not your solution. At some point a choice has to made between pedantic correctness and the kind of JSON that real programs like NPM consume. I don't have any idea how both can be satisfied at once, and the lexer already suffers from trying to do too much.

@nyamatongwe
Copy link
Member

nyamatongwe commented Apr 7, 2022

Added an AllStyles.json file to examples to show all styles and act as a basic test. Fixed code so that there is no split CRLF styling on comment and error lines.

Switched SCE_C_DEFAULT to SCE_JSON_STRING (edited) SCE_JSON_DEFAULT.

@rdipardo, the 'minimalistic attempt' changes interpretation of (enabled) escape sequence in AllStyles.json to not return to string state for ending ". The start and end " styles should match as sometimes code tries to search for delimited elements.

@nyamatongwe
Copy link
Member

Some of the unusual elements identified by this lexer (SCE_JSON_LDKEYWORD, SCE_JSON_COMPACTIRI) are defined by JSON-LD.

@JC3
Copy link
Author

JC3 commented Apr 8, 2022

Some of the unusual elements identified by this lexer (SCE_JSON_LDKEYWORD, SCE_JSON_COMPACTIRI) are defined by JSON-LD.

JSON-LD is neither an update nor an extension to JSON. It is a separate specification of a JSON-based schema. Its relation to JSON is the same as, say, the relation of SVG to XML.

Therefore the JSON lexer shouldn't deal with JSON-LD at all, just as it doesn't deal with LIME, JSON-RPC, etc.

Those are all useful schemas, but really shouldn't be handled at the lexing level anyways (ideally, they're post-lex/parse steps) -- although if they were to be handled during lexing, they would certainly demand their own lexers rather than being part of the general JSON lexer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
json Caused by the json lexer
Projects
None yet
Development

No branches or pull requests

4 participants