New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Links followed by escaped quote breaks JSON parsing. #72
Comments
The |
Fwiw (if it influences a decision to look into it -- note I'm not familiar with lexilla at all and don't have a concept of how complex a fix would be), while it's an edge case, it's also not completely uncommon. Since it's not dependent on the
Also, there isn't really a workaround. E.g. the HAR file I mentioned in my OP, which I was viewing in Notepad++, had 8 or 9 occurrences of it, and I actually ended up having to just not use Notepad++ for that case, as formatting / navigating the large document was too difficult. And since the issue happens on the lexer level, there isn't a way for a client to influence it (e.g. Notepad++'s "clickable link" option does not -- and can not -- affect the issue), i.e. there's no clean way for Notepad++ to implement a workaround even if it wanted to. So, if it matters, because it isn't super rare, and it can't be worked around, I think it's ... moderately important. 🤷♂️ There's also the philosophical reason that, edge case or not, it is currently misparsing valid JSON (and does so when parsing URLs, which aren't recognized as special in the JSON spec). So if that's important, it's another reason. |
Complicated, like
Does not display the issue as the url is valid. Am I missing something here? I see the escaped double quote issue generated, though the ideas behind it's creation based on a malformed url seems flawed. |
I don't know if you are missing something; I do know that it shows the issue for me, in Notepad++ anyways: The first two show the issue. In the third, I removed the 'p' from 'https' in the URL (so it is not parsed as a link), and the issue disappears. Note the syntax coloring: And the effect when collapsed (the highlighted lines should not be visible) (not shown: top level collapse also fails):
It could be a red herring but... the examples above show that when a link is not parsed, the issue does not appear. |
|
To clarify, that was a capture of SciTE. The OP didn't specify a Notepad++ version, but all current releases consume Scintilla 4.4.6. While that should not invalidate the comparison (LexJSON hasn't changed much in several years), I do notice a difference in how the trailing object tokens are styled. Per SciTE, the state is consistently There are 3 code paths resulting in To be sure, the application acknowledges the Footnotes
|
I'm sorry. I swear I'm usually better at bug reports... 😂 Notepad++ 8.3.3, 64-bit, Windows 10 And SciLexer.dll (I'm assuming that's the Scintilla lib) says 4.4.6.0, product version field says "4.4.6 for Notepad++". |
There is a JSON specification available. It describes strings but not URLs. URL recognition within strings adds value but should not continue past the boundary of the string since strings may include invalid URL fragments. Reasonable approaches for the invalid URL are to treat it as:
|
Thanks for the images and details. I see differences between the grey and the black text in Notepad++. The appearance in SciTE looks deceptive in comparison: as it displays as string style at the end. To my eyes, cannot see an issue. Edit: Now seeing red EOLFilled with testing a new compiled build (facepalm). Testlexers shows a problem with styling in Lexilla 5.1.6.
Note
The URL is decoration only styling, so it is a feature. What matters most is the escaping required for JSON. To be more precise as far as the lexer in involved, is the escaped double quote As the Testlexers example shows, this is not just about invalid URLs. This will need some code scutiny to fix. |
Well, the issue actually only appears with valid URLs (well, URLs recognized as links), and if I've understood correctly, it's more about the escape sequence handling than the URL itself. When the URL is not parseable as a link, it seems OK. As an aside though, your option 2 would be the appropriate choice in the cases you describe, since malformed URLs are not a JSON error.
Well, any other opinions besides that one are demonstrably inferior so... good choice of opinion. 😂 |
Also, fwiw, philosophically, link parsing shouldn't be done on the lexer level at all if links aren't part of the language's vocabulary. That's more a post-lexer text-processing step, really. When the two are entangled, it'll always be work to discover / test for / avoid a conflict, whereas in a postprocessing, metadata-adding step on stringish tokens, there can't be any conflicts. But, I acknowledge that it is convenient and also already implemented. |
This is my attempt at a fix: URLs do not get processed with property Note: Line 314 in 0c84f25
Looks like an strange constant for the JSON lexer. Perhaps should be |
The latter, and a harmless one at that: |
Here's a more minimalistic attempt: diff --git a/lexers/LexJSON.cxx b/lexers/LexJSON.cxx
index 92616715..967e9b93 100644
--- a/lexers/LexJSON.cxx
+++ b/lexers/LexJSON.cxx
@@ -310,8 +310,7 @@ void SCI_METHOD LexerJSON::Lex(Sci_PositionU startPos,
break;
}
if (context.ch == '"') {
- context.SetState(stringStyleBefore);
- context.ForwardSetState(SCE_C_DEFAULT);
+ context.ForwardSetState(SCE_JSON_STRING);
} else if (context.ch == '\\') {
if (!escapeSeq.newSequence(context.chNext)) {
context.SetState(SCE_JSON_ERROR);
@@ -370,7 +369,11 @@ void SCI_METHOD LexerJSON::Lex(Sci_PositionU startPos,
if ((!setKeywordJSONLD.Contains(context.ch) &&
(context.state == SCE_JSON_LDKEYWORD)) ||
(!setURL.Contains(context.ch))) {
- context.SetState(stringStyleBefore);
+ if (context.ch == '\\') {
+ context.SetState(options.escapeSequence ? SCE_JSON_ESCAPESEQUENCE : SCE_JSON_STRING);
+ break;
+ } else
+ context.SetState(stringStyleBefore);
}
if (context.ch == '"') {
context.ForwardSetState(SCE_JSON_DEFAULT);
This does however overload the significance of It also doesn't get around the mutual exclusivity illustrated by comparison of @mpheath's |
I tracked down the difference between views in SciTE and Notepad++
The options have an affect with the cases. case SCE_JSON_LDKEYWORD:
case SCE_JSON_URI: These cases are not selected with
I would option to change to
Nice for the url, though the trailing double quote perhaps makes the url invalid. What I sort of try to do is handle no double quotes, paired double quotes and regard the trailing quote on it's own as an invalid url. The leading double quotes need to be recognized to determine if the trailing double quotes are valid, which determines if the url is valid to style as string or as url.
I appreciate the pros/cons mentioned. Any improvements are welcome. Changing the IIRC with my debug output, the escape with |
That's a job for a
The observation was in reference to the constraints of the problem, not your solution. At some point a choice has to made between pedantic correctness and the kind of JSON that real programs like NPM consume. I don't have any idea how both can be satisfied at once, and the lexer already suffers from trying to do too much. |
…ine end characters so both CR and LF are in DEFAULT.
Added an AllStyles.json file to examples to show all styles and act as a basic test. Fixed code so that there is no split CRLF styling on comment and error lines. Switched SCE_C_DEFAULT to @rdipardo, the 'minimalistic attempt' changes interpretation of (enabled) escape sequence in AllStyles.json to not return to string state for ending |
Some of the unusual elements identified by this lexer ( |
JSON-LD is neither an update nor an extension to JSON. It is a separate specification of a JSON-based schema. Its relation to JSON is the same as, say, the relation of SVG to XML. Therefore the JSON lexer shouldn't deal with JSON-LD at all, just as it doesn't deal with LIME, JSON-RPC, etc. Those are all useful schemas, but really shouldn't be handled at the lexing level anyways (ideally, they're post-lex/parse steps) -- although if they were to be handled during lexing, they would certainly demand their own lexers rather than being part of the general JSON lexer. |
Originally reported at notepad-plus-plus/notepad-plus-plus#11467 but was told to post it here instead:
It seems that the JSON lexer chokes on the following JSON document:
It misparses the
}
after the\"
as a JSON token (e.g. in Notepad++):At a glance it seems like it thinks that backslashes are part of links, and so it treats the(Real reason is in comments.)\
as part of the link rather than as an escape character for the following"
, and so it ends the string early.I ran into this in the wild in a HAR file that contained a value that ended with a similar sequence (value is escaped JSON, and that JSON contains a value that is a URL):
Version
It's the version that ships with Notepad++ 8.3.3, 64-bit (Windows 10).
SciLexer.dll (I'm assuming that's the Scintilla lib) says 4.4.6.0, product version field says "4.4.6 for Notepad++".
The text was updated successfully, but these errors were encountered: