-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Higlighting issue with strings & comments #307
Comments
I have used SyntaxHighlighter in the past which doesn't have this problem. It would be a good idea to look at its core. |
Maybe this can be solved by using
|
It works now :-)
|
This was my first thought, but I didn't have any chance to look at it yet. How SyntaxHighlighter fixes it is by simple checking for nested matches and removing these: https://github.com/alexgorbatchev/syntaxhighlighter/blob/master/src/js/shCore.js#L1325 |
You should format your comment as code, otherwise it is not really readable 😉 my only problem is that this regex is pretty slow. |
Yeah, I’d rather not slow down the general case to work around an edge case either… |
Ok, so we basically have two issues:
1. PerformanceLet's take a look at some numbers. The current regex for clike goes like this: It does some pretty basic lookbehind testing to ensure that we match only if there is no If we skip this requirement (to keep it simple and it doesn't cause a performance hit) we end up with: I will take the canonical example of a failing test case - the grunt config: grunt.initConfig({
sass: {
dist: {
files: [{
test: 5, //a
src: "scss/**/*.scss"
}]
}
}
}); So, the current issue ("known failure") is that The first match is the first comment (
That's a pretty heavy performance impact - and what is pretty ugly: it depends on the length of the input (so it might get even worse). 2. Partial matches with global contextWell, this is the main problem. There is no global context available. 3. Solution?Well, I will try to look at the way SyntaxHighlighter has implemented it (by transforming the content in a "sort-of-AST" and optimizing it). I have the hopes up that this could be implementable without much effort and complication. But we will see when the prototype stands. edit: typo in new regex fixed |
That would be the best. Currently I feel that PrismJS locks the string which is highlighted once and then prevents it from matching successive tokens. Thus results are not what you expect. A problem that I am stuck with in the last few days is purely because of this fact. I am in need to highlight the first word of the first line of code. |
I use a large 1600 lined code for testing my prism definition for autohotkey language. Before implementing the fix for this issue, it took about 4 seconds to load on my computer and now too takes almost 4 seconds. I know this is not precise, and so I tried to use regex101.com to measure the performance impact but I can't find where they display number of "STEPS". |
Regarding your last question: in the left sidebar you need to click on |
Ok, update. After sleeping a night over it I realized, that making all the quantified selectors in the lookbehind non-greedy could speed up the regex a lot, because we dramatically reduce the branching factor of the regex matching.
improved & optimized solution: http://regex101.com/r/yS6dY5/1 While this looks promising we need to be very careful not to break existing language definitions. PS: if this works out we can try to profile and possibly optimize regex'es of other existing language definitions. |
PS2: we still need to fix the global lookbehind issue. |
Thanks @apfelbox msgbox % ";str" " /*str"" " ;comment
msgbox % "Loading ;str" "" ;comment Old - (38|65|4) - http://regex101.com/r/nY6iY0/1 Updated |
What is it ? |
See |
I have used the |
I know that, but there are several problems:
Your regex won't match that. |
Oh sorry. Multi-line strings are not there in AutoHotkey. I should have appended "in case C syntax allows that".
What if I keep the comments regex right at the top.Then the whole of haystack will be available as input. |
There are languages which extend If your language just allows single linge comments it is perfectly fine to do it with the regex - there won't be any problems and by anchoring the regex in the line with
This is true. I will evaluate all possible solutions in the coming time, so maybe I can tell a bit more about it in the future. |
This is just a joke, so don't take it too seriously :). But it does actually work and it solves most of the above problems. I don't know how good the performance is though. I use a fake RegExp object and do some extra checking. It handles single quotes, regexes, double quotes and escaped quotes. I don't think your proposed solutions can do that. Prism.languages.clike = {
'comment':{
pattern: {
pattern: /(?:\/\/.*|\/\*[\w\W]*?\*\/)/g,
exec: function(str) {
var p = this.pattern,
match;
p.lastIndex = 0;
match = p.exec(str);
while (match) {
var inDoubleQuote = false,
inSingleQuote = false,
inRegex = false,
isEscaped = false;
for (var i = 0; i < match.index; ++i) {
switch(str[i]) {
case '"':
inDoubleQuote = (!inDoubleQuote || isEscaped) &&
!inSingleQuote &&
!inRegex;
break;
case '\'':
inSingleQuote = !inDoubleQuote &&
(!inSingleQuote || isEscaped) &&
!inRegex;
break;
case '/':
inRegex = !inDoubleQuote &&
!inSingleQuote &&
(!inRegex || isEscaped);
break;
case '\n':
inRegex = false;
break;
case '\\':
isEscaped = !isEscaped;
break;
default:
isEscaped = false;
}
}
if (!inDoubleQuote && !inSingleQuote && !inRegex)
return match;
p.lastIndex = match.index + 1;
match = p.exec(str);
}
return match;
}
}
},
... |
That looks like a parser of sorts. Of course it would be correct, but I doubt the performance will be decent, especially for larger snippets. |
Check prism-grammar external parsing add-on for Prism, which uses a grammar to highlight solving all these issues (and more) |
chiming in with a proof of concept that I've no idea if is worth exploring. Basically, it removes strings, highlight the content, then it puts strings back. function safeHighlight(content, highgligh) {
var
uniqueID = '_@no\x01shenanigans@_',
restore = new RegExp('(["\'])' + uniqueID + '(.+?)\\1', 'g'),
str = []
;
function pre(content) {
return content.replace(
/(["'])(?:(?=(\\?))\2.)*?\1/g,
($0, $1) => $1 + uniqueID + (str.push($0.slice(1, -1)) - 1) + $1
);
}
function post(content) {
return content.replace(
restore,
($0, $1, $2) => $1 + str[$2] + $1
);
}
return post(highgligh(pre(content)));
}
var content = 'var str = "hello/**/world//!";';
// use prism or the actual highlighter instead of String
safeHighlight(content, String); I haven't done many benchmarks but it seems like a simple/safe solution for strings in comments and/or in code: it will put them back in their context, if that was highlighted, strings will result highlighted too. Best Regards [edit] it might be worth using a unique token to avoid conflicts with generated html so that any |
@WebReflection your solution would certainly work for strings, although it seems more like a hack. Do you have any proposal for turning it into a general purpose solution? We have recently merged a partial solution for that problem #779. It can correctly handle strings with one comment inside but not more (e.g.: |
well, the general purpose solution is what you call the hack:
A char-by-char parser wouldn't have this problem because it would know it's inside a string ... my work-around does exactly the same making the "inside a string" a no-brainer for the parser. I could write a more robust example with a better placeholder but I honestly don't know how to improve every language parsed by [edit] this would be my best pick var safeHighlight = (function () {
var
uid = '_@no\x01shenanigans@_',
find = /(["'])(?:(?=(\\?))\2.)*?\1/g,
restore = new RegExp('(["\'])' + uid + '(\\d+)\\1', 'g'),
rp1 = function ($0, $1) {
return $1 + uid + (str.push($0.slice(1, -1)) - 1) + $1;
},
rp2 = function ($0, $1, $2) {
return $1 + str[$2] + $1;
},
str
;
return function safeHighlight(content, highgligh) {
str = [];
console.log(restore);
var out = highgligh(content.replace(find, rp1)).replace(restore, rp2);
str = [];
return out;
};
}()); |
I understand how it works and it is a nice solution. With general purpose solution I mean something that works for any grammar and not only for strings but also for other stuff. This problem is not limited to strings. For example you can have a regex with strings inside |
not an issue, regexp will be highlighted as such, and strings will be put back. TL;DR it should work universally but the order obviously matters.
This would be my order/solution ... but yet it's crafted for JS comments and JS strings. |
to better explain:
If needed, drop strings entirely (remove $1 too and return just a placeholder) and put them back at the end using a Does this make sense? |
Since i happen to get notified in this discussion, let me enter my 2 cents Yes using placeholders partialy solves the problem, so in this sense it is a hack, a nice hack though. For example multiply escaped strings, inside strings e.g Then other combinations, e.g strings inside comments, or vice-versa, or comments inside regexps and so on.. Or tags inside CDATA and so on.. So this is not a good solution, only partial and even this in some cases. i still suggest using a special (delimited) tokenizer which handles only delimited tokens (e.g comments, strings, regexps, cdata and what not, which are all over the place). it is not difficult, it is not slower (in fact it will be faster) and is actually intuitive to define in a specification e.g "comment" : ["/*","*/"],
"string": ["'","'","\\"],
"regexp": ["/","/","\\"]
// and so on.. Cheers |
I've already implemented exactly that some time ago, but the code size got too big. The problem is that |
Well it seems to get quite complicated. The constraints for performance and code size are quite challenging here. We already had a few proposed solutions that didn't cut the mustard. Nevertheless I am happy to test and review any proof of concept implementations... |
@foo123 your example is broken, while this string doesn't create a Moreover, that's not an issue at all with the var code = 'var s = "a string with \\\\\\\"escaped string\\\\\\\\\\\" inside";';
code.replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '$1hello$1');
// "var s = "hello";" So I'm not sure what are your concerns about correctly escaped strings. For non correctly escaped strings though, well ... that's a "who cares" case because the code is broken and cannot be possibly highlighted correctly, right? |
The fact remains that regexps cannot count (e.g number of parentheses, or number of escapes and so on..) So if they cannot count, all delimited tokens that can be escaped , meaning they can contain their own end-pattern if it is escaped in a perfectly valid way, cannot be parsed correctly with regexps. Maybe i phrased it better now. |
I agree char-by-char is a rock-solid solution but still I'd like to underline the It's not a universal solution, I agree, but it works with strings and with literal Here an improved example that make the intermediate content easily parsable via var safeHighlight = (function (Prism) {
// v3 without shenanigans
var
html = {
chars: {
'&': '&',
'<': '<',
'>': '>',
"'": ''',
'"': '"'
},
escape: function (content) {
return content.replace(html.find, html.replace);
},
find: /[&<>'"]/g,
replace: function ($0) {
return html.chars[$0];
}
},
uid = (Math.random() + '').slice(2),
uidFinder = uidFinder = new RegExp(
'(?:<[^>]+?>)?' +
uid + '\\.(\\d+)' +
'(?:<\\/[^>]+?>)?', 'g'
),
uidReplacer = function ($0, $1) {
return html.escape(chunks[$1]);
},
commonStringRegExp =
/(["'])(?:(?=(\\?))\2.)*?\1/g,
commonReplacer = function ($0, $1) {
var len = $1.length;
return $1 + uid + '.' + (
chunks.push($0.slice(len, -len)) - 1
) + $1;
},
languageWrap = {
javascript: {
find: [
commonStringRegExp,
/(\/)[^/*](?:(?=(\\?))\2.)*\1([^/=*]|$)/g
],
replace: [
commonReplacer,
function ($0, $1, $2, $3) {
var len = $1.length;
return $1 + uid + '.' + (
chunks.push($0.slice(len, -(len + $3.length))) - 1
) + $1 + $3;
}
]
},
python: {
find: [commonStringRegExp],
replace: [commonReplacer]
},
ruby: {
find: [commonStringRegExp],
replace: [commonReplacer]
}
},
langMap = new Map(
Object.keys(Prism.languages).map(function (key) {
return [Prism.languages[key], key];
})
),
chunks
;
// some language has a shortcut equivalent.
languageWrap.js = languageWrap.javascript;
languageWrap.py = languageWrap.python;
return function safeHighlight(content, language) {
var
languageName = langMap.get(language) || '',
wrap = languageWrap.hasOwnProperty(languageName) ?
languageWrap[languageName] : null,
out = content,
i = 0
;
chunks = [];
if (wrap) {
while (i < wrap.find.length) {
out = out.replace(wrap.find[i], wrap.replace[i]);
i++;
}
}
out = Prism.highlight(out, language);
if (wrap) {
while (i--) out = out.replace(uidFinder, uidReplacer);
}
chunks = [];
return out;
};
}(Prism));
var code = `
/* this is a multiline comment */
var s = "a string with \\\\\\\"escaped string\\\\\\\\\\\" inside";
var c = "a string with /*ml comment*/ and a //sl too";
// this is a single line comment
var re = /this is a re" with a string "inside/g;
var rec = /this is a re with a [^/*ml*/] inside/g;
`;
safeHighlight(code, Prism.languages.js); How does the intermediate code look like? /* this is a multiline comment */
var s = "123123882183.0";
var c = "123123882183.1";
// this is a single line comment
var re = /123123882183.2/g;
var rec = /123123882183.3/g; Is that middle-content easy to parse for |
Yes having the regexp take account of any preceding escapes does fix the thing, however i;m not sure what would happen with multiple escaped strings inside a single string, does it work the same? It should be tested. true not usual case, but lets do it for science :) Btw even ace editor (if i'm not mistaken) which github uses to highlight code has a hard time with these things even though it uses a state machine for parsing tokens (ace javascript mode) |
dunno, what I think is good, is that I've created a jsbin for this case. At least, if nothing, there is a playground to test nasty code and see how to proceed. There are surely edge cases, but since edge cases are a "won't fix" here, due nature of the parser which is Hopefully worth exploring. |
@WebReflection Your expanded solution works as a proof of concept, but you have to fully integrate it into Also the string inside of a regex example
We have recently merged a general purpose, small solution with minimal performance impact, which reduces these edge cases significantly #779. |
nice catch about the regExp, it's because I've adde the HTML escaper without retesting (stupid me). I understand it should work as general purpose solution but I don't think it will be accepted here (because of skepticism) so I won't bother with a PR but I'll fix my proof of concept. If there is real interest in having it in I'll try to improve the proof of concept and submit a PR with the feature in. |
FYI I've fixed the issue with strings in regexps (with strings in general) https://jsbin.com/qazida / @zeitgeist87 |
The problem with removing strings and using placeholders is that it doesn't handle all other kinds of nested tokens, and if we handle those too we're basically parsing the input, so we may as well do some proper lightweight parsing first. |
I'm not sure you should handle any nested token within single or double quoted strings but I've improved the example with better code, Python and Ruby compatible too (easy to make it safer with others) and improved a bit in terms of API The previous version was suffering nested tokens indeed, now in the worst case scenario you gonna have a number highlight instead of string. Still not perfect but we all know with RegExp only it's hard to be so :-) |
Hi there, please help |
This is the combined issue for all related issues (#187, #183, #155, #153).
There is a fight between the highlighting of strings vs. comments.
Currently only one of them can work
I will take a look whether this issue can be "fixed" (it can't be fixed for all cases due to the limited power of regular expressions though).
The text was updated successfully, but these errors were encountered: