Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect coloring for html comment across multiple lines #63

Closed
Shu-Ting-Huang opened this issue Aug 30, 2020 · 77 comments
Closed

incorrect coloring for html comment across multiple lines #63

Shu-Ting-Huang opened this issue Aug 30, 2020 · 77 comments

Comments

@Shu-Ting-Huang
Copy link

Shu-Ting-Huang commented Aug 30, 2020

I am printing an html file in which there are comments for the form

  1. <!--comment-->

  2. /*comment*/

But only the first line of the comment is correctly colored as comment.

@PeterWone
Copy link
Collaborator

This is a shortcoming of the syntax colouring library rather than VS Code Printing.

If you can give me a sample that demonstrates the problem I'll be happy to log a bug with the syntax colouring project.

@Shu-Ting-Huang
Copy link
Author

Below is an html sample together with the PDF printout that demonstrate the problem.

https://drive.google.com/drive/folders/1SuTVZryAV9MVRucQ3zmbocVW_Od7WVet?usp=sharing

@Neohiro79
Copy link

Neohiro79 commented Nov 20, 2020

Just saw this conversation as a matter of interest. Since I cannot access your HTML Sample, this is just a blind guess here, but you might want to check:

This is the HTML Code:

VSC-Code

This would be the output (and the wrong /* HTML-comment */ won't show up colored):
Printout-Style


It might solve your request - in that case this is NOT a bug - this is a wrong HTML comment.


@Neohiro79
Copy link

Neohiro79 commented Nov 20, 2020

Sorry, my fault, I have overlooked that you meant "over multiple lines", which also triggers the same no-syntax problem as meantioned above (the wrong HTML comment).

I've made a JSFiddle to first see what happens when using "highlight.js" and if this is a bug related to highlight.js:

http://jsfiddle.net/neohiro79/zbfdhxjy/4/


The output there seems fine to me - except the fact that it won't show HTML comments - even when HTML is set as class.

But when used through the plugin, it shows this (which looks like a bug inbetween somewhere):

VSC Input VS Extension Output:

__

@PeterWone
Copy link
Collaborator

PeterWone commented Nov 20, 2020

There are two modes, manual and auto. I use manual whenever VSCode has a definite langId, and for an HTML file that is the case. Manual mode basically means I don't let highlight.js guess, I explicitly tell it "this is HTML"

But that doesn't explain the behaviour you're seeing so I suppose I'll have to spin it up in the debugger and seen exactly what's being sent to highlight.js

@Neohiro79
Copy link

Neohiro79 commented Nov 20, 2020

Peter I've also tried to do it with Prism, but I've ran into similar troubles.

I've already opened a ticket:

PrismJS/prism#2651

You can see my tryouts in this JSFiddle here (now it works, but it is a quirky workaroud, since comments in HTML will require a plugin, which in my opinion is not the best way of having to deal with):

http://jsfiddle.net/neohiro79/zbfdhxjy/3/

Again, see my comments in here:

PrismJS/prism#2651

At least the multi-line comments worked for Javascript and CSS.

@Neohiro79
Copy link

Digging into the rabbithole ...

Even this nice looking approach

https://asvd.github.io/microlight/

won't work as expected:

http://jsfiddle.net/neohiro79/zbfdhxjy/5/

Issue has been submitted, but I guess he might not answer, seems like a dead git repo to me:

asvd/microlight#12

... to be continued ...

@PeterWone
Copy link
Collaborator

PeterWone commented Nov 21, 2020

Without wishing to rain on your parade, highlight.js is probably as good as it gets. Stackoverflow just switched to using it.

This is why, despite otherwise being a bit perfectionist about vscode-printing, I think we should let the open-source division of problems thing work for us. Gather your evidence and just log a case with the highlight.js people. We can't do everything and they know their codebase.

@Neohiro79
Copy link

Finally I got some feedback for rainbow.js:

This works PERFECTLY NOW:

http://jsfiddle.net/neohiro79/zbfdhxjy/6/

If you want to include all you can use this command here for a build:
gulp build --languages=all

That might also address potential performance issues ...
I know you like highlight.js but it might be worth a try ...

@Neohiro79
Copy link

To bundle Craig told me this:

The best bet is using gulp build to create a custom bundle and passing whatever languages you want in a list.
You can also use the bundler from the rainbow website, but it is using a slightly out of date version of rainbow (long story)
http://rainbowco.de

@Neohiro79
Copy link

Neohiro79 commented Nov 21, 2020

So Peter, as thought, don't expect 'help' from the guys from highlight.js, as seen in the comment here:

highlightjs/highlight.js#2884

It seems to me they have the very same attitude than we're well aware of the superior VSC forum feedback in our case.

Since I've made that research for various reasons so comprehensive, I'll personally stick with rainbow.js since it not only seems the most reliable code syntax highlighter on the market right now, it also has the nicest and most helpful developer feedback I've encountered so far!

Hope you can find a solution which finally works - give it a try!

@Neohiro79
Copy link

Neohiro79 commented Nov 21, 2020

The answer came promptly after telling them that it obviously works in rainbow.js, so this is how it "only-works" in highlight.js:

https://jsfiddle.net/neohiro79/zbfdhxjy/12/

That said, I still prefer the rainbow.js solution, since you do not need to "mask" anything to make it work, since I don't get the whole "I am concerned about security issue" in this specific regard:

ccampbell/rainbow#249


@edit: since I've had a really good chat with josh about this whole security issue, I see the problem now which might occur when you're using unmasked code.


@joshgoebel
Copy link

joshgoebel commented Nov 22, 2020

So Peter, as thought, don't expect 'help' from the guys from highlight.js, as seen in the comment here:

@Neohiro79 FYI: This is the attitude I picked up on that very slightly miffed me. :-)

Just to explain here we do not highlight unescaped HTML code because it is a HUGE security risk and a very bad idea. This is a major feature, not a bug. If you want HTML code to be treated as "text" and highlighted, it must be properly escaped. Otherwise google "html injection", "javascript injection", etc...

IE, correct:

<code>
&lt;code&gt;this is a code tag to be highlithed&lt;/code&gt;
</code>

Incorrect:

<code>
<code>hey code here!!!</code>
</code>

I personally believe Rainbow is buggy in this regard or at the very least encouraging/enabling people to make some very poor security choices. I've explained this as clearly as I can to the author and we'll see what they decide. :-) With a very small change to the library they could close this security hole and prevent people from shooting themselves in their own foot.

@joshgoebel
Copy link

joshgoebel commented Nov 22, 2020

@PeterWone I'm not sure if this affects your plugin or not... if you are asking us [Highlight.js] to highlight code inside a block you need that code to be properly HTML escaped (which should be pretty trivial to do with JS). The block should not include RAW unescaped HTML - which we are NOT going to highlight.

@Neohiro79
Copy link

Neohiro79 commented Nov 22, 2020

@josh :-) big hug for all your effort !!!

@Neohiro79 FYI: This is the attitude I picked up on that very slightly miffed me. :-)

Yeah, I got the point, that's what I thought ... but you also opened up for us 'beginners' now, which makes me happy :-)

@PeterWone
Copy link
Collaborator

PeterWone commented Nov 22, 2020 via email

@joshgoebel
Copy link

joshgoebel commented Nov 22, 2020

That is my current understanding, yes. Highlight.js (currently) preserves HTML - it's passed thru silently - while the actual TEXT content is highlighted... and the HTML remains, doing whatever it would do... this is legacy behavior and in the future we'll likely drop it completely but we can't do that until version 11. If you want all the content to be treated as text (and highlighted) then just escape it.

There is no reason for your block to include anything other than text (ie, properly escaped HTML which is "text" to the browser/runtime) unless you purposely WANT to use HTML to do something:

<code>
<span style="color:red !important;">var x = 5;</span>
var y = 9;
<code>

For example you can do tricks like this... and we'll highlight the text but preserve the HTML so line 1 will always be RED. Some people use this to highlight/emphasize parts of their code, etc...

@Neohiro79
Copy link

Oh, that explains why some of them won't show the html tags but the content, they used innerText only to get the markup of that specific container right?

@joshgoebel
Copy link

joshgoebel commented Nov 22, 2020

Oh, that explains why some of them won't show the html tags but the content, they used innerText only to get the markup of that specific container right?

Yes, or textContent. (which is honestly the correct behavior IMHO - vs innerHTML)

@Neohiro79
Copy link

Neohiro79 commented Nov 22, 2020

So the issue with rainbow.js is that they use innerHTML instead of textContent to "get all the code" they want to highlight and inject it back into the given container - so basically if the HTML is malicious in first place it will keep beeing malicious with the additional highlighting, correct?

And the fact that they change the HTML during the process won't change the fact they read from innerHTML the whole code ... ?

Because in that case it makes sense to me to definately warn the user once any input of code is not masked properly in a very clear way (and not in the console) that you can see the mistake and directly put a link to one of those converters to finally get the result someone might expect, just as you proposed ... just my two cents...

@joshgoebel
Copy link

joshgoebel commented Nov 22, 2020

So the issue with rainbow.js is that they use innerHTML instead of textContent to "get all the code"

Yes. You want "the text content in a code block" not "all the HTML code in a code block".

so basically if the HTML is malicious in first place it will keep beeing malicious with the additional highlighting, correct?

Well, no.. :-) It's quite likely Rainbow would actually "fix it"... by turning it into harmless highlighted code (albeit in weird ways)... but it's too late... the HTML is loaded BEFORE Rainbow sees it. The JS runs BEFORE Rainbow sees it. Plus you can't guarantee Rainbow will run anyways - perhaps the server (or internet) has a glitch and your page loads without Rainbow.

<pre><code>
<script>
// this code runs BEFORE Rainbow
// this code could even hack Rainbow and prevent it from loading if it wanted to
</script>
</code></pre>

@Neohiro79
Copy link

Neohiro79 commented Nov 22, 2020

Yeah, got it.

But this also means (to me) that weather Rainbow.js nor Highlight.js can ever prevent anyone from injecting "bad javascript" (of course) which will anyway beeing executed BEFORE any of those methods through the browser engine, weather innerHTML nor textContent nor any masking techniques beeing applied by any javascript library afterwards, will ever come into play.

Meaning that the main problem I still see is that it can only "warn" a user AFTER potential malicious code was injected already, or in other terms, it is much more important to warn users BEFOREHAND that they shall never use untested or unknown php code which might be used in a contact form or a comment form of any plugin since they do not know weather or not this code is beeing properly masked BEFORE it even hits the html page - which in turn will execute the script BEFORE EVEN ANY highlighter script can detect something via innerHTML or innerText - or am I wrong?

@joshgoebel
Copy link

joshgoebel commented Nov 22, 2020

@Neohiro79 Right, we can't STOP an injection attack like this "in the wild"... but if someone is building their project and testing things out and they do this accidentally we can yell at them very loudly then - so they fix it early - hopefully before their code every gets out "in the wild"... or even if it's in the wild we can still warm them if they are using raw HTML even innocently. Perhaps they can fix it before they are exploited maliciously.

And we can use textContent (not innerHTML) so people never even get the idea of using unescaped HTML.

@Neohiro79
Copy link

Yeah, got you.

@Neohiro79
Copy link

Neohiro79 commented Nov 22, 2020

@josh I'll work on a html warning page and provide it to you once I finished which can be used to be injected into the document or as a popup for all to be aware of once they use HTML code in plain form - since I really think this is a problem since not everyone is aware of this shit.

You can sent me the text you might think or come up with, I'll see what I can do, to make it strikingly clear for others.

@joshgoebel
Copy link

joshgoebel commented Nov 22, 2020

We don't need anything from you. We have an issue to track this, but we can't do anything about it until 2021 anyways - because we can't break the existing behavior in v10 (which passes HTML thru). If you wanted to drop some text on that issue you could, but we'll likely think about this carefully when we get around to doing it. So you can save the time.

@Neohiro79
Copy link

Sure, if there's no help needed I won't help of course. It was just a proposal with a good intension after all.

@PeterWone
Copy link
Collaborator

PeterWone commented Nov 22, 2020

OK so looking at the code (it's been a while) I have a function getSourceCode() which ultimately either loads the content of a file or grabs the buffer of the active editor. Two strings come back as an array, some source code and a type. It could contain anthing and it could be toxic.

I give highlight.js a crack at it

  let sourceCode = await getSourceCode();
  let renderedCode = "";
  try {
    renderedCode = hljs.highlight(sourceCode[0], sourceCode[1]).value;
  }
  catch (err) {
    renderedCode = hljs.highlightAuto(sourceCode[1]).value;
  }

and as you can see if I cannot map the language to something highlight.js can digest then I let highlight.js off the leash and let it make up its own mind.

This does have the vulnerability we discussed, both for accidentally toxic code and for handling the work of others.

@joshgoebel do I understand correctly that for HTML you want me to escape all angle brackets to to &lt; and &gt; before passing the HTML to highlight.js?

I can't do this for unsaved editor buffers because I don't know what they are, these will always be processed in auto mode.

@joshgoebel
Copy link

joshgoebel commented Nov 22, 2020

This does have the vulnerability we discussed, both for accidentally toxic code and for handling the work of others.

I was never asking you to do anything magic, just upgrade to v10, which fixes many of the issues. :-) If our library still has regex issues, then yeah it's possible you'll find them - but that's on us, not you. :-)

do I understand correctly that for HTML you want me to escape all angle brackets to to < and > before passing the HTML to highlight.js?

That's only if your HTML is already on HTML page inside a block, etc... if you're using the API functions directly (highlight, highlightAuto, etc) then you pass them regular text/code whatever... there is zero need to do any type of escaping inside JS. And we escape the output.

Escaping matters if you are using highlightBlock.

@Neohiro79
Copy link

@ josh I really don't want to be pedantic here, but when I've searched in the CDN for html I found nothing than this vbscript-html.min.js which I used (and it works, I know). In no way I would've come to the conclusion to specifically search for xml - eventhough that might be the 100% correct technical term for that specific syntax style - but it would also make really sense to me to rename this into html-xml or xml-html at the very least,

I mean this way one can find it when searching for it.

Who looks into a language description file when searching for the language - and even more - at the bottom of it, just to see if there is a hint for any potential language-alias?

But now you've mentioned it, yes, sure, I know where to look when I am not sure about the usecase of a specific language file.

@Neohiro79
Copy link

@ peter from @ josh

do you support APL?
Can you expand that acronym please?

@joshgoebel
Copy link

This is why we have:

https://github.com/highlightjs/highlight.js/blob/master/SUPPORTED_LANGUAGES.md :-)

but it would also make really sense to me to rename this into html-xml or xml-html at the very least,

That isn't something we could do until v11 in any case since it would break things for existing users. There are perhaps some other things worth discussing around HTML vs XML though - could you file an issue with that suggestion?

Who looks into a language description file when searching for the language - and even more - at the bottom of it, just to see if there is a hint for any potential language-alias?

That was more to show you how it works. :) Then I went and pasted the list for you instead.

@Neohiro79
Copy link

Thanx, yes - but I saw even in the list someone thought the same than me:

HTML-XML

:-)

Sure, I'll will file an issue on highlight.js for that specific request.

@Neohiro79
Copy link

And this list would of course IMHO make sense to be put in the same folder here:

https://github.com/highlightjs/highlight.js/tree/master/src/languages

since I didn't know of that list before that it even exists.

@Neohiro79
Copy link

Just as an idea, having for example three basic subsets, one that is optimised for xml-html-css-js usage, one that is bundled for some-broad-but-special-programming-languages-used and one that is the full metal jacket release - which should only be used in rare cases of course - would be so much more convenient ...

and for those geeks and nerds who really love to look into each file and tweak the shit out of everything there still is the option to bundle the language-files individually.

Just my two cents.

@joshgoebel
Copy link

joshgoebel commented Nov 23, 2020

I don't think the uses cases are that clear cut at all - that's why we make it so easy for people to download what they need instead of making that choice for them:

https://highlightjs.org/download/

And if you want to common on the 'common' list especially there is an issue open for any changes we might make there in 2021.

@Neohiro79
Copy link

Neohiro79 commented Nov 23, 2020

But it wouldn't hurt either to promote at least a carefully chosen web-preset package, I mean ...

You can keep the page as it is - just 'help' someone to find the right 'choice' - beginners love easy answers ;-)
I think jQuery (yes, I know, not the best example) did it pretty well to help in that regard with 'basic packages' ...

@Neohiro79
Copy link

It's like HTML in the beginning - "strict" "transitional" and "frameset", three basic choices

for example for web
HTML - CSS - JS - PHP - SQL

for example the most used ones (for programming):
HTML - CSS - JS - PHP - SQL - PYTHON - JAVA - BASH/SHELL/POWERSHELL / C# / C++ / C / Typescript

and the full package

AND THE INDIVIDUAL CHOICES / BUILDS

@joshgoebel
Copy link

The existing "common" build is small enough to serve for "both" the cases you just list at just 36kb.

@Neohiro79
Copy link

Neohiro79 commented Nov 23, 2020

Oh I see. I haven't recognised those "headlines". Yeah, but it's kinda huge package for "just the web" ...

It for sure serves well for the programming fraction like you - sure - but not for "just the simple web package" (seems like a huge language overkill to me).

I mean I don't want to spend hours of finding out which language packages to use to just highlight my basic code examples in my webpage and even then I later figure out, that I might used the wrong ones or could've used better ones - I hope you get what I mean or want to tell you here.

It's kinda like this first 'contact' feeling:

You want to start quickly and have some experience of success?

  • include highlight.js
  • include highlight-basic-web-language-pack.js

USE PROPER HTML MASKING TO YOUR OWN SAFETY !!! (link see here and here)

Enjoy.

@Neohiro79
Copy link

Neohiro79 commented Nov 23, 2020

For example, right now I use this in my first testcases, because I didn't even realised common existed:

<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.18.5/highlight.min.js"></script>
<srcipt src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.18.5/languages/vbscript-html.min.js"></srcipt>
<srcipt src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.18.5/languages/css.min.js"></srcipt>
<srcipt src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.18.5/languages/javascript.min.js"></srcipt>

<script>hljs.initHighlightingOnLoad();</script>

and it took me more than a minute to figure it out and find the right choices ...

@joshgoebel
Copy link

joshgoebel commented Nov 23, 2020

There might be a docfix here (to better explain the CDN and default built-ins), but right now I'm not convinced more and more variants is the answer. :)

In that case you only need the first line.

@Neohiro79
Copy link

The fun-fact is, I didn't even know that there are variants. I just thought you need the basic lib and then the specific language packs named after their language.

@Neohiro79
Copy link

Neohiro79 commented Nov 23, 2020

By the way, see, even after 15 minutes research of how to implement highlight.js properly, I used it wrong - or at least implemented useless scripts.

Please consider at least a web-package. You will see in the stats, how often it will be picked and used.

Or just include this basic web-package in the highlight.js file itself and mention that you don't need any .css/.js/.xml language pack at all to start with. Only those "additional" languages - I don't know. It's way to hard to figure it out as a beginner.

@Neohiro79
Copy link

@ josh Am I understanding this right:

highlight.js already packages this 'common' bundle within itself???

In that case you only need the first line.

At least it seems to work here right now:
http://jsfiddle.net/neohiro79/zbfdhxjy/32/

common-bundle

The existing "common" build is small enough to serve for "both" the cases you just list at just 36kb.

So all this is already bundled within the highlight.js file itself?

common-content

@PeterWone
Copy link
Collaborator

PeterWone commented Nov 23, 2020

Yes, they are all bundled by webpack.

APL is the name of a programming language. It stands for A Programming Language. Really. https://en.wikipedia.org/wiki/APL_(programming_language)

Back in the day you needed a special keyboard and a terminal that supported user defined characters. Nowadays editors like VS Code convert chords to the appropriate character codes and use an APL compatible font.

APL is extremely obscure and I was basically making a joke about the extraordinary number of languages highlight.js supports. It is probably the most exotic, illegible and unmaintainable programming language ever genuinely intended to be used. Perl doesn't count because it isn't Turing complete.

@Neohiro79
Copy link

Well, than I guess, my whole effort was for nothing to convince @josh here to make a webpack-bundle ;-)

... was worth a try when not knowing that it already includes all that languages from start.

But what if I want only those three languages (like a light version of your full-blown-highlighter - in that case I just use the language pack and no highlight.js at all? - because else it makes no sense to me to provide those single language files via CDN ...

@Neohiro79
Copy link

Did you find the problem @peter ?

It might be a setting of the correct "classname" - maybe?

Can you show the code to @josh that he might find out what the problem is?
Or paste the questionable part on his service for 24h - http://pastie.org?

@PeterWone
Copy link
Collaborator

Your efforts to improve the quality of the product are never a bad idea.

To understand why I bundled all of it you must consider the fact that VS Code Printing is designed to work offline.

To install a user selection would require either a VSIX that supports some mechanism for choosing during the installation process (I don't think this is possible) or a fancy download page that builds a custom VSIX according to selections on a web page.

Due to the overheads of handling loads of small files it's actually faster to load an 835KB containing all of highlight.js than it is to demand load parts of it and all its dependencies. No I haven't time today. It's my first day at a new job.

@Neohiro79
Copy link

Sorry, might has been a bit chaotic, the proposal for the bundling was originally meant for @josh (the whole webbundle thing).

Yeah sure, no hurry, good luck on your new job!

@PeterWone PeterWone reopened this Mar 17, 2021
@PeterWone
Copy link
Collaborator

I have a strong suspicion that this is actually caused by the problem in #78 which is about to be resolved.

@PeterWone
Copy link
Collaborator

PeterWone commented Mar 17, 2021

We combine "hanging indent" for line numbers with PRE for code indent by using a table and putting the PRE in a TD.

This assumes that the code is strictly line based and that any markup tags introduced by highlight.js will both open and close on the same line, a line being defined by \n rather than wrapping. Mutli-line comments and string literals break this assumption and to fix it in a language-agnostic way we have to parse the rendered HTML looking for unbalanced tags, close them at the end of the line and re-open them at the start of the next.

@PeterWone
Copy link
Collaborator

@joshgoebel note the above comment -- this is not a highlight.js problem.

@joshgoebel
Copy link

joshgoebel commented Mar 17, 2021

we have to parse the rendered HTML looking for unbalanced tags, close them at the end of the line and re-open them at the start of the next.

Yep. You got the idea. If you do it abstractly enough you should make a plugin for it. :) It shouldn't be a lot of code, it's just never been a priority for me.

@PeterWone
Copy link
Collaborator

@alan7872 thanks for opening this can of worms. We finally got to the bottom of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants