-
Notifications
You must be signed in to change notification settings - Fork 28.7k
-
Notifications
You must be signed in to change notification settings - Fork 28.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[html] prevent that embedded grammars can corrupt syntax highlighting of rest of document #20488
Comments
At the moment I'm not sure if it's possible to modify the grammar to make this happen, but the whole tokenization relies on it. The grammar is really fragile, like even if you write a complete CSS style, right now we are still seeing the closing angle bracket @aeschli 's HTML extension doesn't rely on the TM Grammar to figure out the embedded language boundary, which is not supported yet in the extension host as well. The extension parses the file and calculate the boundaries itself and it's perfectly correct. That's why the intellisense works reasonably even though the color is wrong. So the general feature request can be, if the extension itself knows about the tokens better than Grammar file does (very likely if there is a language service behind it), can the extension feed Code with the tokens information ? |
Not really, for this issue I just want:
|
This has to do with the grammars involved here: |
@aeschli The grammar can behave the same. But I believe embedded grammar support differs from editor to editor (Sublime seems to have its own implementation). So I'm wondering if VSCode can change embedded grammar to this way:
Now it's like:
However, some grammars when embedded could get "stuck" in step 2 and cause this flicking. @alexandrudima Correct me if I'm wrong... |
Remember that the text mate grammar sees a line at a time. Mostly it can't see the What could be done is writing a HTML grammar that contains rules for embedded css and JavaScript, in particular the appropriate end rules for |
@aeschli 👍 TM grammars are parsed according to TM semantics. TM has no notion of an embedded language, it simply has the notion of including rules from other scopes (from other files). There are TM grammars split into multiple files that do not represent embedded languages (there is one I think for perl regular expressions or one for C platform functions). Our only option to stay TM grammar compatible is to do what @aeschli suggests, where we have a special CSS grammar that expects it to be embedded inside HTML and has end rules for |
That's what sublime have: https://www.sublimetext.com/docs/3/syntax.html As an alternative, is it possible for us to have such a "pop" rule?
It's possible to modify the CSS grammar to make it expects itself to be embedded. But I don't know any way to do that by only modifying HTML grammar now in VSCode. And I believe this is a problem shared by embedded language support. I don't think it's reasonable to ask each language's grammar to expect itself to be embedded. For example in Vue's single file component, one can embed css/sass/scss/less/stylus/postcss. I hope there is a way to mark the end of embedded region in HTML, instead of adding a negative lookahead to each grammar rule in these 6 languages. |
The problem bothers me too. It is so easy to corrupt the syntax highlighting of the whole rest document when embedded language involves. |
Markdown had a similar problem for fenced codeblocks. The solution was to switch from a This case in HTML is a bit more complicated as you can also have single line |
@mjbvz Thanks I'll give it a try. BTW I think a better way to write TM grammar is to write in JSON: https://github.com/octref/vetur/blob/master/syntaxes/vue.YAML Then use https://github.com/SublimeText/PackageDev to compile them to XML. |
Closing this issue as we don't want to diverge from TextMate. |
Reopening to collect issues around this issue. |
To my opinion if VSCode cannot make TextMate grammars work as expected for html , VSCode team should consider other workarounds, or even use a different grammar at all. That is, any html coder would expect to see syntax highlighting as a main feature of an editor, it is sometimes the easiest and simplest way to debug the code, breaking the highlight when using embedded grammar is a tough issue. |
Just constant reminder why this is indispensable for writing embedded grammars: |
VsCode always had serious problems with syntax highlight not only in PHP/HTML. This is annoying issue that prevent me to switch from ST3. |
I see Sublime's solution mentioned, but fyi Atom is addressing this with their new Tree-sitter grammar, which supports injection.
|
Working on an HTML embedded syntax of my own, I have now had the (mis?)fortune to discover this ticket. I think the most compelling1 example is the incomplete CSS one from #20488 (comment): <style>
h1 {
</style> Any eager rule in the injected syntax definition disregards the end-embed tag. This is also true for C#, where There are at least two ways to fix this.
The Markdown syntax takes advantage of the fact that it knows the end tag will be the only thing on the line for its @KamasamaK Can you elaborate on why you thought tree-sitter would solve this problem and why it turns out not to? Footnotes
|
@michaelblyons Sorry, it actually does work as intended for what is implemented. I was testing in Atom v1.29.0 but it wasn't implemented until v1.30, which is still in beta. Also, embedded CSS still does not work at all. The reason I thought it would work is because the tree-sitter grammars are used to actually parse the full code into a syntax tree. For embedded languages, it would be using distinct grammars to parse those languages and (I think) maintaining those trees independently which should mean that they do not affect each other directly. Also, tree-sitter has inherent error recovery. |
Here is what Atom provides: https://flight-manual.atom.io/hacking-atom/sections/creating-a-grammar/#language-injection. The embedded language can specify an |
I want to add, that this problem is not unique to embedding grammers, such as JS or CSS in HTML. It also occurs with complex and quirky languages such as Batch Script. Its related in that some commands, like the comment, consume the rest of the line. This can be handled with an exception to the line end, but usually its a hack that doesn't fully conform with the language. Instance: IF DEFINED A (
echo hello
echo a=012) else echo no a
echo no next The echo can consume the entire line, up to certain characters, but only up to the echo a=012) else echo no a and to demonstrate no echo a=012 else echo no a The colorizer here has no way to know if its in a block, but assumes it is, and that
The following one actually tripped me up, and proper handling (coloring or a syntax checker, but even a syntax checker in batchfile is difficult) in the editor would have saved me some trouble: :: * Check if already performed copy today, if so, no more copies, else clean-up any previous day flags
IF EXIST "%~5\Replication.Flags\%YEARMONTHDAY%" (
ECHO:%DATE% %TIME%-Replication already completed for today! (%YEARMONTHDAY%)>>"%LOGFILE%"
GOTO EXITFLAG
)
The closing And to make matters worse, this doesn't happen everywhere. a The I tried using such a rule as to have the block rule capture all the line, up to but not including another open block, or string quote, or the group end, parsing the capture, and then including only additional group or string items, but this hasn't worked so far, because it will prevent other block constructs ( |
Having spent a few more days trying to make the current batchfile (batch script) syntax closer to how 'CMD' actually reacts, and I am pretty sure this is what TextMate and the tmLanguage system needs:
Maybe use BEGIN_SUBPARSE/END_SUBPARSE/SUBPARSE_PATTERNS with just PATTERNS being the exception patterns that consume the document to prevent the end from matching, and SUBPARSE_PATTERNS being the scopes to apply to the content. Standard repository items could be used for the inclusion patterns, but any scoping they would do would be ignored. This actually matches how CMD processes input, specifically where grouping is involved. It normally reads in 1 line a time, unless in that line it finds a group that hasn't ended, then it reads ahead further lines only checking for exceptions to finding the group end, and then returns to processing the collected lines one command at a time. A command on a line by itself may process its line completely different than when a preceding group consumes it. For instance, by itself:
the SET command tokenizes the line using the outer quotes, stripping them, resulting in But
will, if 'avar' is not defined, cause 'avar' to be erased (but its not even defined) as the set is restricted to tokenizing Now, if you take the arithmetic option, you get a new issue, as parenthesis are permitted for arithmetic grouping. But these are not to be confused with the code groups, and in fact, CMD only accepts opening of groups at certain places, so the arithmetic grouping of the SET /A command do not count, and thus the first closing parenthesis will be taken as the closing of the command group, rather than the closing of an arithmetic set. |
Any progress on this please? |
I'll chip in that I have had huge problems regarding this as well and for the life of me haven't understood why. Disappointingly enough, it seems to boil down to the fact that embedded languages are inherently broken because the devs don't want to make additions on top of the TextMate grammar. That is a real shame because it turns embedded languages into a random mess, sometimes it works, sometimes it doesn't, you never know which. There have been several valid proposals (copying |
Currently, when writing embedded language, a lot of the times the embedded grammar can cause disruption to the rest of DOCUMENT:
This is html grammar.
Imagine editing css in a
<style>
at the<head>
of a long html file. The rest of document will be constantly re-tokenized and (wrongly) syntax highlighted. This doesn't look good UX-wise and will probably have some perf impact.For the following grammar and setting:
I hope VSCode can only apply CSS grammar to the section between
<style>
and</style>
, so when I'm typing out my incomplete CSS VSCode would just re-tokenize and highlight until</style>
, but not rest of the document./cc @alexandrudima
Also would like to hear @aeschli's opinion, as you are writing the html grammar.
The text was updated successfully, but these errors were encountered: