Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[html] prevent that embedded grammars can corrupt syntax highlighting of rest of document #20488

Open
octref opened this issue Feb 12, 2017 · 35 comments
Assignees
Labels
bug Issue identified by VS Code Team member as probable bug debt Code quality issues languages-basic Basic language support issues upstream Issue identified as 'upstream' component related (exists outside of VS Code)
Milestone

Comments

@octref
Copy link
Contributor

octref commented Feb 12, 2017

Currently, when writing embedded language, a lot of the times the embedded grammar can cause disruption to the rest of DOCUMENT:

This is html grammar.

image

Imagine editing css in a <style> at the <head> of a long html file. The rest of document will be constantly re-tokenized and (wrongly) syntax highlighted. This doesn't look good UX-wise and will probably have some perf impact.

For the following grammar and setting:

- name: source.css.embedded.html
   begin: <style>
   end: </style>
   patterns:
   - include: source.css
"embeddedLanguages": {
  "source.css": "css"
}

I hope VSCode can only apply CSS grammar to the section between <style> and </style>, so when I'm typing out my incomplete CSS VSCode would just re-tokenize and highlight until </style>, but not rest of the document.

/cc @alexandrudima

Also would like to hear @aeschli's opinion, as you are writing the html grammar.

@joaomoreno joaomoreno assigned octref, alexdima, aeschli and rebornix and unassigned octref Feb 14, 2017
@rebornix
Copy link
Member

At the moment I'm not sure if it's possible to modify the grammar to make this happen, but the whole tokenization relies on it. The grammar is really fragile, like even if you write a complete CSS style, right now we are still seeing the closing angle bracket <style> as part of the CSS (it can be fixed by #20610).

@aeschli 's HTML extension doesn't rely on the TM Grammar to figure out the embedded language boundary, which is not supported yet in the extension host as well. The extension parses the file and calculate the boundaries itself and it's perfectly correct. That's why the intellisense works reasonably even though the color is wrong. So the general feature request can be, if the extension itself knows about the tokens better than Grammar file does (very likely if there is a language service behind it), can the extension feed Code with the tokens information ?

@octref
Copy link
Contributor Author

octref commented Feb 14, 2017

So the general feature request can be, if the extension itself knows about the tokens better than Grammar file does (very likely if there is a language service behind it), can the extension feed Code with the tokens information ?

Not really, for this issue I just want:

  • Embedded grammar shouldn't tokenize anything after end.
  • So when I have autoclose off, this kind of flicking color wouldn't happen. This especially bothers me on large html files.

grammar

@aeschli
Copy link
Contributor

aeschli commented Feb 15, 2017

This has to do with the grammars involved here:
https://github.com/textmate/html.tmbundle and https://github.com/atom/language-css.

I just checked that TextMate behaves the same.
image

@octref
Copy link
Contributor Author

octref commented Feb 15, 2017

@aeschli The grammar can behave the same. But I believe embedded grammar support differs from editor to editor (Sublime seems to have its own implementation). So I'm wondering if VSCode can change embedded grammar to this way:

  1. For a rule with end, match end first
  2. If end is found, tokenize section between start and end with embedded grammar
  3. If end is not found, match for next rule

Now it's like:

  1. Match the start of a rule
  2. If that rule indicates embedded grammar, use embedded grammar to tokenize forward on
  3. When no rule match, return to match the rule's end

However, some grammars when embedded could get "stuck" in step 2 and cause this flicking.

@alexandrudima Correct me if I'm wrong...

@aeschli
Copy link
Contributor

aeschli commented Feb 20, 2017

Remember that the text mate grammar sees a line at a time. Mostly it can't see the </style> tag at the time it enters <style>.
Once it starts matching tokens in between and new states are entered, the end rule is no longer active, that's just how TextMate is designed. I don't think we should try to 'improve' that. We want to be 100% TextMate compatible.
The HTML language server has a proper parser that reads the whole file and knows the exact rules on when embedded ranges are ended. But running the parser is more expensive and runs asynchronously in its own process. That doesn't work well with tokenizing, which runs in the renderer and needs to update quickly, as you type.

What could be done is writing a HTML grammar that contains rules for embedded css and JavaScript, in particular the appropriate end rules for </style> and </script> tags.

@aeschli aeschli added this to the Backlog milestone Feb 20, 2017
@alexdima
Copy link
Member

@aeschli 👍

TM grammars are parsed according to TM semantics.

TM has no notion of an embedded language, it simply has the notion of including rules from other scopes (from other files).

There are TM grammars split into multiple files that do not represent embedded languages (there is one I think for perl regular expressions or one for C platform functions).

Our only option to stay TM grammar compatible is to do what @aeschli suggests, where we have a special CSS grammar that expects it to be embedded inside HTML and has end rules for </style> everywhere. The alternative would be to fork TM parsing semantics, in which case I would do something entirely different...

@octref
Copy link
Contributor Author

octref commented Feb 20, 2017

That's what sublime have:

https://www.sublimetext.com/docs/3/syntax.html

image

As an alternative, is it possible for us to have such a "pop" rule?

What could be done is writing a HTML grammar that contains rules for embedded css and JavaScript, in particular the appropriate end rules for </style> and </script> tags.

Our only option to stay TM grammar compatible is to do what @aeschli suggests, where we have a special CSS grammar that expects it to be embedded inside HTML and has end rules for </style> everywhere.

It's possible to modify the CSS grammar to make it expects itself to be embedded. But I don't know any way to do that by only modifying HTML grammar now in VSCode.

And I believe this is a problem shared by embedded language support. I don't think it's reasonable to ask each language's grammar to expect itself to be embedded. For example in Vue's single file component, one can embed css/sass/scss/less/stylus/postcss. I hope there is a way to mark the end of embedded region in HTML, instead of adding a negative lookahead to each grammar rule in these 6 languages.

@aeschli aeschli added the feature-request Request for new features or functionality label Feb 21, 2017
@alexdima alexdima removed their assignment Mar 28, 2017
@onixie
Copy link

onixie commented Mar 29, 2017

The problem bothers me too. It is so easy to corrupt the syntax highlighting of the whole rest document when embedded language involves.

@mjbvz mjbvz added the languages-basic Basic language support issues label May 2, 2017
@mjbvz mjbvz self-assigned this May 2, 2017
@mjbvz
Copy link
Collaborator

mjbvz commented May 2, 2017

Markdown had a similar problem for fenced codeblocks. The solution was to switch from a begin +end rule to begin+while rule. Here's an example rule: https://github.com/Microsoft/vscode/blob/e105d5cc998b18e35ec8b2ca8b340a2bf2bdb3db/extensions/markdown/syntaxes/markdown.tmLanguage#L584

This case in HTML is a bit more complicated as you can also have single line style / script blocks. I believe these would have to be handled separately since a while rule would not work in that case

@octref
Copy link
Contributor Author

octref commented May 3, 2017

@mjbvz Thanks I'll give it a try.

BTW I think a better way to write TM grammar is to write in JSON: https://github.com/octref/vetur/blob/master/syntaxes/vue.YAML

Then use https://github.com/SublimeText/PackageDev to compile them to XML.
Just in case you get tired of editing XML.

@aeschli
Copy link
Contributor

aeschli commented Sep 12, 2017

Closing this issue as we don't want to diverge from TextMate.

@aeschli aeschli closed this as completed Sep 12, 2017
@aeschli aeschli changed the title Apply embedded language grammar only to section between begin and end [html] embedded grammars corrupt syntax highlighting of rest of document Sep 13, 2017
@aeschli aeschli reopened this Sep 13, 2017
@aeschli
Copy link
Contributor

aeschli commented Sep 13, 2017

Reopening to collect issues around this issue.

@aeschli aeschli added the debt Code quality issues label Sep 13, 2017
@aeschli aeschli changed the title [html] embedded grammars corrupt syntax highlighting of rest of document [html] prevent that embedded grammars can corrupt syntax highlighting of rest of document Sep 13, 2017
@SDP190
Copy link

SDP190 commented Jun 3, 2018

To my opinion if VSCode cannot make TextMate grammars work as expected for html , VSCode team should consider other workarounds, or even use a different grammar at all.

That is, any html coder would expect to see syntax highlighting as a main feature of an editor, it is sometimes the easiest and simplest way to debug the code, breaking the highlight when using embedded grammar is a tough issue.

@rebornix rebornix removed their assignment Jun 4, 2018
@octref
Copy link
Contributor Author

octref commented Jun 6, 2018

Just constant reminder why this is indispensable for writing embedded grammars:

vuejs/vetur#811
vuejs/vetur#812

image
image
image

@andremacola
Copy link

VsCode always had serious problems with syntax highlight not only in PHP/HTML. This is annoying issue that prevent me to switch from ST3.

@KamasamaK
Copy link

KamasamaK commented Aug 21, 2018

I see Sublime's solution mentioned, but fyi Atom is addressing this with their new Tree-sitter grammar, which supports injection.

EDIT: Actually, it doesn't seem that their implementation solves this issue.

@michaelblyons
Copy link

michaelblyons commented Aug 25, 2018

Working on an HTML embedded syntax of my own, I have now had the (mis?)fortune to discover this ticket. I think the most compelling1 example is the incomplete CSS one from #20488 (comment):

<style>
h1 {
</style>

Any eager rule in the injected syntax definition disregards the end-embed tag. This is also true for C#, where <% this.worksFine(); %> but opening a brace clobbers the end-injection tag: <% if(you.haveProblems()) { %> <p>Sorry!</p>

There are at least two ways to fix this.

  1. In Sublime Text, there are two constructs2 that make an effective end regex take precedence over any included patterns. This means that it's injected to the left in all nested patterns until the stack recedes to uncover it. I understand why this is annoying for the engine design, but it is critical for injectable languages.

    This does have a limitation, though, wherein contexts where the end pattern should not apply still jump back to the parent: <%= foo["uh-oh%>"] %>

  2. In my specific case, if there were meta (or other) scopes specific to C# brace blocks, I could set up injections to push a grandchild text.html.basic when encountering the closing tag %> in text.html meta.embedded source.cs meta.block. If the code is valid, the closing braces <% } %> would all eventually rectify themselves.

    In a sense, this is a more accurate representation of the scope stack, but it doesn't help the folks with dangling braces like the CSS above. (It also doesn't help me since the C# syntax definition doesn't put curly-brace blocks on the scope stack.)

The Markdown syntax takes advantage of the fact that it knows the end tag will be the only thing on the line for its while escape.

@KamasamaK Can you elaborate on why you thought tree-sitter would solve this problem and why it turns out not to?

Footnotes

  1. I suspect that the PHP one is less significant. It might be possible to avert it by adding a L:text.html.php source.css scope to the end of line 112.

  2. with_prototype and embed/escape

@KamasamaK
Copy link

@michaelblyons Sorry, it actually does work as intended for what is implemented. I was testing in Atom v1.29.0 but it wasn't implemented until v1.30, which is still in beta. Also, embedded CSS still does not work at all.

The reason I thought it would work is because the tree-sitter grammars are used to actually parse the full code into a syntax tree. For embedded languages, it would be using distinct grammars to parse those languages and (I think) maintaining those trees independently which should mean that they do not affect each other directly. Also, tree-sitter has inherent error recovery.

@octref
Copy link
Contributor Author

octref commented Aug 28, 2018

Here is what Atom provides: https://flight-manual.atom.io/hacking-atom/sections/creating-a-grammar/#language-injection. The embedded language can specify an injectionRegex that's being read by the outer language grammar.

@msftrncs
Copy link

I want to add, that this problem is not unique to embedding grammers, such as JS or CSS in HTML.

It also occurs with complex and quirky languages such as Batch Script. Its related in that some commands, like the comment, consume the rest of the line. This can be handled with an exception to the line end, but usually its a hack that doesn't fully conform with the language.

Instance:

IF DEFINED A (
	echo hello
	echo a=012) else echo no a
echo no next

The echo can consume the entire line, up to certain characters, but only up to the ) character if the echo appears inside a block as shown. Notice how the following doesn't color correctly here:

echo a=012) else echo no a

and to demonstrate no )

echo a=012 else echo no a

The colorizer here has no way to know if its in a block, but assumes it is, and that else echo are keywords. Myself I would prefer the scoping of the entire contents of the echo as 'string.unquoted' because that is what it is. This statement outside of a block (and not part of an IF block) would echo:

a=012) else echo no a

The following one actually tripped me up, and proper handling (coloring or a syntax checker, but even a syntax checker in batchfile is difficult) in the editor would have saved me some trouble:

:: * Check if already performed copy today, if so, no more copies, else clean-up any previous day flags
IF EXIST "%~5\Replication.Flags\%YEARMONTHDAY%" (
    ECHO:%DATE% %TIME%-Replication already completed for today! (%YEARMONTHDAY%)>>"%LOGFILE%"
    GOTO EXITFLAG
)

The closing ) in the ECHO statement closed the block, and the script always jumped to EXITFLAG regardless of the condition. This isn't the only nearly identical ECHO statement in the script, but its the only one in a block, and they all looked exactly alike. (again, my preference for making ECHO scope as an unquoted string, it just might have been more apparent if the block ending ) would have appeared differently than the ones treated as echo text)

And to make matters worse, this doesn't happen everywhere. a REM cannot be broken by any of the special characters, but REM: (among others) can be (unless they're in double quotes).

The SET statement is even better. It also consumes an entire line, and you can start string or block constructs (with /A), but they always terminate at the end of the line, whether the end construct came or not, except lack of a block close causes an error, which can cause confusion if you are already in a block. Not hard to handle, since you just use SET specific block and quoting rules, but handling of outer blocks is the same problem.

I tried using such a rule as to have the block rule capture all the line, up to but not including another open block, or string quote, or the group end, parsing the capture, and then including only additional group or string items, but this hasn't worked so far, because it will prevent other block constructs (IF or FOR) from scoping their whole block. Even if I could get this to work right, it will probably break line continuation, if not other things.

@msftrncs
Copy link

Having spent a few more days trying to make the current batchfile (batch script) syntax closer to how 'CMD' actually reacts, and I am pretty sure this is what TextMate and the tmLanguage system needs:

  • a new block construct meant for the purpose of capturing multiple lines (still processed one line at a time) for sub-tokenizing together in a specified scope(s).
  • allow for inclusion of patterns that further capture lines, preventing the matching the blocks end point.
  • Then tokenize this captured content via the specified scope(s) all the captured document, from end of the beginning match to the beginning of the end match, again 1 line at a time.

Maybe use BEGIN_SUBPARSE/END_SUBPARSE/SUBPARSE_PATTERNS with just PATTERNS being the exception patterns that consume the document to prevent the end from matching, and SUBPARSE_PATTERNS being the scopes to apply to the content.

Standard repository items could be used for the inclusion patterns, but any scoping they would do would be ignored.

This actually matches how CMD processes input, specifically where grouping is involved. It normally reads in 1 line a time, unless in that line it finds a group that hasn't ended, then it reads ahead further lines only checking for exceptions to finding the group end, and then returns to processing the collected lines one command at a time. A command on a line by itself may process its line completely different than when a preceding group consumes it.

For instance, by itself:

SET "avar=" hello) & echo !!!"

the SET command tokenizes the line using the outer quotes, stripping them, resulting in
avar, =, and " hello) & echo !!!

But

IF DEFINED avar THEN (echo Yes) ELSE (
SET "avar=" hello) & echo !!!"

will, if 'avar' is not defined, cause 'avar' to be erased (but its not even defined) as the set is restricted to tokenizing avar, =, and then echoing !!!"

Now, if you take the arithmetic option, you get a new issue, as parenthesis are permitted for arithmetic grouping. But these are not to be confused with the code groups, and in fact, CMD only accepts opening of groups at certain places, so the arithmetic grouping of the SET /A command do not count, and thus the first closing parenthesis will be taken as the closing of the command group, rather than the closing of an arithmetic set.

@tomByrer
Copy link

tomByrer commented May 2, 2019

Any progress on this please?

@HugoGranstrom
Copy link

I'll chip in that I have had huge problems regarding this as well and for the life of me haven't understood why. Disappointingly enough, it seems to boil down to the fact that embedded languages are inherently broken because the devs don't want to make additions on top of the TextMate grammar. That is a real shame because it turns embedded languages into a random mess, sometimes it works, sometimes it doesn't, you never know which. There have been several valid proposals (copying pop from sublime for example) that just seems to be ignored just to "conform to Textmate grammar no matter what even if it means having a broken feature"...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue identified by VS Code Team member as probable bug debt Code quality issues languages-basic Basic language support issues upstream Issue identified as 'upstream' component related (exists outside of VS Code)
Projects
None yet
Development

No branches or pull requests