Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve LexJulia #13

Closed
wants to merge 13 commits into from
Closed

Improve LexJulia #13

wants to merge 13 commits into from

Conversation

getzze
Copy link
Contributor

@getzze getzze commented May 26, 2021

From https://sourceforge.net/p/scintilla/feature-requests/1380/

  • Prefix lexer options name with lexer.julia.
  • [Edit] Prefix fold options name with fold.julia.
  • Remove unsafe glib calls to isdigit and isspace
  • Add string.interpolation option
  • [Edit] Add fold.julia.syntax.based option
  • Strip trailing spaces
  • [Edit] correct bug with inline comments

The fact that the end of the line of a single line comment is not styled as a comment, is a bug, all the line should be styled as comment. But I don't manage to make it work. It is likely a problem at line 970 with the condition if (sc.atLineEnd || sc.ch == '\r' || sc.ch == '\n').

@nyamatongwe
Copy link
Member

To follow the naming convention, "fold.docstring" should be "fold.julia.docstring".

@nyamatongwe
Copy link
Member

These have been merged.

@nyamatongwe nyamatongwe added the julia Caused by the julia lexer label Jun 2, 2021
@nyamatongwe
Copy link
Member

Closing as equivalents merged into 5.0.3 release.

@nyamatongwe nyamatongwe closed this Jun 2, 2021
@nyamatongwe nyamatongwe reopened this Jun 12, 2021
@nyamatongwe
Copy link
Member

There is a problem with docstring folding.

  1. Open the example file lexilla/test/examples/julia/x.jl. Folding is correct.
  2. Move the caret inside the docstring - the triple-double-quoted region.
  3. Type a character.

Result: folding region for docstring now extends to end of file.

It appears this is because of the test for the end of the docstring which requires isdocstring to be true but its initialized to false. The code immediately above that sets isdocstring doesn't run as the fold processing restarts at the line before the typed character, not before the whole docstring.

@getzze
Copy link
Contributor Author

getzze commented Jun 14, 2021

Good catch, I don't remember why I introduced this variable, but it seems that it's working well removing it.
There will be the same problem with the syntax based folding, though. I am keeping track if the current line is inside parentheses or brackets with the list_comprehension and indexing_level variables. That is because for and if keyword do not need an end inside brackets.
Can we tell the Folder to start again at the opening bracket and not one line above?

@getzze
Copy link
Contributor Author

getzze commented Jun 14, 2021

Minimal example:

a = [2 
       for i in randn(10)
       if i > 1]

This code should fold between the 2 brackets.
Modify the second line and the folding is broken.
Any idea how to fix it?

@zufuliu
Copy link
Contributor

zufuliu commented Jun 14, 2021

How about track nested levels for braces/brackets/parentheses on each line, and given keywords inside braces/brackets/parentheses a different style.

@nyamatongwe
Copy link
Member

Its common for lexers and folders to backtrack to a safe consistent starting point before performing their main loop. Perhaps LexerRaku::Fold can be used as an example, where it goes looking for a line starting with the default style as a starting point. I'm currently fixing some similar bugs in LexerRaku.

If backtracking doesn't look reasonable as its too expensive or uncertain, LineState (or similar) can be used to leave breadcrumbs of which lines are good starting points or to count braces (for example).

@nyamatongwe
Copy link
Member

This bug was found by a new automatic test in TestLexers: if lexing / folding the whole file matches the known-good files, then the file is lexed and folded line-by-line. If the result of this is different then that is reported. This test found bugs in 5 folders and 1 lexer.

This test is temporarily disabled for Julia by the testlexers.per.line.disable property in SciTE.properties. Remove the line (it can't be set to 0 as the properties parsing in TestLexers is primitive) and a diagnostic appear along with a .folded.new file.

@zufuliu
Copy link
Contributor

zufuliu commented Jun 15, 2021

I believe current Julia lexer has bugs for type keyword:
from https://docs.julialang.org/en/v1/base/base/#Keywords

The following two-word sequences are reserved: abstract type, mutable struct, primitive type. However, you can create variables with names: abstract, mutable, primitive and type.

@getzze
Copy link
Contributor Author

getzze commented Jun 22, 2021

I just pushed two commits:

  • add a keyword for colorizing builtin functions
  • use LineState to correctly fold and lex line-by-line (and put a block of code inside a function to clean up the code)

Regarding 2-word keywords, as @zufuliu noticed, do you have any idea if this was implemented for another language?
It could be hard-coded, but to make it configurable, i can think of providing a list of 2-word keywords separated by a non-word character, like abstract/type primitive/type mutable/struct.

@zufuliu
Copy link
Contributor

zufuliu commented Jun 22, 2021

There are other languages has keyword pairs (e.g. While ... End While in VB), but unlike Julia, both word are reserved.

Copy link
Member

@nyamatongwe nyamatongwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lexilla aims at binary compatibility between releases. One aspect of that is that lexical style numbers should remain stable. This change set adds a new style SCE_JULIA_KEYWORD4 at 6 and then moves SCE_JULIA_CHAR and all styles from SCE_JULIA_STRING.

Instead, SCE_JULIA_KEYWORD4 should be added as 20. This allows current applications to work with a new Lexilla.DLL or .so.

@nyamatongwe
Copy link
Member

Regarding 2-word keywords, as @zufuliu noticed, do you have any idea if this was implemented for another language?
It could be hard-coded, but to make it configurable, i can think of providing a list of 2-word keywords separated by a non-word character, like abstract/type primitive/type mutable/struct.

Handling two word keywords completely is more work, particularly when they are split by a line end. Some languages handle this by retaining the previous word and hard-coding some sequences: see LexFortran around line 642 for "type is".

The character currently used for abbreviations or abridgements in word lists is "~" so "mutable~struct".

Copy link
Member

@nyamatongwe nyamatongwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an unused function that should be removed:

../lexers/LexJulia.cxx:518:13: warning: 'bool IsTripleStringState(Lexilla::LexAccessor&, Sci_Position)' defined but not used [-Wunused-function]
  518 | static bool IsTripleStringState(LexAccessor &styler, Sci_Position i) {
      |             ^~~~~~~~~~~~~~~~~~~

Copy link
Member

@nyamatongwe nyamatongwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an unused function that should be removed.

../lexers/LexJulia.cxx:518:13: warning: 'bool IsTripleStringState(Lexilla::LexAccessor&, Sci_Position)' defined but not used [-Wunused-function]
  518 | static bool IsTripleStringState(LexAccessor &styler, Sci_Position i) {
      |             ^~~~~~~~~~~~~~~~~~~

@getzze
Copy link
Contributor Author

getzze commented Jun 23, 2021

Lexilla aims at binary compatibility between releases. One aspect of that is that lexical style numbers should remain stable. This change set adds a new style SCE_JULIA_KEYWORD4 at 6 and then moves SCE_JULIA_CHAR and all styles from SCE_JULIA_STRING.

Instead, SCE_JULIA_KEYWORD4 should be added as 20. This allows current applications to work with a new Lexilla.DLL or .so.

Is this a recommendation or is it mandatory?
It's just that it's making the code a bit more ugly, it's better if all the SCE_JULIA_KEYWORDs are consecutive numbers. As it is the first release with LexJulia, very few people will be affected by this change.
Sorry, if I had known this before I would have done like in LexRust and assigned lots of consecutive KEYWORDs just in case it was needed.

@getzze
Copy link
Contributor Author

getzze commented Jun 23, 2021

Regarding 2-word keywords, as @zufuliu noticed, do you have any idea if this was implemented for another language?
It could be hard-coded, but to make it configurable, i can think of providing a list of 2-word keywords separated by a non-word character, like abstract/type primitive/type mutable/struct.

Handling two word keywords completely is more work, particularly when they are split by a line end. Some languages handle this by retaining the previous word and hard-coding some sequences: see LexFortran around line 642 for "type is".

The character currently used for abbreviations or abridgements in word lists is "" so "mutablestruct".

I'm actually thinking if it's worth the effort. In the geany thread, somebody commented that it is an acceptable bug :)

@nyamatongwe
Copy link
Member

Is this a recommendation or is it mandatory?

I regard it as important. How would the change to LexJulia be communicated to downstream projects? I suppose a second Julia lexer could be added with a new name so there was no possibility of mismatches.

It's just that it's making the code a bit more ugly, it's better if all the SCE_JULIA_KEYWORDs are consecutive numbers. As it is the first release with LexJulia,

It will be the third release with LexJulia. The first was 5.0.3 and 5.1.0 was just released.

Sorry, if I had known this before I would have done like in LexRust and assigned lots of consecutive KEYWORDs just in case it was needed.

It appears that the Julia code for keywords is just choosing from sets of lexemes that look like identifiers. A better mechanism for expandable keyword sets is 'substyles' as is implemented in LexCPP and LexPython starting with AllocateSubStyles which allocates style numbers dynamically. Distinct keyword lists are needed more when sets of keywords are also lexically distinct like comment documentation keywords or where there are sublanguages like embedding SQL in C or JavaScript in HTML.

Naming a word list "Raw string literals" was also a bit strange since the code matches identifiers.

@nyamatongwe
Copy link
Member

nyamatongwe commented Jun 23, 2021

From the application point-of-view, substyles are documented at https://www.scintilla.org/ScintillaDoc.html#Substyles .

From the lexer's perspective, the ILexer4 methods from AllocateSubStyles to GetSubStyleBases may need to be implemented. The SubStyles class provides a basic implementation of the required methods as can be seen in LexPython. 'Secondary' styles are probably not needed - they allow greying out inactive code for C/C++.

@getzze
Copy link
Contributor Author

getzze commented Jun 24, 2021

I ended up not using substyles, I didn't understand how they work and just added KEYWORD4 at the end of the style list.
I'm not going to implement the 2-word keywords, it seems like too much work, so this can be merged if there is no other problems.

@nyamatongwe
Copy link
Member

just added KEYWORD4 at the end of the style list.

The change also moved the "Raw string literal prefixes" (previously named "Raw string literals") from keywords[3] to keywords[4]. This is another incompatible change. It has less effect so isn't as much of a problem as renumbering styles but stability is an important attribute of library code. Libraries should be dependable and should strive for compatibility unless there are strong benefits to a change in interface.

@getzze
Copy link
Contributor Author

getzze commented Jun 24, 2021

It's a bit better to have this ordering, because the string prefixes are not identifers, whereas the other lists are. But I understand the compatibility issue.

@nyamatongwe
Copy link
Member

The tags in LexicalClass definitions are more likely to be useful when they are shared as they enable applications to define appearance and behaviour based on particular tags. For example, an application may define a spell-check command that only looks inside styles with tag "comment".

Look in other lexers for the tags they use and only invent new tags when there is no good existing value. The best examples are LexCPP, LexPython, and LexHTML.

3,  "SCE_JULIA_KEYWORD1", "keyword1", "Reserved keywords",
Should just be "keyword" when they are keywords and "identifier" when they are used to highlight functions from a library, perhaps the language's standard library.

6,  "SCE_JULIA_CHAR", "char", "Character",
Should be "literal string character" as this allows operations over all "literal" styles or over more restricted domains.

8,  "SCE_JULIA_BRACKET", "bracket", "Bracket",
Most lexers treat brackets as "operator" and don't have a separate bracket class. Perhaps "operator bracket".

10, "SCE_JULIA_STRING", "string", "String",
"literal string".

11, "SCE_JULIA_SYMBOL", "symbol", "Symbol",
Unsure but "symbol" is likely OK. In different languages, the equivalent of symbols could be thought of as "literal string" or "identifier" and sometimes they are unique or interned.

12, "SCE_JULIA_MACRO", "macro", "Macro",
Could include "preprocessor" (as could SCE_JULIA_SYMBOL) depending on whether you want to treat metaprogramming as a preprocessor.

13, "SCE_JULIA_STRINGINTERP", "stringinterp", "String interpolation",
 "literal string interpolated".

14, "SCE_JULIA_DOCSTRING", "docstring", "Docstring",
 "literal string documentation".

16, "SCE_JULIA_COMMAND", "command", "Command",
Perhaps "literal string command".

18, "SCE_JULIA_TYPEANNOT", "typeannot", "Type annotation",
I don't think there is a tag currently used for types as they are often identifiers. Perhaps "identifier type".

19, "SCE_JULIA_LEXERROR", "lexerror", "Lexing error",
Should include separated "error".

@nyamatongwe
Copy link
Member

Folding fix pushed. Also enabled testing of per-line folding with 2f03abb and documented in change log with 4dd405c.

@getzze
Copy link
Contributor Author

getzze commented Jun 25, 2021

The tags in LexicalClass definitions are more likely to be useful when they are shared as they enable applications to define appearance and behaviour based on particular tags. For example, an application may define a spell-check command that only looks inside styles with tag "comment".

Look in other lexers for the tags they use and only invent new tags when there is no good existing value. The best examples are LexCPP, LexPython, and LexHTML.

3,  "SCE_JULIA_KEYWORD1", "keyword1", "Reserved keywords",
Should just be "keyword" when they are keywords and "identifier" when they are used to highlight functions from a library, perhaps the language's standard library.

6,  "SCE_JULIA_CHAR", "char", "Character",
Should be "literal string character" as this allows operations over all "literal" styles or over more restricted domains.

8,  "SCE_JULIA_BRACKET", "bracket", "Bracket",
Most lexers treat brackets as "operator" and don't have a separate bracket class. Perhaps "operator bracket".

10, "SCE_JULIA_STRING", "string", "String",
"literal string".

11, "SCE_JULIA_SYMBOL", "symbol", "Symbol",
Unsure but "symbol" is likely OK. In different languages, the equivalent of symbols could be thought of as "literal string" or "identifier" and sometimes they are unique or interned.

12, "SCE_JULIA_MACRO", "macro", "Macro",
Could include "preprocessor" (as could SCE_JULIA_SYMBOL) depending on whether you want to treat metaprogramming as a preprocessor.

13, "SCE_JULIA_STRINGINTERP", "stringinterp", "String interpolation",
 "literal string interpolated".

14, "SCE_JULIA_DOCSTRING", "docstring", "Docstring",
 "literal string documentation".

16, "SCE_JULIA_COMMAND", "command", "Command",
Perhaps "literal string command".

18, "SCE_JULIA_TYPEANNOT", "typeannot", "Type annotation",
I don't think there is a tag currently used for types as they are often identifiers. Perhaps "identifier type".

19, "SCE_JULIA_LEXERROR", "lexerror", "Lexing error",
Should include separated "error".

For 18 SCE_JULIA_TYPEANNOT, it is used in a::Int, by default, it highlights the operator for type annotation ::, but there is an option to make it highlight ::Int. So "operator type" is more appropriate, but it could also mean "identifier"... What do you think is best?

For 15 and 17, SCE_JULIA_STRINGLITERAL and SCE_JULIA_COMMANDLITERAL, it is used in string literal like raw"str{}" to highlight raw, like the f in python f-string. What do you think of "literal string (command) prefix" ?

@getzze
Copy link
Contributor Author

getzze commented Jun 25, 2021

I just realized that the raw string prefixes was not needed as all the prefixed strings are treated as raw strings.
I also changed the tags.

@nyamatongwe
Copy link
Member

As an example of tag usage there is a screen shot of an unfinished "search in tag" option for SciTE at the end.

For 18 SCE_JULIA_TYPEANNOT, it is used in a::Int, by default, it highlights the operator for type annotation ::, but there is an option to make it highlight ::Int. So "operator type" is more appropriate, but it could also mean "identifier"... What do you think is best?

Tags were seen as constant so they aren't such a good fit where a property changes the meaning. If its likely that either will be chosen then "operator type". If one choice is likely to predominate then just use its tag.

For 15 and 17, SCE_JULIA_STRINGLITERAL and SCE_JULIA_COMMANDLITERAL, it is used in string literal like raw"str{}" to highlight raw, like the f in python f-string. What do you think of "literal string (command) prefix" ?

Most lexers include any prefix text as part of the string, so "literal string" would likely be most similar to other lexers.

FindTags

@getzze
Copy link
Contributor Author

getzze commented Jun 26, 2021

Tags were seen as constant so they aren't such a good fit where a property changes the meaning. If its likely that either will be chosen then "operator type". If one choice is likely to predominate then just use its tag.

I actually wrote the lexer just to have syntax colorization, but I understand now that this is more powerful. So I added another style 8(), SCE_JULIA_TYPEOPERATOR to tag the operator before type annotation (::) and SCE_JULIA_TYPEANNOT is now used only to tag the type annotation identifier after the operator, only if the option is set to 1.

Thanks!

nyamatongwe pushed a commit that referenced this pull request Jun 28, 2021
nyamatongwe pushed a commit that referenced this pull request Jun 28, 2021
nyamatongwe pushed a commit that referenced this pull request Jun 28, 2021
Standardise tags in lexical class information to follow other lexers where possible.
nyamatongwe pushed a commit that referenced this pull request Jun 28, 2021
@nyamatongwe
Copy link
Member

nyamatongwe commented Jun 28, 2021

The changes have been merged over another change 44f6ac8 which added the include of <functional> needed for a change to OptionSet. I also amended the changes to reference this pull request, remove an unused variable sizeJuliaLexicalClasses, change the order of header includes, and add change log entries.

You should synchronize to the current repository state before making more changes so the changes will apply.

@getzze
Copy link
Contributor Author

getzze commented Jun 28, 2021

Thanks, I think this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
julia Caused by the julia lexer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants