Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base.isidentifier(::Symbol) should check normalization #52641

Open
stevengj opened this issue Dec 27, 2023 · 5 comments · May be fixed by #52735
Open

Base.isidentifier(::Symbol) should check normalization #52641

stevengj opened this issue Dec 27, 2023 · 5 comments · May be fixed by #52735
Labels
bug Indicates an unexpected problem or unintended behavior display and printing Aesthetics and correctness of printed representations of objects. good first issue Indicates a good issue for first-time contributors to Julia parser Language parsing and surface syntax

Comments

@stevengj
Copy link
Member

stevengj commented Dec 27, 2023

As discussed on discourse, while the Symbol("...") foo constructor allows one to construct invalid Julia identifiers and in particular identifiers lacking normalization (an intentional decision way back in #5462), the Base.isidentifier function doesn't check for normalization of Symbol arguments.

This also has the consequence that show is incorrect for non-normalized Symbols, since it checks isidentifier to see whether they can be printed as :foo rather than as Symbol("foo"):

julia> "e\u0301" # e with acute accent, not NFC normalized
""

julia> Symbol("e\u0301") == :é   # correct: :é gets normalized
false

julia> length(String(Symbol("e\u0301"))), length(String(:é))
(2, 1)

julia> Base.isidentifier(Symbol("e\u0301"))   # incorrect: should check normalization
true

julia> Symbol("e\u0301") # incorrect display: not the same as :é
:é

My suggestion is that

isidentifier(s::Symbol) = isidentifier(string(s))
should be changed to something like:

function isidentifier(s::Symbol)
    str = string(s)
    ct(codepoint::UInt32) = get(Unicode._julia_charmap, codepoint, codepoint)
    isidentifier(str) && str == Unicode.normalize(str; stable=true, compose=true, chartransform=ct)
end

(Note that isidentifier should continue to not check normalization for ::String arguments, since in that case it has a slightly different meaning — it checks whether the string can be parsed into a valid identifier.)

@stevengj stevengj added bug Indicates an unexpected problem or unintended behavior parser Language parsing and surface syntax display and printing Aesthetics and correctness of printed representations of objects. good first issue Indicates a good issue for first-time contributors to Julia labels Dec 27, 2023
@stevengj
Copy link
Member Author

stevengj commented Dec 27, 2023

(Marking as "good first issue" since someone needs only to make a 5-line patch from the above suggestion, clarify the isidentifier docs, then add a test.)

@KristofferC
Copy link
Sponsor Member

Note that Symbol(::String) can currently be constant propagated so it would be good to make sure that is still the case.

@stevengj
Copy link
Member Author

stevengj commented Jan 3, 2024

@KristofferC, the proposed fix won't affect Symbol(::String) at all, only isidentifier(::Symbol).

@re1san
Copy link
Contributor

re1san commented Jan 4, 2024

Hey @stevengj I made the above changes as suggested and added a test, but the build process fails due to segmentation fault. I have attached the log below.
log.txt

@stevengj
Copy link
Member Author

stevengj commented Jan 4, 2024

Maybe the chartransform argument needs to be in global scope, so that it is a constant. No, that shouldn't matter. May be some kind of bootstrapping thing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior display and printing Aesthetics and correctness of printed representations of objects. good first issue Indicates a good issue for first-time contributors to Julia parser Language parsing and surface syntax
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants