Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clang 14 rejects certain Unicode characters in identifiers that are accepted by Clang 13 and the C++ Standard #54732

Closed
tttapa opened this issue Apr 3, 2022 · 57 comments
Labels
c++23 clang:frontend Language frontend issues, e.g. anything involving "Sema"

Comments

@tttapa
Copy link

tttapa commented Apr 3, 2022

Some Unicode characters like ₊ (U+208A) and other subscripts are rejected by Clang 14. These characters are in the allowed ranges for identifiers in the [lex.name] section of the C++ Standard. Recent versions of GCC and older versions of Clang do not raise any errors.

For example:

double foo(double xₖ, double xₖ₊₁) {
  return xₖ₊₁ - xₖ;
}
$ clang++-14 -c unicode.cpp -std=c++20                                                                                                                                                 
unicode.cpp:1:36: error: character <U+208A> not allowed in an identifier
double foo(double xₖ, double xₖ₊₁) {
                               ^
unicode.cpp:1:39: error: character <U+2081> not allowed in an identifier
double foo(double xₖ, double xₖ₊₁) {
                                ^
unicode.cpp:2:14: error: character <U+208A> not allowed in an identifier
  return xₖ₊₁ - xₖ;
           ^
unicode.cpp:2:17: error: character <U+2081> not allowed in an identifier
  return xₖ₊₁ - xₖ;
            ^
4 errors generated.
$ clang++-14 --version
Ubuntu clang version 14.0.1-++20220402053234+23d08271a4b2-1~exp1~20220402053315.111
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Is this a deliberate change or a regression bug from Clang 13 to 14?

@EugeneZelenko EugeneZelenko added clang:frontend Language frontend issues, e.g. anything involving "Sema" and removed new issue labels Apr 3, 2022
@llvmbot
Copy link
Collaborator

llvmbot commented Apr 3, 2022

@llvm/issue-subscribers-clang-frontend

@cor3ntin
Copy link
Contributor

cor3ntin commented Apr 4, 2022

This is a deliberate change.
Clang 14 implemented https://wg21.link/p1949 which was standardized for all C++ versions and limits the set of allowed codepoints in identifiers to what is recommended by Unicode.

I do wonder if we could make that clearer in the diagnostic though

@tttapa
Copy link
Author

tttapa commented Apr 4, 2022

Thanks for the quick reply!

That's unfortunate, these subscripts were really handy to use the mathematical notation from papers, formulas and pseudocode in the C++ implementation. Is there an option to get back the old behavior? (I didn't find one in https://clang.llvm.org/docs/ClangCommandLineReference.html but I might be searching for the wrong keywords.)

@intractabilis
Copy link

For some inexplicable reason ∂, 𝜕 are now not allowed. Dear Steve Downey, Zach Laine, Tom Honermann, Peter Bindels, and Jens Maurer why do you hate derivatives so much?

Dear Clang, please, don't adopt P1949R7 because it is ridiculous, unnecessary, and breaks existing code.

@intractabilis
Copy link

I posted a comment at the corresponded review page: https://reviews.llvm.org/D104975#3486313

@AaronBallman
Copy link
Collaborator

I agree that this behavior is intentional and some amount of broken code is expected as a result. I'm sorry you've been caught by that!

Is there an option to get back the old behavior?

There is not. We could perhaps elect to not implement this paper in older language modes (so it only happens in -std=c++2b and later) and we could elect to add a feature flag so you can opt into a non-conforming mode in C++23 and later. However, such a change is somewhat risky and something I'd like to avoid unless we see significant code breakage in the wild (system headers, major third-party library headers, a ton of individual user projects, etc). Previously, neither the C nor the C++ committee had a principled reason for what was or wasn't a valid character in an identifier when it came to Unicode characters. This caused real problems (including a high-score CVE in the same space) and so the committees both decided to defer to the Unicode consortium as to what is and isn't a valid character for an identifier (with one exception for _ as the low-line character was necessary for backwards compatibility). We do not wish to go back to ad hoc allowance/rejection of characters for identifiers, so we're most likely going to wait for the Unicode consortium to change their rules or the C or C++ committees to add exceptions.

That said, the fact that this code was broken and it's causing you pain is helpful for the standards bodies to understand the impact of the changes. We'll make sure this information is fed back to the standards bodies (it's already generated some discussion from the original report). And if it starts to look like more people are getting caught by this, we'll certainly consider what changes we can make to ease the burdens.

Dear Clang, please, don't adopt P1949R7 because it is ridiculous, unnecessary, and breaks existing code.

While you might be frustrated by the situation, please do not disparage the hard work of others as being ridiculous or unnecessary, and please follow our Code of Conduct: https://llvm.org/docs/CodeOfConduct.html

@intractabilis
Copy link

intractabilis commented May 2, 2022

I insist that excluding math symbols like partial derivative is in my opinion unnecessary and strange. I would argue that 𝜕 has more sense than supporting emoji in identifiers. I am really sorry and regret that hard work of others was directed to something unnecessary and arbitrary, but hard work of others hardly can be a reason to change my opinion.

GCC supports 𝜕, and I hope it's not going to change.

@intractabilis
Copy link

I guess the reason why 𝜕 was excluded is somebody confused a mathematical operator, which is a function that acts on other functions or on some structured objects, with a programming language operator, which is professional lingo for a mathematical operation, or for some other language-specific weird operation. 𝜕 is a mathematical operator, for any purpose of a programmer it's just a letter.

However, the existence of an explanation doesn't make this decision good, logical, sane, or necessary. 𝜕 is much more useful than emoji. If I implement an algorithm from a paper, and the paper says 𝜕Ω, this is what would be absolutely natural to use in the code.

There is neither benefit nor sense in removing 𝜕.

@FrankHB
Copy link

FrankHB commented May 7, 2022

I guess the reason why 𝜕 was excluded is somebody confused a mathematical operator, which is a function that acts on other functions or on some structured objects, with a programming language operator, which is professional lingo for a mathematical operation, or for some other language-specific weird operation. 𝜕 is a mathematical operator, for any purpose of a programmer it's just a letter.

I'd like to add some notes.

Generally speaking, a mathematical operator is an operator in the sense of programming languages, which derive the rules from math systems.

The design to make operators always in specific syntactic categories is language-specific. For example, operators are punctuations (rather than identifiers) in C and C++, while many Lisp dialects just treat operators same to elements in the head subform of a function application (i.e. operators = functions being applied to) and such operators can be named by identifiers. (To be accurate, punctuators here are the "operator-or-punctuator" category in C++; operators and puctuators will be in different categories later.)

I don't mean to change C/C++, but I'd argue the latter ("the Lisp style") is better than the former ("the ALGOL style") in contexts of language-agnostic meaning of the notion "operator". (I use "ALGOL" to suggest whether an operator can be defined be user is irrelavant here.)

Traditionally, math systems do not distinguish syntactic and semantic forms of elements like operators, as they are always self-evaluating. That is, thay will not change to anything else during the deduction in the system, unless combined with somthing other (the operands). This is OK in traditional uses (just interested in getting the results of some computations), but formally probamatic when you want to describe the underlying system in more detailed ways (say, operational semantics). To describe the precise behavior of the system, you have to differentiate whether an element (not necessarily a self-evaluating one) is evaluated (more formally, reduced to the normal form) or not, and mixing different contextual meanings of such elements would be a mess.

Languages like C and C++ are not totally formally defined. However, they still imply rules like deduction systems in their formal grammars. Specifically, C and C++ have notion of phases of translations, so lexical identical elements (even "self-evaluating" during the translation) can be actually different: identifiers of preprocessing tokens are not same to identifiers of tokens.

This example is important because it is exactly concerned in the meaning of "identifiers" handled here. I don't think it is the Unicode consortium's work to clarify the mapping from the meaning in specific programming languages to the definition in the Unicode specification. In particular, not all programming languages need the distinction. (C and C++ need it, because some of identifiers of preprocessing tokens would be converted to keywords of tokens instead of identifiers of tokens.) So, it would hardly come true "to establish conventions that will be followed by most/all programming languages" without further efforts with more careful analyses (which are closer than "the Lisp style" rather than "the ALGOL style").

The syntactic element 𝜕 is traditionally used to construct some specific kind of operators in the calculus. A single letter 𝜕 has no specific formal semantics without combining to other syntactic elements. Nevertheless, in more modern views, it can be treated as a higher-order constructor of the operators (to combine other symbols of variables together). This is essentially the same role of the lexeme lambda (or λ) in Lisp dialects: a single lambda names no special entity because special forms in Lisp do not reduce to Lisp values, only well-formed lambda forms are meaningful. So, lambda in such languages will not name a thing by user.

However, in some extended calculi, lambda can actually be derived as first-class objects. So, I suggest the fact that whether such an syntactic element can be solely used in well-fomed programs a language-specific detail, and language-agnostic treatment should allow such elements being plain variables in general (as λ not necessarily represents the constructor of lambda abstractions). To minimize the surprise, assigning the same lexical category (compared to other identifiers naming variables) to such elements seem necessary, whatever it would be an element of identifiers in the Unicode side.

@intractabilis
Copy link

intractabilis commented May 7, 2022

Generally speaking, a mathematical operator is an operator in the sense of programming languages, which derive the rules from math systems.

It is not. What we call a math operator in a programming language, in real math, is called an operation. Never in my life I heard anyone calling addition a "plus operator" in a calculus class. Wikipedia even has two separate articles for math operators and for programming language operators.

I don't think it is the Unicode consortium's work to clarify the mapping from the meaning in specific programming languages to the definition in the Unicode specification.

Yes, I agree. I think that P1949 helps nobody.

The syntactic element 𝜕 is traditionally used to construct some specific kind of operators in the calculus.

Not necessarily. It's common to denote a boundary of Ω as 𝜕Ω. With some effort, I guess you can develop a theory where 𝜕 in this case will be an operator, but in most papers it's just a syntax sugar to denote a boundary.

@tttapa
Copy link
Author

tttapa commented May 7, 2022

@AaronBallman, thank you for the detailed reply.

While I agree that serious issues and security vulnerabilities should be addressed and fixed retroactively for older standards, I feel this change goes way beyond that, for two reasons:

  1. It potentially breaks a lot of working code, including e.g. code in papers or other publications, which will now forever be broken, even when explicitly compiling for older language standards.
  2. There is a big difference between disallowing risky control characters because of security issues and disallowing normal, well-behaved mathematical characters and subscripts.

I'm in favor of ironing out some of the inconsistencies in the ranges of allowed characters, and addressing things like normalization may be important, but in my opinion, by suddenly disallowing perfectly reasonable characters, the current implementation of P1949 creates many more problems than it actually solves.


Aside from the issue of backwards compatibility, I'd like to motivate the use of certain Unicode characters:

I'd argue that it is perfectly reasonable to use variable names such as 𝜕Ω, as this 1. improves readability of the code, and 2. allows you to closely follow the notation used in the field (e.g. mathematical papers).

As an example from the code I'm currently working on:

// Compute forward-backward envelope
φₖ₊₁ = ψₖ₊₁ + 1 / (2 * γₖ₊₁) * pₖ₊₁ᵀpₖ₊₁ + grad_ψₖ₊₁ᵀpₖ₊₁;

With limited Unicode support, I have to write something like this:

// Compute forward-backward envelope
φ_k_plus_1 = ψ_k_plus_1 + 1 / (2 * γ_k_plus_1) * p_k_plus_1ᵀp_k_plus_1
           + grad_ψ_k_plus_1ᵀp_k_plus_1;

What could previously be effortlessly parsed and easily matched to the formula in the paper has now become an unreadable mess of letters and underscores.

In the first expression, the variable names are distinct and concise, and can be recognized at a glance. As a result, you can easily focus on the operations that are actually carried out on them. Thanks to the subscripts, you automatically focus on the actual names rather than on the k+1, which is less important.

In the second expression, variable names share the _k_plus_1 part and are harder to tell apart at a glance. Furthermore, because of the long variable names, the operators are farther apart, making it much harder to parse the expression as a whole.

Searching the code base of my current project, I found over a thousand matches for the subscripts 0 through 9.
Personally, I will not be refactoring the code, because 1. it is an unnecessary change to code that works perfectly fine as is, and 2. it would seriously hurt readability.
That unfortunately means I've had to downgrade Clang to version 13.


I noticed that P1949R7 states (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1949r7.html#what-will-this-proposal-not-change):

5 What will this proposal not change?

5.1 The validity of “extended”" characters in identifiers

All current compilers allow characters outside the basic source character set directly in source today.

Does this leave room for Clang and other compilers to still allow these characters even though they might be outside of XID_*?

We'll make sure this information is fed back to the standards bodies

I appreciate that, thank you.
I realize that this is not an issue for the majority of programmers, but this is something that is quite important when dealing with mathematical expressions. For example, in Julia, another language that's often used in our field, φₖ₊₁, grad_ψₖ₊₁ᵀpₖ₊₁ and 𝜕Ω are all valid identifiers.

@intractabilis
Copy link

intractabilis commented May 7, 2022

@tttapa I totally agree with everything you said. Btw, GCC has also implemented P1949 in version 12, but it still allows math symbols like 𝜕Ω and φₖ₊₁.

Here is a demonstration with a snapshot of GCC from the release 12 branch:

#include <iostream>

int main(int argc, char* argv[]) {
    auto 𝜕Ω = 4;
    auto φₖ₊₁ = 5;
    std::cout << "𝜕Ω = " << 𝜕Ω << std::endl;
    std::cout << "φₖ₊₁ = " << φₖ₊₁ << std::endl;
}
$ g++ --version
g++ (GCC) 12.0.1 20220504 (prerelease)
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ g++ -std=c++23 -o test test.cpp
$ ./test
𝜕Ω = 4
φₖ₊₁ = 5

Btw, before you, I didn't know there is Unicode for subscripts and superscripts. Now I am totally going to use it.

@llvmbot
Copy link
Collaborator

llvmbot commented May 9, 2022

@llvm/issue-subscribers-c-2b

@AaronBallman
Copy link
Collaborator

Does this leave room for Clang and other compilers to still allow these characters even though they might be outside of XID_*?

Yes, we have the wiggle room to do this (for example, with a feature flag to let users opt out of the P1949 behavior), but there's some space between "can" and "should" we need to be careful to consider. That feature flag makes it far more likely you'll run into portability issues with your code, but if it's something you explicitly opt into, then that's your decision to make.

I'm coming around to the idea of giving a feature flag for this -- if for no other reason, than because there was no deprecation period before this change broke code for some folks. Giving people an ability to upgrade to newer Clang versions while transitioning their code base to valid identifiers (or whatever other portability measure they want to take, if any) has value that may be worth the maintenance costs. I'm still not certain of the shape of the flag though -- does it allow any Unicode character no matter how dangerous it is in an identifier, does it allow only math symbols, something else? We'd have to figure out what the right behavior is, which sort of brings us right back to the challenge of "what's the principle behind whether a character is in or out?". As a strawman, I think we could say "allows math symbols too", but I'd definitely want input from @cor3ntin @tahonermann when deciding whether to add such a flag and what its behavior should be. (Note, one of the difficulties I hope we can avoid with the flag is compile time performance impact given that this involves lexing every character from a translation unit.)

@cor3ntin
Copy link
Contributor

cor3ntin commented May 9, 2022

Does this leave room for Clang and other compilers to still allow these characters even though they might be outside of XID_*?

Yes, we have the wiggle room to do this (for example, with a feature flag to let users opt out of the P1949 behavior), but there's some space between "can" and "should" we need to be careful to consider. That feature flag makes it far more likely you'll run into portability issues with your code, but if it's something you explicitly opt into, then that's your decision to make.

I'm coming around to the idea of giving a feature flag for this -- if for no other reason, than because there was no deprecation period before this change broke code for some folks. Giving people an ability to upgrade to newer Clang versions while transitioning their code base to valid identifiers (or whatever other portability measure they want to take, if any) has value that may be worth the maintenance costs. I'm still not certain of the shape of the flag though -- does it allow any Unicode character no matter how dangerous it is in an identifier, does it allow only math symbols, something else? We'd have to figure out what the right behavior is, which sort of brings us right back to the challenge of "what's the principle behind whether a character is in or out?". As a strawman, I think we could say "allows math symbols too", but I'd definitely want input from @cor3ntin @tahonermann when deciding whether to add such a flag and what its behavior should be. (Note, one of the difficulties I hope we can avoid with the flag is compile time performance impact given that this involves lexing every character from a translation unit.)

I would be strongly against that, as it seems perfectly reasonable for either C or C++ or other C derived language to want to support maths symbols as operator or as some syntax element of sort in the future, and so it would be a pretty big grab.
It is also very defeating for P1949, the goal being to give people the confidence Unicode works portably, consistency and unsurprisingly.
There needs to be a very strong motivation for than and i'm not convinced "i like this character" is a sufficiently motivated cause for deviation.
A flag doesn't help. Do we really want identifier lexing to depend on non portable disabled-by-default options? How many users would enable that?

And as you hinted, for clang to decide what else to allow would take a lot of resources for an unsatisfactory result, as any effort not involving actual Unicode experts would just be extremely opinionated.

Identifiers are elements usable in words. Math symbols are not. But what about electronic symbols, engineering symbols, music notation, etc? All of these can be equally justified by "someone might use them"

@AaronBallman
Copy link
Collaborator

From Corentin:
... as it seems perfectly reasonable for either C or C++ or other C derived language to want to support maths symbols as operator or as some syntax element of sort in the future, and so it would be a pretty big grab.
...
And as you hinted, for clang to decide what else to allow would take a lot of resources for an unsatisfactory result, as any effort not involving actual Unicode experts would just be extremely opinionated.

Agreed, which is why I've been resistant to this as much as I have been. However:

From Aaron:
I'm coming around to the idea of giving a feature flag for this -- if for no other reason, than because there was no deprecation period before this change broke code for some folks. Giving people an ability to upgrade to newer Clang versions while transitioning their code base to valid identifiers (or whatever other portability measure they want to take, if any) has value that may be worth the maintenance costs.

This point is still valid -- there's no deprecation period and we're adding constraints that are impacting our users. The constraints added are mildly related to security (see trojan source as an example), so on the one hand, no deprecation period is understandable. This happened for other things in C and C++ as well (implicit function declarations and gets() both immediately come to mind). On the other hand, it's not so strongly related to security that we shouldn't consider a transition period as we have for other hard breaking changes.

From Corentin: There needs to be a very strong motivation for than and i'm not convinced "i like this character" is a sufficiently motivated cause for deviation.

I think the motivation here is less "I like this character" and more "I want to upgrade to the latest Clang but can't because none of my code compiles.", which is reasonably strong motivation depending on how many folks are in that situation. We've gotten reports from two different users against Clang 14, which suggests this is causing more problems than anticipated.

@intractabilis
Copy link

Identifiers are elements usable in words.

Nope, they are not. Identifiers are lexical tokens that name objects, period. Other requirements can be either technical (terminal in this system doesn't support Unicode), or personal preferences. "Elements usable in words" is a personal preference.

@intractabilis
Copy link

compile time performance impact

This is infinitesimally small compared to other phases of compilation. For any language because building AST takes much more time. This is infinitesimally small 10-fold for C++.

@tahonermann
Copy link
Contributor

Given the lack of a deprecation period and the number of reports we've already seen, I am leaning in the same direction as Aaron. I am not in favor of the approach gcc took of allowing previously allowed (and not disallowed) characters to be used in non-pedantic modes by default, but I think an option that achieves the same result would be fine. This would ensure opt-in backward compatibility without having to make difficult (probably poor) choices regarding character allowances and ensure Clang has the ability to match gcc behavior. I don't have a great suggestion for a new option name. Perhaps -frelaxed-identifiers.

@AaronBallman
Copy link
Collaborator

I don't have a great suggestion for a new option name. Perhaps -frelaxed-identifiers.

I think that's fairly reasonable for an option name. Do you envision it going back to the old Clang behavior pre-P1949, or do you envision it being P1949 + additional allowances for only some characters (or class of characters)?

@tahonermann
Copy link
Contributor

@AaronBallman, I envision it matching what gcc does now; that it allows the union of pre-P1949 identifiers and P1949 identifiers. I don't think it should behave as though that set includes -fdollars-in-identifiers though; that is a separate option for good reason.

@tttapa
Copy link
Author

tttapa commented May 9, 2022

it seems perfectly reasonable for either C or C++ or other C derived language to want to support maths symbols as operator or as some syntax element of sort in the future, and so it would be a pretty big grab.

I don't think this is very plausible, given the committee's reluctance to introduce keywords. However, if they did decide to assign special meanings to some symbols, it would be similar to adding new keywords like co_await.

There needs to be a very strong motivation for than and i'm not convinced "i like this character" is a sufficiently motivated cause for deviation.

True, but I could easily turn this around: there is no strong motivation for suddenly disallowing harmless characters such as mathematical symbols and subscripts, after over a decade of allowing them.
I'd argue that removing characters that were previously allowed demands stronger motivation than adding support for a new character.

Don't get me wrong, I believe that problematic control characters should be disallowed, but I don't think this should necessarily mean that unrelated characters have to be removed as well, especially if this is a breaking change.

Identifiers are elements usable in words.

I strongly disagree. And this doesn't match the definition for Unicode's XID_Start and XID_Continue either: they contain 131974 and 135072 code points respectively, they are certainly not all usable in words, there are punctuation signs, characters from the phonetic alphabet, iteration marks, Arabic mathematical characters, etc.

Even though they might not be usable in words, symbols like 𝜕 are useful as identifiers. E.g. one could argue that locally defining a function 𝛛 eases notation and is a sensible thing to do in some mathematical contexts:

const auto 𝛛 = [](auto expression, auto variable) {
    return partial_derivative(expression, variable);
};

But what about electronic symbols, engineering symbols, music notation, etc? All of these can be equally justified by "someone might use them"

I am not asking to add arbitrary characters to the allowed set, I'm solely requesting not to break code by suddenly removing harmless characters from the set.


Regarding security: it should be noted that P1949 does not solve Trojan source problems, and does not address homoglyph attacks.

Clang 14 is still vulnerable to CVE-2021-42574, because control characters are still allowed in other contexts like string literals. E.g. the original example from https://www.openwall.com/lists/oss-security/2021/11/01/1 compiles without warnings:

https://godbolt.org/z/njvoEGd38

#include <string>
#include <iostream>

int main() {
    std::string access_level = "user";
    if (access_level != "user‮ ⁦// Check if admin⁩ ⁦") {
        std::cout << "not a user\n";
    }
}

9.2 Does not exclude homoglyph attack

Homoglyph attacks, where visually indistinguishable characters from different scripts are used to create confusion, such as between latin letter c and cyrillic letter c. This is covered by Unicode Technical Report #36 UNICODE SECURITY CONSIDERATIONS[UAX36]. It requires much more extensive analysis of text, using the full Unicode database, and for a compiled language would provide limited benefit.

@tahonermann
Copy link
Contributor

tahonermann commented May 9, 2022

Hi @tttapa. I don't think this is the right forum for some of the issues that you are raising. This issue is best used to focus on mitigating the impact of the P1949 changes.

Concerns about what characters should or should not be allowed in identifiers would be best directed to WG21. I recommend you share your concerns with SG16. WG21 is now following Unicode guidance (pre-P1949, the character allowances included ranges of unassigned code points; that isn't a good strategy). If Unicode guidance changes (and it may as a result of the impact P1949 has had; there is a working group meeting regularly), then I'm sure WG21 will follow along.

You are correct regarding the Trojan Source concerns and UAX #36. There is on-going work to address those concerns though I don't expect to have actionable guidance for quite some time.

@termi-official
Copy link

Any update on how users should deal with this in short term? Clang 15 release is on the way and I could not find any patch introducing a flag to either allow pre-P1949 identifiers or equalize the allowed characters with those in gcc.

@AaronBallman
Copy link
Collaborator

Any update on how users should deal with this in short term? Clang 15 release is on the way and I could not find any patch introducing a flag to either allow pre-P1949 identifiers or equalize the allowed characters with those in gcc.

Unfortunately, nobody proposed a patch for Clang 15 introducing the option discussed.

Further, WG14 adopted the same restrictions from https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2836.pdf at our Feb 2022 meeting. It's worth noting that https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2932.htm was discussed at our May 2022 meeting and there was consensus against adoption but some weak sentiment to consider it for a TS.

@intractabilis
Copy link

I disagree. We need a long-term maintenance plan for features or we risk unmanageable complexity that lowers the quality of the product. Having some idea of what the expected usage pattern is for the feature based on discussions with people who intend to use it is very valuable for coming up with that plan.

When I worked for NVIDIA, we tried to switch to Clang several times, every time switching back to GCC after a couple of months because of maintenance headaches. You got it backwards: it's the shenanigans like this one make Clang a low-quality software. Don't worry, I can guarantee you: adding a command line option for backward compatibility will not produce unmanageable complexity and will improve the quality of software.

The same as required today: ask the third-party to please fix their code to conform to the standard.

And if they disagree, apply to The International Court of Justice? What you've done would be an equivalent of removing goto from the standard and the compiler, and then say: we don't care your code doesn't work anymore.

@AaronBallman
Copy link
Collaborator

You got it backwards: it's the shenanigans like this one make Clang a low-quality software.
And if they disagree, apply to The International Court of Justice? What you've done would be an equivalent of removing goto from the standard and the compiler, and then say: we don't care your code doesn't work anymore.

This sort of hyperbole is not productive or appreciated, please re-familiarize yourself with our Code of Conduct.

@AaronBallman
Copy link
Collaborator

Definitely understand and share the concern. I really put quite some thought into this over the last months, but cannot find a satisfactory migration path for my codebases. Big part of the problem is that there is no simple automation.

Thanks for this feedback! If there was a tool, such as a clang-tidy check, which would allow you to automatically modify problematic identifiers, would that automation be sufficient for you to migrate your code base? (I'm imagining something that would do simple renames, like replacing the problematic characters with a placeholder such as _ rather than anything overly clever like trying to map subscript N to a numerical form.)

For context, I am also rather against introducing such a flag, since it is what it is and we should call it by name: A pretty ugly hotfix. If we decide to introduce such a flag, then it also should be removed again at some point when the discussion about the matter is settled, raising the question if it is even worth the time invest in first place.

The other issue is when removing the flag, how many people's build systems break as a result. (Flag deprecation is tricky but certainly not impossible.)

On the other hand I am also not seeing an alternative short/mid term, because this discussion might take some time. I also cannot say how strong the impact of P1949 on other existing code bases is, because the trend to utilize the now disallowed identifiers is rather recent (<10 years) while most code bases in scientific computing are older. Happy for any suggestions and ideas.

That's why I still am okay exploring the idea of adding a flag. Removal without a deprecation period is rather harsh and we have plenty of precedence for flags to allow people to migrate. However, our experience with quite a few of those flags is that they're detrimental in the long-run unless we are aggressive about removing the flag (implicit int and implicit function declarations both come to mind as recent examples).

@h-vetinari
Copy link
Contributor

Also, the tropes I used are not hyperboles.

It's still a mighty stretch to compare the breakage of something that was more or less accidentally working in a field as messy as human text (due to intentional clean-ups based on the work of bodies charged with producing guidance on that very subject), with the removal of a central & clearly delineated feature such as goto.

@AaronBallman
Copy link
Collaborator

This sort of hyperbole is not productive or appreciated, please re-familiarize yourself with our Code of Conduct.

You are again trying to be a moral judge. In this sense, you are not adhering to the code of conduct yourself. My straightforwardness is just biology, I am slightly on the autism spectrum and say things as they are very bluntly. However, I do it respectfully. I didn't make any remarks about your personality, I didn't call names. There is no reason to throw the code of conduct at me just because you don't like the way I speak. Being autistic doesn't make me less intelligent than you. Also, the tropes I used are not hyperboles.

It's not about intelligence or straightforwardness, so I'm very sorry if I've given you that impression! It's about "Be careful in the words that you choose and be kind to others" and "Be respectful" specifically. Calling an open source tool "low-quality software" or saying we "don't care your code doesn't work anymore" when you disagree with a behavior mandated by the standards comes across as denigrating a lot of people's hard work, including the people interacting with you on this thread in efforts to find a positive way forward.

@termi-official
Copy link

Thanks for the response Aaron.

If there was a tool, such as a clang-tidy check, which would allow you to automatically modify problematic identifiers, would that automation be sufficient for you to migrate your code base? (I'm imagining something that would do simple renames, like replacing the problematic characters with a placeholder such as _ rather than anything overly clever like trying to map subscript N to a numerical form.)

That is what I basically did for some of the code bases to explore migration paths, but I am not a huge fan of this, because from my perspective this sometimes hurt the readability. Ω₀ and Ω_0 just look different and parse differently. I know, minor issues. I hope I am not making the impression that migrating such code bases manually is impossible. While I absolutely like providing automated migration paths via clang-tidy, I think for this task it is not the right tool.

It's still a mighty stretch to compare the breakage of something that was more or less accidentally working in a field as messy as human text (due to intentional clean-ups based on the work of bodies charged with producing guidance on that very subject), with the removal of a central & clearly delineated feature such as goto.

Let me quickly comment on this, too, because I think this is wrong. It is from my perspective not a stretch comparing this to removing goto statements. Developing clear code that is easy to understand should be a high priority for any project, because it helps on the long term, making the code more maintainable. In the scientific computing community we have over the last decade moved towards using unicode in identifiers to write code that is as close as possible to the theoretical formulas and it seems to help developers and studens to better understand what is going on in less time - so it has become a good practice. Luckily it seems like larger c++ projects seem to have not catched up on this trend yet, so I think the impact (globally speaking, not on my code bases) is not as severe as I initially expected. Goto on the other hand is just bad practice in 99% of the cases where it is deployed. There are legitimate use cases, but even in such cases we can always rewrite the control flow to eliminate all gotos with higher level constructs.

Still, I believe this breakage is more on the accidental side an I am looking forwards to use cases for which we want to block the removed characters, as mentioned previously in the thread.

@AaronBallman
Copy link
Collaborator

While I absolutely like providing automated migration paths via clang-tidy, I think for this task it is not the right tool.

Thanks! I was kind of thinking the same thing, but confirmation is helpful. :-)

My current thinking on this is that we don't want to expose a feature flag like -fallow-awesome-characters to enable a language subset. I think a better approach is to turn err_character_not_allowed_identifier into a warning which defaults to an error. This way, users who want to downgrade the diagnostic back into a warning can do so via -Wno-error=whatever-we-name-the-warning-group to do so, or users who want to ignore the diagnostic entirely can do -Wno-whatever-we-name-the-warning-group. However, it continues to appropriately signal that this code is in error so users will get the correct behavior by default. If/when UAX31 is updated, we can turn the diagnostic back into an error-only diagnostic if we think it no longer serves a need (Clang treats unknown feature flags as an error, but it treats unknown warning flags as a warning, so this should be less disruptive on build systems). If UAX31 is determined to be correct as-is, we can leave the diagnostic as a warning for "sufficient time" for people to migrate their code to new identifiers before turning it back into an error (no firm opinion on what constitutes sufficient time).

@termi-official
Copy link

For me this seems to be a great compromise. Thanks for taking the time!

@tahonermann
Copy link
Contributor

I spent some time reading back through this discussion and would like to correct a possible misconception that readers might have come away with. The Unicode standard, via UAX#31, specifies, and will continue to specify, three models of identifier syntax for language designers to follow. One of these (hashtag identifiers) is not relevant for C++. The other two are. The change made for C23 and C++23 was to migrate C++ from immutable identifiers to default identifiers. This change was made solely by the C and C++ standardization committees and not at the recommendation of the Unicode Consortium; as stated earlier, the Unicode standard will continue to specify multiple models of identifier syntax for language designers to use at their discretion. The possible misconception that I want to correct is that the change was the result of a Unicode Consortium recommendation; it wasn't.

That all being said, the Unicode maintainers are aware of the issues we are encountering in migrating from immutable identifier syntax to default identifier syntax and will be reviewing the default identifier syntax character allowances for a future Unicode standard.

@cor3ntin
Copy link
Contributor

@AaronBallman what you suggest sounds okay to me. I'm thinking about whether we want to allow people to disable the error in C++23/C23. I think we do because discouraging upgrade over this feature sounds like a net negative.

But I would like us to have a long term plan to make sure there is no confusion that this is a temporary solution and not an allowance to deviate from the standard ad aeternam.

@AaronBallman
Copy link
Collaborator

But I would like us to have a long term plan to make sure there is no confusion that this is a temporary solution and not an allowance to deviate from the standard ad aeternam.

I think that a long-term plan at least partially depends on whether UAX#31 expands the default identifiers tables sufficiently for the folks running into problems (or WG14/WG21 change the identifier set back to immutable, etc). I suspect the long-term plan will be to eventually speculatively turn the diagnostic back into an error-only diagnostic either after UAX#31 has been modified or after some number of Clang releases (whichever comes first), and see how much pain that causes folks in practice with pre-release testing. If there's still significant pain, we'd revert back to warning-defaults-to-error for a while longer.

@AaronBallman
Copy link
Collaborator

I've posted https://reviews.llvm.org/D132877 as the review for implementing what I proposed above. If it lands and there aren't concerns about backporting, I will try to get this backported to Clang 15. No promises about it making Clang 15 though, as the release is set to go out next Monday (so there are no more release candidates planned).

@nomennescio
Copy link

I understand clang wanting to follow (draft) standards, but the rationale behind it is completely wrong; Unicode defining code points as being "identifier characters" or "mathematical characters" should bear no impact on any programming language; it's the programming language that is deciding what is a valid identifier character. As function names are identifiers, having so-called "mathematical characters" in them is not only sane, it's the better thing to do. Luckily other programming languages still understand this perfectly fine. Also breaking existing code without a true necessity is always a bad idea. I think clang should pull some weight and stand for their user base.

@rayfalling
Copy link

Do clang 15 has any option to accept invalid characters.
Our build system use clang reflect code generateion, but now clang show Error GB070C32F: character <U+FF0C> not allowed in an identifier. This is part of chinese character in unicode set. (All Chinese characters will reflect to Raw string in code in our build system.) Do P1949 reject chinese UTF-8 characters? Or what should I do to make this reflect correct?
@AaronBallman

@AaronBallman
Copy link
Collaborator

Do clang 15 has any option to accept invalid characters.

Not currently as of 15.0.4.

Our build system use clang reflect code generateion, but now clang show Error GB070C32F: character <U+FF0C> not allowed in an identifier. This is part of chinese character in unicode set. (All Chinese characters will reflect to Raw string in code in our build system.) Do P1949 reject chinese UTF-8 characters? Or what should I do to make this reflect correct?

I'll leave it to @tahonermann and @cor3ntin to correct me if I'm wrong, but <U+FF0C> is a fullwidth comma (https://www.fileformat.info/info/unicode/char/ff0c/index.htm), which is the kind of character that UAX #31 intentionally disallows due to the significant potential for malicious confusion in source code (the comma can look like a separator between arguments in a function call, for example).

@rayfalling
Copy link

rayfalling commented Nov 10, 2022

the error chekcer report wrong position:)
Clang out put is

0>uri(str): 资源后缀名,后缀名称可以不用带点,例如gim等等
0>              ^~

the correct should be

0>uri(str): 资源后缀名,后缀名称可以不用带点,例如gim等等
0>                   ^~

Yes, there is probably a bug in the width estimation code, I'll look into it.

@AaronBallman
Copy link
Collaborator

I think I'm a bit confused, @rayfalling.

Our build system use clang reflect code generateion, but now clang show Error GB070C32F: character <U+FF0C> not allowed in an identifier.

So you're getting an error about use of that character in an identifier.

Yes, this is full width comma in our comment.

But this is about a comment and not an identifier. Can you attach a reduced test case that reproduces the issue for you so I can be sure we're considering the same situation?

@rayfalling
Copy link

rayfalling commented Nov 11, 2022

@AaronBallman
The minimal code I will paste below

#include <iostream>

#define NCLASS(...)
#define NFUNCTION(...)

#define REFLECTION_NCLASS(...) NCLASS(has_reflection=true, __VA_ARGS__)

REFLECTION_NCLASS(alias=start_record, ops.scope=unit, ops.hotkey=Alt+1, ops.comment=(
根据上下文,开始当前编辑器的录制操作
))
class TestClang
{
public:
	NFUNCTION()
	virtual bool Recordable() const { return false; }
};

int main()
{
	const TestClang testClang;
    std::cout << "Hello World!\n" << testClang.Recordable();
}

Our build system will use clang visitor to analyze the reflection macro REFLECTION_NCLASS, ops.comment will be treated as annotate attr. and write to an other position with raw string literal flag R"""()"""

In addition, the macro expansion should be empty in minimal code.

@tahonermann
Copy link
Contributor

I think the case @rayfalling is reporting is a lexing defect. Consider the following example (https://godbolt.org/z/j71Kjdr1P):

#define M(X) #X
auto v1 = M(,);
auto v2 = M(x,);
auto v3 = ",";

Clang issues the following diagnostics:

<source>:2:13: error: unexpected character <U+FF0C>
auto v1 = M(,);
            ^~
<source>:3:14: error: character <U+FF0C> not allowed in an identifier
auto v2 = M(x,);
             ^~
2 errors generated.

According to [lex.pptoken], the U+FF0C character should become its own preprocessing-token since it doesn't combine with any of the other token kinds. The initializers for v1 and v3 should be equivalent (they each are a string literal containing the U+FF0C character), but Clang errors when attempting to produce a pp-token in the v1 case.

@AaronBallman
Copy link
Collaborator

Thank you for the example @rayfalling and thank you for the analysis @tahonermann!

Tom, doesn't Clang's behavior match this: https://eel.is/c++draft/lex.pptoken#2.sentence-5 ?

By my reading, <U+FF0C> runs into:
"The categories of preprocessing token are: header names, placeholder tokens produced by preprocessing import and module directives (import-keyword, module-keyword, and export-keyword), identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single non-whitespace characters that do not lexically match the other preprocessing token categories."

and it qualifies as a single non-whitespace character that does not lexically match the other preprocessing token categories. Then we skip "If a U+0027 APOSTROPHE or a U+0022 QUOTATION MARK character matches the last category, the behavior is undefined." as it does not apply, bringing us to:

"If any character not in the basic character set matches the last category, the program is ill-formed." <U+FF0C> is not in the basic character set, so the program is ill-formed.

So to me, I think it's a case where the diagnostic is misleading and low-quality, but is actually correct. However, you have far more expertise on how to interpret this part of the standard. What am I misunderstanding?

@cor3ntin
Copy link
Contributor

@AaronBallman This matches my understanding.
As for the quality of the wording, It is a difficult problem: How do we distinguish "User wanted to use that codepoint as identifier, and it is not allowed" from "user wanted to input unicode outside of an identifier, which is not allowed"?
The common case being the former, the diagnostics make sense in most cases. That explains the discrepency observed by @tahonermann.

@tahonermann
Copy link
Contributor

Thanks @AaronBallman and @cor3ntin, it looks like you are right. A couple of interesting observations:

  • Some other characters not in the basic character set are not diagnosed, for example, $, @, and ` (Unless P2558 has already been implemented for Clang? Or perhaps these are common enough that they are accepted as an extension).
  • A diagnostic is not issued when just preprocessing. https://godbolt.org/z/xYdYcah1x.

@rayfalling
Copy link

rayfalling commented Nov 12, 2022

It looks like clang doesn't change his behavior when preprocessing full-width characters. Although this is not consistent with MSVC's behavior. I will change our code and build system to match clang preprocessing.

Thanks for your answers.

@cor3ntin
Copy link
Contributor

@rayfalling I realized I accidentally edited your reply instead of quoting it. I blame my phone. very sorry about that.

I've looked onto that carret bug you reported and clang actually behaves correctly, but it will render incorrectly if the terminal you are using is not a proper column terminal.
Godbolt does not appear to behave correctly for example (ie if the fonts used for different scripts have different width)

@cor3ntin
Copy link
Contributor

Clang now supports additional mathematical symbols in identifiers, as an extension for backward portability.
This is based on https://www.unicode.org/L2/L2022/22230-math-profile.pdf
This should be sufficient to support the code that has been reported broken here.

Thanks for reporting this issue!

GuilhermeValarini pushed a commit to GuilhermeValarini/llvm-project that referenced this issue Dec 24, 2022
Implement the proposed UAX Profile
"Mathematical notation profile for default identifiers".

This implements a not-yet approved Unicode for a vetted
UAX31 identifier profile
https://www.unicode.org/L2/L2022/22230-math-profile.pdf

This change mitigates the reported disruption caused
by the implementation of UAX31 in C++ and C2x,
as these mathematical symbols are commonly used in the
scientific community.

Fixes llvm#54732

Reviewed By: tahonermann, #clang-language-wg

Differential Revision: https://reviews.llvm.org/D137051
@llvm llvm deleted a comment from xhawk18 Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++23 clang:frontend Language frontend issues, e.g. anything involving "Sema"
Projects
None yet
Development

No branches or pull requests