Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect initialization of global arrays that are 2^32 bytes or larger #57353

Closed
nlsandler opened this issue Aug 25, 2022 · 23 comments · Fixed by #92473
Closed

Incorrect initialization of global arrays that are 2^32 bytes or larger #57353

nlsandler opened this issue Aug 25, 2022 · 23 comments · Fixed by #92473
Labels
clang:codegen good first issue https://github.com/llvm/llvm-project/contribute miscompilation

Comments

@nlsandler
Copy link

C programs with partially initialized arrays that are larger than 2^32 bytes aren't compiled correctly:

  • char arrays get initialized entirely to zero, even if they have a non-zero initializer in the original source code
  • when initializing arrays of other types, LLVM emits a .space or .zero directive with a negative byte count, which is rejected by the assembler.
char bad_char[4294967296] = {1};
char ok_char[4294967295u] = {1};

int bad_int[1073741824] = {1};
int ok_int[1073741823] = {1};

Compiling this with clang -S -arch x86_64 bad_static_arrays.c produces:

	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 10, 15	sdk_version 10, 15
	.globl	_bad_char               ## @bad_char
.zerofill __DATA,__common,_bad_char,4294967296,4
	.section	__DATA,__data
	.globl	_ok_char                ## @ok_char
	.p2align	4
_ok_char:
	.byte	1                       ## 0x1
	.space	4294967294

	.globl	_bad_int                ## @bad_int
	.p2align	4
_bad_int:
	.long	1                       ## 0x1
	.space	4294967292
	.space	-4294967296

	.globl	_ok_int                 ## @ok_int
	.p2align	4
_ok_int:
	.long	1                       ## 0x1
	.space	4294967288


.subsections_via_symbols

That's from compiling with Clang 11.0, but it looks like the issue exists in the most recent version too: https://godbolt.org/z/eobvhbf7s

@OfekShochat
Copy link

OfekShochat commented Aug 25, 2022

just a note, seems like this isnt a problem within clang but llvm codegen, https://godbolt.org/z/dEo36dW43 at least with bad_int, with bad_char it is initializing to zeros. I'll have to investigate deeper because I don't know the codebase yet, but this seems relevant. I don't know actually, you might want to ignore this comment, but that means somehow VectorType is doing something fishy, because all it does (understandably) is extending Type with methods to get the number of elements etc

@efriedma-quic
Copy link
Collaborator

See

unsigned Size = DL.getTypeAllocSize(CS->getType());

@efriedma-quic efriedma-quic added good first issue https://github.com/llvm/llvm-project/contribute llvm:codegen miscompilation and removed new issue labels Aug 25, 2022
@llvmbot
Copy link
Collaborator

llvmbot commented Aug 25, 2022

@llvm/issue-subscribers-good-first-issue

@OfekShochat
Copy link

Hello! Thanks for pointing that out, i had to do something else (if you're wondering). Currently I don't have an environment to actually fix this, but if this isn't in a very high priority, can I fix this as I actually set up my machine?

@efriedma-quic
Copy link
Collaborator

Sure; this appears to be a longstanding issue, so no rush.

@OfekShochat
Copy link

Thanks!

@msebor
Copy link

msebor commented Aug 25, 2022

See also #54766.

@OfekShochat
Copy link

OfekShochat commented Aug 27, 2022

interestingly, both master and 14.0.6 fail on my machine in the parsing part, and seems like godbolt does too https://godbolt.org/z/59oqh7na7 (this is c++, not c)

@OfekShochat
Copy link

OfekShochat commented Aug 28, 2022

Hello! Today I found both problems. Of course that I just didn't pay attention to where exactly I was pointed to.
Question now is, what should it do? Should it fail? In codegen (the bad_char one, that clang's codegen is the problem) it tries to reserve NumElements in a SmallVector which it can't really handle.. The alternative to failing is having a check there and not trying to reserve all the elements, but giving EmitArrayConstant an argument so it knows to emit the space but doesn't allocate anything.
Is any of my conclusions wrong (they might be, I still didn't go over EmitArrayConstant thoroughly)?

@shafik
Copy link
Collaborator

shafik commented Aug 29, 2022

I think this is related to #57317

@efriedma-quic
Copy link
Collaborator

This was originally reported as the bug for C code, not the issue for C++, so it should probably remain the bug for C code. The C++ crash should be a different bug (not sure if there's an existing report for it).

@OfekShochat
Copy link

Of course. Would you be willing to reply to my comment?

@efriedma-quic
Copy link
Collaborator

I don't want to discuss the C++ bug too much on the "wrong" bug report. File that report if it doesn't exist, and CC me on it. But probably if we can, we want to split the initializer the same way we end up doing for C code. And in the case where we can't split, we probably run out of memory on most computers, so probably fine to just report an error.

@nlsandler
Copy link
Author

As an LLVM user, I'd find it clearest and least confusing to just reject any array that exceeds this 4GB size limit earlier on, in semantic analysis/type checking, regardless of how it's initialized.

Right now, it looks like ConstantArrayType::getMaxSizeBits enforces a maximum array size of 2^61 bytes (see https://godbolt.org/z/cPbnbnb5z for an example.) Would it make sense to just reduce this number from 61 to 32?

This seems appropriate especially because from the discussion on #57317 it looks like multiple spots in the code base assume that array sizes can fit into 32 bits.

(I don't mean to extend the discussion of the C++ problem here but I think that fix would apply to both C and C++)

@OfekShochat
Copy link

OfekShochat commented Aug 29, 2022

Yea, on both occasions that made the problems appear (that assumption). And it would fix both
But I don't know how people feel about limiting that, is there a problem? That was my question

@efriedma-quic
Copy link
Collaborator

I suspect enforcing a 32-bit limit on array sizes would break existing code. And "it's too much work to make LLVM clean for 64-bit array sizes" doesn't really feel like a compelling argument; we've only seen a few scattered issues with accidentally truncating 64-bit sizes to 32 bits.

@OfekShochat
Copy link

Makes sense, I agree.
Would you like to continue this discussion in irc to not clutter the issue any more than we (I) have?

@rjmccall
Copy link
Contributor

rjmccall commented Sep 6, 2022

If you're interested in working on this, patches are welcome. IR generation in Clang will probably have to completely change how we emit constant initialization of uninitialized elements of large arrays in order to produce code with acceptable resource usage here. Right now, we have a special case for fully bitwise-zero initializers, but when that doesn't match we'll need to produce something like { i32 1, [1073741823 x i32] zeroinitializer }. That means we won't necessarily be producing an initializer that looks structurally like the type does in C, but that's already true for other patterns.

That's assuming that the AST is already reasonable, of course. IIRC, the AST doesn't make any effort to use a sparse representation in general, but it does have a special representation for trailing zero initializers. As a result, we'd probably blow out memory if you used a designator to initialize just the billionth element of an array like this. But we can tackle that separately.

@OfekShochat
Copy link

OfekShochat commented Sep 9, 2022

Hello, I have a branch that does compile (correctly) the example above! https://github.com/OfekShochat/llvm-project/blob/big-test/clang/lib/CodeGen/CGExprConstant.cpp#L1227
the file wasnt formatted, so I accidentally did (with git-clang-format). also, there are multiple other things I imagine I should add for this to be mergable (logic wise)
cc @rjmccall

@OfekShochat
Copy link

Opened a diff for this issue, it was just a unsigned/uint64_t issue, expectedly. Thanks, https://reviews.llvm.org/D133648

@gxyd
Copy link

gxyd commented Feb 13, 2024

Hi @OfekShochat , can I maybe try to work on this issue?

@efriedma-quic
Copy link
Collaborator

Current state:

MaskRay pushed a commit to MaskRay/llvm-project that referenced this issue May 16, 2024
MaskRay added a commit that referenced this issue May 16, 2024
MaskRay added a commit to MaskRay/llvm-project that referenced this issue May 16, 2024
Based on @OfekShochat's https://reviews.llvm.org/D133648

init.c is the primary test for array initialization, but it uses a
32-bit triple. Add the new test to array-init.c instead.

Fix llvm#57353
MaskRay added a commit to MaskRay/llvm-project that referenced this issue May 16, 2024
Based on @OfekShochat's https://reviews.llvm.org/D133648

init.c is the primary test for array initialization, but it uses a
32-bit triple, which would lead to an "array is too large" error. Add
the new test to array-init.c instead.

Fix llvm#57353
MaskRay added a commit that referenced this issue May 17, 2024
Based on @OfekShochat's https://reviews.llvm.org/D133648

init.c is the primary test for array initialization, but it uses a
32-bit triple, which would lead to an "array is too large" error. Add
the new test to array-init.c instead.

Fix #57353

Pull Request: #92473
@llvmbot
Copy link
Collaborator

llvmbot commented May 17, 2024

@llvm/issue-subscribers-clang-codegen

Author: None (nlsandler)

C programs with partially initialized arrays that are larger than 2^32 bytes aren't compiled correctly: * `char` arrays get initialized entirely to zero, even if they have a non-zero initializer in the original source code * when initializing arrays of other types, LLVM emits a `.space` or `.zero` directive with a negative byte count, which is rejected by the assembler.
char bad_char[4294967296] = {1};
char ok_char[4294967295u] = {1};

int bad_int[1073741824] = {1};
int ok_int[1073741823] = {1};

Compiling this with clang -S -arch x86_64 bad_static_arrays.c produces:

	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 10, 15	sdk_version 10, 15
	.globl	_bad_char               ## @<!-- -->bad_char
.zerofill __DATA,__common,_bad_char,4294967296,4
	.section	__DATA,__data
	.globl	_ok_char                ## @<!-- -->ok_char
	.p2align	4
_ok_char:
	.byte	1                       ## 0x1
	.space	4294967294

	.globl	_bad_int                ## @<!-- -->bad_int
	.p2align	4
_bad_int:
	.long	1                       ## 0x1
	.space	4294967292
	.space	-4294967296

	.globl	_ok_int                 ## @<!-- -->ok_int
	.p2align	4
_ok_int:
	.long	1                       ## 0x1
	.space	4294967288


.subsections_via_symbols

That's from compiling with Clang 11.0, but it looks like the issue exists in the most recent version too: https://godbolt.org/z/eobvhbf7s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:codegen good first issue https://github.com/llvm/llvm-project/contribute miscompilation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants