Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variables with the same name as registers break Intel assembly #50

Closed
Valetoriy opened this issue Feb 18, 2023 · 18 comments
Closed

Variables with the same name as registers break Intel assembly #50

Valetoriy opened this issue Feb 18, 2023 · 18 comments

Comments

@Valetoriy
Copy link
Contributor

rax :: 69
putchar : ext integer(ch : integer)
putchar(rax)

When compiling with --dialect intel no warnings/errors are produced and the following code is generated:

.section .data
rax: .byte 69,0,0,0,0,0,0,0
.intel_syntax noprefix
.section .text

.global main
.global putchar

main:
    push rbp
    mov rbp, rsp
    sub rsp, 0
.L0:
    lea rax, [rip + rax]
    mov rdi, [rip + rax]
    sub rsp, 8
    push rcx
    push rsi
    push rdi
    call putchar
    pop rdi
    pop rsi
    pop rcx
    add rsp, 8
    mov rcx, rax
    mov rax, rcx
    mov rsp, rbp
    pop rbp
    ret

Same as in #46, the issue only occurs during assembly:

code.S: Assembler messages:
code.S:14: Error: `[rip+rax]' is not a valid base/index expression
code.S:15: Error: `[rip+rax]' is not a valid base/index expression
@Valetoriy Valetoriy changed the title Variables named after x64 registers break Intel assembly Variables named after registers break Intel assembly Feb 18, 2023
@Valetoriy Valetoriy changed the title Variables named after registers break Intel assembly Variables with the same name as registers break Intel assembly Feb 18, 2023
@LensPlaysGames
Copy link
Owner

Interesting; it seems the intel dialect has quite a few of these cases due to it's ambiguous nature (no way in the syntax to determine whether something is a register, immediate, symbol, etc).

I believe it's in the plans to mangle not only function names, but also variable names, with the caveat of anything ext being left unmangled. This means that eventually we'll end up with something like _Xv3rax, which wouldn't have such conflicts.

@Sirraide
Copy link
Contributor

I find the best solution to this problem is to ask GCC for advice

@Sirraide
Copy link
Contributor

image

@Sirraide
Copy link
Contributor

... thanks GCC

@Sirraide
Copy link
Contributor

It seems that GCC avoids dealing w/ this by simply emitting object files instead 👁️

@Sirraide
Copy link
Contributor

Sirraide commented Feb 19, 2023

It’s worth noting that GCC uses AT&T by default, so it doesn’t normally suffer from this problem unless you use intel syntax

@Sirraide
Copy link
Contributor

Clang straight-up just generates incorrect assembly:
image

@Sirraide
Copy link
Contributor

NASM can deal with this problem it seems:

An identifier may also be prefixed with a $ to indicate that it is intended to be read as an identifier and not a reserved word; thus, if some other module you are linking with defines a symbol called eax, you can refer to $eax in NASM code to distinguish the symbol from the register.

@Sirraide
Copy link
Contributor

I guess the ‘solution’ to this is to disallow .intel_syntax noprefix and require .intel_syntax prefix instead when emitting assembly for GAS as GAS doesn’t seem to have a way of handling these kinds of symbols... If we ever add a NASM backend, we can solve this problem as described in the previous paragraph.

I guess the ‘easiest’ way of solving this issue would be to simply add a flag to the codegen context (bool intel_syntax_use_prefix or sth like that) and then depending on that, we choose whether to emit a prefix in femit_*(). That flag can then be set or not depending on a command line option. Speaking of, we should probably add an option like --assembler or sth like that that accepts different values (GAS being the only one atm), and depending on that and on the assembly dialect, we can choose whether to use prefixes or not, as well as what directives to emit etc.

@Sirraide
Copy link
Contributor

@LensPlaysGames Not sure how familiar you are w/ NASM, but I can take a look at implementing a NASM backend if you’d like 👀

@Sirraide
Copy link
Contributor

This means that eventually we'll end up with something like _Xv3rax, which wouldn't have such conflicts.

Finally, this is definitely true and I doubt conflicts like this one would be common, but that doesn’t change the fact that 1. they’re possible, and 2. more assembler backends and options is probably a good idea anyway, because maybe someone would prefer to just use NASM or some other assembler...

@LensPlaysGames
Copy link
Owner

I guess the ‘solution’ to this is to disallow .intel_syntax noprefix and require .intel_syntax prefix instead when emitting assembly for GAS

This is probably the first step to take; it would fix this issue for the default backend, and therefore for most users of the Intercept compiler.

Not sure how familiar you are w/ NASM

I learned NASM first, then GAS. That's why LensorOS has a mix of NASM .asm and GAS .S, haha. But yeah: I'm familiar :).

I can take a look at implementing a NASM backend if you'd like

I mean, don't let me stop you (support for more backends is always highly appreciated), but this isn't necessarily top priority to me. Personally, I envision that what will be most useful is an object file backend, as those object files can then be used by any assembler, compiler, linker, etc.; but that is an entire project in itself, hehe. However that's just my perspective: someone who doesn't really know C all that well would probably greatly appreciate a backend for an easier-to-obtain assembler like NASM (it has prebuilt binaries readily available, whereas binutils... yeah).

  1. they’re possible, and 2. more assembler backends and options is probably a good idea anyway

💯. It's only a bad thing to support more backends when it is something that won't be maintained. I think NASM is close enough to the existing default backend (GAS) that it really won't be too difficult to maintain; it'll basically be a translation job more than anything, when implementing new codegen.

@Sirraide
Copy link
Contributor

Personally, I envision that what will be most useful is an object file backend

on it

Seriously tho, I’m sort of familiar w/ object file layout, so that shouldn’t be too bad. ELF files that is. I happen to know that you’re at least somewhat familiar w/ COFF, so I’ll leave that up to you 👀

@Sirraide
Copy link
Contributor

And adding an asm backend would just entail changing the directives, so that should be fairly simple imo

@Sirraide
Copy link
Contributor

so that shouldn’t be too bad

I forgot for a second that this also entails assembling the code, so er, that might take a bit longer haha

@LensPlaysGames
Copy link
Owner

The machine_code branch takes care of this the same way GCC does; by emitting object files instead :P

#66

  D:/Programming/strema/2023/Interceptmachine_code  24                            Garry 2023-05-27 12:14:28
╰─> .\bld\intc examples\register_variable_name.int -t coff_object -v -o code.obj
GObj dump:
  Sections:
    .text Data: 34 bytes:
      55 48 89 e5 48 8b 05 00  00 00 00 48 89 c1 51 48      |UH..H......H..QH|
      83 ec 28 e8 00 00 00 00  48 83 c4 28 59 48 89 ec      |..(.....H..(YH..|
      5d c3                                                 |].|
    .rodata EMPTY
    .data Data: 8 bytes:
      45 00 00 00 00 00 00 00                               |E.......|
    .bss Fill 0 bytes with '0'
  Relocations:
    0: rax in .text at offset 7 : None
    0: putchar in .text at offset 20 : Function
  Symbols:
    rax in .data at offset 0 : Static
    main in .text at offset 0 : Function
    putchar in .text at offset 34 : External

Generated code at output filepath "code.obj"
  D:/Programming/strema/2023/Interceptmachine_code  24                            Garry 2023-05-27 12:16:38
╰─> gcc code.obj -o code.exe
  D:/Programming/strema/2023/Interceptmachine_code  24                            Garry 2023-05-27 12:16:41
╰─> .\code.exe
E
  D:/Programming/strema/2023/Interceptmachine_code  24                            Garry 2023-05-27 12:16:44
╰─> echo $env.LAST_EXIT_CODE
69

@Sirraide
Copy link
Contributor

I feel like if GCC doesn’t handle this case, then we don’t really need to care too much about handling it either. I recall LLVM-generated assembly not being particularly compilable either if you use... less orthodox label names...

@LensPlaysGames
Copy link
Owner

#66 has closed, main now behaves as GCC does (and LLVM's clang, presumably), so we are good to go with closing this bad boi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants