Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT build crashes on x86_64-linux with LLVM 18 #130673

Closed
rennsax opened this issue Feb 28, 2025 · 12 comments · Fixed by #130906
Closed

JIT build crashes on x86_64-linux with LLVM 18 #130673

rennsax opened this issue Feb 28, 2025 · 12 comments · Fixed by #130906
Labels
build The build process and cross-build topic-JIT type-bug An unexpected behavior, bug, or error

Comments

@rennsax
Copy link
Contributor

rennsax commented Feb 28, 2025

Bug report

Bug description:

I'm trying to build CPython 3.13.2 with JIT supported (--enable-experimental-jit=yes-off), but the build process crashes after reporting like this:

python3>     | Traceback (most recent call last):
python3>     |   File "/build/Python-3.13.2/Tools/jit/_targets.py", line 181, in _compile
python3>     |     return await self._parse(o)
python3>     |            ^^^^^^^^^^^^^^^^^^^^
python3>     |   File "/build/Python-3.13.2/Tools/jit/_targets.py", line 89, in _parse
python3>     |     self._handle_section(wrapped_section["Section"], group)
python3>     |   File "/build/Python-3.13.2/Tools/jit/_targets.py", line 330, in _handle_section
python3>     |     value, base = group.symbols[section["Info"]]
python3>     |                   ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
python3>     | KeyError: 5

I've tracked the build process and found that the _parse routine in Tools/_targets.py almost fails for all the object files produced by the _compile process. When handling a ELF section of type SHT_PROGBITS, if SHF_ALLOC is not included in its flags, then the symbol tables of the stencil group will not be updated. Then if a later section refers to the symbol, a KeyError occurs. For example, an object file (_NOP.o) like this:

[
  {
    "Section": {
      "Index": 5,
      "Name": { "Name": ".debug_info", "Value": 108 },
      "Type": { "Name": "SHT_PROGBITS", "Value": 1 },
      "Flags": {
        "Value": 2048,
        "Flags": [{ "Name": "SHF_COMPRESSED", "Value": 2048 }]
      },
      "Address": 0,
      "Offset": 463,
      "Size": 29486,
      "Link": 0,
      "Info": 0
      // ...
  },
  {
    "Section": {
      "Index": 6,
      "Name": { "Name": ".rela.debug_info", "Value": 103 },
      "Type": { "Name": "SHT_RELA", "Value": 4 },
      "Flags": {
        "Value": 64,
        "Flags": [{ "Name": "SHF_INFO_LINK", "Value": 64 }]
      },
      "Address": 0,
      "Offset": 44968,
      "Size": 96,
      "Link": 20,
      "Info": 5
      // ...
  }

When handling the 5th section, L349 is not executed:

elif section_type == "SHT_PROGBITS":
if "SHF_ALLOC" not in flags:
return
if "SHF_EXECINSTR" in flags:
value = _stencils.HoleValue.CODE
stencil = group.code
else:
value = _stencils.HoleValue.DATA
stencil = group.data
group.symbols[section["Index"]] = value, len(stencil.body)
for wrapped_symbol in section["Symbols"]:
symbol = wrapped_symbol["Symbol"]
offset = len(stencil.body) + symbol["Value"]
name = symbol["Name"]["Name"]
name = name.removeprefix(self.prefix)
group.symbols[name] = value, offset
stencil.body.extend(section["SectionData"]["Bytes"])
assert not section["Relocations"]

Then when handling the 6th section, L330 will try to index group.symbols[5]:

if section_type == "SHT_RELA":
assert "SHF_INFO_LINK" in flags, flags
assert not section["Symbols"]
value, base = group.symbols[section["Info"]]
if value is _stencils.HoleValue.CODE:
stencil = group.code
else:
assert value is _stencils.HoleValue.DATA
stencil = group.data
for wrapped_relocation in section["Relocations"]:
relocation = wrapped_relocation["Relocation"]
hole = self._handle_relocation(base, relocation, stencil.body)
stencil.holes.append(hole)

where the error occurs.

I'm not sure whether it's because of the version of LLVM (18.1.8) I'm using.

CPython versions tested on:

3.13.2

Operating systems tested on:

GNU/Linux

Build Toolchains

  • Python 3.12.4
  • LLVM 18.1.8

Linked PRs

@rennsax rennsax added the type-bug An unexpected behavior, bug, or error label Feb 28, 2025
@picnixz picnixz added build The build process and cross-build topic-JIT labels Feb 28, 2025
@ZeroIntensity
Copy link
Member

I thought we needed LLVM 19 to build the JIT, not 18. If that's the issue, it's probably worth adding a better error here.

cc @savannahostrowski

@rennsax
Copy link
Contributor Author

rennsax commented Mar 3, 2025

I thought we needed LLVM 19 to build the JIT, not 18. If that's the issue, it's probably worth adding a better error here.

cc @savannahostrowski

Which kind of error message do you need? I almost attach enough context to figure out how the problem happens. If you need other information, feel free to ask me :)


emmmm, run into the same error when building with LLVM 19 on v3.14.0a5:

  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/home/xxx/python313-jit/python-cpython-v3.14.0a5/Tools/jit/_targets.py", line 146, in _compile
    |     return await self._parse(o)
    |            ^^^^^^^^^^^^^^^^^^^^
    |   File "/home/xxx/python313-jit/python-cpython-v3.14.0a5/Tools/jit/_targets.py", line 89, in _parse
    |     self._handle_section(wrapped_section["Section"], group)
    |   File "/home/xxx/python313-jit/python-cpython-v3.14.0a5/Tools/jit/_targets.py", line 314, in _handle_section
    |     value, base = group.symbols[section["Info"]]
    |                   ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
    | KeyError: 5
    +------------------------------------                                                 

LLVM version:

$ clang --version
clang version 19.1.0-rc1
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /nix/store/mgsrjph55ykck58v3aq049z6bx32bhs2-clang-19.1.0-rc1/bin
$ llvm-readobj --version
LLVM (http://llvm.org/):
  LLVM version 19.1.0-rc1
  Optimized build.

@ZeroIntensity
Copy link
Member

Which kind of error message do you need?

Oh, don't worry, you're all good! I was saying we should add an error if the LLVM version was the culprit, which seemingly it's not. Your report was perfect :)

@savannahostrowski
Copy link
Member

Thanks for testing with LLVM 19 as well (good one, @ZeroIntensity).

Just to be clear, are you testing against main or the 3.13 branch? I'm just a little confused by | File "/home/xxx/python313-jit/python-cpython-v3.14.0a5/Tools/jit/_targets.py", line 146, in _compile in the trace but then 3.13 in the issue report.

@rennsax
Copy link
Contributor Author

rennsax commented Mar 4, 2025

Thanks for testing with LLVM 19 as well (good one, @ZeroIntensity).

Just to be clear, are you testing against main or the 3.13 branch? I'm just a little confused by | File "/home/xxx/python313-jit/python-cpython-v3.14.0a5/Tools/jit/_targets.py", line 146, in _compile in the trace but then 3.13 in the issue report.

Originally I'm testing on 3.13.2 with LLVM 18, and after Peter told me LLVM 19 may be necessary I switch to 3.14.0a5 with LLVM 19. But both two build processes fail.

We can also focus only on v3.14.0a5 and LLVM 19, if you wish.

By the way, you may find that the line numbers do not match your code. That's because I've modified the build code for testing easier. My modification is simple:

159,160c159,160
--- Tools/jit/_targets.py       2025-03-04 10:11:58.781927617 +0800
+++ _targets-2.py       2025-03-03 14:09:09.611480315 +0800
@@ -156,8 +156,8 @@
         with tempfile.TemporaryDirectory() as tempdir:
             work = pathlib.Path(tempdir).resolve()
             async with asyncio.TaskGroup() as group:
-                coro = self._compile("shim", TOOLS_JIT / "shim.c", work)
-                tasks.append(group.create_task(coro, name="shim"))
+                # coro = self._compile("shim", TOOLS_JIT / "shim.c", work)
+                # tasks.append(group.create_task(coro, name="shim"))
                 template = TOOLS_JIT_TEMPLATE_C.read_text()
                 for case, opname in cases_and_opnames:
                     # Write out a copy of the template with *only* this case
@@ -165,6 +165,8 @@
                     # of executor_cases.c.h each time we compile (since the C
                     # compiler wastes a bunch of time parsing the dead code for
                     # all of the other cases):
+                    if opname != "_NOP":
+                        continue
                     c = work / f"{opname}.c"
                     c.write_text(template.replace("CASE", case))
                     coro = self._compile(opname, c, work)

Only compile _NOP without asynchronism so the error can be reproduced in the same way.

@savannahostrowski
Copy link
Member

Are you able to build successfully if you don't modify the code? I cannot reproduce this from main on my x86_64 Linux machine.

Can you share more about why you're trying to make this modification?

@rennsax
Copy link
Contributor Author

rennsax commented Mar 4, 2025

Are you able to build successfully if you don't modify the code?

No, of course.

Can you share more about why you're trying to make this modification?

I do this modification because the build process try to compile and parse multiple object files asynchronously, which makes me hard to reproduce the error message. However, as I've mentioned before, the _parse subroutine fails to parse almost every object file produced by the _compile subroutine.


More information: I'm trying to port CPython with JIT to Nixpkgs. The build process just works on aarch64-darwin but fails on x86_64-linux. I doubt the problem is caused by the different behavior of LLVM toolchains on Nixpkgs. Could you attach the result of llvm-readobj here? Maybe I can find something useful. The output can be generated by a simple shell script like:

#!/usr/bin/env bash

opname=_NOP
ll="${opname}.ll"
o="${opname}.o"

CPYTHON=$PWD

### _compile

_compile() {
    clang --target=x86_64-unknown-linux-gnu -DPy_BUILD_CORE_MODULE -DNDEBUG  \
          -D_JIT_OPCODE="${opname}" -D_PyJIT_ACTIVE -D_Py_JIT \
          -I. \
          -I"${CPYTHON}"/Include \
          -I"${CPYTHON}"/Include/internal \
          -I"${CPYTHON}"/Include/internal/mimalloc \
          -I"${CPYTHON}"/Python \
          -O3 \
          -c \
          -fno-asynchronous-unwind-tables \
          -fno-builtin \
          -fno-plt \
          -fno-stack-protector \
          -std=c11 \
          -fpic \
          -S -emit-llvm -fomit-frame-pointer \
          -o ${ll} \
          "${CPYTHON}"/Tools/jit/template.c

    sed -i.bak -E 's/((noalias|nonnull|noundef )*ptr @_JIT_\w+\()/ghccc \1/; s/musttail call/musttail call ghccc/; s/ghccc ghccc/ghccc/' $ll

    clang --target=x86_64-unknown-linux-gnu -DPy_BUILD_CORE_MODULE -DNDEBUG  \
          -D_JIT_OPCODE="${opname}" -D_PyJIT_ACTIVE -D_Py_JIT \
          -I. \
          -I"${CPYTHON}"/Include \
          -I"${CPYTHON}"/Include/internal \
          -I"${CPYTHON}"/Include/internal/mimalloc \
          -I"${CPYTHON}"/Python \
          -O3 \
          -c \
          -fno-asynchronous-unwind-tables \
          -fno-builtin \
          -fno-plt \
          -fno-stack-protector \
          -std=c11 \
          -fpic \
          -Wno-unused-command-line-argument \
          -o ${o} \
          ${ll}
}

### _parse

_parse() {
    llvm-readobj --elf-output-style=JSON \
                 --expand-relocs \
                 --section-data \
                 --section-relocations \
                 --section-symbols \
                 --sections \
                 ${o}
}

_compile
_parse

I write this script for CPython 3.13.2 to reproduce the problem. The compile and parse commands are just copied from _targets.py.

@brandtbucher
Copy link
Member

Thanks for opening the issue. Regardless of how the .debug_info section is making it into your build (I'm guessing you might have a downstream-patched LLVM or some other way that different default flags are being chosen), we should probably handle this gracefully.

Does someone want to open a PR to replace this group.symbols[section["Info"]] with group.symbols.get(section["Info"]) (or something) and just return if it's not in there?

@rennsax
Copy link
Contributor Author

rennsax commented Mar 6, 2025

I'm guessing you might have a downstream-patched LLVM or some other way that different default flags are being chosen

I double that too, and also think we can handle it gracefully.

Does someone want to open a PR to replace this group.symbols[section["Info"]] with group.symbols.get(section["Info"]) (or something) and just return if it's not in there?

I can test it now. Can I open the PR when I'm done?

@brandtbucher
Copy link
Member

Yep, that’d be great. Thanks!

@rennsax
Copy link
Contributor Author

rennsax commented Mar 6, 2025

@brandtbucher By the way, is there any JIT related test in CPython repo? I can build it now but I'm not sure if the JIT compiler can act correctly.

@brandtbucher
Copy link
Member

@brandtbucher By the way, is there any JIT related test in CPython repo? I can build it now but I'm not sure if the JIT compiler can act correctly.

We have tests for trace collection and optimization passes, but not really for the machine code backend (which you've fixed here). We "test" it in CI by building with the JIT enabled and running the test suite, which compiles and runs lots of stuff. If something was badly broken, we'd know pretty quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build The build process and cross-build topic-JIT type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants