Skip to content

Merge master updates#4

Merged
tmm77 merged 21 commits intoamd-developfrom
master
Aug 15, 2025
Merged

Merge master updates#4
tmm77 merged 21 commits intoamd-developfrom
master

Conversation

@tmm77
Copy link
Copy Markdown
Collaborator

@tmm77 tmm77 commented Aug 15, 2025

Brief Summary

merging latest master updates to amd-develop

hughperkins and others added 21 commits April 30, 2025 18:21
Issue: #

### Brief Summary

Add SNode.snode_tree_id

copilot:summary

### Walkthrough

Add SNode.snode_tree_id

This will be useful for diagnosing crash bugs in Debug.

copilot:walkthrough
…v#8709)

Issue: #

### Brief Summary

Factorize cpp examples into separate executables:
- cpp_examples_run_snode
- cpp_examples_autograd
- cpp_examples_aot_save

The benefit is that then we can run each example on its own, and easily
add new examples.

Also:
- replace `Arch::x64` with `host_arch()`
- run aot_save once for cpu, then once for vulkan, if available; and
once for dx12, if available

copilot:summary

### Walkthrough

copilot:walkthrough
Issue: #

### Brief Summary

Use brew clang compiler on mac

copilot:summary

### Walkthrough

When building on latest Macs, e.g. Sequoia, the system clang is 16.0.0,
or 17.0.0, depending on whether using XCode (16.0.0), or command line
tools (17.0.0). Either way, the bytecode generated at
python/taichi/_lib/runtime/runtime_arm64.bc is not loadable by llvm 15.
Trying to load the clang-15/16 compiled runtime_arm64.bc with llvm 15
gives an error about unknown attribute (86). llvm 15 is the version of
llvm used by taichi currently. therefore, building using system clang on
macos sequoia prevents taichi from loading/running.

Example failure: 


https://github.com/taichi-dev/taichi/actions/runs/14742813933/job/41384420135?pr=8688

![Screenshot 2025-04-30 at 7 28
39 AM](https://github.com/user-attachments/assets/9fe577b6-1879-4bc8-b99c-cf9909565083)


To fix this, we use the clang from the brew installed llvm@15 instead.
This then runs ok.

We assume in compile.py that brew has already been used earlier in the
script to install llvm@15. We use brew config to locate brew, and then
assume the clang path as `{HOMEBREW_PREFIX}/opt/llvm@15/bin/clang`, and
similarly for clang++.

copilot:walkthrough
…llocations as dupes (taichi-dev#8705)

Issue: #

### Brief Summary

The unified allocator always allocated the first two allocations as dupe
memory addresses, which always clobbered each other.

copilot:summary

# New walkthrough

Looking at the Unified allocator code, we can see that when we first
allocate a memory chunk, we do not add `size` to `head`, and thus the
next allocation will receive the exact same address too. Thus, there
will be two structs or similar, in memory, which clobber each other,
leading to plausibly a plethora of hard-to-debug crashes.

## High level overview of how allocator works

The allocator can work with two types of request:
- `exclusive`
- `not exclusive`

For exclusive requests:
- a buffer is allocated from the system:
-
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.cpp#L74-L75
- the size of the buffer matches the requested bytes (to within
alignment bytes)
- a new chunk is created
    - `chunk.data` is set to the start of this buffer
    - `chunk.head` is too, but it's not really used for exclusive access
- `chunk.tail` is set to the end of the buffer, but again not really
used for exclusive access

Exclusive access requests are thus fairly straightforward

For non-exclusive, it is slightly more complex
- for the first request,we allocate a much larger buffer than the
request
- by default 1GB
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.cpp#L9-L10
- we create a new chunk
    - `chunk.data` is set to the start of this buffer
    - `chunk.head` is set to the start of unused space in this buffer
        - it should be set to `chunk.data + size`
- prior to this PR, it is incorrectly being set to point to `chunk.data`
though
- meaning that the next request will incorrectly return the start of
this chunk, again
        - then we return `chunk.head`
- for subsequent requests, we look for a chunk that has available space
(head - tail <= requested size)
    - when we find such a chunk:
          - we add `size` to `head` (to within alignment)
          - we return the old `head` (to within alignment)

## Proposed fix

The proposed fix is to set `head` to `data + size` for newly allocated
chunks
- thinking about it, an alternative fix is to split the function into
two parts:
    - first part searches for an existing chunk, or makes a new one
         - does not return the allocated address
         - does not update head etc
    - second part is always executed
        - updates head
        - returns old head

I don't really have a strong opinion on which fix we prefer. The second
approach seems mildly cleaner perhaps, since decouples 'finding/creating
a chunk' from 'updating the chunk and returning the requested memory
pointer'.

## Low level details

In more details, and assuming non exclusive mode:
- let's say client requests `size` bytes
- we allocate a chunk much larger than that, `default_allocator_size`
bytes
-
https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L65-L75
    - the address of this chunk is stored in `ptr`
- we create a `chunk` structure to store information about the chunk we
just allocated
-
https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L63
-
https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L76-L79
    - ptr is stored in chunk.data
    - head is set to ptr too, via chunk.data
    - tail is set to ptr + allocation size, via chunk.data
- we return ptr
-
https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L85
- we should have added allocation_size to chunk.head
- we can look at what happens when we re-use this chunk later, to
confirm this:

When we re-use a chunk:
- we loop over all allocated chunks, looking for non-exclusive chunks
-
https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L39-L45
- we add allocation size to head, adjusting for alignment, store that in
ret, and check if ret is less than tail
-
https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L49-L53
- if ret is less or equal to tail, then we:
- update head to be equal to ret (so, we've updated it to be old head +
allocation_size, adjusted for alignment)
    - return ret
    - (and break out of the loop, by virtue of the return)
- otherwise, we ignore, and keep looping over available chunks
- (if no suitable chunks found, then we will allocate a brand new chunk)

# Original Walkthrough

High level summary:
===============

- both the LLVMRuntime and the result_buffer are allocated to the same
memory address
- this results in the return error code from running a kernel
overwriting the address of snode tree 19
- this results in subsequent access to any field having snode tree 19
crashing Taichi

Reproducing the bug
===================

This bug was initially reproduced in
taichi-dev#8569 , but knowing what the
bug is, we can reproduce it using the following much simpler code:

```
import taichi as ti

ti.init(arch=ti.arm64, debug=True)

fields = []
for i in range(20):
    fields.append(ti.field(float, shape=()))
    ti.sync()

@ti.kernel
def foo():
    fields[19][None] = 1

foo()
foo()
```

What this code does:
- allocates snode trees 0 through 19, by creating fields indexed 0
through 19, and immediately calling ti.sync, to materialize the snode
tree
- you can optionally print out the snode tree ids as long as you have a
version of master that includes the PR at
taichi-dev#8697, to verify this assertion
- following the creation of snode trees 0 through 19, we call a kernel
twice
   - the first kernel runs without issue
- however, the address of snode tree 19 will be set to 0, following this
kernel call, because it is overwritten by the return code of this call
- when we run the second kernel call, using the address of snode tree 19
- which is now set to 0 - to access values from snode tree 19, causes a
segmentation fault:

[E 04/30/25 19:00:30.022 3136495] Received signal 11 (Segmentation
fault: 11)

Detailed walkthrough
====================

1. LLVMRuntime and result_buffer are allocated the same memory address

- When we first initialize the LLVMRuntime, we:
- allocate a result_buffer from the unified allocator, via the host
allocator
- result_buffer allocated here
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/llvm_runtime_executor.cpp#L699-L700
   - call runtime_initialize
- here
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/llvm_runtime_executor.cpp#L706-L711
       - passing in the result_buffer
       - and the host allocator
   - inside runtime_initialize, we:
       - allocate the LLVMRuntime, using the same allocator
- here
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L932-L933
- interestingly, the address allocated for the LLVMRuntime memory is
identical to the address of the result_buffer memory
- verifiable by printing out the two addresses. Over multiple runs, they
consistently have the same address as each other (though the exact
addresses vary between runs)
   - these are both allocated from the exact same allocator
- if you print out the address of the allocator in each location, they
are identical
       - and no deallocations take place between the allocations
       - so, how is this possible?
   - looking at the unified allocator, there is a concept of 'exclusive'
- here
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.cpp#L32
- if a request for memory is not marked as exclusive, previously
allocated buffers can be re-used, and allocated to new requests
- here
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.cpp#L57
   - the default is exclusive = false
- here
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.h#L31
- therefore, by default, memory chunks allocated can be
re-used/returned/allocated across multiple requests

Let's first walk through the effects of LLVMRuntime and result_buffer
occupying the same space.

2. The return code of a kernel overwrites snode tree address 19

- following a kernel launch, the method
runtime_retrieve_and_reset_error_code is run on runtime.cpp
- here
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L727-L730
- this method calls `runtime->set_result(taichi_result_buffer_error_id,
runtime->error_code);`
- the first parameter is a constant
- defined here
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/inc/constants.h#L21
    - `constexpr std::size_t taichi_max_num_ret_value = 30;`
- `set_result`:
- is here
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L600-L604
    - sets result_buffer[i] to t
- here
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L602
    - in this case, i is taichi_max_num_ret_value
        - which is 30
    - t is the return code
- empirically this has a value of 0, in the test cases described above
    - i is used to index onto an array of i64
- here
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L602
    - therefore each element of the array has 8 bytes
- and therefore to get the address of the element which will be set to
0, we should multiply the index, which is 30, by 8
        - thus, we will zero out 8 bytes at byte offset 30 * 8 = 240
    - the base address for this offset is result_buffer
        - however, result_buffer has the same address as LLVMRuntime
        - (as discussed in the first section)
- so we are going to clobber 8 bytes in LLVMRuntime with zeros, at
offset 240
    - let's now look at where byte offset 240 is in LLVMRuntime
    - LLVMRuntime struct:
- is here
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L552-L562
- there are two PreallocatedMemoryChunks, each of which contains two
pointers and a size_t
-
https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L546-L550
- each pointer is 8 bytes, and size_t likely 8 bytes, for 24 bytes each
            - 48 bytes total
        - host_allocator_type is a pointer to function -> 8 more bytes
- assert_failed_type, host_printf_type, host_vsnprintf_type, and Ptr are
also all pointers, so 8 bytes each, for a total for them of 32 bytes
- now we arrive at roots, which is the snode tree roots address array
            - at this point, we are at an offset of 48 + 8 + 32 = 88
            - so our offset into roots will be 240 - 88 = 152
            - each element of roots is also a pointer
            - so size 8 bytes
            - 152 bytes / 8 bytes = 19
- thus when we write the return code of 0 to result_buffer[30], we
clobber the address of tree snode 19 with 0

3. kernel access of tree snode 19

- when a kernel is initialized, and that kernel uses a field that is
allocated on snode tree 19:
- the lowered kernel calls `%10 = call ptr @LLVMRuntime_get_roots(ptr
%9, i32 19` (exact statement index varies depends on the kernel of
course)
        - this will return 0
    - then when we access a field based on offset 0, we crash.

~Proposed fix~
============

~We need to ensure that the allocator does not allocate the same memory
block twice, both to the LLVMRuntime and to the result_buffer~
- ~my proposed fix is to expose the `exclusive` parameter to the
LLVMRuntime~
- ~and to set this parameter to `true`, when used from the runtime~

~Questions~
=========

~A question in my mind is, why we would ever want exclusive to not be
true. And by default, it is in fact set to false. I feel like there is
some knowledge or insight that is missing to me.~

copilot:walkthrough
### Brief Summary
This small PR resolves the `threading` library warnings, which you can
find in the [CI
logs](https://github.com/taichi-dev/taichi/actions/runs/15284687653/job/42992007742#step:7:6782):
```python
  C:\Users\buildbot\actions-runner\_work\taichi\taichi\tests\python\test_offline_cache.py:46: DeprecationWarning: currentThread() is deprecated, use current_thread() instead
    return join(OFFLINE_CACHE_TEMP_DIR, str(threading.currentThread().ident))
```
It also fixes a small typo along the way.

### Walkthrough

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
…at… (taichi-dev#8717)

upgrade sscache

---------

Co-authored-by: Proton <feisuzhu@163.com>
Issue: #

### Brief Summary

- Vulkan 1.3.236.0 has been removed. 1.3.296.0 (The final 1.3.x) should
be used instead.
- Allow clang 16.
- CMake 3.x should be specified as CMake 4 is not compatible.

### Walkthrough

---------

Co-authored-by: Proton <feisuzhu@163.com>
Issue: taichi-dev#8673 

### Brief Summary

copilot:summary
This PR resolves CMake 4.0 compilation failures by standardizing the
minimum required version to 3.17 across all submodules
(TaichiExamples/TaichiCAPITests/TaichiCAPI/TaichiTests), ensuring
compatibility with modern Linux distributions.

### Walkthrough

copilot:walkthrough
#### Context
- CMake 4.0 dropped legacy support (including CMake 3.5) causing build
failures on updated distros
- Reference: [CMake Version
Policy](https://cmake.org/cmake/help/latest/command/cmake_minimum_required.html)

#### Changes Made
1. Updated version requirement in:
   - `TaichiExamples.cmake`
   - `TaichiCAPITests.cmake`
   - `TaichiCAPI.cmake` 
   - `TaichiTests.cmake`
2. Unified requirement to CMake 3.17 (already the de facto standard)

#### Impact Analysis
- No breaking changes (3.17 was already the effective minimum)
- Improves forward compatibility
- Affects only build system configuration

#### Verification
- No new tests needed (version requirement change only)
- Confirmed via manual build testing

#### Additional Notes
- Aligns with CMake's modern version policy
- Prevents future issues on rolling-release distros
- Maintains backward compatibility


I sincerely apologize for the additional PR noise (taichi-dev#8701, taichi-dev#8678). Due to
my initial lack of Git proficiency, I inadvertently created redundant
PRs while attempting to sync with upstream. This new PR (taichi-dev#8703)
consolidates all changes with proper rebasing.

Thank you for your patience, and I appreciate your guidance throughout
this process.
Issue: #

### Brief Summary

copilot:summary

### Walkthrough

copilot:walkthrough

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@tmm77 tmm77 merged commit 712d405 into amd-develop Aug 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants