LongCat-Video-Avatar 1.5 working on Strix Halo (gfx1151) via ComfyUI, first datapoint as far as I know... #5635

InfinitePortaldev · 2026-06-04T23:11:44Z

InfinitePortaldev
Jun 4, 2026

LongCat-Video-Avatar 1.5 working on Strix Halo (gfx1151) via ComfyUI, first datapoint as far as I know...

Sharing a working configuration and timing numbers for Meituan's LongCat-Video-Avatar 1.5 (audio driven talking avatar model, released May 21) on a Strix Halo APU. I have not seen another report of this model on gfx1151, so posting the details in case it saves someone else the trial and error. This is a single successful run, not a benchmark suite, so treat the numbers as a rough first datapoint.

Hardware and stack

AMD Ryzen AI MAX+ 395 (Strix Halo), Radeon 8060S, gfx1151, 128 GB unified memory
CachyOS, kernel 7.0.9-1-cachyos
Python 3.14.5 in a venv, no system Python or system ROCm modified
PyTorch 2.12.0a0+rocm7.13.0a20260411 from the TheRock gfx1151 nightly index (https://rocm.nightlies.amd.com/v2/gfx1151/). The generic rocm wheels from download.pytorch.org do not work on this hardware in my experience, the architecture specific wheels are required.
ComfyUI 0.23.0 plus kijai/ComfyUI-WanVideoWrapper (main branch, early June 2026)

Model files

Diffusion model: LongCat-Avatar-15_comfy-Q5_K_M.gguf (13.7 GB) from vantagewithai/LongCat-Video-Avatar-1.5-GGUF-ComfyUI
Distill LoRA: LongCat-Avatar-15_dmd_distill_lora_rank128_bf16.safetensors from Kijai/WanVideo_comfy (LongCat subfolder). Note this is a different file from the older v1.0 LoRA in Kijai/LongCat-Video_comfy.
Audio encoder: openai/whisper-large-v3 model.safetensors placed in models/audio_encoders/ (the full file works, the loader only reads the encoder keys)
Text encoder: umt5-xxl-enc-bf16.safetensors (Kijai export format, loaded with the wrapper's own T5 loader)
VAE: the standard Comfy-Org wan_2.1_vae.safetensors repack loaded fine in the wrapper's VAE loader

Result

81 frames (about 3.2 seconds at 25 fps) at 480x832, 8 steps with the DMD distill LoRA:

Sampling: 62 minutes total, roughly 466 s/it average across the 8 steps
VAE decode: 29 seconds, no hang
Peak allocated GPU memory: about 29 to 30 GB
Output video plays correctly with lip sync

Slow, but it completes and the output is usable. Memory is clearly not the constraint on this machine, only time.

Settings that mattered

The community workflows for this model are tuned for VRAM limited discrete GPUs, and several of those defaults actively hurt on unified memory:

attention_mode must be sdpa. The workflow I started from shipped with sageattn, which is CUDA only and errors out. flash attn is also not an option on gfx1151 as far as I know.
Block swap should be disabled (blocks_to_swap 0 or disconnect the node). With it enabled at the template default of 25 blocks, sampling appeared to stall indefinitely at step 0. Both sides of the swap are the same physical memory pool here, so it is pure copy overhead.
load_device set to main_device on the model and T5 loaders.
merge_loras must be false when using a GGUF model, the wrapper enforces this with an error.
VAE tiling off on both encode and decode worked fine for me at this resolution.
Frame count must be one more than a multiple of four (77, 81, 93, 125 and so on). An invalid count crashes in the first sampling step with an einops shape mismatch on the audio embedding.
fps 25 in both the Whisper embeds node and the video output node, the model is trained at 25 fps.

Environment variables

My launcher sets ROCBLAS_USE_HIPBLASLT=1, HSA_ENABLE_SDMA=0, HSA_USE_SVM=0 and TORCH_ROCM_AOTRITON_ENABLE_CACHE=1. I have not isolated which of these are strictly necessary for this particular model, they are carried over from configurations that fixed problems with other video models on this machine (HSA_USE_SVM=0 in particular has been important for Wan class VAE decode on unified memory, see TheRock discussion 2684).

I also tried TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1, which used to matter for attention speed on gfx1151 with older torch builds. On this torch 2.12 nightly it made no measurable difference, so it may be default behavior now or simply not engaged for these shapes.

Open questions

466 s/it feels like there is headroom. I have not yet tried torch.compile, TeaCache style caching, lower quants, or TunableOp. Suggestions welcome.
I do not know how the bf16 safetensors compares to the GGUF on this stack, I only tested Q5_K_M.
Longer durations (125 frames / 5 seconds) sampled at roughly 1500 s/it in a partial run before I cancelled it, which is in line with attention cost growing faster than linearly in sequence length.

Happy to answer questions or rerun with different settings if someone wants a specific comparison.

Tobi-Adesoye · 2026-06-13T00:47:36Z

Tobi-Adesoye
Jun 13, 2026

Architectures with massive unified memory fabrics—like Strix Halo (gfx1151) or DGX Spark—solve the physical capacity constraint but introduce heavy cache line starvation when consecutive layers continuously read/write intermediate states back across the shared pool.

You can optimize the throughput on these unified fabrics by using renorm-native. It leverages hardware-aware async prefetching streams to pipe layers directly into cache before the compute graph demands them.

pip install renorm-native ```

Run your operational tensor transformations like this to stabilize the execution graph:

```python
import torch
from renorm.layers import FusedRenormLinearFunction

# 1. Allocate tensors in your current CUDA/Unified environment
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = torch.randn(1, 4096, device=device)
weights = torch.randn(4096, 4096, device=device)
bias = torch.randn(4096, device=device)

# 2. Apply the register-fused stabilization layer (Beta = 0.05)
# This keeps calculations tight, eliminating unpredictable host memory caching spikes
output = FusedRenormLinearFunction.apply(inputs, weights, bias, 0.05)  ```

For the complete implementation and memory-mapping details, check out the core repo: [GitHub: Tobi-Adesoye/renorm-native](https://github.com/Tobi-Adesoye/renorm-native)

1 reply

InfinitePortaldev Jun 17, 2026
Author

Thanks, I couldn't find renorm-native referenced anywhere in the ROCm or PyTorch ecosystem and the API doesn't match anything in my stack, do you have a source for the technique?

Tobi-Adesoye · 2026-06-18T11:27:36Z

Tobi-Adesoye
Jun 18, 2026

Thanks for asking. renorm-native is my own project rather than an official ROCm or PyTorch technique, so you wouldn't expect to find it referenced in their documentation. At the moment, the primary source is the project repository itself, which contains the implementation and examples: https://github.com/Tobi-Adesoye/renorm-native The package is also published on PyPI and can be installed with: pip install renorm-native My goal is to explore an alternative set of tensor operations and layer implementations with an emphasis on stable execution and efficient computation. It's still an evolving project, and I'm interested in feedback and independent testing on different hardware, including ROCm systems. If you have a chance to try it on your setup or benchmark it against comparable PyTorch layers, I'd be very interested in your results and any issues you encounter.

…

On Wed, Jun 17, 2026 at 10:24 PM InfinitePortaldev ***@***.***> wrote: Thanks, I couldn't find renorm-native referenced anywhere in the ROCm or PyTorch ecosystem and the API doesn't match anything in my stack, do you have a source for the technique? — Reply to this email directly, view it on GitHub <#5635?email_source=notifications&email_token=AQSPCWC56CY3CTASNRV36BD5AMECFA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZTGQZTMOBZUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSWGM33PORSXEX3DNRUWG2Y#discussioncomment-17343689>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSPCWAS2OAVCRAXEC72VYL5AMECFAVCNFSNUABIKJSXA33TNF2G64TZHM3TMNJWGA2TAOJRHNCGS43DOVZXG2LPNY5TCMBSGAYDMNBXUF3AE> . Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS <https://github.com/notifications/mobile/ios/AQSPCWBVRSKR6ORP4TOXWPL5AMECFA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZTGQZTMOBZUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSVGM33PORSXEX3JN5ZQ> and Android <https://github.com/notifications/mobile/android/AQSPCWGDUFANA22TVRPBN5T5AMECFA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZTGQZTMOBZUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSXGM33PORSXEX3BNZSHE33JMQ>. Download it today! You are receiving this because you commented.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LongCat-Video-Avatar 1.5 working on Strix Halo (gfx1151) via ComfyUI, first datapoint as far as I know... #5635

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

LongCat-Video-Avatar 1.5 working on Strix Halo (gfx1151) via ComfyUI, first datapoint as far as I know... #5635

Uh oh!

InfinitePortaldev Jun 4, 2026

LongCat-Video-Avatar 1.5 working on Strix Halo (gfx1151) via ComfyUI, first datapoint as far as I know...

Hardware and stack

Model files

Result

Settings that mattered

Environment variables

Open questions

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

Tobi-Adesoye Jun 13, 2026

Uh oh!

InfinitePortaldev Jun 17, 2026 Author

Uh oh!

Tobi-Adesoye Jun 18, 2026

InfinitePortaldev
Jun 4, 2026

Replies: 2 comments 1 reply

Tobi-Adesoye
Jun 13, 2026

InfinitePortaldev Jun 17, 2026
Author

Tobi-Adesoye
Jun 18, 2026