LongCat-Video-Avatar 1.5 working on Strix Halo (gfx1151) via ComfyUI, first datapoint as far as I know... #5635
InfinitePortaldev
started this conversation in
Show and tell
Replies: 2 comments 1 reply
-
|
Architectures with massive unified memory fabrics—like Strix Halo (gfx1151) or DGX Spark—solve the physical capacity constraint but introduce heavy cache line starvation when consecutive layers continuously read/write intermediate states back across the shared pool. You can optimize the throughput on these unified fabrics by using pip install renorm-native ```
Run your operational tensor transformations like this to stabilize the execution graph:
```python
import torch
from renorm.layers import FusedRenormLinearFunction
# 1. Allocate tensors in your current CUDA/Unified environment
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = torch.randn(1, 4096, device=device)
weights = torch.randn(4096, 4096, device=device)
bias = torch.randn(4096, device=device)
# 2. Apply the register-fused stabilization layer (Beta = 0.05)
# This keeps calculations tight, eliminating unpredictable host memory caching spikes
output = FusedRenormLinearFunction.apply(inputs, weights, bias, 0.05) ```
For the complete implementation and memory-mapping details, check out the core repo: [GitHub: Tobi-Adesoye/renorm-native](https://github.com/Tobi-Adesoye/renorm-native) |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
Thanks for asking. renorm-native is my own project rather than an official
ROCm or PyTorch technique, so you wouldn't expect to find it referenced in
their documentation.
At the moment, the primary source is the project repository itself, which
contains the implementation and examples:
https://github.com/Tobi-Adesoye/renorm-native
The package is also published on PyPI and can be installed with:
pip install renorm-native
My goal is to explore an alternative set of tensor operations and layer
implementations with an emphasis on stable execution and efficient
computation. It's still an evolving project, and I'm interested in feedback
and independent testing on different hardware, including ROCm systems.
If you have a chance to try it on your setup or benchmark it against
comparable PyTorch layers, I'd be very interested in your results and any
issues you encounter.
…On Wed, Jun 17, 2026 at 10:24 PM InfinitePortaldev ***@***.***> wrote:
Thanks, I couldn't find renorm-native referenced anywhere in the ROCm or
PyTorch ecosystem and the API doesn't match anything in my stack, do you
have a source for the technique?
—
Reply to this email directly, view it on GitHub
<#5635?email_source=notifications&email_token=AQSPCWC56CY3CTASNRV36BD5AMECFA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZTGQZTMOBZUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSWGM33PORSXEX3DNRUWG2Y#discussioncomment-17343689>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQSPCWAS2OAVCRAXEC72VYL5AMECFAVCNFSNUABIKJSXA33TNF2G64TZHM3TMNJWGA2TAOJRHNCGS43DOVZXG2LPNY5TCMBSGAYDMNBXUF3AE>
.
Triage notifications, keep track of coding agent tasks and review pull
requests on the go with GitHub Mobile for iOS
<https://github.com/notifications/mobile/ios/AQSPCWBVRSKR6ORP4TOXWPL5AMECFA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZTGQZTMOBZUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSVGM33PORSXEX3JN5ZQ>
and Android
<https://github.com/notifications/mobile/android/AQSPCWGDUFANA22TVRPBN5T5AMECFA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZTGQZTMOBZUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSXGM33PORSXEX3BNZSHE33JMQ>.
Download it today!
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
LongCat-Video-Avatar 1.5 working on Strix Halo (gfx1151) via ComfyUI, first datapoint as far as I know...
Sharing a working configuration and timing numbers for Meituan's LongCat-Video-Avatar 1.5 (audio driven talking avatar model, released May 21) on a Strix Halo APU. I have not seen another report of this model on gfx1151, so posting the details in case it saves someone else the trial and error. This is a single successful run, not a benchmark suite, so treat the numbers as a rough first datapoint.
Hardware and stack
Model files
Result
81 frames (about 3.2 seconds at 25 fps) at 480x832, 8 steps with the DMD distill LoRA:
Slow, but it completes and the output is usable. Memory is clearly not the constraint on this machine, only time.
Settings that mattered
The community workflows for this model are tuned for VRAM limited discrete GPUs, and several of those defaults actively hurt on unified memory:
Environment variables
My launcher sets ROCBLAS_USE_HIPBLASLT=1, HSA_ENABLE_SDMA=0, HSA_USE_SVM=0 and TORCH_ROCM_AOTRITON_ENABLE_CACHE=1. I have not isolated which of these are strictly necessary for this particular model, they are carried over from configurations that fixed problems with other video models on this machine (HSA_USE_SVM=0 in particular has been important for Wan class VAE decode on unified memory, see TheRock discussion 2684).
I also tried TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1, which used to matter for attention speed on gfx1151 with older torch builds. On this torch 2.12 nightly it made no measurable difference, so it may be default behavior now or simply not engaged for these shapes.
Open questions
Happy to answer questions or rerun with different settings if someone wants a specific comparison.
Beta Was this translation helpful? Give feedback.
All reactions