[Feature] iluvatar platforms support#1045
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Iluvatar CUDA platform, adding new configuration files, device initialization logic, and optimized implementations for Flash Attention, INT8 quantized linear layers, RMS Norm, and RoPE. The review feedback identifies several critical issues, including instances where NotImplementedError is instantiated but not raised, potential NameError and logic bugs in the Flash Attention implementation, and uninitialized attributes in the RoPE class. Additionally, there are inconsistencies in API usage for quantization functions and minor documentation typos that require correction.
| device = q.device | ||
|
|
||
| def half(x): | ||
| return x if x.dtype in half_dtypes else x.to(dtype) |
There was a problem hiding this comment.
There are two issues here:
dtypeis undefined iflen(q.shape) == 3because it is only assigned inside theelif len(q.shape) == 4block (line 30). This will cause aNameErrorifhalf()is ever called for 3D inputs.- The logic
x if x.dtype in half_dtypes else x.to(dtype)does not actually ensure half precision if the inputqisfloat32(asdtypewould befloat32). Flash attention kernels typically requirefloat16orbfloat16.
| return_softmax_lse=False, | ||
| causal=False, | ||
| ) | ||
| return x.reshape(bs * max_seqlen_q, -1) |
There was a problem hiding this comment.
The reshape operation will fail for variable-length 4D inputs. When cu_seqlens_q is provided, the output x from flash_attn_varlen_func is a packed tensor with shape [sum(seqlens), nheads, head_dim]. Reshaping it to [bs * max_seqlen_q, -1] (the padded size) will raise a RuntimeError because the total number of elements won't match if any padding was present in the original 4D input.
| device = x.device | ||
| input_tensor_quant = torch.empty(x.shape, dtype=torch.int8, device=device) | ||
| input_tensor_scale = torch.empty(x.shape[:-1], dtype=torch.float32, device=device) | ||
| ixf.dynamic_scaled_int8_quant(output=input_tensor_quant, input=x, scale=input_tensor_scale) |
There was a problem hiding this comment.
The API usage of ixf.dynamic_scaled_int8_quant is inconsistent with its usage in lightx2v_platform/ops/mm/iluvatar_cuda/q_linear.py. Here it is used as an in-place function with keyword arguments (output=, input=, scale=), while in q_linear.py it is used as a function returning two values with a single positional argument. Please verify the correct ixformer API and ensure consistency.
There was a problem hiding this comment.
iluvatar output使用先分配tensor
| if xq.dim() == 4: | ||
| xq = xq.squeeze(0) | ||
| xk = xk.squeeze(0) | ||
| return xq.to(self.infer_dtype), xk.to(self.infer_dtype) |
There was a problem hiding this comment.
class IluvatarWanRope(RopeTemplate),RopeTemplate
RopeTemplate 会初始化 infer_dtype
| Quant MM: | ||
| Weight: int8 perchannel sym | ||
| Act: int8 perchannel dynamic sym | ||
| Kernel: mlu |
|
|
||
| # Need this, otherwise Triton tries to launch from cuda:0 and we get | ||
| # ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) | ||
| with torch.cuda.device(x.device.index): |
Update lightx2v/models/input_encoders/hf/wan/t5/model.py
No description provided.