[QST] How to efficiently convert fp32 Tensor to fp16 Tensor in Cutlass 3.x

I'd like to convert a fp32 Tensor (in registers) to a fp16 Tensor (in registers), ideally using the `__float22half2_rn` function for efficiency.
Cutlass 2.x has `NumericArrayConverter` that [specializes](https://github.com/NVIDIA/cutlass/blob/66d9cddc832c1cdc2b30a8755274f7f74640cfe6/include/cutlass/numeric_conversion.h#L816) to `fp32 -> fp16` conversion that uses this function. 

For Cutlass 3.x, I'm currently doing:
```
// accum is a fp32 Tensor in register
Tensor acc_fp16 = make_tensor<cutlass::half_t>(shape(accum));
for (int i = 0; i < size(accum); ++i) { acc_fp16(i) = accum(i); }
```
However this might not be as efficient, as it doesn't use the function for float2 -> half2 conversion. I looked at the generated PTX and it's not using anything like `cvt.rn.f16x2.f32`.

Should I cast the fp32 Tensor to Array, then use `NumericArrayConverter`, then cast the result back to Tensor?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How to efficiently convert fp32 Tensor to fp16 Tensor in Cutlass 3.x #802

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[QST] How to efficiently convert fp32 Tensor to fp16 Tensor in Cutlass 3.x #802

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions