I'd like to convert a fp32 Tensor (in registers) to a fp16 Tensor (in registers), ideally using the __float22half2_rn function for efficiency.
Cutlass 2.x has NumericArrayConverter that specializes to fp32 -> fp16 conversion that uses this function.
For Cutlass 3.x, I'm currently doing:
// accum is a fp32 Tensor in register
Tensor acc_fp16 = make_tensor<cutlass::half_t>(shape(accum));
for (int i = 0; i < size(accum); ++i) { acc_fp16(i) = accum(i); }
However this might not be as efficient, as it doesn't use the function for float2 -> half2 conversion. I looked at the generated PTX and it's not using anything like cvt.rn.f16x2.f32.
Should I cast the fp32 Tensor to Array, then use NumericArrayConverter, then cast the result back to Tensor?
I'd like to convert a fp32 Tensor (in registers) to a fp16 Tensor (in registers), ideally using the
__float22half2_rnfunction for efficiency.Cutlass 2.x has
NumericArrayConverterthat specializes tofp32 -> fp16conversion that uses this function.For Cutlass 3.x, I'm currently doing:
However this might not be as efficient, as it doesn't use the function for float2 -> half2 conversion. I looked at the generated PTX and it's not using anything like
cvt.rn.f16x2.f32.Should I cast the fp32 Tensor to Array, then use
NumericArrayConverter, then cast the result back to Tensor?