Ignore Resize ops when validating all ID uses are exactly mapped. #64

naoyam · 2023-03-23T23:14:37Z

Resize ops are not replayed, so they don't need to be exactly mapped

Previously, FusionSliceForNanoGPT3_CUDA was segmented as the resize ops are not exactly mapped since they have the different expansion arguments. Since those resize ops are part of rfactor transformations, they were detected as conflicting rfactor transformations. However, unlike the split and merge used by reshape, resize ops are not replayed, so they don't need to be uniform.

This is also part of the fix for #58. Looks like the Python example is not segmented anymore, although I suspect there's still something need to do for permute.

Resize ops are not replayed, so they don't need to be exactly mapped

zasdfgbnm · 2023-03-23T23:41:53Z

Hmmm, if I have multiple slices with different result extent, should I reject the fusion?

  auto tv1 = slice(
      tv0,
      {{IrBuilder::create<Int>(0), IrBuilder::create<Int>(16)},
       {IrBuilder::create<Int>(0), IrBuilder::create<Int>(128)},
       {IrBuilder::create<Int>(0), IrBuilder::create<Int>(1024)}});
  auto tv2 = slice(
      tv0,
      {{IrBuilder::create<Int>(0), IrBuilder::create<Int>(16)},
       {IrBuilder::create<Int>(0), IrBuilder::create<Int>(128)},
       {IrBuilder::create<Int>(1024), IrBuilder::create<Int>(2049)}}); // Note: not 2048

naoyam · 2023-03-23T23:57:08Z

For correctness, it doesn't need to be.

%kernel {
T4_l[ iblockIdx.x37{( ceilDiv(( ceilDiv(( 128 * 1024 ), blockDim.x) ), 4) )}, iblockIdx.y39{( ceilDiv(16, 1) )}, iUS40{1}, iS38{4}, ithreadIdx.x36{blockDim.x} ] ca_pos( 5 )
   = slice( T0_g[ iS58{( ceilDiv(( ceilDiv(( i2 * i3 ), blockDim.x) ), 4) )}, iS60{( ceilDiv(i0, 1) )}, iS61{1}, iS59{4}, iS57{blockDim.x} ], { {0, 16, 1} {0, 128, 1} {0, 1024, 1} } )
T1_g[ iblockIdx.x30{( ceilDiv(( ceilDiv(( 128 * 1024 ), blockDim.x) ), 4) )}, iblockIdx.y32{( ceilDiv(16, 1) )}, iUS33{1}, iS31{4}, ithreadIdx.x29{blockDim.x} ] ca_pos( 3 ) produce_pos( 5 )
   = T4_l[ iblockIdx.x37{( ceilDiv(( ceilDiv(( 128 * 1024 ), blockDim.x) ), 4) )}, iblockIdx.y39{( ceilDiv(16, 1) )}, iUS40{1}, iS38{4}, ithreadIdx.x36{blockDim.x} ] ca_pos( 5 );
T5_l[ iblockIdx.x65{( ceilDiv(( ceilDiv(( 128 * 1025 ), blockDim.x) ), 4) )}, iblockIdx.y67{( ceilDiv(16, 1) )}, iUS68{1}, iS66{4}, ithreadIdx.x64{blockDim.x} ] ca_pos( 5 )
   = slice( T0_g[ iS58{( ceilDiv(( ceilDiv(( i2 * i3 ), blockDim.x) ), 4) )}, iS60{( ceilDiv(i0, 1) )}, iS61{1}, iS59{4}, iS57{blockDim.x} ], { {0, 16, 1} {0, 128, 1} {1024, 2049, 1} } )
T2_g[ iblockIdx.x72{( ceilDiv(( ceilDiv(( 128 * 1025 ), blockDim.x) ), 4) )}, iblockIdx.y74{( ceilDiv(16, 1) )}, iUS75{1}, iS73{4}, ithreadIdx.x71{blockDim.x} ] produce_pos( 5 )
   = T5_l[ iblockIdx.x65{( ceilDiv(( ceilDiv(( 128 * 1025 ), blockDim.x) ), 4) )}, iblockIdx.y67{( ceilDiv(16, 1) )}, iUS68{1}, iS66{4}, ithreadIdx.x64{blockDim.x} ] ca_pos( 5 );

The two slices are scheduled in the same way, although each axis may have a different extent.

T4_l[ iblockIdx.x37{( ceilDiv(( ceilDiv(( 128 * 1024 ), blockDim.x) ), 4) )},
T5_l[ iblockIdx.x65{( ceilDiv(( ceilDiv(( 128 * 1025 ), blockDim.x) ), 4) )},

This means that blockIdx.x is no longer unique in ParallelDimensionMap, but it still should work correctly (if not, there must be a bug).

For performance, I'm not sure if it's always better to fuse them or reject them. It's likely the overall performance is determined by the larger slice output, so we might want to pick that as the reference of scheduling.

I'd say it's too early to worry too much about the performance. Since it should be fine for correctness, I'd like to make them opportunistically fused and revisit when perf problems were found.

zasdfgbnm

It's good to know that there is no correctness issue. Thanks for explaining!

Ignore Resize ops when validating all ID uses are exactly mapped.

8a3b79b

Resize ops are not replayed, so they don't need to be exactly mapped

naoyam requested a review from zasdfgbnm March 23, 2023 23:20

zasdfgbnm approved these changes Mar 24, 2023

View reviewed changes

naoyam merged commit 2bdb29d into main Mar 24, 2023

naoyam deleted the reshape_segmentation_fix branch March 24, 2023 00:03

naoyam mentioned this pull request Mar 24, 2023

slice performance: Horizontal fusion based on slice of an input tensor results in segmentation #58

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore Resize ops when validating all ID uses are exactly mapped. #64

Ignore Resize ops when validating all ID uses are exactly mapped. #64

naoyam commented Mar 23, 2023

zasdfgbnm commented Mar 23, 2023

naoyam commented Mar 23, 2023 •

edited

Loading

zasdfgbnm left a comment

Ignore Resize ops when validating all ID uses are exactly mapped. #64

Ignore Resize ops when validating all ID uses are exactly mapped. #64

Conversation

naoyam commented Mar 23, 2023

zasdfgbnm commented Mar 23, 2023

naoyam commented Mar 23, 2023 • edited Loading

zasdfgbnm left a comment

Choose a reason for hiding this comment

naoyam commented Mar 23, 2023 •

edited

Loading