Tessellate performance #2438

jvwilliams23 · 2024-04-22T13:38:35Z

jvwilliams23
Apr 22, 2024

TLDR; tessellate backprop very slow.

I have found my lbann StyleGAN implementation to be less performant than the pytorch implementation. I have just put it through nvprof (for 1000 epochs) and found the main time % is spent on backprop for the tessellate function:

==2060519== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   41.37%  107.858s     41028  2.6289ms  4.1600us  19.091ms  void lbann::_GLOBAL__N__afbfbb61_13_tessellate_cu_074422a9_1976799::bp_gpu_3d_kernel<float>(__int64, __int64, __int64, __int64, __int64, __int64, __int64, __int64, float const *, __int64, __int64, __int64, lbann::_GLOBAL__N__afbfbb61_13_tessellate_cu_074422a9_1976799::bp_gpu_3d_kernel<float>*, __int64)
                    5.50%  14.3445s    255746  56.088us  2.5280us  501.05us  _ZN8hydrogen6device6kernel29entrywise_map_2d_kernel_naiveILi32ELi8EffxZN62_INTERNAL_42edad97_23_binary_with_constant_cu_074422a9_197687964_GLOBAL__N__42edad97_23_binary_with_constant_cu_074422a9_197687912ApplyScaleFPIfEEvT_RKN2El6MatrixIS6_LNS_6DeviceE1EEERSA_EUlRKfE_EEvT3_SH_PKT1_SH_PT2_SH_T4_
                    4.32%  11.2657s     29365  383.64us  113.47us  971.90us  cudnn_infer_volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1
                    4.30%  11.2185s   1714624  6.5420us  3.9680us  40.415us  void fft2d_c2r_32x32<float, bool=0, bool=0, unsigned int=0, bool=0, bool=0>(float*, float2 const *, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*, int2, int, int)
                    3.74%  9.75802s   1758200  5.5500us  3.2960us  39.104us  void fft2d_r2c_32x32<float, bool=0, unsigned int=0, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
                    2.75%  7.17632s   1790640  4.0070us  2.6550us  16.672us  void gemmk1_kernel<int, float2, int=256, int=5, bool=0, bool=0, bool=1, bool=0, cublasGemvTensorStridedBatched<float2 const >, cublasGemvTensorStridedBatched<float2 const >, cublasGemvTensorStridedBatched<float2>, float2, int=0>(cublasGemmk1Params<float2, float2 const , cublasGemvTensorStridedBatched<float2 const >, cublasGemvTensorStridedBatched<float2 const >, float2, biasType<cublasGemvTensorStridedBatched<float2 const >::value_type, float2>::type>)
                    2.49%  6.48753s     23343  277.92us  38.047us  2.4431ms  void cudnn::detail::dgrad_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=0>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=0>*, kernel_grad_params, __int64, int, __int64, int, float, int, int, int)
                    2.09%  5.43917s     76076  71.496us  8.2880us  295.78us  void lbann::_GLOBAL__N__1cb62eab_14_concatenate_cu_074422a9_1976658::concat4d_kernel<float>(unsigned long, float const * restrict *, lbann::gpu_lib::array<unsigned long, unsigned long=4> const *, lbann::gpu_lib::array<unsigned long, unsigned long=4> const , lbann::_GLOBAL__N__1cb62eab_14_concatenate_cu_074422a9_1976658::concat4d_kernel<float>*, lbann::gpu_lib::array, unsigned long const *)
                    2.02%  5.26741s     76076  69.238us  6.2400us  296.99us  void lbann::_GLOBAL__N__b2ee8c35_8_slice_cu_074422a9_1976801::slice4d_kernel<float>(unsigned long, float const *, lbann::gpu_lib::array<unsigned long, unsigned long=4>, unsigned long const *, lbann::_GLOBAL__N__b2ee8c35_8_slice_cu_074422a9_1976801::slice4d_kernel<float>* restrict *, lbann::gpu_lib::array const *, lbann::gpu_lib::array const )

At several points in my code I need to multiply a 1D set of activations (dim = 512) by the weights of a 2D conv (e.g. 512x512x3x3), so I tessellate the 1D array to the same shape as the weights and then do some operations. I was wondering if there is a better way to do this, to bypass this large performance overhead?

Here is an example of what I mean:

    # apply weight demodulation, based on styles*weights.
    styles_demod_reshaped = lbann.Reshape(styles, dims=[in_channels, 1, 1])
    styles_shape_weights = lbann.Tessellate(
      styles_demod_reshaped,
      dims=[in_channels * out_channels, kernel_size, kernel_size],
    )
    w = lbann.Multiply(w, styles_shape_weights)

Also I create the biases like so, so they need to be tessellated to add onto the activations:

      b = lbann.WeightsLayer(
        weights=lbann.Weights(
          initializer=lbann.ConstantInitializer(value=0.0),
          name=name + "bias",
        ),
        dims=[out_channels, 1, 1],
        name=name + "biaslayer",
      )
      b = lbann.Tessellate(
        b,
        dims=[
          self.out_channels,
          self.resolution,
          self.resolution
        ],
      )
      x = lbann.Add(x, b)

Thanks,
Josh

Answered by tbennun

Jun 28, 2024

Hi @jvwilliams23, I just uploaded PR #2460 to try and address the issue. Please let us know if that performs better.

View full answer

tbennun · 2024-06-28T23:11:54Z

tbennun
Jun 28, 2024
Collaborator

Hi @jvwilliams23, I just uploaded PR #2460 to try and address the issue. Please let us know if that performs better.

2 replies

jvwilliams23 Jul 2, 2024
Author

Hi @tbennun. I have only tested on 1 gpu, but it seems to work great. Nearly 4x overall speedup (train time for 1000 epochs down from 1000 s to 290 s).
Profiler outputs below:

original

Time(%)      Time     Calls       Avg       Min       Max  Name
74.83%  811.212s    528000  1.5364ms  3.7430us  9.0505ms  void lbann::_GLOBAL__N__b67b5e46_13_tessellate_cu_d9c12cb6_10226::bp_gpu_3d_kernel<float>(__int64, __int64, __int64, __int64, __int64, __int64, __int64, __int64, float const *, __int64, __int64, __int64, lbann::_GLOBAL__N__b67b5e46_13_tessellate_cu_d9c12cb6_10226::bp_gpu_3d_kernel<float>*, __int64)

new tessellate

Time(%)      Time     Calls       Avg       Min       Max  Name
6.50%  19.3438s    104000  186.00us  20.896us  291.90us  void lbann::_GLOBAL__N__9e6fe113_13_tessellate_cu_02888f84_6183::bp_gpu_3d_kernel<float>(__int64, __int64, __int64, __int64, __int64, __int64, __int64, __int64, float const *, __int64, __int64, __int64, lbann::_GLOBAL__N__9e6fe113_13_tessellate_cu_02888f84_6183::bp_gpu_3d_kernel<float>*, __int64)

number of calls seems lower too (for the same case). Do you expect this?

tbennun Jul 2, 2024
Collaborator

Yes, this is expected as some calls are now replaced by cutensor calls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tessellate performance #2438

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Tessellate performance #2438

jvwilliams23 Apr 22, 2024

Replies: 1 comment · 2 replies

tbennun Jun 28, 2024 Collaborator

jvwilliams23 Jul 2, 2024 Author

tbennun Jul 2, 2024 Collaborator

jvwilliams23
Apr 22, 2024

Replies: 1 comment 2 replies

tbennun
Jun 28, 2024
Collaborator

jvwilliams23 Jul 2, 2024
Author

tbennun Jul 2, 2024
Collaborator