[Speed]concat operator need to be enhanced #8567

dzhwinter · 2018-02-26T06:59:01Z

scripts https://github.com/dzhwinter/benchmark/blob/master/fluid/machine_translation.py

make the fancy copy on host Tensor, avoid many times call of cuda memcpy.
separate the copy and compute stream. https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/
Optimize cuda Data transfer. Use Pined memory as Host Tensor when overlap the compute and copy stream.
https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/

dzhwinter · 2018-02-26T10:52:54Z

This is an one pass forward result.

------------------------->     Profiling Report     <-------------------------

Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread

Event                                     Calls       Total       Min.        Max.        Ave.
thread0::while                            47          55496.6     831.934     1633.26     1180.78
thread0::concat                           6001        46348.9     0.029568    37.7211     7.72353
thread0::sequence_softmax                 2977        2385.51     0.039552    2.60496     0.801313
thread0::mul                              32935       2383.93     0.02848     4.30915     0.0723829
thread0::sequence_pool                    3024        1020.04     0.03024     2.97926     0.337316
thread0::array_to_lod_tensor              47          944.105     11.8905     24.6831     20.0873
thread0::lod_tensor_to_array              47          893.578     11.1313     23.2719     19.0123
thread0::sequence_expand                  2977        598.336     0.027776    3.84378     0.200986
thread0::softmax                          2977        568.94      0.039488    3.61206     0.191112
thread0::elementwise_mul                  11908       535.098     0.019712    87.0525     0.044936
thread0::sum                              14885       490.149     0.026688    4.02086     0.0329291
thread0::elementwise_add                  14885       485.046     0.020832    3.86573     0.0325862
thread0::shrink_rnn_memory                11908       467.859     0.003008    4.13501     0.0392895
thread0::lstm                             94          395.678     3.06426     4.88291     4.20934
thread0::write_to_array                   9025        207.213     0.003456    6.32        0.0229599
thread0::tanh                             8978        200.484     0.016064    3.97434     0.0223306
thread0::sigmoid                          8931        181.16      0.015904    4.16406     0.0202844
thread0::reorder_lod_tensor_by_rank       141         171.073     0.376928    1.6017      1.21328
thread0::read_from_array                  8931        139.829     0.004288    3.612       0.0156566
thread0::reshape                          2977        83.6286     0.024672    0.057792    0.0280916
thread0::less_than                        3024        36.6551     0.00176     7.4663      0.0121214
thread0::increment                        2977        16.9588     0.00176     3.81085     0.00569662
thread0::lookup_table                     94          11.0348     0.07136     0.157152    0.117391
thread0::mean                             47          4.64058     0.065664    0.116576    0.0987357
thread0::lod_rank_table                   47          2.56218     0.034176    0.065024    0.0545144
thread0::cross_entropy                    47          1.64195     0.030368    0.039296    0.0349351
thread0::fetch                            47          1.08454     0.020288    0.032416    0.0230754
thread0::fill_constant_batch_size_like    47          1.03235     0.019456    0.027424    0.0219649
thread0::feed                             141         1.00586     0.00448     0.015456    0.00713373
thread0::fill_constant                    94          0.642144    0.00496     0.01248     0.00683132
thread0::max_sequence_len                 47          0.266528    0.00448     0.00912     0.00567081

pass_id=0, test_loss: 6.739109, words/s: 2238.753745, sec/pass: 310.397694

chengduoZH · 2018-02-27T10:27:57Z

Analysis the `concat` operation

The input is a list of tensors and axis which indicates the concation axis. The shape of input's tensor can be any, but only the dimension of axis can be different.
For example, the input is two tensors.

case 1:
- t_a's shape: [9,2,3,4]
- t_b's shape:[3,2,3,4]
- axis = 0,

Obviously, the output's shape is [12,2,3,4]. To simply solve this case, we can reshape t_a to [9, 24] and t_b to [3, 24], finally concate the two tensor longitudinally. The output's shape is [12, 24]. In this case, we only copy two times.

case 2:
- t_a's shape: [9,2,3,4]
- t_b's shape:[9,3,3,4]
- axis = 2,

To simply solve this case, we can reshape t_a to [9, 2, 12] and t_b to [9, 3, 12], finally concate the two tensor on the second axis. The output's shape is [9,5,12]. In this case, we should copy 18 times.

case 3:
- t_a's shape: [9,2,3,4]
- t_b's shape:[9,2,3,3]
- axis = 4,

Firstly, we reshape t_a to [54, 4] and t_b to [54, 3], finally concate the two tensor horizontally. The output's shape is [54, 7]. This is the worst case, we should copy 108 times.

chengduoZH · 2018-02-27T12:13:01Z

I have a plan to optimize it.

use one Cuda kernel to complete those copies.

dzhwinter mentioned this issue Feb 26, 2018

accelerate the cuda concat op, avoid many times copy #8585

Merged

dzhwinter closed this as completed in #8585 Feb 26, 2018

dzhwinter reopened this Feb 26, 2018

dzhwinter mentioned this issue Feb 27, 2018

The problem of improving the performance of Parallel_Do #8592

Closed

chengduoZH self-assigned this Feb 27, 2018

chengduoZH mentioned this issue Mar 1, 2018

Refine concat_op #8669

Merged

1 task

dzhwinter changed the title ~~concat operator need to be enhanced~~ [Speed]concat operator need to be enhanced Mar 6, 2018

dzhwinter added this to Done in Performance Tuning Mar 6, 2018

dzhwinter mentioned this issue Mar 6, 2018

[Speed] concat operator math kernel improvement #8764

Closed

1 task

chengduoZH closed this as completed in #8669 Mar 7, 2018

chengduoZH removed this from Done in Performance Tuning Mar 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speed]concat operator need to be enhanced #8567

[Speed]concat operator need to be enhanced #8567

dzhwinter commented Feb 26, 2018 •

edited

Loading

dzhwinter commented Feb 26, 2018

chengduoZH commented Feb 27, 2018 •

edited

Loading

chengduoZH commented Feb 27, 2018 •

edited

Loading

[Speed]concat operator need to be enhanced #8567

[Speed]concat operator need to be enhanced #8567

Comments

dzhwinter commented Feb 26, 2018 • edited Loading

dzhwinter commented Feb 26, 2018

chengduoZH commented Feb 27, 2018 • edited Loading

Analysis the concat operation

chengduoZH commented Feb 27, 2018 • edited Loading

dzhwinter commented Feb 26, 2018 •

edited

Loading

chengduoZH commented Feb 27, 2018 •

edited

Loading

Analysis the `concat` operation

chengduoZH commented Feb 27, 2018 •

edited

Loading