Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Speed]concat operator need to be enhanced #8567

Closed
1 of 3 tasks
dzhwinter opened this issue Feb 26, 2018 · 3 comments · Fixed by #8585 or #8669
Closed
1 of 3 tasks

[Speed]concat operator need to be enhanced #8567

dzhwinter opened this issue Feb 26, 2018 · 3 comments · Fixed by #8585 or #8669
Assignees

Comments

@dzhwinter
Copy link
Contributor

dzhwinter commented Feb 26, 2018

scripts https://github.com/dzhwinter/benchmark/blob/master/fluid/machine_translation.py

@dzhwinter
Copy link
Contributor Author

This is an one pass forward result.

------------------------->     Profiling Report     <-------------------------

Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread

Event                                     Calls       Total       Min.        Max.        Ave.
thread0::while                            47          55496.6     831.934     1633.26     1180.78
thread0::concat                           6001        46348.9     0.029568    37.7211     7.72353
thread0::sequence_softmax                 2977        2385.51     0.039552    2.60496     0.801313
thread0::mul                              32935       2383.93     0.02848     4.30915     0.0723829
thread0::sequence_pool                    3024        1020.04     0.03024     2.97926     0.337316
thread0::array_to_lod_tensor              47          944.105     11.8905     24.6831     20.0873
thread0::lod_tensor_to_array              47          893.578     11.1313     23.2719     19.0123
thread0::sequence_expand                  2977        598.336     0.027776    3.84378     0.200986
thread0::softmax                          2977        568.94      0.039488    3.61206     0.191112
thread0::elementwise_mul                  11908       535.098     0.019712    87.0525     0.044936
thread0::sum                              14885       490.149     0.026688    4.02086     0.0329291
thread0::elementwise_add                  14885       485.046     0.020832    3.86573     0.0325862
thread0::shrink_rnn_memory                11908       467.859     0.003008    4.13501     0.0392895
thread0::lstm                             94          395.678     3.06426     4.88291     4.20934
thread0::write_to_array                   9025        207.213     0.003456    6.32        0.0229599
thread0::tanh                             8978        200.484     0.016064    3.97434     0.0223306
thread0::sigmoid                          8931        181.16      0.015904    4.16406     0.0202844
thread0::reorder_lod_tensor_by_rank       141         171.073     0.376928    1.6017      1.21328
thread0::read_from_array                  8931        139.829     0.004288    3.612       0.0156566
thread0::reshape                          2977        83.6286     0.024672    0.057792    0.0280916
thread0::less_than                        3024        36.6551     0.00176     7.4663      0.0121214
thread0::increment                        2977        16.9588     0.00176     3.81085     0.00569662
thread0::lookup_table                     94          11.0348     0.07136     0.157152    0.117391
thread0::mean                             47          4.64058     0.065664    0.116576    0.0987357
thread0::lod_rank_table                   47          2.56218     0.034176    0.065024    0.0545144
thread0::cross_entropy                    47          1.64195     0.030368    0.039296    0.0349351
thread0::fetch                            47          1.08454     0.020288    0.032416    0.0230754
thread0::fill_constant_batch_size_like    47          1.03235     0.019456    0.027424    0.0219649
thread0::feed                             141         1.00586     0.00448     0.015456    0.00713373
thread0::fill_constant                    94          0.642144    0.00496     0.01248     0.00683132
thread0::max_sequence_len                 47          0.266528    0.00448     0.00912     0.00567081

pass_id=0, test_loss: 6.739109, words/s: 2238.753745, sec/pass: 310.397694

@chengduoZH
Copy link
Contributor

chengduoZH commented Feb 27, 2018

Analysis the concat operation

The input is a list of tensors and axis which indicates the concation axis. The shape of input's tensor can be any, but only the dimension of axis can be different.
For example, the input is two tensors.

  • case 1:
    • t_a's shape: [9,2,3,4]
    • t_b's shape:[3,2,3,4]
    • axis = 0,

Obviously, the output's shape is [12,2,3,4]. To simply solve this case, we can reshape t_a to [9, 24] and t_b to [3, 24], finally concate the two tensor longitudinally. The output's shape is [12, 24]. In this case, we only copy two times.

  • case 2:
    • t_a's shape: [9,2,3,4]
    • t_b's shape:[9,3,3,4]
    • axis = 2,

To simply solve this case, we can reshape t_a to [9, 2, 12] and t_b to [9, 3, 12], finally concate the two tensor on the second axis. The output's shape is [9,5,12]. In this case, we should copy 18 times.

  • case 3:
    • t_a's shape: [9,2,3,4]
    • t_b's shape:[9,2,3,3]
    • axis = 4,

Firstly, we reshape t_a to [54, 4] and t_b to [54, 3], finally concate the two tensor horizontally. The output's shape is [54, 7]. This is the worst case, we should copy 108 times.

@chengduoZH
Copy link
Contributor

chengduoZH commented Feb 27, 2018

I have a plan to optimize it.

  • use one Cuda kernel to complete those copies.

@chengduoZH chengduoZH self-assigned this Feb 27, 2018
@chengduoZH chengduoZH mentioned this issue Mar 1, 2018
1 task
@dzhwinter dzhwinter changed the title concat operator need to be enhanced [Speed]concat operator need to be enhanced Mar 6, 2018
@dzhwinter dzhwinter added this to Done in Performance Tuning Mar 6, 2018
@chengduoZH chengduoZH removed this from Done in Performance Tuning Mar 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants