[ROCm] revert cat operator performance work-around #988

jeffdaily · 2022-04-07T21:23:41Z

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed.

python -m pt.cat_test --tag_filter all --device cuda

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015

Pull Request resolved: pytorch#74129
Approved by: https://github.com/ngimel

Fixes #ISSUE_NUMBER

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

jeffdaily merged commit adc5c89 into release/1.11 Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] revert cat operator performance work-around #988

[ROCm] revert cat operator performance work-around #988

jeffdaily commented Apr 7, 2022

[ROCm] revert cat operator performance work-around #988

[ROCm] revert cat operator performance work-around #988

Conversation

jeffdaily commented Apr 7, 2022