Skip to content

Performance issue with complicated loops in function #984

@zhangcx93

Description

@zhangcx93

Describe the bug
When I'm running my script, I found that the first time run of the main function containing for loop is unreasonably slow.
By putting the code outside of the main function to the global makes the first step in loop much faster and the following step the same as warmed up code.

To reproduce

The Minimal Working Example (MWE) for this bug:

using CUDA
using Statistics

function some_func(x, y)
    a = CUDA.rand(1000)
    b = CUDA.rand(1000)
    c = a .* b
    return c .* x .* y
end

function main()
    result = []
    count = 0
    for i in 1:3
        t = time()
        for j in 0:20
            for k in 1:20
                count += 1
                for m in 1:20
                    a = CUDA.rand(1000)
                    b = CUDA.rand(1000)
                    c = some_func(a, b)
                    push!(result, mean(c))
                end
            end
        end
        println(time() - t)
    end
end

main()

The printed timing is
First time:

42.087000131607056
33.35199999809265
34.365999937057495

Second time:

1.1009998321533203
0.8480000495910645
1.4190001487731934

Without main, just global loop

10.562000036239624
0.935999870300293
1.316999912261963

Expected behavior

The top loop in main should be around 10s for the first step and around 1s for the following, but not 42s for the first step and around 33s for the following step. 42s and 33s are too much for compilation here.

Version info

Details on Julia:

Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_GPU_ALLOWSCALAR = false

Details on CUDA:

CUDA toolkit 11.3.1, artifact installation
CUDA driver 11.3.0
NVIDIA driver 466.77.0

Libraries:
- CUBLAS: 11.5.1
- CURAND: 10.2.4
- CUFFT: 10.4.2
- CUSOLVER: 11.1.2
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+466.77
- CUDNN: 8.20.0 (for CUDA 11.3.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:
- Julia: 1.6.1
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA GeForce RTX 2080 Ti (sm_75, 8.582 GiB / 11.000 GiB available)

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceHow fast can we go?regressionSomething that used to work, doesn't anymore.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions