Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The sm_efficiency of this raytracing program is just 1.79% #1

Open
WilliamWangPeng opened this issue Sep 3, 2021 · 1 comment
Open

Comments

@WilliamWangPeng
Copy link

Hi dear author,
It's an honor to open one issue here, I have compiled your program "raytracing" successfully, and I use nvprof to test the sm_efficiency, which is only 1.79%.

==17595== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla P100-PCIE-16GB (0)"
    Kernel: render_init(int, int, curandStateXORWOW*)
          1                             sm_efficiency                   Multiprocessor Activity       1.75%       1.75%       1.75%
    Kernel: rand_init(curandStateXORWOW*)
          1                             sm_efficiency                   Multiprocessor Activity       1.00%       1.00%       1.00%
    Kernel: render(Vec3*, int, int, int, Camera**, Entity**, curandStateXORWOW*)
          1                             sm_efficiency                   Multiprocessor Activity       1.79%       1.79%       1.79%
    Kernel: texture_init(unsigned char*, int, int, ImageTexture**)
          1                             sm_efficiency                   Multiprocessor Activity       1.60%       1.60%       1.60%
    Kernel: create_cornell_box(Entity**, Entity**, Camera**, int, int, ImageTexture**, curandStateXORWOW*)
          1                             sm_efficiency                   Multiprocessor Activity       1.78%       1.78%       1.78%

thank you
Best Regards
William

@Belval
Copy link
Owner

Belval commented Sep 3, 2021

That's actually super interesting, but not very surprising. My code only uses a single SM processor and the P100 has 56 of those some quick back-of-the-napkin math tells us that the SM processor that we are using is being used at 100% because 100 / 56 = 1.7857 or 1.79%.

Now an interesting question would be how can we change the code to use all available SMs? To be clear I don't have an answer but as far as I could tell from Googling around this is not something you want to do as the CUDA scheduler is smart enough to figure out a good allocation for us. My intuition is that since every pixel takes a while to render (by default I run with a depth of 400) there is no need for multiple SM because we already achieve max utilization.

That being said, if you figure out a way to make the code faster I am always interested!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants