New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPCClusterFinder: Improve performance of noisy pad filter on CPU. #5249
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fweig
I just tested this on CPU and GPU. On the NVIDIA GPU the output is the same as before and the performance is 50% better, which is quite nice.
But on the CPU something must be wrong. The performance seems has not improved. It is more or less identical to before, or sometimes worse.
And on some events I see that number of clusters changes significantly, this seems to happen when I change number of threads, if I run with OMP_NUM_THREADS=1 the output seems correct. Also, I believe this might happen when there are empty TPC sectors. Could you have a look?
For reference, a perfomance output I see here (obtained with
The baseline filter takes 3.5 seconds, which is much too long I would say compared to the rest. If we cannot get this down significantly, we have to think about a different algorithm. |
@davidrohr Ugh, it should be much faster than that. I'll have a closer look tomorrow. |
thx, simulation is
|
054dde9
to
4214815
Compare
@davidrohr Just pushed another update. CPU pad filter is now vectorized. I'm seeing a speedup of ~6 compared to the version in |
Thx, will check tomorrow. Did you run a benchmark with the pp event simulation as I have sent? What performance should be expected with such data?
|
@davidrohr I used your commands from above for the simulation. And also copied your command to run the reco-workflow.
Performance output from
|
@fweig : With the new version, I see a significant speedup in the CPU version, and the same speedup in the GPU version as before. GPU results seem correct, but the CPU version still looses clusters, now I see it even with 1 OMP thread. I have also tested it in the standalone benchmark, with the same result.
After:
I have uploaded the standalone data I used for the test here: https://qon.jwdt.org/nmls/tmp/o2-pp-10.tar.bz2 |
Ok, thx. I'm on it. |
4214815
to
aedcd43
Compare
I was able to reproduce the missing clusters and hopefully fix the problem. Apparently I was using the wrong way to initialize the Vc vectors which caused undefined behavior. When compiled with gcc 7, which I used, it still worked fine. But broke for newer gcc versions. Tested your dataset with gcc 7 and gcc 10 on the standalone benchmark. Got the correct amount of clusters for both compilers and with one and multiple omp threads, as well as the GPU version. |
@fweig : Thx, this seems correct now. On unfortunate point is that there is no fallback without Vc, while the standalone benchmark can still work without Vc at the moment. Could you add a fallback CPU code with a
protection, such that it works also without Vc? |
aedcd43
to
d655404
Compare
@davidrohr Ah, forgot to ask about that. Fallback when Vc is not available is now up. |
This PR adds a separate, faster CPU implementation of the noisy pad filter.