Updated/refactored T5 DNN by jkguiang · Pull Request #304 · SegmentLinking/TrackLooper

jkguiang · 2023-07-11T16:42:33Z

Summary

Made the following updates to the existing T5 DNN (#291):

Trained slightly longer
Saved more working points
Added ifdef toggles for the chi-squared and DNN cuts:
- -DUSE_RZCHI2 toggles the r-z chi-squared cut
- -DUSE_T5_DNN toggles the T5 DNN cuts
- -DUSE_RPHICHI2 toggles the r-phi chi-squared cuts
- Currently, -DUSE_RZCHI2 and -DUSE_T5_DNN are toggled (see SDL/Makefile)
Refactored the neural network code such that the matrix multiplication code is now done within T5DNN::runInference

Timing

This PR (measured July 11th, 2023 around 9:30am PDT)

Total Timing Summary
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Event      Short           Rate
   avg      1.7      2.0      1.4      3.4      3.7      1.4      1.6      1.0      5.1      21.3      18.3+/-  5.1      22.9   explicit_cache[s=1]
   avg      4.4      3.7      1.9      5.2      5.3      1.5      3.6      2.1      6.9      34.8      28.8+/-  7.7      19.3   explicit_cache[s=2]
   avg      8.0      5.5      3.1      9.6      8.8      1.9      7.3      4.2     11.2      59.6      49.8+/- 11.9      16.9   explicit_cache[s=4]
   avg     13.9      7.4      4.8     14.0     14.2      2.4     12.4      6.5     15.7      91.3      75.0+/- 14.1      15.8   explicit_cache[s=6]
   avg     24.9     10.2      6.4     19.6     20.8      3.2     17.2      9.4     22.3     134.0     105.9+/- 25.8      17.6   explicit_cache[s=8]

Pre-DNN (measured July 11th, 2023 around 9:40am PDT)

Total Timing Summary
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Event      Short           Rate
   avg      1.6      2.0      1.4      3.4      3.4      1.4      1.3      1.0      2.2      17.8      14.8+/-  3.5      19.4   explicit_cache[s=1]
   avg      2.4      3.1      1.8      5.0      4.6      1.4      2.6      1.5      3.5      25.9      22.1+/-  4.1      13.8   explicit_cache[s=2]
   avg      6.1      5.7      3.3     10.1      8.5      1.7      6.3      3.4      6.4      51.4      43.6+/-  7.6      13.3   explicit_cache[s=4]
   avg      9.6      6.7      4.2     12.5     12.5      2.0      9.1      4.8      9.0      70.5      58.8+/- 10.2      12.2   explicit_cache[s=6]
   avg     14.7      8.0      5.1     16.1     16.9      2.4     12.3      6.0     11.6      93.3      76.1+/- 15.9      12.6   explicit_cache[s=8]

…-in-TC cuts

VourMa · 2023-07-11T17:39:21Z

I notice a major slowdown in TCs. Is this understood/expected?

GNiendorf · 2023-07-13T15:59:08Z

Could you add a toggle from the command line for the DNN like we do with the caching allocator? For example "sdl_make_tracklooper -mcd" where -d is a toggle for the DNN or something like that.

jkguiang · 2023-07-14T15:46:29Z

Somehow, by changing the SDL Makefile, the TC timing increase is fixed:

This PR (measured July 14th, 2023 around 7:30am)

Total Timing Summary
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Event      Short           Rate
   avg      1.6      2.0      1.5      3.4      4.0      1.4      1.3      1.0      2.1      18.2      15.2+/-  3.5      19.7   explicit_cache[s=1]
   avg      3.7      3.4      1.9      5.4      5.4      1.5      2.8      1.7      3.5      29.3      24.1+/-  4.7      15.2   explicit_cache[s=2]
   avg      5.5      5.3      3.2      9.0      9.0      1.7      6.2      3.1      6.5      49.6      42.4+/-  8.3      12.9   explicit_cache[s=4]
   avg     10.5      6.5      4.3     12.9     12.9      2.2      9.2      4.7      8.5      71.8      59.1+/- 11.7      13.0   explicit_cache[s=6]
   avg     14.6      8.4      6.0     17.0     18.6      2.6     13.4      6.6     11.8      99.1      81.8+/- 12.0      12.8   explicit_cache[s=8]

Pre-DNN (measured July 14th, 2023 around 7:40am)

Total Timing Summary
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Event      Short           Rate
   avg      1.6      2.0      1.5      3.4      3.5      1.4      1.3      1.0      2.2      17.9      14.9+/-  3.6      19.4   explicit_cache[s=1]
   avg      2.3      3.4      1.9      5.1      4.6      1.5      2.8      1.5      3.5      26.5      22.7+/-  4.8      14.0   explicit_cache[s=2]
   avg      5.0      5.1      3.2      9.2      8.2      1.7      6.2      3.2      6.1      47.9      41.2+/-  7.7      12.5   explicit_cache[s=4]
   avg      9.1      6.8      4.4     12.8     12.3      2.1      9.6      4.8      8.9      70.7      59.5+/- 11.0      12.2   explicit_cache[s=6]
   avg     15.2      8.3      5.0     16.3     17.0      2.5     12.4      6.2     11.9      94.7      77.0+/- 13.4      12.2   explicit_cache[s=8]

I have also merged NeuralNetwork.cu with NeuralNetwork.cuh as requested in the last LST meeting. As for adding a toggle for the DNN in the SDL CLI, I have not worked on that yet. Do we need it? I do not know how trivial it is to add.

GNiendorf · 2023-07-14T16:02:18Z

For the toggle, it should only take a couple more lines of code. See how it's done for the other toggles we have: https://github.com/SegmentLinking/TrackLooper/blob/master/bin/sdl_make_tracklooper

…h include back to Quintuplet.cuh

GNiendorf · 2023-07-18T17:24:43Z

@VourMa or @slava77 Can I get your take on whether you think a toggle would be beneficial here? My concern is that after this gets pushed, the only way that someone would know how to turn the DNN off properly would be to go to this PR and find the two relevant flags that need to be turned on when you toggle off the DNN. I think more likely people will just make the mistake of turning off the DNN flag without turning on the -DUSE_RPHICHI2 flag, although I'm not sure how big of a difference this flag makes. Am I missing something here @jkguiang? A toggle would do this automatically and avoid potential confusion in the future, as well as making it less complicated for new users. And it should take less than 15 lines of code to accomplish this since we already make use of other toggles, right?

VourMa

The PR looks good and the comments are generally minor. There are two open issues I see, beyond the current code changes:

There has been a request for a command line toggle for the usage of the DNN within the code. Even though it is not mandatory, it could be of use, since dealing with more than one compilation flags for one operation gets quickly complicated and forgotten (hence my comment about the cleanup of the makefile that needs a separate PR to be brought up to date).
Adding a toggle should not be that hard: First, one creates a separate make target, as here, with the proper extra compilation flag enabled. Then, this make target is turned on using the command line argument like this.
In my opinion, the increase in timing is still there and it wasn't at all clear to me what happened in the meanwhile and 3ms went down to ~0.5ms. Could you please comment on the investigations that you did to understand this? I feel it is ok to take a sub-ms hit to the timing but we should know it and we should know why.

slava77 · 2023-07-18T22:16:11Z

I think more likely people will just make the mistake of turning off the DNN flag without turning on the -DUSE_RPHICHI2 flag, although I'm not sure how big of a difference this flag makes. Am I missing something here @jkguiang? A toggle would do this automatically and avoid potential confusion in the future, as well as making it less complicated for new users.

is the point to be sure that both are not used? (a "toggle")

GNiendorf · 2023-07-18T22:35:13Z

is the point to be sure that both are not used? (a "toggle")

I think the correct behavior is that if the DNN is being used, -DUSE_RZCHI2 and -DUSE_T5_DNN should be turned on, and if the DNN is not being used then -DUSE_RZCHI2 and -DUSE_RPHICHI2 should be turned on. At least to keep it consistent with what cuts were being done before the DNN. A toggle would do this automatically by just including a -d (for example) or not when running sdl_make_tracklooper. So like sdl_make_tracklooper -mcd and there's no need to change the makefile yourself.

jkguiang · 2023-07-19T15:39:43Z

Adding a toggle should not be that hard: First, one creates a separate make target, as here, with the proper extra compilation flag enabled. Then, this make target is turned on using the command line argument like this.

If we add a make target as above, I believe we will just need to be careful about cases where we want more than one make target to be used, as I am guessing we will want to be able to use the DNN-toggling make targets alongside any of the make targets already defined.

In any case, I agree adding a flag to sdl_make_tracklooper would make things easier.

VourMa · 2023-07-19T15:50:47Z

I am guessing we will want to be able to use the DNN-toggling make targets alongside any of the make targets already defined.

You 're right. Then this is probably a better example?

jkguiang · 2023-07-19T16:12:02Z

In my opinion, the increase in timing is still there and it wasn't at all clear to me what happened in the meanwhile and 3ms went down to ~0.5ms. Could you please comment on the investigations that you did to understand this? I feel it is ok to take a sub-ms hit to the timing but we should know it and we should know why.

For this, I made only a brief mention of what I had done:

Somehow, by changing the SDL Makefile, the TC timing increase is fixed

Expanding on this, I had originally put the flags in the Makefile as follows:

CUTVALUEFLAG_FLAGS = -DCUT_VALUE_DEBUG -DUSE_RZCHI2 -DUSE_T5_DNN
%_cuda.o : %.cu %.cuh
	$(LD) -x cu $(PT0P8) ... $(CUTVALUEFLAG) $(DUPLICATES) $< -o $@

With the Makefile configured as above, I got the strangely long runtimes. Then, I moved the flags:

CUTVALUEFLAG_FLAGS = -DCUT_VALUE_DEBUG
CUTFLAGS = -DUSE_RZCHI2 -DUSE_T5_DNN
%_cuda.o : %.cu %.cuh
	$(LD) -x cu $(PT0P8) ... $(CUTVALUEFLAG) $(CUTFLAGS) $(DUPLICATES) $< -o $@

This somehow fixed the issue I was having. I have no idea why this changed the runtime.

VourMa · 2023-07-19T22:47:08Z

This somehow fixed the issue I was having. I have no idea why this changed the runtime.

In the "slow" configuration:

CUTVALUEFLAG_FLAGS = -DCUT_VALUE_DEBUG -DUSE_RZCHI2 -DUSE_T5_DNN
%_cuda.o : %.cu %.cuh
	$(LD) -x cu $(PT0P8) ... $(CUTVALUEFLAG) $(DUPLICATES) $< -o $@

I am not even sure you included the flags in the compilation. You were changing the CUTVALUEFLAG_FLAGS variable, while the one that is used in the nominal make target is CUTVALUEFLAG. The CUTVALUEFLAG_FLAGS is only compiled in the explicit_cache_cutvalue make target:

TrackLooper/SDL/Makefile

Line 58 in c58c36e

explicit_cache_cutvalue: CUTVALUEFLAG = $(CUTVALUEFLAG_FLAGS)

If what I am saying makes sense, I am not even sure what you were running: no DNN, no RZCHI2 cut for the T5, I don't know what that does and how the validations came out fine.

In the "fast" configuration:

CUTVALUEFLAG_FLAGS = -DCUT_VALUE_DEBUG
CUTFLAGS = -DUSE_RZCHI2 -DUSE_T5_DNN
%_cuda.o : %.cu %.cuh
	$(LD) -x cu $(PT0P8) ... $(CUTVALUEFLAG) $(CUTFLAGS) $(DUPLICATES) $< -o $@

you are now correctly including the USE_RZCHI2 and USE_T5_DNN flags, so I expect this configuration to give the right results. However, when the explicit_cache_cutvalue is run, these flags are overwritten:

TrackLooper/SDL/Makefile

Line 58 in c58c36e

explicit_cache_cutvalue: CUTVALUEFLAG = $(CUTVALUEFLAG_FLAGS)

Could you please fix the above issue and double check that we are loading the flags in all cases we want them loaded?

jkguiang · 2023-07-20T15:48:02Z

I believe I have addressed the comments made so far.

I have added a toggle -N to sdl_make_tracklooper that toggles the T5 DNN
In order to implement (1), I no longer put the DNN and chi-squared flags in CUTVALUEFLAG, so they should still be toggled when the make target is explicit_cache_cutvalue
I verified that the toggles indeed work by putting ifdef statements for each flag in Event.cu in createQuintuplets that printed a message to stdout for each flag
I have made the arguments to T5DNN::runInference const where appropriate
I re-ran the timing, and am now seeing fewer differences with the pre-DNN timing

Total Timing Summary
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Event      Short           Rate
   avg      1.6      2.0      1.4      3.3      3.4      1.4      1.4      1.1      2.2      17.7      14.8+/-  3.4      19.3   explicit_cache[s=1]
   avg      3.8      3.4      1.9      5.2      4.8      1.4      2.9      1.7      3.6      28.7      23.4+/-  5.1      15.0   explicit_cache[s=2]
   avg      7.8      5.0      3.0      8.9      7.8      1.7      5.8      3.3      6.6      49.8      40.3+/-  8.6      13.6   explicit_cache[s=4]
   avg      8.9      6.7      4.3     13.3     12.5      2.1      9.5      4.6      8.7      70.5      59.5+/- 10.8      12.2   explicit_cache[s=6]
   avg     14.2      8.6      5.7     17.7     17.8      2.6     13.3      6.4     11.4      97.6      80.8+/- 15.4      12.6   explicit_cache[s=8]

Finally, per the last comment:

I am not even sure you included the flags in the compilation.

This concerned me as well, however I had verified several times that the flags were indeed being used. I know for a fact now from (3) that the flags are being toggled, and the plots do not change. We also know empirically that the flags were being used before because the fake rate changed significantly with respect to the baseline. Nevertheless, I do not understand how this was possible.

VourMa · 2023-07-21T17:51:34Z

I think that during today's meeting we agreed that my explanation above makes sense from a technical point of view as to why there was a timing increase, and your solution with the toggle covers the issue I had mentioned.

I noticed two comments not addressed:

The move of constant variables to the appropriate file (comment)
The check of the profiling report that the declaration of the extra variable does not increase the register usage (comment)

Let me know if you plan to address them. Thanks!

jkguiang · 2023-07-24T16:26:08Z

I have moved the constants per (1). For (2), I have not used the profiler (as I have been focused on ML development), so I do not know exactly how to make the comparison. If it is a huge worry, I can just remove the extra variable declarations. Otherwise, I will not have time to look into how to use the profiler until later this week at the earliest. I have run the profiler (per the command on the wiki) and put the files here if someone else has time to look:

http://uaf-10.t2.ucsd.edu/~jguiang/dump/PR304.nsys-rep
http://uaf-10.t2.ucsd.edu/~jguiang/dump/PR304.sqlite

slava77 · 2023-07-24T18:10:22Z

I have moved the constants per (1). For (2), I have not used the profiler (as I have been focused on ML development), so I do not know exactly how to make the comparison. If it is a huge worry, I can just remove the extra variable declarations. Otherwise, I will not have time to look into how to use the profiler until later this week at the earliest. I have run the profiler (per the command on the wiki) and put the files here if someone else has time to look:

http://uaf-10.t2.ucsd.edu/~jguiang/dump/PR304.nsys-rep http://uaf-10.t2.ucsd.edu/~jguiang/dump/PR304.sqlite

I'd like to see ncu outputs to see code line details.

jkguiang · 2023-07-24T18:50:27Z

I'd like to see ncu outputs to see code line details.

I ran the following command on cgpu-1:

/opt/nvidia/nsight-compute/2022.2.1/ncu --set full -o PR304 -f --import-source on ./bin/sdl -n 1 -v 0 -i PU200

and put the output here:
http://uaf-10.t2.ucsd.edu/~jguiang/dump/PR304.ncu-rep

GNiendorf · 2023-07-24T19:08:33Z

I don't see any register usage from the is_endcap variables. @slava77

slava77 · 2023-07-24T21:05:46Z

I don't see any register usage from the is_endcap variables

👍

VourMa

Thank you all for following up on this. I am merging the PR.

slava77 · 2024-01-31T19:30:23Z

+            mdsInGPU.anchorEta[mdIndex3],                                                // outer T3 anchor hit 4 eta (t3_0_eta)
+            mdsInGPU.anchorPhi[mdIndex3],                                                // outer T3 anchor hit 4 phi (t3_0_phi)
+            mdsInGPU.anchorZ[mdIndex3],                                                  // outer T3 anchor hit 3 eta (t3_0_z)
+            sqrtf(x3*x3 + y3*y3),                                                        // outer T3 anchor hit 3 r (t3_0_r)
+            float(modulesInGPU.layers[lowerModuleIndex3] + 6*is_endcap3),                // outer T3 anchor hit 3 layer (t3_0_layer)


@jkguiang
is this intentionally identical to L85-90?
we could've saved a bit on the number of weights, matrix operations, and having the network to learn identities.

Please check and perhaps keep a todo somewhere to possibly clean this up

It was not a design decision, but it is something we could indeed clean up.

jkguiang added 4 commits June 1, 2023 16:31

refactored DNN

88c0f5c

added USE_T5_DNN flag

5718760

moved working point to NeuralNetwork.cuh

751e7ac

refactored neural network, added more working points, added T5 vs. T5…

c9b84fc

…-in-TC cuts

jkguiang added 2 commits July 14, 2023 08:38

fixed TC timing increase

1a4e958

merged NeuralNetwork.cu/NeuralNetwork.cuh

f74bd69

GNiendorf reviewed Jul 14, 2023

View reviewed changes

Comment thread SDL/Quintuplet.cu Outdated

added inline prefix to T5DNN::runInference and moved NeuralNetwork.cu…

649abe0

…h include back to Quintuplet.cuh

VourMa reviewed Jul 18, 2023

View reviewed changes

Comment thread SDL/Makefile

Comment thread SDL/NeuralNetwork.cuh Outdated

Comment thread SDL/NeuralNetwork.cuh Outdated

slava77 reviewed Jul 18, 2023

View reviewed changes

Comment thread SDL/NeuralNetwork.cuh

jkguiang added 2 commits July 20, 2023 08:40

added toggle for T5 DNN to sdl_make_tracklooper

000f8cf

made all arguments consts

e48acb7

removed commented-out lines

3385021

moved DNN working points to Constants.cuh

1a05096

VourMa approved these changes Jul 25, 2023

View reviewed changes

VourMa merged commit a8c352a into SegmentLinking:master Jul 25, 2023

slava77 reviewed Jan 31, 2024

View reviewed changes

Conversation

jkguiang commented Jul 11, 2023

Summary

Timing

This PR (measured July 11th, 2023 around 9:30am PDT)

Pre-DNN (measured July 11th, 2023 around 9:40am PDT)

Uh oh!

VourMa commented Jul 11, 2023

Uh oh!

GNiendorf commented Jul 13, 2023

Uh oh!

Uh oh!

jkguiang commented Jul 14, 2023

This PR (measured July 14th, 2023 around 7:30am)

Pre-DNN (measured July 14th, 2023 around 7:40am)

Uh oh!

GNiendorf commented Jul 14, 2023

Uh oh!

GNiendorf commented Jul 18, 2023

Uh oh!

VourMa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slava77 commented Jul 18, 2023

Uh oh!

GNiendorf commented Jul 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkguiang commented Jul 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VourMa commented Jul 19, 2023

Uh oh!

jkguiang commented Jul 19, 2023

Uh oh!

VourMa commented Jul 19, 2023

Uh oh!

jkguiang commented Jul 20, 2023

Uh oh!

VourMa commented Jul 21, 2023

Uh oh!

jkguiang commented Jul 24, 2023

Uh oh!

slava77 commented Jul 24, 2023

Uh oh!

jkguiang commented Jul 24, 2023

Uh oh!

GNiendorf commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slava77 commented Jul 24, 2023

Uh oh!

VourMa left a comment

Choose a reason for hiding this comment

Uh oh!

slava77 Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

jkguiang Feb 1, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GNiendorf commented Jul 18, 2023 •

edited

Loading

jkguiang commented Jul 19, 2023 •

edited

Loading

GNiendorf commented Jul 24, 2023 •

edited

Loading